microsoft
/

Phi-4-multimodal-instruct

Automatic Speech Recognition

text-generation

speech-summarization

speech-translation

visual-question-answering

phi-4-multimodal

Model card Files Files and versions

nguyenbh commited on Mar 21

Commit

6cf9696

·

verified ·

1 Parent(s): 18812f4

Update readme

Files changed (1) hide show

README.md +20 -0

README.md CHANGED Viewed

@@ -145,6 +145,8 @@ With Phi-4-multimodal-instruct, a single new open model has been trained across
 It is anticipated that Phi-4-multimodal-instruct will greatly benefit app developers and various use cases. The enthusiastic support for the Phi-4 series is greatly appreciated. Feedback on Phi-4 is welcomed and crucial to the model's evolution and improvement. Thank you for being part of this journey!
 ## Model Quality
 To understand the capabilities, Phi-4-multimodal-instruct  was compared with a set of models over a variety of benchmarks using an internal benchmark platform (See Appendix A for benchmark methodology). Users can refer to the Phi-4-Mini-Instruct model card for details of language benchmarks. At the high-level overview of the model quality on representative speech and vision benchmarks:
@@ -262,6 +264,7 @@ BLINK is an aggregated benchmark with 14 visual tasks that humans can solve very
 ![alt text](./figures/multi_image.png)
 ## Usage
@@ -474,6 +477,23 @@ print(f'>>> Response\n{response}')
 More inference examples can be found [**here**](https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/sample_inference_phi4mm.py).
 ## Training
 ### Fine-tuning

 It is anticipated that Phi-4-multimodal-instruct will greatly benefit app developers and various use cases. The enthusiastic support for the Phi-4 series is greatly appreciated. Feedback on Phi-4 is welcomed and crucial to the model's evolution and improvement. Thank you for being part of this journey!
 ## Model Quality
+<details>
+  <summary>Click to view details</summary>
 To understand the capabilities, Phi-4-multimodal-instruct  was compared with a set of models over a variety of benchmarks using an internal benchmark platform (See Appendix A for benchmark methodology). Users can refer to the Phi-4-Mini-Instruct model card for details of language benchmarks. At the high-level overview of the model quality on representative speech and vision benchmarks:
 ![alt text](./figures/multi_image.png)
+</details>
 ## Usage
 More inference examples can be found [**here**](https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/sample_inference_phi4mm.py).
+### vLLM inference
+User can start a server with this command
+```bash
+python -m vllm.entrypoints.openai.api_server --model 'microsoft/Phi-4-multimodal-instruct' --dtype auto --trust-remote-code --max-model-len 131072 --enable-lora --max-lora-rank 320 --lora-extra-vocab-size 0 --limit-mm-per-prompt audio=3,image=3 --max-loras 2 --lora-modules speech=<path to speech lora folder> vision=<path to vision lora folder>
+```
+The speech lora and vision lora folders are within the Phi-4-multimodal-instruct folder downloaded by vLLM, you can also use the following script to find thoses:
+```python
+from huggingface_hub import snapshot_download
+model_path = snapshot_download(repo_id="microsoft/Phi-4-multimodal-instruct")
+speech_lora_path = model_path+"/speech-lora"
+vision_lora_path = model_path+"/vision-lora"
+```
 ## Training
 ### Fine-tuning