Update readme
Browse files
README.md
CHANGED
|
@@ -145,6 +145,8 @@ With Phi-4-multimodal-instruct, a single new open model has been trained across
|
|
| 145 |
It is anticipated that Phi-4-multimodal-instruct will greatly benefit app developers and various use cases. The enthusiastic support for the Phi-4 series is greatly appreciated. Feedback on Phi-4 is welcomed and crucial to the model's evolution and improvement. Thank you for being part of this journey!
|
| 146 |
|
| 147 |
## Model Quality
|
|
|
|
|
|
|
| 148 |
|
| 149 |
To understand the capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of benchmarks using an internal benchmark platform (See Appendix A for benchmark methodology). Users can refer to the Phi-4-Mini-Instruct model card for details of language benchmarks. At the high-level overview of the model quality on representative speech and vision benchmarks:
|
| 150 |
|
|
@@ -262,6 +264,7 @@ BLINK is an aggregated benchmark with 14 visual tasks that humans can solve very
|
|
| 262 |
|
| 263 |

|
| 264 |
|
|
|
|
| 265 |
|
| 266 |
## Usage
|
| 267 |
|
|
@@ -474,6 +477,23 @@ print(f'>>> Response\n{response}')
|
|
| 474 |
|
| 475 |
More inference examples can be found [**here**](https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/sample_inference_phi4mm.py).
|
| 476 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 477 |
## Training
|
| 478 |
|
| 479 |
### Fine-tuning
|
|
|
|
| 145 |
It is anticipated that Phi-4-multimodal-instruct will greatly benefit app developers and various use cases. The enthusiastic support for the Phi-4 series is greatly appreciated. Feedback on Phi-4 is welcomed and crucial to the model's evolution and improvement. Thank you for being part of this journey!
|
| 146 |
|
| 147 |
## Model Quality
|
| 148 |
+
<details>
|
| 149 |
+
<summary>Click to view details</summary>
|
| 150 |
|
| 151 |
To understand the capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of benchmarks using an internal benchmark platform (See Appendix A for benchmark methodology). Users can refer to the Phi-4-Mini-Instruct model card for details of language benchmarks. At the high-level overview of the model quality on representative speech and vision benchmarks:
|
| 152 |
|
|
|
|
| 264 |
|
| 265 |

|
| 266 |
|
| 267 |
+
</details>
|
| 268 |
|
| 269 |
## Usage
|
| 270 |
|
|
|
|
| 477 |
|
| 478 |
More inference examples can be found [**here**](https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/sample_inference_phi4mm.py).
|
| 479 |
|
| 480 |
+
### vLLM inference
|
| 481 |
+
|
| 482 |
+
User can start a server with this command
|
| 483 |
+
|
| 484 |
+
```bash
|
| 485 |
+
python -m vllm.entrypoints.openai.api_server --model 'microsoft/Phi-4-multimodal-instruct' --dtype auto --trust-remote-code --max-model-len 131072 --enable-lora --max-lora-rank 320 --lora-extra-vocab-size 0 --limit-mm-per-prompt audio=3,image=3 --max-loras 2 --lora-modules speech=<path to speech lora folder> vision=<path to vision lora folder>
|
| 486 |
+
```
|
| 487 |
+
|
| 488 |
+
The speech lora and vision lora folders are within the Phi-4-multimodal-instruct folder downloaded by vLLM, you can also use the following script to find thoses:
|
| 489 |
+
|
| 490 |
+
```python
|
| 491 |
+
from huggingface_hub import snapshot_download
|
| 492 |
+
model_path = snapshot_download(repo_id="microsoft/Phi-4-multimodal-instruct")
|
| 493 |
+
speech_lora_path = model_path+"/speech-lora"
|
| 494 |
+
vision_lora_path = model_path+"/vision-lora"
|
| 495 |
+
```
|
| 496 |
+
|
| 497 |
## Training
|
| 498 |
|
| 499 |
### Fine-tuning
|