Update README.md
Browse files
README.md
CHANGED
|
@@ -1,199 +1,161 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
---
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
[
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
## Evaluation
|
| 104 |
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
### Testing Data, Factors & Metrics
|
| 108 |
-
|
| 109 |
-
#### Testing Data
|
| 110 |
-
|
| 111 |
-
<!-- This should link to a Dataset Card if possible. -->
|
| 112 |
-
|
| 113 |
-
[More Information Needed]
|
| 114 |
-
|
| 115 |
-
#### Factors
|
| 116 |
-
|
| 117 |
-
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
| 118 |
-
|
| 119 |
-
[More Information Needed]
|
| 120 |
-
|
| 121 |
-
#### Metrics
|
| 122 |
-
|
| 123 |
-
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
| 124 |
-
|
| 125 |
-
[More Information Needed]
|
| 126 |
-
|
| 127 |
-
### Results
|
| 128 |
-
|
| 129 |
-
[More Information Needed]
|
| 130 |
-
|
| 131 |
-
#### Summary
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
## Model Examination [optional]
|
| 136 |
-
|
| 137 |
-
<!-- Relevant interpretability work for the model goes here -->
|
| 138 |
-
|
| 139 |
-
[More Information Needed]
|
| 140 |
-
|
| 141 |
-
## Environmental Impact
|
| 142 |
-
|
| 143 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
| 144 |
-
|
| 145 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
| 146 |
-
|
| 147 |
-
- **Hardware Type:** [More Information Needed]
|
| 148 |
-
- **Hours used:** [More Information Needed]
|
| 149 |
-
- **Cloud Provider:** [More Information Needed]
|
| 150 |
-
- **Compute Region:** [More Information Needed]
|
| 151 |
-
- **Carbon Emitted:** [More Information Needed]
|
| 152 |
-
|
| 153 |
-
## Technical Specifications [optional]
|
| 154 |
-
|
| 155 |
-
### Model Architecture and Objective
|
| 156 |
-
|
| 157 |
-
[More Information Needed]
|
| 158 |
-
|
| 159 |
-
### Compute Infrastructure
|
| 160 |
-
|
| 161 |
-
[More Information Needed]
|
| 162 |
-
|
| 163 |
-
#### Hardware
|
| 164 |
-
|
| 165 |
-
[More Information Needed]
|
| 166 |
-
|
| 167 |
-
#### Software
|
| 168 |
-
|
| 169 |
-
[More Information Needed]
|
| 170 |
-
|
| 171 |
-
## Citation [optional]
|
| 172 |
-
|
| 173 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
| 174 |
-
|
| 175 |
-
**BibTeX:**
|
| 176 |
-
|
| 177 |
-
[More Information Needed]
|
| 178 |
-
|
| 179 |
-
**APA:**
|
| 180 |
|
| 181 |
-
[More Information Needed]
|
| 182 |
|
| 183 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
|
| 185 |
-
|
| 186 |
|
| 187 |
-
|
| 188 |
|
| 189 |
-
##
|
| 190 |
|
| 191 |
-
[
|
|
|
|
| 192 |
|
| 193 |
-
##
|
| 194 |
|
| 195 |
-
|
| 196 |
|
| 197 |
-
|
| 198 |
|
| 199 |
-
[
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
license_link: https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct/blob/main/LICENSE
|
| 4 |
+
language:
|
| 5 |
+
- ja
|
| 6 |
+
- en
|
| 7 |
+
tags:
|
| 8 |
+
- vila
|
| 9 |
+
- nvila
|
| 10 |
+
- conversational
|
| 11 |
+
- multimodal
|
| 12 |
+
base_model:
|
| 13 |
+
- Qwen/Qwen2.5-0.5B-Instruct
|
| 14 |
+
- Efficient-Large-Model/paligemma-siglip-so400m-patch14-448
|
| 15 |
+
pipeline_tag: image-text-to-text
|
| 16 |
---
|
| 17 |
+
# Heron-NVILA-Lite-1B-hf
|
| 18 |
+
|
| 19 |
+
Heron-NVILA-Lite-1B-hf is a vision language model trained for Japanese, based on the [NVILA](https://arxiv.org/abs/2412.04468)-Lite architecture.
|
| 20 |
+
This implementation is compatible with [transformers.AutoModelForImageTextToText](https://huggingface.co/docs/transformers/main/en/tasks/image_text_to_text).
|
| 21 |
+
|
| 22 |
+
## Model Overview
|
| 23 |
+
|
| 24 |
+
* **Developer**: [Turing Inc.](https://www.turing-motors.com/)
|
| 25 |
+
* **Vision Encoder**: [paligemma-siglip-so400m-patch14-448](https://huggingface.co/Efficient-Large-Model/paligemma-siglip-so400m-patch14-448)
|
| 26 |
+
* **Projector**: mlp_downsample_2x2_fix
|
| 27 |
+
* **LLM**: [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)
|
| 28 |
+
* **Supported Languages**: Japanese, English
|
| 29 |
+
|
| 30 |
+
## Setup
|
| 31 |
+
|
| 32 |
+
```bash
|
| 33 |
+
# We have developed this implementation with `transformers==4.57.1`. Other versions of transformers may also work, but we have not tested them.
|
| 34 |
+
pip install "transformers[torch]==4.57.1" pillow
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
## Usage
|
| 38 |
+
|
| 39 |
+
```python
|
| 40 |
+
from transformers import AutoConfig, AutoModelForImageTextToText, AutoProcessor
|
| 41 |
+
|
| 42 |
+
MODEL_NAME = "turing-motors/Heron-NVILA-Lite-1B-hf"
|
| 43 |
+
|
| 44 |
+
# you can use config
|
| 45 |
+
config = AutoConfig.from_pretrained(MODEL_NAME, trust_remote_code=True)
|
| 46 |
+
model = AutoModelForImageTextToText.from_config(config, trust_remote_code=True)
|
| 47 |
+
|
| 48 |
+
# or directly from_pretrained
|
| 49 |
+
model = AutoModelForImageTextToText.from_pretrained(MODEL_NAME, trust_remote_code=True, device_map="auto")
|
| 50 |
+
|
| 51 |
+
# load processor
|
| 52 |
+
processor = AutoProcessor.from_pretrained(MODEL_NAME, trust_remote_code=True)
|
| 53 |
+
|
| 54 |
+
# show chat_template
|
| 55 |
+
print(processor.tokenizer.chat_template)
|
| 56 |
+
|
| 57 |
+
def generate_content(content: str, images: list | None = None, **kwargs) -> str:
|
| 58 |
+
conversation = [{"role": "user", "content": content}]
|
| 59 |
+
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
|
| 60 |
+
encoding = processor(text=text, images=images, return_tensors="pt").to(model.device)
|
| 61 |
+
output = model.generate(**encoding, **kwargs)
|
| 62 |
+
return processor.decode(output[0, len(encoding["input_ids"][0]):], skip_special_tokens=True)
|
| 63 |
+
|
| 64 |
+
# examples generate with raw text
|
| 65 |
+
response = generate_content("こんにちは")
|
| 66 |
+
print(response)
|
| 67 |
+
print("---" * 40)
|
| 68 |
+
|
| 69 |
+
# examples generate with text + image
|
| 70 |
+
from PIL import Image
|
| 71 |
+
import requests
|
| 72 |
+
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
| 73 |
+
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
|
| 74 |
+
response = generate_content("<image>\n画像を説明してください。", images=[image])
|
| 75 |
+
print(response)
|
| 76 |
+
print("---" * 40)
|
| 77 |
+
|
| 78 |
+
# examples generate using generation_config
|
| 79 |
+
from PIL import Image
|
| 80 |
+
import requests
|
| 81 |
+
from transformers import GenerationConfig
|
| 82 |
+
generation_config = {
|
| 83 |
+
"max_new_tokens": 512,
|
| 84 |
+
"temperature": 0.5,
|
| 85 |
+
"do_sample": True,
|
| 86 |
+
}
|
| 87 |
+
generation_config = GenerationConfig(**generation_config)
|
| 88 |
+
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
| 89 |
+
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
|
| 90 |
+
response = generate_content(
|
| 91 |
+
"<image>\n画像を説明してください。",
|
| 92 |
+
images=[image],
|
| 93 |
+
generation_config=generation_config
|
| 94 |
+
)
|
| 95 |
+
print(response)
|
| 96 |
+
print("---" * 40)
|
| 97 |
+
|
| 98 |
+
# examples generate with text + image + text + image + text
|
| 99 |
+
from PIL import Image
|
| 100 |
+
import requests
|
| 101 |
+
url_list = [
|
| 102 |
+
"https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
|
| 103 |
+
"https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
|
| 104 |
+
]
|
| 105 |
+
images = [
|
| 106 |
+
Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in url_list
|
| 107 |
+
]
|
| 108 |
+
response = generate_content("".join([
|
| 109 |
+
"<image>\n",
|
| 110 |
+
"これは日本の画像です",
|
| 111 |
+
"<image>\n",
|
| 112 |
+
"これはオーストリアの画像です",
|
| 113 |
+
"各画像の違いを説明して"]), images)
|
| 114 |
+
print(response)
|
| 115 |
+
print("---" * 40)
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
## Training Summary
|
| 119 |
+
|
| 120 |
+
| Stage | Training | Data Sources | Samples |
|
| 121 |
+
|--------|-------------------------------|-------------------------------|-------------|
|
| 122 |
+
| Stage1 | Projector | [Japanese image text pairs](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-image-text-pairs), [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) | 1.1M |
|
| 123 |
+
| Stage2 | Projector, LLM | Filtered [MOMIJI](https://huggingface.co/datasets/turing-motors/MOMIJI) (CC-MAIN-2024-46, CC-MAIN-2024-51, CC-MAIN-2025-05) | 13M |
|
| 124 |
+
| | | [Japanese image text pairs (subset)](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-image-text-pairs), [Japanese interleaved data (subset)](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-interleaved-data), [mmc4-core (subset)](https://github.com/allenai/mmc4), [coyo-700m (subset)](https://huggingface.co/datasets/kakaobrain/coyo-700m), [wikipedia_ja](https://huggingface.co/datasets/turing-motors/Wikipedia-Vision-JA), [llava_pretrain_ja](https://huggingface.co/datasets/turing-motors/LLaVA-Pretrain-JA), [stair_captions](http://captions.stair.center/) | 20M |
|
| 125 |
+
| Stage3 | Vision Encoder, Projector, LLM | [llava-instruct-v1_5-en-subset-358k](https://huggingface.co/datasets/llm-jp/llava-instruct-v1_5-en-subset-358k), [llava-instruct-ja](https://huggingface.co/datasets/llm-jp/llava-instruct-ja), [japanese-photos-conv](https://huggingface.co/datasets/llm-jp/japanese-photos-conversation), [ja-vg-vqa](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation), [synthdog-ja (subset)](https://huggingface.co/datasets/naver-clova-ix/synthdog-ja), [ai2d](https://huggingface.co/datasets/lmms-lab/ai2d), [synthdog-en](https://huggingface.co/datasets/naver-clova-ix/synthdog-en), [sherlock](https://github.com/allenai/sherlock) | 1.1M |
|
| 126 |
|
| 127 |
## Evaluation
|
| 128 |
|
| 129 |
+
I used [llm-jp-eval-mm](https://github.com/llm-jp/llm-jp-eval-mm) for this evaluation. Scores for models other than Heron-NVILA-Lite and Sarashina2-Vision-14B were taken from [llm-jp-eval-mm leaderboard](https://llm-jp.github.io/llm-jp-eval-mm/) as of March 2025 and the [Asagi website](https://uehara-mech.github.io/asagi-vlm?v=1). Heron-NVILA-Lite and Sarashina2-Vision-14B were evaluated using llm-as-a-judge with "gpt-4o-2024-05-13". Sarashina2-Vision-14B was evaluated on the [official blog](https://www.sbintuitions.co.jp/blog/entry/2025/03/17/111703) using "gpt-4o-2024-08-06"; please note that due to differing evaluation conditions, the results for Sarashina2-Vision-14B should be treated as reference only.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
|
|
|
|
| 131 |
|
| 132 |
+
| Model | LLM Size | Heron-Bench overall LLM (%) | JA-VLM-Bench-In-the-Wild LLM (/5.0) | JA-VG-VQA-500 LLM (/5.0) |
|
| 133 |
+
|--------------------------------|----------|------------------------------|-------------------------------------|--------------------------|
|
| 134 |
+
| **Heron-NVILA-Lite-1B** | 0.5B | 45.9 | 2.92 | 3.16 |
|
| 135 |
+
| **[Heron-NVILA-Lite-2B](https://huggingface.co/turing-motors/Heron-NVILA-Lite-2B)** | 1.5B | 52.8 | 3.52 | 3.50 |
|
| 136 |
+
| **[Heron-NVILA-Lite-15B](https://huggingface.co/turing-motors/Heron-NVILA-Lite-15B)** | 14B | 59.6 | 4.2 | 3.82 |
|
| 137 |
+
| [LLaVA-CALM2-SigLIP](https://huggingface.co/cyberagent/llava-calm2-siglip) | 7B | 43.3 | 3.15 | 3.21 |
|
| 138 |
+
| [Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2) | 8B | 39.3 | 2.92 | 2.96 |
|
| 139 |
+
| [VILA-jp](https://huggingface.co/llm-jp/llm-jp-3-vila-14b) | 13B | 57.2 | 3.69 | 3.62 |
|
| 140 |
+
| [Asagi-14B](https://huggingface.co/MIL-UT/Asagi-14B) | 13B | 55.8 | 3.44 | 3.84 |
|
| 141 |
+
| [Sarashina2-Vision-14B](https://huggingface.co/sbintuitions/sarashina2-vision-14b) | 13B | 50.9 | 4.1 | 3.43 |
|
| 142 |
+
| [Qwen2-VL 7B Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) | 7B | 55.5 | 3.61 | 3.6 |
|
| 143 |
+
| GPT-4o | - | 87.6 | 3.85 | 3.58 |
|
| 144 |
|
| 145 |
+
## Risks and Limitations
|
| 146 |
|
| 147 |
+
This model is experimental and has not been thoroughly calibrated for ethical compliance or legal standards. Caution is advised for sensitive applications.
|
| 148 |
|
| 149 |
+
## License
|
| 150 |
|
| 151 |
+
- Model weights are licensed under [Apache License 2.0](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct/blob/main/LICENSE).
|
| 152 |
+
- Users must comply with [OpenAI terms of use](https://openai.com/policies/terms-of-use) due to the inclusion of GPT-4-generated synthetic data.
|
| 153 |
|
| 154 |
+
## Acknowledgements
|
| 155 |
|
| 156 |
+
This model is based on results obtained from a project, JPNP20017, subsidized by the New Energy and Industrial Technology Development Organization (NEDO).
|
| 157 |
|
| 158 |
+
I would like to acknowledge the use of the following open-source repositories:
|
| 159 |
|
| 160 |
+
- [VILA](https://github.com/NVlabs/VILA)
|
| 161 |
+
- [llm-jp-eval-mm](https://github.com/llm-jp/llm-jp-eval-mm)
|