Update README.md
Browse files
README.md
CHANGED
|
@@ -28,6 +28,8 @@ Usually, in LLaVA models, we generate N embeddings for the image, which we then
|
|
| 28 |
for one image, we create K<<N tokens for M<N parts of the image (crops)? It would allow us to get visual information from small parts of the image and not inflate the
|
| 29 |
number of image "tokens" too much. I called this method multi-crop LLaVA (MC-LLaVA).
|
| 30 |
|
|
|
|
|
|
|
| 31 |
MC-LLaVA-3b was fine-tuned from [Phi-2 merge](vince62s/phi-2-psy) using vision tower from
|
| 32 |
[SigLIP 400M](https://huggingface.co/google/siglip-so400m-patch14-384).
|
| 33 |
|
|
|
|
| 28 |
for one image, we create K<<N tokens for M<N parts of the image (crops)? It would allow us to get visual information from small parts of the image and not inflate the
|
| 29 |
number of image "tokens" too much. I called this method multi-crop LLaVA (MC-LLaVA).
|
| 30 |
|
| 31 |
+
You can read more about the model in the [blog post](https://huggingface.co/blog/visheratin/vlm-resolution-curse).
|
| 32 |
+
|
| 33 |
MC-LLaVA-3b was fine-tuned from [Phi-2 merge](vince62s/phi-2-psy) using vision tower from
|
| 34 |
[SigLIP 400M](https://huggingface.co/google/siglip-so400m-patch14-384).
|
| 35 |
|