Add pipeline tag and library name
#2
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,8 +1,9 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
-
|
| 6 |
# Leopard-LLaVA
|
| 7 |
|
| 8 |
[Paper](https://arxiv.org/abs/2410.01744) | [Github](https://github.com/tencent-ailab/Leopard) | [Models-LLaVA](https://huggingface.co/wyu1/Leopard-LLaVA) | [Models-Idefics2](https://huggingface.co/wyu1/Leopard-Idefics2)
|
|
@@ -17,8 +18,6 @@ and resolutions of the input images. Experiments across a wide range of benchmar
|
|
| 17 |
|
| 18 |
For Leopard-LLaVA, we use SigLIP-SO-400M with 364 × 364 image resolutions as the visual encoder, as it supports a larger resolution than the commonly used 224 × 224 resolution CLIP visual encoder. Each image is encoded into a sequence of 26 × 26 = 676 visual features with a patch size of 14. With the visual feature pixel shuffling strategy, each image is further processed into a sequence of 169 visual features. We limit the maximum number of images (M) in each sample to 50, which produces up to 8,450 visual features in total. Following LLaVA, we adopt a two-layer MLP as the visual-language connector. We use LLaMA-3.1 as the language model.
|
| 19 |
|
| 20 |
-
|
| 21 |
-
|
| 22 |
## Citation
|
| 23 |
```
|
| 24 |
@article{jia2024leopard,
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
pipeline_tag: image-text-to-text
|
| 4 |
+
library_name: transformers
|
| 5 |
---
|
| 6 |
|
|
|
|
| 7 |
# Leopard-LLaVA
|
| 8 |
|
| 9 |
[Paper](https://arxiv.org/abs/2410.01744) | [Github](https://github.com/tencent-ailab/Leopard) | [Models-LLaVA](https://huggingface.co/wyu1/Leopard-LLaVA) | [Models-Idefics2](https://huggingface.co/wyu1/Leopard-Idefics2)
|
|
|
|
| 18 |
|
| 19 |
For Leopard-LLaVA, we use SigLIP-SO-400M with 364 × 364 image resolutions as the visual encoder, as it supports a larger resolution than the commonly used 224 × 224 resolution CLIP visual encoder. Each image is encoded into a sequence of 26 × 26 = 676 visual features with a patch size of 14. With the visual feature pixel shuffling strategy, each image is further processed into a sequence of 169 visual features. We limit the maximum number of images (M) in each sample to 50, which produces up to 8,450 visual features in total. Following LLaVA, we adopt a two-layer MLP as the visual-language connector. We use LLaMA-3.1 as the language model.
|
| 20 |
|
|
|
|
|
|
|
| 21 |
## Citation
|
| 22 |
```
|
| 23 |
@article{jia2024leopard,
|