Add pipeline tag and library name

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +2 -3
README.md CHANGED
@@ -1,8 +1,9 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
4
 
5
-
6
  # Leopard-LLaVA
7
 
8
  [Paper](https://arxiv.org/abs/2410.01744) | [Github](https://github.com/tencent-ailab/Leopard) | [Models-LLaVA](https://huggingface.co/wyu1/Leopard-LLaVA) | [Models-Idefics2](https://huggingface.co/wyu1/Leopard-Idefics2)
@@ -17,8 +18,6 @@ and resolutions of the input images. Experiments across a wide range of benchmar
17
 
18
  For Leopard-LLaVA, we use SigLIP-SO-400M with 364 × 364 image resolutions as the visual encoder, as it supports a larger resolution than the commonly used 224 × 224 resolution CLIP visual encoder. Each image is encoded into a sequence of 26 × 26 = 676 visual features with a patch size of 14. With the visual feature pixel shuffling strategy, each image is further processed into a sequence of 169 visual features. We limit the maximum number of images (M) in each sample to 50, which produces up to 8,450 visual features in total. Following LLaVA, we adopt a two-layer MLP as the visual-language connector. We use LLaMA-3.1 as the language model.
19
 
20
-
21
-
22
  ## Citation
23
  ```
24
  @article{jia2024leopard,
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
  ---
6
 
 
7
  # Leopard-LLaVA
8
 
9
  [Paper](https://arxiv.org/abs/2410.01744) | [Github](https://github.com/tencent-ailab/Leopard) | [Models-LLaVA](https://huggingface.co/wyu1/Leopard-LLaVA) | [Models-Idefics2](https://huggingface.co/wyu1/Leopard-Idefics2)
 
18
 
19
  For Leopard-LLaVA, we use SigLIP-SO-400M with 364 × 364 image resolutions as the visual encoder, as it supports a larger resolution than the commonly used 224 × 224 resolution CLIP visual encoder. Each image is encoded into a sequence of 26 × 26 = 676 visual features with a patch size of 14. With the visual feature pixel shuffling strategy, each image is further processed into a sequence of 169 visual features. We limit the maximum number of images (M) in each sample to 50, which produces up to 8,450 visual features in total. Following LLaVA, we adopt a two-layer MLP as the visual-language connector. We use LLaMA-3.1 as the language model.
20
 
 
 
21
  ## Citation
22
  ```
23
  @article{jia2024leopard,