Long-CLIP: Unlocking the Long-Text Capability of CLIP
Paper
•
2403.15378
•
Published
•
4
LongCLIP extends the CLIP vision–language framework to support significantly longer text inputs, enabling richer contextual understanding while preserving strong image–text alignment.
Original paper: Long-CLIP: Unlocking Long-Text Capability in CLIP, Zhang et al., 2024
This model uses the LongCLIP B/16 variant, which is based on a ViT-Base backbone with 16×16 image patches and enhanced long-text encoding capacity. It is well suited for vision–language applications such as image retrieval, zero-shot classification, and multimodal reasoning where long textual prompts or descriptions are important.
Model Configuration:
| Model | Device | Model Link |
|---|---|---|
| LongCLIP-B16 Image Encoder | N1-655 | Model_Link |
| LongCLIP-B16 Text Encoder | N1-655 | Model_Link |
| LongCLIP-B16 Image encoder | CV72 | Model_Link |
| LongCLIP-B16 Text Encoder | CV72 | Model_Link |
| LongCLIP-B16 Image encoder | CV75 | Model_Link |
| LongCLIP-B16 Text Encoder | CV75 | Model_Link |