--- license: apache-2.0 language: - code library_name: peft tags: - code-search - text-embeddings - decoder-only - supervised-contrastive-learning - codegemma - llm2vec --- ## 📖 Are Decoder-Only Large Language Models the Silver Bullet for Code Search? This model is an official artifact from our research paper: **"[Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)"**. In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies. For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository: ➡️ **[GitHub: Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)** --- # Model Card: DCS-CodeGemma-7b-it-SupCon-CSN ## 📜 Model Description This is a PEFT adapter for the **`google/codegemma-7b-it`** model, fine-tuned for the task of **Code Search** as part of the research mentioned above. The model was trained using the **Supervised Contrastive Learning** method proposed in the [llm2vec](https://github.com/McGill-NLP/llm2vec) framework, designed to generate high-quality vector embeddings for code snippets. ## 🔬 Model Performance & Reproducibility The table below provides details about this model, its corresponding results in our paper, and how to reproduce the evaluation. | Attribute | Details | | :------------------------- | :------------------------------------------------------------------------------------------------------------------------------ | | **Base Model** | `google/codegemma-7b-it` | | **Fine-tuning Method** | Supervised Contrastive Learning via `llm2vec` | | | **Evaluation Script** | [CSN_Test_Finetuning_Decoder_Model.py](https://github.com/Georgepitt/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CSN_Test_Finetuning_Decoder_Model.py),
[CoSQA_Plus_Test_Finetuning_Decoder_Model.py](https://github.com/ChenyxEugene/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CoSQA_Plus_Test_Finetuning_Decoder_Model.py) | | **Prerequisite Model** | This model must be loaded on top of an MNTP pre-trained model. | --- ## 🚀 How to Use (with `llm2vec`) For best results, we strongly recommend using the official `llm2vec` wrapper to load and use this model. **1. Install Dependencies** ```bash pip install llm2vec transformers torch peft accelerate ``` **2. Example Usage** > **Important**: The `llm2vec` supervised contrastive (SupCon) models are fine-tuned on top of **MNTP (Masked Next Token Prediction)** models. Therefore, loading requires first merging the MNTP weights before loading the SupCon adapter. ```python import torch from transformers import AutoTokenizer, AutoModel, AutoConfig from peft import PeftModel from llm2vec import LLM2Vec # --- 1. Define Model IDs --- base_model_id = "google/codegemma-7b-it" mntp_model_id = "SYSUSELab/DCS-CodeGemma-7B-It-MNTP" supcon_model_id = "SYSUSELab/DCS-CodeGemma-7B-It-SupCon-E5" # --- 2. Load Base Model and MNTP Adapter --- tokenizer = AutoTokenizer.from_pretrained(base_model_id) config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True) model = AutoModel.from_pretrained( base_model_id, trust_remote_code=True, config=config, torch_dtype=torch.bfloat16, device_map="cuda" if torch.cuda.is_available() else "cpu", ) model = PeftModel.from_pretrained(model, mntp_model_id) model = model.merge_and_unload() # --- 3. Load the Supervised (this model) Adapter on top of the MNTP-merged model --- model = PeftModel.from_pretrained(model, supcon_model_id) # --- 4. Use the LLM2Vec Wrapper for Encoding --- l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512) queries = ["how to read a file in Python?"] code_snippets = ["with open('file.txt', 'r') as f:\n content = f.read()"] query_embeddings = l2v.encode(queries) code_embeddings = l2v.encode(code_snippets) print("Query Embedding Shape:", query_embeddings.shape) # This usage example is adapted from the official llm2vec repository. Credits to the original authors. ``` --- ## 📄 Citation If you use our model or work in your research, please cite our paper. As our method is built upon `llm2vec`, please also cite their foundational work. **Our Paper:** * **Paper Link:** [Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240) * **GitHub:** [https://github.com/Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch) * **BibTeX:** ```bibtex @article{chen2024decoder, title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?}, author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin}, journal={arXiv preprint arXiv:2410.22240}, year={2024} } ``` **llm2vec (Foundational Work):** * **Paper Link:** [LLM2Vec: Large Language Models Are Good Contextual Text Encoders](https://arxiv.org/abs/2404.05961) * **GitHub:** [https://github.com/McGill-NLP/llm2vec](https://github.com/McGill-NLP/llm2vec) * **BibTeX:** ```bibtex @article{vaishaal2024llm2vec, title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders}, author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran}, journal={arXiv preprint arXiv:2404.05961}, year={2024} } ```