π Are Decoder-Only Large Language Models the Silver Bullet for Code Search?
This model is an official artifact from our research paper: "Are Decoder-Only Large Language Models the Silver Bullet for Code Search?".
In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies.
For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository:
β‘οΈ GitHub: Georgepitt/DecoderLLMs-CodeSearch
Model Card: DCS-CodeGemma-7b-it-SupCon-CSN
π Model Description
This is a PEFT adapter for the google/codegemma-7b-it model, fine-tuned for the task of Code Search as part of the research mentioned above.
The model was trained using the Supervised Contrastive Learning method proposed in the llm2vec framework, designed to generate high-quality vector embeddings for code snippets.
π¬ Model Performance & Reproducibility
The table below provides details about this model, its corresponding results in our paper, and how to reproduce the evaluation.
| Attribute | Details |
|---|---|
| Base Model | google/codegemma-7b-it |
| Fine-tuning Method | Supervised Contrastive Learning via llm2vec |
| Evaluation Script | CSN_Test_Finetuning_Decoder_Model.py, CoSQA_Plus_Test_Finetuning_Decoder_Model.py |
| Prerequisite Model | This model must be loaded on top of an MNTP pre-trained model. |
π How to Use (with llm2vec)
For best results, we strongly recommend using the official llm2vec wrapper to load and use this model.
1. Install Dependencies
pip install llm2vec transformers torch peft accelerate
2. Example Usage
Important: The
llm2vecsupervised contrastive (SupCon) models are fine-tuned on top of MNTP (Masked Next Token Prediction) models. Therefore, loading requires first merging the MNTP weights before loading the SupCon adapter.
import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig
from peft import PeftModel
from llm2vec import LLM2Vec
# --- 1. Define Model IDs ---
base_model_id = "google/codegemma-7b-it"
mntp_model_id = "SYSUSELab/DCS-CodeGemma-7B-It-MNTP"
supcon_model_id = "SYSUSELab/DCS-CodeGemma-7B-It-SupCon-E5"
# --- 2. Load Base Model and MNTP Adapter ---
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
base_model_id,
trust_remote_code=True,
config=config,
torch_dtype=torch.bfloat16,
device_map="cuda" if torch.cuda.is_available() else "cpu",
)
model = PeftModel.from_pretrained(model, mntp_model_id)
model = model.merge_and_unload()
# --- 3. Load the Supervised (this model) Adapter on top of the MNTP-merged model ---
model = PeftModel.from_pretrained(model, supcon_model_id)
# --- 4. Use the LLM2Vec Wrapper for Encoding ---
l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)
queries = ["how to read a file in Python?"]
code_snippets = ["with open('file.txt', 'r') as f:\n content = f.read()"]
query_embeddings = l2v.encode(queries)
code_embeddings = l2v.encode(code_snippets)
print("Query Embedding Shape:", query_embeddings.shape)
# This usage example is adapted from the official llm2vec repository. Credits to the original authors.
π Citation
If you use our model or work in your research, please cite our paper. As our method is built upon llm2vec, please also cite their foundational work.
Our Paper:
- Paper Link: Are Decoder-Only Large Language Models the Silver Bullet for Code Search?
- GitHub: https://github.com/Georgepitt/DecoderLLMs-CodeSearch
- BibTeX:
@article{chen2024decoder, title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?}, author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin}, journal={arXiv preprint arXiv:2410.22240}, year={2024} }
llm2vec (Foundational Work):
- Paper Link: LLM2Vec: Large Language Models Are Good Contextual Text Encoders
- GitHub: https://github.com/McGill-NLP/llm2vec
- BibTeX:
@article{vaishaal2024llm2vec, title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders}, author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran}, journal={arXiv preprint arXiv:2404.05961}, year={2024} }
- Downloads last month
- 12