Configuration Parsing Warning: In adapter_config.json: "peft.task_type" must be a string

📖 Are Decoder-Only Large Language Models the Silver Bullet for Code Search?

This model is an official artifact from our research paper: "Are Decoder-Only Large Language Models the Silver Bullet for Code Search?".

In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies.

For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository:

➡️ GitHub: Georgepitt/DecoderLLMs-CodeSearch

Model Card: DCS-CodeGemma-7b-it-SupCon-CSN

📜 Model Description

This is a PEFT adapter for the google/codegemma-7b-it model, fine-tuned for the task of Code Search as part of the research mentioned above.

The model was trained using the Supervised Contrastive Learning method proposed in the llm2vec framework, designed to generate high-quality vector embeddings for code snippets.

🔬 Model Performance & Reproducibility

The table below provides details about this model, its corresponding results in our paper, and how to reproduce the evaluation.

Attribute	Details
Base Model	`google/codegemma-7b-it`
Fine-tuning Method	Supervised Contrastive Learning via `llm2vec`
Evaluation Script	CSN_Test_Finetuning_Decoder_Model.py, CoSQA_Plus_Test_Finetuning_Decoder_Model.py
Prerequisite Model	This model must be loaded on top of an MNTP pre-trained model.

🚀 How to Use (with `llm2vec`)

For best results, we strongly recommend using the official llm2vec wrapper to load and use this model.

1. Install Dependencies

pip install llm2vec transformers torch peft accelerate

2. Example Usage

Important: The llm2vec supervised contrastive (SupCon) models are fine-tuned on top of MNTP (Masked Next Token Prediction) models. Therefore, loading requires first merging the MNTP weights before loading the SupCon adapter.

import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig
from peft import PeftModel
from llm2vec import LLM2Vec

# --- 1. Define Model IDs ---
base_model_id = "google/codegemma-7b-it" 
mntp_model_id = "SYSUSELab/DCS-CodeGemma-7B-It-MNTP" 
supcon_model_id = "SYSUSELab/DCS-CodeGemma-7B-It-SupCon-E5" 

# --- 2. Load Base Model and MNTP Adapter ---
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    base_model_id,
    trust_remote_code=True,
    config=config,
    torch_dtype=torch.bfloat16,
    device_map="cuda" if torch.cuda.is_available() else "cpu",
)
model = PeftModel.from_pretrained(model, mntp_model_id)
model = model.merge_and_unload()

# --- 3. Load the Supervised (this model) Adapter on top of the MNTP-merged model ---
model = PeftModel.from_pretrained(model, supcon_model_id)

# --- 4. Use the LLM2Vec Wrapper for Encoding ---
l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)

queries = ["how to read a file in Python?"]
code_snippets = ["with open('file.txt', 'r') as f:\n    content = f.read()"]
query_embeddings = l2v.encode(queries)
code_embeddings = l2v.encode(code_snippets)

print("Query Embedding Shape:", query_embeddings.shape)
# This usage example is adapted from the official llm2vec repository. Credits to the original authors.

📄 Citation

If you use our model or work in your research, please cite our paper. As our method is built upon llm2vec, please also cite their foundational work.

Our Paper:

Paper Link: Are Decoder-Only Large Language Models the Silver Bullet for Code Search?
GitHub: https://github.com/Georgepitt/DecoderLLMs-CodeSearch

BibTeX:

@article{chen2024decoder,
  title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?},
  author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin},
  journal={arXiv preprint arXiv:2410.22240},
  year={2024}
}

llm2vec (Foundational Work):

Paper Link: LLM2Vec: Large Language Models Are Good Contextual Text Encoders
GitHub: https://github.com/McGill-NLP/llm2vec

BibTeX:

@article{vaishaal2024llm2vec,
    title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders},
    author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran},
    journal={arXiv preprint arXiv:2404.05961},
    year={2024}
}

Downloads last month: 12

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including SYSUSELab/DCS-CodeGemma-7B-It-SupCon-E5

Are Decoder-Only LLMs the Silver Bullet

Collection

Official Models for "Are Decoder-Only LLMs the Silver Bullet for Code Search?" • 47 items • Updated 1 day ago