Configuration Parsing Warning: In adapter_config.json: "peft.task_type" must be a string

πŸ“– Are Decoder-Only Large Language Models the Silver Bullet for Code Search?

This model is an official artifact from our research paper: "Are Decoder-Only Large Language Models the Silver Bullet for Code Search?".

In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies.

For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository:

➑️ GitHub: Georgepitt/DecoderLLMs-CodeSearch


Model Card: DCS-CodeGemma-7b-it-SupCon-CSN

πŸ“œ Model Description

This is a PEFT adapter for the google/codegemma-7b-it model, fine-tuned for the task of Code Search as part of the research mentioned above.

The model was trained using the Supervised Contrastive Learning method proposed in the llm2vec framework, designed to generate high-quality vector embeddings for code snippets.

πŸ”¬ Model Performance & Reproducibility

The table below provides details about this model, its corresponding results in our paper, and how to reproduce the evaluation.

Attribute Details
Base Model google/codegemma-7b-it
Fine-tuning Method Supervised Contrastive Learning via llm2vec
Evaluation Script CSN_Test_Finetuning_Decoder_Model.py,
CoSQA_Plus_Test_Finetuning_Decoder_Model.py
Prerequisite Model This model must be loaded on top of an MNTP pre-trained model.

πŸš€ How to Use (with llm2vec)

For best results, we strongly recommend using the official llm2vec wrapper to load and use this model.

1. Install Dependencies

pip install llm2vec transformers torch peft accelerate

2. Example Usage

Important: The llm2vec supervised contrastive (SupCon) models are fine-tuned on top of MNTP (Masked Next Token Prediction) models. Therefore, loading requires first merging the MNTP weights before loading the SupCon adapter.

import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig
from peft import PeftModel
from llm2vec import LLM2Vec

# --- 1. Define Model IDs ---
base_model_id = "google/codegemma-7b-it" 
mntp_model_id = "SYSUSELab/DCS-CodeGemma-7B-It-MNTP" 
supcon_model_id = "SYSUSELab/DCS-CodeGemma-7B-It-SupCon-E5" 

# --- 2. Load Base Model and MNTP Adapter ---
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    base_model_id,
    trust_remote_code=True,
    config=config,
    torch_dtype=torch.bfloat16,
    device_map="cuda" if torch.cuda.is_available() else "cpu",
)
model = PeftModel.from_pretrained(model, mntp_model_id)
model = model.merge_and_unload()

# --- 3. Load the Supervised (this model) Adapter on top of the MNTP-merged model ---
model = PeftModel.from_pretrained(model, supcon_model_id)

# --- 4. Use the LLM2Vec Wrapper for Encoding ---
l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)

queries = ["how to read a file in Python?"]
code_snippets = ["with open('file.txt', 'r') as f:\n    content = f.read()"]
query_embeddings = l2v.encode(queries)
code_embeddings = l2v.encode(code_snippets)

print("Query Embedding Shape:", query_embeddings.shape)
# This usage example is adapted from the official llm2vec repository. Credits to the original authors.

πŸ“„ Citation

If you use our model or work in your research, please cite our paper. As our method is built upon llm2vec, please also cite their foundational work.

Our Paper:

llm2vec (Foundational Work):

Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including SYSUSELab/DCS-CodeGemma-7B-It-SupCon-E5