pitt111 commited on
Commit
e35af71
·
verified ·
1 Parent(s): a15af67

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +131 -3
  2. adapter_config.json +37 -0
  3. adapter_model.safetensors +3 -0
README.md CHANGED
@@ -1,3 +1,131 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - code
5
+ library_name: peft
6
+ tags:
7
+ - code-search
8
+ - text-embeddings
9
+ - decoder-only
10
+ - supervised-contrastive-learning
11
+ - codegemma
12
+ - llm2vec
13
+ ---
14
+
15
+ ## 📖 Are Decoder-Only Large Language Models the Silver Bullet for Code Search?
16
+
17
+ This model is an official artifact from our research paper: **"[Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)"**.
18
+
19
+ In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies.
20
+
21
+ For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository:
22
+
23
+ ➡️ **[GitHub: Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)**
24
+
25
+ ---
26
+
27
+ # Model Card: DCS-CodeGemma-7b-it-SupCon-CSN
28
+
29
+ ## 📜 Model Description
30
+
31
+ This is a PEFT adapter for the **`google/codegemma-7b-it`** model, fine-tuned for the task of **Code Search** as part of the research mentioned above.
32
+
33
+ The model was trained using the **Supervised Contrastive Learning** method proposed in the [llm2vec](https://github.com/McGill-NLP/llm2vec) framework, designed to generate high-quality vector embeddings for code snippets.
34
+
35
+ ## 🔬 Model Performance & Reproducibility
36
+
37
+ The table below provides details about this model, its corresponding results in our paper, and how to reproduce the evaluation.
38
+
39
+ | Attribute | Details |
40
+ | :------------------------- | :------------------------------------------------------------------------------------------------------------------------------ |
41
+ | **Base Model** | `google/codegemma-7b-it` |
42
+ | **Fine-tuning Method** | Supervised Contrastive Learning via `llm2vec` |
43
+ | **Corresponds to Paper** | Section IV, Table VI |
44
+ | **Evaluation Script** | [CSN_Test_Finetuning_Decoder_Model.py](https://github.com/Georgepitt/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CSN_Test_Finetuning_Decoder_Model.py),<br>[CoSQA_Plus_Test_Finetuning_Decoder_Model copy.py](https://github.com/Georgepitt/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CoSQA_Plus_Test_Finetuning_Decoder_Model%20copy.py) |
45
+ | **Prerequisite Model** | This model must be loaded on top of an MNTP pre-trained model. E.g., `[SYSUSELab/DCS-CodeGemma-7b-It-MNTP]` |
46
+
47
+ ---
48
+
49
+ ## 🚀 How to Use (with `llm2vec`)
50
+
51
+ For best results, we strongly recommend using the official `llm2vec` wrapper to load and use this model.
52
+
53
+ **1. Install Dependencies**
54
+ ```bash
55
+ pip install llm2vec transformers torch peft accelerate
56
+ ```
57
+
58
+ **2. Example Usage**
59
+
60
+ > **Important**: The `llm2vec` supervised contrastive (SupCon) models are fine-tuned on top of **MNTP (Masked Next Token Prediction)** models. Therefore, loading requires first merging the MNTP weights before loading the SupCon adapter.
61
+
62
+ ```python
63
+ import torch
64
+ from transformers import AutoTokenizer, AutoModel, AutoConfig
65
+ from peft import PeftModel
66
+ from llm2vec import LLM2Vec
67
+
68
+ # --- 1. Define Model IDs ---
69
+ base_model_id = "google/codegemma-7b-it"
70
+ mntp_model_id = "[SYSUSELab/DCS-CodeGemma-7b-It-MNTP]"
71
+ supcon_model_id = "[SYSUSELab/DCS-CodeGemma-7b-It-SupCon-CSN]"
72
+
73
+ # --- 2. Load Base Model and MNTP Adapter ---
74
+ tokenizer = AutoTokenizer.from_pretrained(base_model_id)
75
+ config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True)
76
+ model = AutoModel.from_pretrained(
77
+ base_model_id,
78
+ trust_remote_code=True,
79
+ config=config,
80
+ torch_dtype=torch.bfloat16,
81
+ device_map="cuda" if torch.cuda.is_available() else "cpu",
82
+ )
83
+ model = PeftModel.from_pretrained(model, mntp_model_id)
84
+ model = model.merge_and_unload()
85
+
86
+ # --- 3. Load the Supervised (this model) Adapter on top of the MNTP-merged model ---
87
+ model = PeftModel.from_pretrained(model, supcon_model_id)
88
+
89
+ # --- 4. Use the LLM2Vec Wrapper for Encoding ---
90
+ l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)
91
+
92
+ queries = ["how to read a file in Python?"]
93
+ code_snippets = ["with open('file.txt', 'r') as f:\n content = f.read()"]
94
+ query_embeddings = l2v.encode(queries)
95
+ code_embeddings = l2v.encode(code_snippets)
96
+
97
+ print("Query Embedding Shape:", query_embeddings.shape)
98
+ # This usage example is adapted from the official llm2vec repository. Credits to the original authors.
99
+ ```
100
+
101
+ ---
102
+
103
+ ## 📄 Citation
104
+
105
+ If you use our model or work in your research, please cite our paper. As our method is built upon `llm2vec`, please also cite their foundational work.
106
+
107
+ **Our Paper:**
108
+ * **Paper Link:** [Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)
109
+ * **GitHub:** [https://github.com/Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)
110
+ * **BibTeX:**
111
+ ```bibtex
112
+ @article{chen2024decoder,
113
+ title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?},
114
+ author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin},
115
+ journal={arXiv preprint arXiv:2410.22240},
116
+ year={2024}
117
+ }
118
+ ```
119
+
120
+ **llm2vec (Foundational Work):**
121
+ * **Paper Link:** [LLM2Vec: Large Language Models Are Good Contextual Text Encoders](https://arxiv.org/abs/2404.05961)
122
+ * **GitHub:** [https://github.com/McGill-NLP/llm2vec](https://github.com/McGill-NLP/llm2vec)
123
+ * **BibTeX:**
124
+ ```bibtex
125
+ @article{vaishaal2024llm2vec,
126
+ title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders},
127
+ author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran},
128
+ journal={arXiv preprint arXiv:2404.05961},
129
+ year={2024}
130
+ }
131
+ ```
adapter_config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": {
4
+ "base_model_class": "GemmaBiModel",
5
+ "parent_library": "llm2vec.models.bidirectional_gemma"
6
+ },
7
+ "base_model_name_or_path": "google/codegemma-7b-it",
8
+ "bias": "none",
9
+ "fan_in_fan_out": false,
10
+ "inference_mode": true,
11
+ "init_lora_weights": true,
12
+ "layer_replication": null,
13
+ "layers_pattern": null,
14
+ "layers_to_transform": null,
15
+ "loftq_config": {},
16
+ "lora_alpha": 32,
17
+ "lora_dropout": 0.05,
18
+ "megatron_config": null,
19
+ "megatron_core": "megatron.core",
20
+ "modules_to_save": null,
21
+ "peft_type": "LORA",
22
+ "r": 16,
23
+ "rank_pattern": {},
24
+ "revision": null,
25
+ "target_modules": [
26
+ "k_proj",
27
+ "gate_proj",
28
+ "q_proj",
29
+ "down_proj",
30
+ "up_proj",
31
+ "v_proj",
32
+ "o_proj"
33
+ ],
34
+ "task_type": null,
35
+ "use_dora": false,
36
+ "use_rslora": false
37
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1ad528a80f6862ab5c068989d2194e2dade5f427d25962def38d345f39a21946
3
+ size 100058184