pitt111 commited on
Commit
b1f9288
·
verified ·
1 Parent(s): fe7af35

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +130 -3
  2. adapter_config.json +37 -0
  3. adapter_model.safetensors +3 -0
README.md CHANGED
@@ -1,3 +1,130 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - code
5
+ library_name: peft
6
+ tags:
7
+ - code-search
8
+ - text-embeddings
9
+ - decoder-only
10
+ - supervised-contrastive-learning
11
+ - codegemma
12
+ - llm2vec
13
+ ---
14
+
15
+ ## 📖 Are Decoder-Only Large Language Models the Silver Bullet for Code Search?
16
+
17
+ This model is an official artifact from our research paper: **"[Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)"**.
18
+
19
+ In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies.
20
+
21
+ For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository:
22
+
23
+ ➡️ **[GitHub: Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)**
24
+
25
+ ---
26
+
27
+ # Model Card: DCS-CodeGemma-7b-it-SupCon-CSN
28
+
29
+ ## 📜 Model Description
30
+
31
+ This is a PEFT adapter for the **`google/codegemma-7b-it`** model, fine-tuned for the task of **Code Search** as part of the research mentioned above.
32
+
33
+ The model was trained using the **Supervised Contrastive Learning** method proposed in the [llm2vec](https://github.com/McGill-NLP/llm2vec) framework, designed to generate high-quality vector embeddings for code snippets.
34
+
35
+ ## 🔬 Model Performance & Reproducibility
36
+
37
+ The table below provides details about this model, its corresponding results in our paper, and how to reproduce the evaluation.
38
+
39
+ | Attribute | Details |
40
+ | :------------------------- | :------------------------------------------------------------------------------------------------------------------------------ |
41
+ | **Base Model** | `google/codegemma-7b-it` |
42
+ | **Fine-tuning Method** | Supervised Contrastive Learning via `llm2vec` | |
43
+ | **Evaluation Script** | [CSN_Test_Finetuning_Decoder_Model.py](https://github.com/Georgepitt/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CSN_Test_Finetuning_Decoder_Model.py),<br>[CoSQA_Plus_Test_Finetuning_Decoder_Model.py](https://github.com/ChenyxEugene/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CoSQA_Plus_Test_Finetuning_Decoder_Model.py) |
44
+ | **Prerequisite Model** | This model must be loaded on top of an MNTP pre-trained model. |
45
+
46
+ ---
47
+
48
+ ## 🚀 How to Use (with `llm2vec`)
49
+
50
+ For best results, we strongly recommend using the official `llm2vec` wrapper to load and use this model.
51
+
52
+ **1. Install Dependencies**
53
+ ```bash
54
+ pip install llm2vec transformers torch peft accelerate
55
+ ```
56
+
57
+ **2. Example Usage**
58
+
59
+ > **Important**: The `llm2vec` supervised contrastive (SupCon) models are fine-tuned on top of **MNTP (Masked Next Token Prediction)** models. Therefore, loading requires first merging the MNTP weights before loading the SupCon adapter.
60
+
61
+ ```python
62
+ import torch
63
+ from transformers import AutoTokenizer, AutoModel, AutoConfig
64
+ from peft import PeftModel
65
+ from llm2vec import LLM2Vec
66
+
67
+ # --- 1. Define Model IDs ---
68
+ base_model_id = "google/codegemma-7b-it"
69
+ mntp_model_id = "SYSUSELab/DCS-CodeGemma-7B-It-MNTP"
70
+ supcon_model_id = "SYSUSELab/DCS-CodeGemma-7B-It-SupCon-CSN-ruby-discard-0.2"
71
+
72
+ # --- 2. Load Base Model and MNTP Adapter ---
73
+ tokenizer = AutoTokenizer.from_pretrained(base_model_id)
74
+ config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True)
75
+ model = AutoModel.from_pretrained(
76
+ base_model_id,
77
+ trust_remote_code=True,
78
+ config=config,
79
+ torch_dtype=torch.bfloat16,
80
+ device_map="cuda" if torch.cuda.is_available() else "cpu",
81
+ )
82
+ model = PeftModel.from_pretrained(model, mntp_model_id)
83
+ model = model.merge_and_unload()
84
+
85
+ # --- 3. Load the Supervised (this model) Adapter on top of the MNTP-merged model ---
86
+ model = PeftModel.from_pretrained(model, supcon_model_id)
87
+
88
+ # --- 4. Use the LLM2Vec Wrapper for Encoding ---
89
+ l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)
90
+
91
+ queries = ["how to read a file in Python?"]
92
+ code_snippets = ["with open('file.txt', 'r') as f:\n content = f.read()"]
93
+ query_embeddings = l2v.encode(queries)
94
+ code_embeddings = l2v.encode(code_snippets)
95
+
96
+ print("Query Embedding Shape:", query_embeddings.shape)
97
+ # This usage example is adapted from the official llm2vec repository. Credits to the original authors.
98
+ ```
99
+
100
+ ---
101
+
102
+ ## 📄 Citation
103
+
104
+ If you use our model or work in your research, please cite our paper. As our method is built upon `llm2vec`, please also cite their foundational work.
105
+
106
+ **Our Paper:**
107
+ * **Paper Link:** [Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)
108
+ * **GitHub:** [https://github.com/Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)
109
+ * **BibTeX:**
110
+ ```bibtex
111
+ @article{chen2024decoder,
112
+ title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?},
113
+ author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin},
114
+ journal={arXiv preprint arXiv:2410.22240},
115
+ year={2024}
116
+ }
117
+ ```
118
+
119
+ **llm2vec (Foundational Work):**
120
+ * **Paper Link:** [LLM2Vec: Large Language Models Are Good Contextual Text Encoders](https://arxiv.org/abs/2404.05961)
121
+ * **GitHub:** [https://github.com/McGill-NLP/llm2vec](https://github.com/McGill-NLP/llm2vec)
122
+ * **BibTeX:**
123
+ ```bibtex
124
+ @article{vaishaal2024llm2vec,
125
+ title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders},
126
+ author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran},
127
+ journal={arXiv preprint arXiv:2404.05961},
128
+ year={2024}
129
+ }
130
+ ```
adapter_config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": {
4
+ "base_model_class": "GemmaBiModel",
5
+ "parent_library": "llm2vec.models.bidirectional_gemma"
6
+ },
7
+ "base_model_name_or_path": "google/codegemma-7b-it",
8
+ "bias": "none",
9
+ "fan_in_fan_out": false,
10
+ "inference_mode": true,
11
+ "init_lora_weights": true,
12
+ "layer_replication": null,
13
+ "layers_pattern": null,
14
+ "layers_to_transform": null,
15
+ "loftq_config": {},
16
+ "lora_alpha": 32,
17
+ "lora_dropout": 0.05,
18
+ "megatron_config": null,
19
+ "megatron_core": "megatron.core",
20
+ "modules_to_save": null,
21
+ "peft_type": "LORA",
22
+ "r": 16,
23
+ "rank_pattern": {},
24
+ "revision": null,
25
+ "target_modules": [
26
+ "up_proj",
27
+ "o_proj",
28
+ "gate_proj",
29
+ "k_proj",
30
+ "q_proj",
31
+ "v_proj",
32
+ "down_proj"
33
+ ],
34
+ "task_type": null,
35
+ "use_dora": false,
36
+ "use_rslora": false
37
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:82af8cb96691a320113be3d367a6f5d7206bd88455ec73ff1ebbb36530e4fe40
3
+ size 100058184