himanshu-skid19 commited on
Commit
4bb2a04
·
verified ·
1 Parent(s): 6ab9d77
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ images/macro_arch.png filter=lfs diff=lfs merge=lfs -text
37
+ images/module.png filter=lfs diff=lfs merge=lfs -text
38
+ images/performance2.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: other
4
+ license_name: nvidia-open-model-license
5
+ license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
6
+ pipeline_tag: text-generation
7
+ ---
8
+
9
+
10
+ # Hymba-1.5B-Base
11
+
12
+ <p align="center">
13
+ 💾 <a href="https://github.com/NVlabs/hymba">Github</a>&nbsp&nbsp | &nbsp&nbsp 📄 <a href="https://arxiv.org/abs/2411.13676">Paper</a> | &nbsp&nbsp 📜 <a href="https://developer.nvidia.com/blog/hymba-hybrid-head-architecture-boosts-small-language-model-performance/">Blog</a> &nbsp
14
+ </p>
15
+
16
+
17
+ ## Model Overview
18
+
19
+ Hymba-1.5B-Base is a base text-to-text model that can be adopted for a variety of natural language generation tasks.
20
+
21
+ The model has hybrid architecture with Mamba and Attention heads running in parallel. Meta tokens, a set of learnable tokens prepended to every prompt, help improve the efficacy of the model. The model shares KV cache between 2 layers and between heads in a single layer. 90% of attention layers are sliding window attention.
22
+
23
+ This model is ready for commercial use.
24
+
25
+
26
+ **Model Developer:** NVIDIA
27
+
28
+ **Model Dates:** Hymba-1.5B-Base was trained between September 1, 2024 and November 10th, 2024.
29
+
30
+ **License:**
31
+ This model is released under the [NVIDIA Open Model License Agreement](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf).
32
+
33
+
34
+ ## Model Architecture
35
+
36
+ > ⚡️ We've released a minimal implementation of Hymba on GitHub to help developers understand and implement its design principles in their own models. Check it out! [barebones-hymba](https://github.com/NVlabs/hymba/tree/main/barebones_hymba).
37
+
38
+ Hymba-1.5B-Base has a model embedding size of 1600, 25 attention heads, and an MLP intermediate dimension of 5504, with 32 layers in total, 16 SSM states, 3 full attention layers, the rest are sliding window attention. Unlike the standard Transformer, each attention layer in Hymba has a hybrid combination of standard attention heads and Mamba heads in parallel. Additionally, it uses Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE).
39
+
40
+ Features of this architecture:
41
+
42
+ - Fuse attention heads and SSM heads within the same layer, offering parallel and complementary processing of the same inputs.
43
+
44
+ <div align="center">
45
+ <img src="https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/images/module.png" alt="Hymba Module" width="600">
46
+ </div>
47
+
48
+ - Introduce meta tokens that are prepended to the input sequences and interact with all subsequent tokens, thus storing important information and alleviating the burden of "forced-to-attend" in attention.
49
+
50
+ - Integrate with cross-layer KV sharing and global-local attention to further boost memory and computation efficiency.
51
+
52
+ <div align="center">
53
+ <img src="https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/images/macro_arch.png" alt="Hymba Model" width="600">
54
+ </div>
55
+
56
+
57
+
58
+ ## Performance Highlights
59
+ - Hymba-1.5B-Base outperforms all sub-2B public models.
60
+
61
+ <div align="center">
62
+ <img src="https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/images/performance1.png" alt="Compare with SoTA Small LMs" width="800">
63
+ </div>
64
+
65
+ <div align="center">
66
+ <img src="https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/images/performance2.png" alt="Compare with SoTA Small LMs" width="800">
67
+ </div>
68
+
69
+
70
+ ## Model Usage
71
+
72
+
73
+ ### Step 1: Environment Setup
74
+
75
+ Since Hymba-1.5B-Base employs [FlexAttention](https://pytorch.org/blog/flexattention/), which relies on Pytorch2.5 and other related dependencies, we provide two ways to setup the environment:
76
+
77
+ - **[Local install]** Install the related packages using our provided `setup.sh` (support CUDA 12.1/12.4):
78
+
79
+ ```
80
+ wget --header="Authorization: Bearer YOUR_HF_TOKEN" https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/setup.sh
81
+ bash setup.sh
82
+ ```
83
+
84
+ - **[Docker]** A docker image is provided with all of Hymba's dependencies installed. You can download our docker image and start a container using the following commands:
85
+ ```
86
+ docker pull ghcr.io/tilmto/hymba:v1
87
+ docker run --gpus all -v /home/$USER:/home/$USER -it ghcr.io/tilmto/hymba:v1 bash
88
+ ```
89
+
90
+
91
+ ### Step 2: Chat with Hymba-1.5B-Base
92
+ After setting up the environment, you can use the following script to chat with our Model
93
+
94
+ ```py
95
+ from transformers import LlamaTokenizer, AutoModelForCausalLM, AutoTokenizer, AutoModel
96
+ import torch
97
+
98
+ # Load the tokenizer and model
99
+ repo_name = "nvidia/Hymba-1.5B-Base"
100
+
101
+ tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
102
+ model = AutoModelForCausalLM.from_pretrained(repo_name, trust_remote_code=True)
103
+ model = model.cuda().to(torch.bfloat16)
104
+
105
+ # Chat with Hymba
106
+ prompt = input()
107
+ inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
108
+ outputs = model.generate(**inputs, max_length=64, do_sample=False, temperature=0.7, use_cache=True)
109
+ response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
110
+
111
+ print(f"Model response: {response}")
112
+
113
+ ```
114
+
115
+ ## Finetuning Hymba
116
+
117
+
118
+ [LMFlow](https://github.com/OptimalScale/LMFlow) is a complete pipeline for fine-tuning large language models.
119
+ The following steps provide an example of how to fine-tune the `Hymba-1.5B-Base` models using LMFlow.
120
+
121
+ 1. Using Docker
122
+
123
+ ```
124
+ docker pull ghcr.io/tilmto/hymba:v1
125
+ docker run --gpus all -v /home/$USER:/home/$USER -it ghcr.io/tilmto/hymba:v1 bash
126
+ ```
127
+ 2. Install LMFlow
128
+
129
+ ```
130
+ git clone https://github.com/OptimalScale/LMFlow.git
131
+ cd LMFlow
132
+ conda create -n lmflow python=3.9 -y
133
+ conda activate lmflow
134
+ conda install mpi4py
135
+ pip install -e .
136
+ ```
137
+
138
+ 3. Fine-tune the model using the following command.
139
+
140
+ ```
141
+ cd LMFlow
142
+ bash ./scripts/run_finetune_hymba.sh
143
+ ```
144
+
145
+ With LMFlow, you can also fine-tune the model on your custom dataset. The only thing you need to do is transform your dataset into the [LMFlow data format](https://optimalscale.github.io/LMFlow/examples/DATASETS.html).
146
+ In addition to full-finetuniing, you can also fine-tune hymba efficiently with [DoRA](https://arxiv.org/html/2402.09353v4), [LoRA](https://github.com/OptimalScale/LMFlow?tab=readme-ov-file#lora), [LISA](https://github.com/OptimalScale/LMFlow?tab=readme-ov-file#lisa), [Flash Attention](https://github.com/OptimalScale/LMFlow/blob/main/readme/flash_attn2.md), and other acceleration techniques.
147
+ For more details, please refer to the [LMFlow for Hymba](https://github.com/OptimalScale/LMFlow/tree/main/experimental/Hymba) documentation.
148
+
149
+
150
+ ## Evaluation
151
+ We use [`LM Evaluation Harness`](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the model. The evaluation commands are as follows:
152
+
153
+ ```bash
154
+ git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
155
+ git fetch --all --tags
156
+ git checkout tags/v0.4.4 # squad completion task is not compatible with the latest version
157
+ cd lm-evaluation-harness
158
+ pip install -e .
159
+
160
+ lm_eval --model hf --model_args pretrained=nvidia/Hymba-1.5B-Base,dtype=bfloat16,trust_remote_code=True \
161
+ --tasks mmlu \
162
+ --num_fewshot 5 \
163
+ --batch_size 1 \
164
+ --output_path ./hymba_HF_base_lm-results \
165
+ --log_samples
166
+
167
+ lm_eval --model hf --model_args pretrained=nvidia/Hymba-1.5B-Base,dtype=bfloat16,trust_remote_code=True \
168
+ --tasks arc_easy,arc_challenge,piqa,winogrande,hellaswag \
169
+ --num_fewshot 0 \
170
+ --batch_size 1 \
171
+ --output_path ./hymba_HF_base_lm-results \
172
+ --log_samples
173
+
174
+ lm_eval --model hf --model_args pretrained=nvidia/Hymba-1.5B-Base,dtype=bfloat16,trust_remote_code=True \
175
+ --tasks squad_completion \
176
+ --num_fewshot 1 \
177
+ --batch_size 1 \
178
+ --output_path ./hymba_HF_base_lm-results \
179
+ --log_samples
180
+ ```
181
+
182
+
183
+ ## Limitations
184
+
185
+ The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
186
+
187
+ The testing suggests that this model is susceptible to jailbreak attacks. If using this model in a RAG or agentic setting, we recommend strong output validation controls to ensure security and safety risks from user-controlled model outputs are consistent with the intended use cases.
188
+
189
+ ## Ethical Considerations
190
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
191
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
192
+
193
+
194
+ ## Citation
195
+ ```
196
+ @misc{dong2024hymbahybridheadarchitecturesmall,
197
+ title={Hymba: A Hybrid-head Architecture for Small Language Models},
198
+ author={Xin Dong and Yonggan Fu and Shizhe Diao and Wonmin Byeon and Zijia Chen and Ameya Sunil Mahabaleshwarkar and Shih-Yang Liu and Matthijs Van Keirsbilck and Min-Hung Chen and Yoshi Suhara and Yingyan Lin and Jan Kautz and Pavlo Molchanov},
199
+ year={2024},
200
+ eprint={2411.13676},
201
+ archivePrefix={arXiv},
202
+ primaryClass={cs.CL},
203
+ url={https://arxiv.org/abs/2411.13676},
204
+ }
added_tokens.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "[PAD]": 32000
3
+ }
config.json ADDED
@@ -0,0 +1,191 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "HymbaForCausalLM"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "attn_hidden_size": -1,
7
+ "attn_implementation": "flex",
8
+ "attn_implementation_new": "flex",
9
+ "auto_map": {
10
+ "AutoConfig": "configuration_hymba.HymbaConfig",
11
+ "AutoModelForCausalLM": "modeling_hymba.HymbaForCausalLM"
12
+ },
13
+ "bos_token_id": 1,
14
+ "calc_logits_for_entire_prompt": false,
15
+ "conv_dim": {
16
+ "0": 3200,
17
+ "1": 3200,
18
+ "10": 3200,
19
+ "11": 3200,
20
+ "12": 3200,
21
+ "13": 3200,
22
+ "14": 3200,
23
+ "15": 3200,
24
+ "16": 3200,
25
+ "17": 3200,
26
+ "18": 3200,
27
+ "19": 3200,
28
+ "2": 3200,
29
+ "20": 3200,
30
+ "21": 3200,
31
+ "22": 3200,
32
+ "23": 3200,
33
+ "24": 3200,
34
+ "25": 3200,
35
+ "26": 3200,
36
+ "27": 3200,
37
+ "28": 3200,
38
+ "29": 3200,
39
+ "3": 3200,
40
+ "30": 3200,
41
+ "31": 3200,
42
+ "4": 3200,
43
+ "5": 3200,
44
+ "6": 3200,
45
+ "7": 3200,
46
+ "8": 3200,
47
+ "9": 3200
48
+ },
49
+ "eos_token_id": 2,
50
+ "global_attn_idx": [
51
+ 0,
52
+ 15,
53
+ 31
54
+ ],
55
+ "hidden_act": "silu",
56
+ "hidden_size": 1600,
57
+ "initializer_range": 0.02,
58
+ "intermediate_size": 5504,
59
+ "kq_head_dim": -1,
60
+ "kq_norm": "none",
61
+ "kv_reuse_every_i_layer": -1,
62
+ "kv_reuse_group": [
63
+ [
64
+ 1,
65
+ 2
66
+ ],
67
+ [
68
+ 3,
69
+ 4
70
+ ],
71
+ [
72
+ 5,
73
+ 6
74
+ ],
75
+ [
76
+ 7,
77
+ 8
78
+ ],
79
+ [
80
+ 9,
81
+ 10
82
+ ],
83
+ [
84
+ 11,
85
+ 12
86
+ ],
87
+ [
88
+ 13,
89
+ 14
90
+ ],
91
+ [
92
+ 16,
93
+ 17,
94
+ 18
95
+ ],
96
+ [
97
+ 19,
98
+ 20
99
+ ],
100
+ [
101
+ 21,
102
+ 22
103
+ ],
104
+ [
105
+ 23,
106
+ 24
107
+ ],
108
+ [
109
+ 25,
110
+ 26
111
+ ],
112
+ [
113
+ 27,
114
+ 28
115
+ ],
116
+ [
117
+ 29,
118
+ 30
119
+ ]
120
+ ],
121
+ "kv_weight_reuse": false,
122
+ "layer_type": [
123
+ "h",
124
+ "h",
125
+ "h",
126
+ "h",
127
+ "h",
128
+ "h",
129
+ "h",
130
+ "h",
131
+ "h",
132
+ "h",
133
+ "h",
134
+ "h",
135
+ "h",
136
+ "h",
137
+ "h",
138
+ "h",
139
+ "h",
140
+ "h",
141
+ "h",
142
+ "h",
143
+ "h",
144
+ "h",
145
+ "h",
146
+ "h",
147
+ "h",
148
+ "h",
149
+ "h",
150
+ "h",
151
+ "h",
152
+ "h",
153
+ "h",
154
+ "h"
155
+ ],
156
+ "mamba_conv_bias": true,
157
+ "mamba_d_conv": 4,
158
+ "mamba_d_state": 16,
159
+ "mamba_dt_rank": 100,
160
+ "mamba_expand": 2,
161
+ "mamba_inner_layernorms": true,
162
+ "mamba_proj_bias": false,
163
+ "max_position_embeddings": 8192,
164
+ "memory_tokens_interspersed_every": 0,
165
+ "mlp_hidden_act": "silu",
166
+ "model_type": "hymba",
167
+ "num_attention_heads": 25,
168
+ "num_experts": 1,
169
+ "num_experts_per_tok": 1,
170
+ "num_hidden_layers": 32,
171
+ "num_key_value_heads": 5,
172
+ "num_mamba": 1,
173
+ "num_memory_tokens": 128,
174
+ "orig_max_position_embeddings": 2048,
175
+ "output_router_logits": false,
176
+ "pad_token_id": 0,
177
+ "rms_norm_eps": 1e-06,
178
+ "rope": true,
179
+ "rope_theta": 10000.0,
180
+ "rope_type": "ntk",
181
+ "router_aux_loss_coef": 0.001,
182
+ "seq_length": 8192,
183
+ "sliding_window": 1024,
184
+ "tie_word_embeddings": true,
185
+ "torch_dtype": "bfloat16",
186
+ "transformers_version": "4.44.0",
187
+ "use_cache": false,
188
+ "use_mamba_kernels": true,
189
+ "v_head_dim": 128,
190
+ "vocab_size": 32001
191
+ }
configuration_hymba.py ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ from transformers.configuration_utils import PretrainedConfig
3
+
4
+
5
+ class HymbaConfig(PretrainedConfig):
6
+
7
+ model_type = "hymba"
8
+ keys_to_ignore_at_inference = ["past_key_values"]
9
+
10
+ def __init__(
11
+ self,
12
+ vocab_size=65536,
13
+ tie_word_embeddings=False,
14
+ hidden_size=4096,
15
+ intermediate_size=14336,
16
+ num_hidden_layers=32,
17
+ num_attention_heads=32,
18
+ num_key_value_heads=8,
19
+ hidden_act="silu",
20
+ initializer_range=0.02,
21
+ rms_norm_eps=1e-6,
22
+ use_cache=True,
23
+ calc_logits_for_entire_prompt=False,
24
+ output_router_logits=False,
25
+ router_aux_loss_coef=0.001,
26
+ pad_token_id=0,
27
+ bos_token_id=1,
28
+ eos_token_id=2,
29
+ sliding_window=None,
30
+ max_position_embeddings=262144,
31
+ orig_max_position_embeddings=None,
32
+ attention_dropout=0.0,
33
+ num_experts_per_tok=2,
34
+ num_experts=16,
35
+ use_mamba_kernels=True,
36
+ mamba_d_state=16,
37
+ mamba_d_conv=4,
38
+ mamba_expand=2,
39
+ mamba_dt_rank="auto",
40
+ mamba_conv_bias=True,
41
+ mamba_proj_bias=False,
42
+ mamba_inner_layernorms=True,
43
+ kv_reuse_every_i_layer=-1,
44
+ kv_reuse_group=None,
45
+ kv_weight_reuse=False,
46
+ global_attn_idx=None,
47
+ num_mamba=1,
48
+ attn_implementation_new='sdpa',
49
+ rope_type=None,
50
+ **kwargs,
51
+ ):
52
+ self.vocab_size = vocab_size
53
+ self.tie_word_embeddings = tie_word_embeddings
54
+ self.hidden_size = hidden_size
55
+ self.intermediate_size = intermediate_size
56
+ self.num_hidden_layers = num_hidden_layers
57
+ self.num_attention_heads = num_attention_heads
58
+ self.sliding_window = sliding_window
59
+ self.max_position_embeddings = max_position_embeddings
60
+ self.orig_max_position_embeddings = orig_max_position_embeddings
61
+ self.attention_dropout = attention_dropout
62
+
63
+ if num_key_value_heads is None:
64
+ num_key_value_heads = num_attention_heads
65
+
66
+ self.num_key_value_heads = num_key_value_heads
67
+ self.hidden_act = hidden_act
68
+ self.initializer_range = initializer_range
69
+ self.rms_norm_eps = rms_norm_eps
70
+
71
+ self.use_cache = use_cache
72
+ self.calc_logits_for_entire_prompt = calc_logits_for_entire_prompt
73
+ self.output_router_logits = output_router_logits
74
+ self.router_aux_loss_coef = router_aux_loss_coef
75
+
76
+ self.num_experts_per_tok = num_experts_per_tok
77
+ self.num_experts = num_experts
78
+
79
+ self.use_mamba_kernels = use_mamba_kernels
80
+ self.mamba_d_state = mamba_d_state
81
+ self.mamba_d_conv = mamba_d_conv
82
+ self.mamba_expand = mamba_expand
83
+ self.mamba_dt_rank = math.ceil(self.hidden_size / 16) if mamba_dt_rank == "auto" else mamba_dt_rank
84
+ self.mamba_conv_bias = mamba_conv_bias
85
+ self.mamba_proj_bias = mamba_proj_bias
86
+ self.mamba_inner_layernorms = mamba_inner_layernorms
87
+
88
+ self.attn_hidden_size = kwargs.pop("attn_hidden_size", -1)
89
+ self.kq_head_dim = kwargs.pop("kq_head_dim", -1)
90
+ self.v_head_dim = kwargs.pop("v_head_dim", -1)
91
+ self.kq_norm = kwargs.pop("kq_norm", None)
92
+ self.rope = kwargs.pop("rope", False)
93
+ self.rope_theta = kwargs.pop("rope_theta", 10000.0)
94
+ self.num_memory_tokens = kwargs.pop("num_memory_tokens", 0)
95
+ self.memory_tokens_interspersed_every = kwargs.pop("memory_tokens_interspersed_every", 0)
96
+
97
+ self.kv_reuse_every_i_layer = kv_reuse_every_i_layer
98
+ self.kv_reuse_group = kv_reuse_group
99
+ self.kv_weight_reuse = kv_weight_reuse
100
+
101
+ self.global_attn_idx = global_attn_idx
102
+
103
+ self.num_mamba = num_mamba
104
+
105
+ self.attn_implementation_new = attn_implementation_new
106
+
107
+ self.rope_type = rope_type
108
+
109
+
110
+ super().__init__(
111
+ pad_token_id=pad_token_id,
112
+ bos_token_id=bos_token_id,
113
+ eos_token_id=eos_token_id,
114
+ tie_word_embeddings=tie_word_embeddings,
115
+ **kwargs,
116
+ )
generation_config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.44.0",
7
+ "use_cache": false
8
+ }
images/macro_arch.png ADDED

Git LFS Details

  • SHA256: c33b925fa41f3ef0cfb8a434db14808fe1e39a666d2bccd3e06707d321656679
  • Pointer size: 131 Bytes
  • Size of remote file: 143 kB
images/module.png ADDED

Git LFS Details

  • SHA256: bd46e35494bbc156f4b80674883fbce8dd928043b8dc3edf1fab403f8eb9cc78
  • Pointer size: 131 Bytes
  • Size of remote file: 117 kB
images/performance1.png ADDED
images/performance2.png ADDED

Git LFS Details

  • SHA256: ab897ce232461a16b94508cbe2a3b4b23f7ca61eaf6dde0009e6b74c25b5e240
  • Pointer size: 131 Bytes
  • Size of remote file: 199 kB
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fdff92c3753ca4b58702b3bf835c36f5210a919e9bfa2383f35dc6ab79e792eb
3
+ size 3045665048
modeling_hymba.py ADDED
The diff for this file is too large to render. See raw diff
 
setup.sh ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # Prompt user to specify CUDA version
4
+ read -p "Enter CUDA version (12.1 or 12.4): " cuda_version
5
+
6
+ # Verify CUDA version input
7
+ if [[ "$cuda_version" != "12.1" && "$cuda_version" != "12.4" ]]; then
8
+ echo "Invalid CUDA version specified. Please choose either 12.1 or 12.4."
9
+ exit 1
10
+ fi
11
+
12
+ export CUDA_HOME=/usr/local/cuda-$cuda_version
13
+
14
+ # Install PyTorch with the specified CUDA version
15
+ conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=$cuda_version -c pytorch -c nvidia
16
+
17
+ # Install other packages
18
+ pip install --upgrade transformers
19
+ pip install tiktoken
20
+ pip install sentencepiece
21
+ pip install protobuf
22
+ pip install ninja einops triton packaging
23
+
24
+ # Clone and install Mamba
25
+ git clone https://github.com/state-spaces/mamba.git
26
+ cd mamba
27
+ pip install -e .
28
+ cd ..
29
+
30
+ # Clone and install causal-conv1d with specified CUDA version
31
+ git clone https://github.com/Dao-AILab/causal-conv1d.git
32
+ cd causal-conv1d
33
+
34
+ TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;8.9;9.0" python setup.py install
35
+ cd ..
36
+
37
+ # Clone and install attention-gym
38
+ git clone https://github.com/pytorch-labs/attention-gym.git
39
+ cd attention-gym
40
+ pip install .
41
+ cd ..
42
+
43
+ # Install Flash Attention
44
+ pip install flash_attn
45
+
46
+ echo "Installation completed with CUDA $cuda_version."
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<unk>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347
3
+ size 499723
tokenizer_config.json ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": true,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<unk>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<s>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "</s>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ },
30
+ "32000": {
31
+ "content": "[PAD]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": true
37
+ }
38
+ },
39
+ "bos_token": "<s>",
40
+ "chat_template": "{{'<extra_id_0>System'}}{% for message in messages %}{% if message['role'] == 'system' %}{{'\n' + message['content'].strip()}}{% if tools or contexts %}{{'\n'}}{% endif %}{% endif %}{% endfor %}{% if tools %}{% for tool in tools %}{{ '\n<tool> ' + tool|tojson + ' </tool>' }}{% endfor %}{% endif %}{% if contexts %}{% if tools %}{{'\n'}}{% endif %}{% for context in contexts %}{{ '\n<context> ' + context.strip() + ' </context>' }}{% endfor %}{% endif %}{{'\n\n'}}{% for message in messages %}{% if message['role'] == 'user' %}{{ '<extra_id_1>User\n' + message['content'].strip() + '\n' }}{% elif message['role'] == 'assistant' %}{{ '<extra_id_1>Assistant\n' + message['content'].strip() + '\n' }}{% elif message['role'] == 'tool' %}{{ '<extra_id_1>Tool\n' + message['content'].strip() + '\n' }}{% endif %}{% endfor %}{%- if add_generation_prompt %}{{'<extra_id_1>Assistant\n'}}{%- endif %}",
41
+ "clean_up_tokenization_spaces": false,
42
+ "eos_token": "</s>",
43
+ "legacy": true,
44
+ "model_max_length": 1000000000000000019884624838656,
45
+ "pad_token": "[PAD]",
46
+ "padding_side": "left",
47
+ "sp_model_kwargs": {},
48
+ "spaces_between_special_tokens": false,
49
+ "tokenizer_class": "LlamaTokenizer",
50
+ "unk_token": "<unk>",
51
+ "use_default_system_prompt": false
52
+ }