Update README.md
Browse files
README.md
CHANGED
|
@@ -2,36 +2,67 @@
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
- zh
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
---
|
| 6 |
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
<p align="center">
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
</p>
|
| 12 |
|
| 13 |
<p align="center">
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
</p>
|
| 16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
## News
|
| 18 |
-
|
| 19 |
-
-
|
|
|
|
|
|
|
| 20 |
|
| 21 |
## Model Introduction
|
| 22 |
|
| 23 |
**Llama-3-SynE** (<ins>Syn</ins>thetic data <ins>E</ins>nhanced Llama-3) is a significantly enhanced version of [Llama-3 (8B)](https://github.com/meta-llama/llama3), achieved through continual pre-training (CPT) to improve its **Chinese language ability and scientific reasoning capability**. By employing a meticulously designed data mixture and curriculum strategy, Llama-3-SynE successfully enhances new abilities while maintaining the original model’s performance. This enhancement process involves utilizing existing datasets and synthesizing high-quality datasets specifically designed for targeted tasks.
|
| 24 |
|
| 25 |
Key features of Llama-3-SynE include:
|
|
|
|
| 26 |
- **Enhanced Chinese Language Capabilities**: Achieved through topic-based data mixture and perplexity-based data curriculum.
|
| 27 |
- **Improved Scientific Reasoning**: Utilizing synthetic datasets to enhance multi-disciplinary scientific knowledge.
|
| 28 |
- **Efficient CPT**: Only consuming around 100 billion tokens, making it a cost-effective solution.
|
| 29 |
|
| 30 |
## Model List
|
| 31 |
|
| 32 |
-
| Model
|
| 33 |
-
|
| 34 |
-
| Llama-3-SynE
|
| 35 |
|
| 36 |
## BenchMark
|
| 37 |
|
|
@@ -44,15 +75,15 @@ For HumanEval and ARC, we report the zero-shot evaluation performance. The best
|
|
| 44 |
|
| 45 |
### Major Benchmarks
|
| 46 |
|
| 47 |
-
|
|
| 48 |
-
|
| 49 |
-
| Llama-3-8B
|
| 50 |
-
| DCLM-7B
|
| 51 |
-
| Mistral-7B-v0.3
|
| 52 |
-
| Llama-3-Chinese-8B
|
| 53 |
-
| MAmmoTH2-8B
|
| 54 |
-
| Galactica-6.7B
|
| 55 |
-
| **Llama-3-SynE (ours)**
|
| 56 |
|
| 57 |
> On **Chinese evaluation benchmarks** (such as C-Eval and CMMLU), Llama-3-SynE significantly outperforms the base model Llama-3 (8B), indicating that our method is very effective in improving Chinese language capabilities.
|
| 58 |
|
|
@@ -62,15 +93,15 @@ For HumanEval and ARC, we report the zero-shot evaluation performance. The best
|
|
| 62 |
|
| 63 |
"PHY", "CHE", and "BIO" denote the physics, chemistry, and biology sub-tasks of the corresponding benchmarks.
|
| 64 |
|
| 65 |
-
| **Models**
|
| 66 |
-
|
| 67 |
-
| Llama-3-8B
|
| 68 |
-
| DCLM-7B
|
| 69 |
-
| Mistral-7B-v0.3
|
| 70 |
-
| Llama-3-Chinese-8B
|
| 71 |
-
| MAmmoTH2-8B
|
| 72 |
-
| Galactica-6.7B
|
| 73 |
-
| **Llama-3-SynE (ours)** | <ins>53.66</ins>
|
| 74 |
|
| 75 |
> On **scientific evaluation benchmarks** (such as SciEval, GaoKao, and ARC), Llama-3-SynE significantly outperforms the base model, particularly showing remarkable improvement in Chinese scientific benchmarks (for example, a 25.71% improvement in the GaoKao biology subtest).
|
| 76 |
|
|
@@ -134,7 +165,7 @@ print(output)
|
|
| 134 |
|
| 135 |
## License
|
| 136 |
|
| 137 |
-
This project is built upon Meta's Llama-3 model. The use of Llama-3-SynE model weights must follow the Llama-3 [license agreement](https://github.com/meta-llama/llama3/blob/main/LICENSE).
|
| 138 |
|
| 139 |
## Citation
|
| 140 |
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
- zh
|
| 5 |
+
datasets:
|
| 6 |
+
- survivi/Llama-3-SynE-Dataset
|
| 7 |
+
library_name: transformers
|
| 8 |
+
pipeline_tag: text-generation
|
| 9 |
---
|
| 10 |
|
| 11 |
+
<p align="center">
|
| 12 |
+
<img src="https://github.com/RUC-GSAI/Llama-3-SynE/blob/main/assets/llama-3-syne-logo.png" width="400"/>
|
| 13 |
+
</p>
|
| 14 |
+
|
| 15 |
+
<!-- <p align="center">
|
| 16 |
+
📄 <a href="https://arxiv.org/abs/2407.18743"> Report </a>  |   🤗 <a href="https://huggingface.co/survivi/Llama-3-SynE">Model on Hugging Face</a>  |   📊 <a href="https://huggingface.co/datasets/survivi/Llama-3-SynE-Dataset">CPT Dataset</a>
|
| 17 |
+
</p>
|
| 18 |
|
| 19 |
<p align="center">
|
| 20 |
+
🔍 <a href="https://github.com/RUC-GSAI/Llama-3-SynE/blob/main/README.md">English</a>  |  <a href="https://github.com/RUC-GSAI/Llama-3-SynE/blob/main/README_zh.md">简体中文</a>
|
| 21 |
+
</p> -->
|
| 22 |
+
|
| 23 |
+
<p align="center">
|
| 24 |
+
📄 <a href="https://arxiv.org/abs/2407.18743"> Report </a>  |   💻 <a href="https://github.com/RUC-GSAI/Llama-3-SynE">GitHub Repo</a>
|
| 25 |
</p>
|
| 26 |
|
| 27 |
<p align="center">
|
| 28 |
+
🔍 <a href="https://huggingface.co/survivi/Llama-3-SynE/blob/main/README.md">English</a>  |  <a href="https://huggingface.co/survivi/Llama-3-SynE/blob/main/README_zh.md">简体中文</a>
|
| 29 |
+
</p>
|
| 30 |
+
|
| 31 |
+
> Here is the Llama-3-SynE model. The continual pre-training dataset is also available [here](https://huggingface.co/datasets/survivi/Llama-3-SynE-Dataset).
|
| 32 |
+
|
| 33 |
+
<!-- <p align="center">
|
| 34 |
+
📄 <a href="https://arxiv.org/abs/2407.18743"> Report </a>  |   💻 <a href="https://github.com/RUC-GSAI/Llama-3-SynE">GitHub Repo</a>
|
| 35 |
</p>
|
| 36 |
|
| 37 |
+
<p align="center">
|
| 38 |
+
🔍 <a href="https://huggingface.co/datasets/survivi/Llama-3-SynE-Dataset/blob/main/README.md">English</a>  |  <a href="https://huggingface.co/datasets/survivi/Llama-3-SynE-Dataset/blob/main/README_zh.md">简体中文</a>
|
| 39 |
+
</p>
|
| 40 |
+
|
| 41 |
+
> Here is the continual pre-training dataset. The Llama-3-SynE model is available [here](https://huggingface.co/survivi/Llama-3-SynE). -->
|
| 42 |
+
|
| 43 |
+
---
|
| 44 |
+
|
| 45 |
## News
|
| 46 |
+
|
| 47 |
+
- ✨✨ `2024/08/12`: We released the [continual pre-training dataset](https://huggingface.co/datasets/survivi/Llama-3-SynE-Dataset).
|
| 48 |
+
- ✨✨ `2024/08/10`: We released the [Llama-3-SynE model](https://huggingface.co/survivi/Llama-3-SynE).
|
| 49 |
+
- ✨ `2024/07/26`: We released the [technical report](https://arxiv.org/abs/2407.18743), welcome to check it out!
|
| 50 |
|
| 51 |
## Model Introduction
|
| 52 |
|
| 53 |
**Llama-3-SynE** (<ins>Syn</ins>thetic data <ins>E</ins>nhanced Llama-3) is a significantly enhanced version of [Llama-3 (8B)](https://github.com/meta-llama/llama3), achieved through continual pre-training (CPT) to improve its **Chinese language ability and scientific reasoning capability**. By employing a meticulously designed data mixture and curriculum strategy, Llama-3-SynE successfully enhances new abilities while maintaining the original model’s performance. This enhancement process involves utilizing existing datasets and synthesizing high-quality datasets specifically designed for targeted tasks.
|
| 54 |
|
| 55 |
Key features of Llama-3-SynE include:
|
| 56 |
+
|
| 57 |
- **Enhanced Chinese Language Capabilities**: Achieved through topic-based data mixture and perplexity-based data curriculum.
|
| 58 |
- **Improved Scientific Reasoning**: Utilizing synthetic datasets to enhance multi-disciplinary scientific knowledge.
|
| 59 |
- **Efficient CPT**: Only consuming around 100 billion tokens, making it a cost-effective solution.
|
| 60 |
|
| 61 |
## Model List
|
| 62 |
|
| 63 |
+
| Model | Type | Seq Length | Download |
|
| 64 |
+
| :----------- | :--- | :--------- | :------------------------------------------------------------ |
|
| 65 |
+
| Llama-3-SynE | Base | 8K | [🤗 Huggingface](https://huggingface.co/survivi/Llama-3-SynE) |
|
| 66 |
|
| 67 |
## BenchMark
|
| 68 |
|
|
|
|
| 75 |
|
| 76 |
### Major Benchmarks
|
| 77 |
|
| 78 |
+
| **Models** | **MMLU** | **C-Eval** | **CMMLU** | **MATH** | **GSM8K** | **ASDiv** | **MAWPS** | **SAT-Math** | **HumanEval** | **MBPP** |
|
| 79 |
+
| :---------------------- | :--------------- | :--------------- | :--------------- | :--------------- | :--------------- | :--------------- | :--------------- | :--------------- | :--------------- | :--------------- |
|
| 80 |
+
| Llama-3-8B | **66.60** | 49.43 | 51.03 | 16.20 | 54.40 | 72.10 | 89.30 | 38.64 | <ins>36.59</ins> | **47.00** |
|
| 81 |
+
| DCLM-7B | 64.01 | 41.24 | 40.89 | 14.10 | 39.20 | 67.10 | 83.40 | <ins>41.36</ins> | 21.95 | 32.60 |
|
| 82 |
+
| Mistral-7B-v0.3 | 63.54 | 42.74 | 43.72 | 12.30 | 40.50 | 67.50 | 87.50 | 40.45 | 25.61 | 36.00 |
|
| 83 |
+
| Llama-3-Chinese-8B | 64.10 | <ins>50.14</ins> | <ins>51.20</ins> | 3.60 | 0.80 | 1.90 | 0.60 | 36.82 | 9.76 | 14.80 |
|
| 84 |
+
| MAmmoTH2-8B | 64.89 | 46.56 | 45.90 | **34.10** | **61.70** | **82.80** | <ins>91.50</ins> | <ins>41.36</ins> | 17.68 | 38.80 |
|
| 85 |
+
| Galactica-6.7B | 37.13 | 26.72 | 25.53 | 5.30 | 9.60 | 40.90 | 51.70 | 23.18 | 7.31 | 2.00 |
|
| 86 |
+
| **Llama-3-SynE (ours)** | <ins>65.19</ins> | **58.24** | **57.34** | <ins>28.20</ins> | <ins>60.80</ins> | <ins>81.00</ins> | **94.10** | **43.64** | **42.07** | <ins>45.60</ins> |
|
| 87 |
|
| 88 |
> On **Chinese evaluation benchmarks** (such as C-Eval and CMMLU), Llama-3-SynE significantly outperforms the base model Llama-3 (8B), indicating that our method is very effective in improving Chinese language capabilities.
|
| 89 |
|
|
|
|
| 93 |
|
| 94 |
"PHY", "CHE", and "BIO" denote the physics, chemistry, and biology sub-tasks of the corresponding benchmarks.
|
| 95 |
|
| 96 |
+
| **Models** | **SciEval PHY** | **SciEval CHE** | **SciEval BIO** | **SciEval Avg.** | **SciQ** | **GaoKao MathQA** | **GaoKao CHE** | **GaoKao BIO** | **ARC Easy** | **ARC Challenge** | **ARC Avg.** | **AQUA-RAT** |
|
| 97 |
+
| :---------------------- | :--------------- | :--------------- | :--------------- | :--------------- | :--------------- | :---------------- | :--------------- | :--------------- | :--------------- | :---------------- | :--------------- | :--------------- |
|
| 98 |
+
| Llama-3-8B | 46.95 | 63.45 | 74.53 | 65.47 | 90.90 | 27.92 | 32.85 | 43.81 | 91.37 | 77.73 | 84.51 | <ins>27.95</ins> |
|
| 99 |
+
| DCLM-7B | **56.71** | 64.39 | 72.03 | 66.25 | **92.50** | 29.06 | 31.40 | 37.14 | 89.52 | 76.37 | 82.94 | 20.08 |
|
| 100 |
+
| Mistral-7B-v0.3 | 48.17 | 59.41 | 68.89 | 61.51 | 89.40 | 30.48 | 30.92 | 41.43 | 87.33 | 74.74 | 81.04 | 23.23 |
|
| 101 |
+
| Llama-3-Chinese-8B | 48.17 | 67.34 | 73.90 | <ins>67.34</ins> | 89.20 | 27.64 | 30.43 | 38.57 | 88.22 | 70.48 | 79.35 | 27.56 |
|
| 102 |
+
| MAmmoTH2-8B | 49.39 | **69.36** | <ins>76.83</ins> | **69.60** | 90.20 | **32.19** | <ins>36.23</ins> | <ins>49.05</ins> | **92.85** | **84.30** | **88.57** | 27.17 |
|
| 103 |
+
| Galactica-6.7B | 34.76 | 43.39 | 54.07 | 46.27 | 71.50 | 23.65 | 27.05 | 24.76 | 65.91 | 46.76 | 56.33 | 20.87 |
|
| 104 |
+
| **Llama-3-SynE (ours)** | <ins>53.66</ins> | <ins>67.81</ins> | **77.45** | **69.60** | <ins>91.20</ins> | <ins>31.05</ins> | **51.21** | **69.52** | <ins>91.58</ins> | <ins>80.97</ins> | <ins>86.28</ins> | **28.74** |
|
| 105 |
|
| 106 |
> On **scientific evaluation benchmarks** (such as SciEval, GaoKao, and ARC), Llama-3-SynE significantly outperforms the base model, particularly showing remarkable improvement in Chinese scientific benchmarks (for example, a 25.71% improvement in the GaoKao biology subtest).
|
| 107 |
|
|
|
|
| 165 |
|
| 166 |
## License
|
| 167 |
|
| 168 |
+
This project is built upon Meta's Llama-3 model. The use of Llama-3-SynE model weights must follow the Llama-3 [license agreement](https://github.com/meta-llama/llama3/blob/main/LICENSE). The code in this open-source repository follows the [Apache 2.0](LICENSE) license.
|
| 169 |
|
| 170 |
## Citation
|
| 171 |
|