lbourdois commited on
Commit
f619fc5
·
verified ·
1 Parent(s): 46a64f6

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +104 -90
README.md CHANGED
@@ -1,91 +1,105 @@
1
- ---
2
- base_model: Qwen/Qwen2.5-7B-Instruct
3
- library_name: peft
4
- ---
5
-
6
- # Model Card for Model ID
7
-
8
- <!-- Provide a quick summary of what the model is/does. -->
9
- base_model: Qwen/Qwen2.5-7B-Instruct
10
- library_name: peft
11
-
12
- If you are unable to directly use [MTIPA-7B-LoRA(This model)](https://huggingface.co/LLMMINE/MTIPA-7B-PositionTask/tree/main)\(Recommend\), **[This is an MTIPA-7B merged LoRA version.](https://huggingface.co/LLMMINE/MTIPA-7B-POSITION-MERGE)** Please load the model directly.
13
-
14
- It should be noted that the MTIPA, TIPA, and training data for this model are all from Chinese, and support for other languages may not be sufficient. If you need to train a model specific to a particular language or for a general purpose, please refer to our paper and GitHub
15
-
16
- This model is trained on the MTIPA dataset, and its function is to predict the position of Chinese misspelled characters and output the original misspelled characters and corrected characters
17
-
18
- ```python
19
- from peft import PeftModel
20
- from transformers import AutoModelForCausalLM, AutoTokenizer
21
-
22
- base_model = AutoModelForCausalLM.from_pretrained(
23
- "Qwen/Qwen2.5-7B-Instruct",
24
- trust_remote_code=True,
25
- torch_dtype="auto",
26
- device_map="auto")
27
- tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
28
-
29
- model = PeftModel.from_pretrained(base_model, "LLMMINE/MTIPA-7B-PositionTask")
30
- def chat(text):
31
- system = "纠正输入这段话中的错别字,以[{position: 字符位置, incorrect: 错误字符, correct: 纠正后的字符}, ...]形式给出,字符位置从1开始计数,如果全部正确,给出[]\n"
32
- messages = [
33
- {"role": "system", "content": system},
34
- {"role": "user", "content": text}
35
- ]
36
- text_input = tokenizer.apply_chat_template(
37
- messages,
38
- tokenize=False,
39
- add_generation_prompt=True
40
- )
41
-
42
- # print("Input to model:")
43
- # print(text_input)
44
- model_inputs = tokenizer([text_input], return_tensors="pt").to(model.device)
45
-
46
- generated_ids = model.generate(
47
- **model_inputs,
48
- max_new_tokens=512,
49
- temperature=0.01,
50
- )
51
- generated_ids = [
52
- output_ids[len(input_ids):]
53
- for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
54
- ]
55
-
56
- response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
57
- # print("Model response:")
58
- # print(response)
59
- return response
60
-
61
- def main():
62
- print("命令行聊天程序已启动。输入您的文本,或输入 'exit' 退出。")
63
- while True:
64
- user_input = input("您: ")
65
- if user_input.lower() in ['exit', 'quit']:
66
- print("程序已退出。")
67
- break
68
- if not user_input.strip():
69
- print("请输入文本。")
70
- continue
71
- response = chat(user_input)
72
- print("回复:", response)
73
-
74
- if __name__ == '__main__':
75
- main()
76
- ```
77
-
78
- Input:
79
- ```
80
- 花雨在镇上落了一整夜,这静寂的风暴覆盖了屋顶,堵住了房门,令露宿的动物窒息而死。如此多的花朵自天而降,天亮时大界小巷都覆上了一层绵密的花毯,人们得用铲子耙子清理出通道才能出殡。
81
- ```
82
- Taken from One Hundred Years of Solitude (Cien Años de Soledad) And let `街` -> `界`
83
-
84
- Output:
85
- ```
86
- [{"position": 56, "incorrect": "界", "correct": "街"}]
87
- ```
88
-
89
- [**Github**](https://github.com/FloatFrank/TIPA) | [**Paper**](https://arxiv.org/abs/2411.17679)
90
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  - PEFT 0.12.0
 
1
+ ---
2
+ base_model: Qwen/Qwen2.5-7B-Instruct
3
+ library_name: peft
4
+ language:
5
+ - zho
6
+ - eng
7
+ - fra
8
+ - spa
9
+ - por
10
+ - deu
11
+ - ita
12
+ - rus
13
+ - jpn
14
+ - kor
15
+ - vie
16
+ - tha
17
+ - ara
18
+ ---
19
+
20
+ # Model Card for Model ID
21
+
22
+ <!-- Provide a quick summary of what the model is/does. -->
23
+ base_model: Qwen/Qwen2.5-7B-Instruct
24
+ library_name: peft
25
+
26
+ If you are unable to directly use [MTIPA-7B-LoRA(This model)](https://huggingface.co/LLMMINE/MTIPA-7B-PositionTask/tree/main)\(Recommend\), **[This is an MTIPA-7B merged LoRA version.](https://huggingface.co/LLMMINE/MTIPA-7B-POSITION-MERGE)** Please load the model directly.
27
+
28
+ It should be noted that the MTIPA, TIPA, and training data for this model are all from Chinese, and support for other languages may not be sufficient. If you need to train a model specific to a particular language or for a general purpose, please refer to our paper and GitHub
29
+
30
+ This model is trained on the MTIPA dataset, and its function is to predict the position of Chinese misspelled characters and output the original misspelled characters and corrected characters
31
+
32
+ ```python
33
+ from peft import PeftModel
34
+ from transformers import AutoModelForCausalLM, AutoTokenizer
35
+
36
+ base_model = AutoModelForCausalLM.from_pretrained(
37
+ "Qwen/Qwen2.5-7B-Instruct",
38
+ trust_remote_code=True,
39
+ torch_dtype="auto",
40
+ device_map="auto")
41
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
42
+
43
+ model = PeftModel.from_pretrained(base_model, "LLMMINE/MTIPA-7B-PositionTask")
44
+ def chat(text):
45
+ system = "纠正输入这段��中的错别字,以[{position: 字符位置, incorrect: 错误字符, correct: 纠正后的字符}, ...]形式给出,字符位置从1开始计数,如果全部正确,给出[]\n"
46
+ messages = [
47
+ {"role": "system", "content": system},
48
+ {"role": "user", "content": text}
49
+ ]
50
+ text_input = tokenizer.apply_chat_template(
51
+ messages,
52
+ tokenize=False,
53
+ add_generation_prompt=True
54
+ )
55
+
56
+ # print("Input to model:")
57
+ # print(text_input)
58
+ model_inputs = tokenizer([text_input], return_tensors="pt").to(model.device)
59
+
60
+ generated_ids = model.generate(
61
+ **model_inputs,
62
+ max_new_tokens=512,
63
+ temperature=0.01,
64
+ )
65
+ generated_ids = [
66
+ output_ids[len(input_ids):]
67
+ for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
68
+ ]
69
+
70
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
71
+ # print("Model response:")
72
+ # print(response)
73
+ return response
74
+
75
+ def main():
76
+ print("命令行聊天程序已启动。输入您的文本,或输入 'exit' 退出。")
77
+ while True:
78
+ user_input = input("您: ")
79
+ if user_input.lower() in ['exit', 'quit']:
80
+ print("程序已退出。")
81
+ break
82
+ if not user_input.strip():
83
+ print("请输入文本。")
84
+ continue
85
+ response = chat(user_input)
86
+ print("回复:", response)
87
+
88
+ if __name__ == '__main__':
89
+ main()
90
+ ```
91
+
92
+ Input:
93
+ ```
94
+ 花雨在镇上落了一整夜,这静寂的风暴覆盖了屋顶,堵住了房门,令露宿的动物窒息而死。如此多的花朵自天而降,天亮时大界小巷都覆上了一层绵密的花毯,人们得用铲子耙子清理出通道才能出殡。
95
+ ```
96
+ Taken from One Hundred Years of Solitude (Cien Años de Soledad) And let `街` -> `界`
97
+
98
+ Output:
99
+ ```
100
+ [{"position": 56, "incorrect": "界", "correct": "街"}]
101
+ ```
102
+
103
+ [**Github**](https://github.com/FloatFrank/TIPA) | [**Paper**](https://arxiv.org/abs/2411.17679)
104
+
105
  - PEFT 0.12.0