CJK Phi3.5-mini-instruct
Collection
Phi3.5 model and tokenizer with improved Chinese, Japanese, and Korean language capabilities.
โข
2 items
โข
Updated
Phi-3.5-mini-instruct Tokenizer supporting Simplified Chinese, Traditional Chinese, Japanese, and Korean.
from transformers import LlamaTokenizer
base_tokenizer = LlamaTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")
new_tokenizer = LlamaTokenizer.from_pretrained("nntsuzu/CJK-Phi-3.5-tokenizer")
text = (
"Hello. This sentence is written in English." \
"ไฝ ๅฅฝใ่ฟๅฅ่ฏๆฏ็จ็ฎไฝไธญๆๅ็ใ" \
"ไฝ ๅฅฝใ้ๅฅ่ฉฑๆฏ็จ็น้ซไธญๆๅฏซ็ใ" \
"ใใใซใกใฏใใใใฏๆฅๆฌ่ชใงๆธใใใๆ็ซ ใงใใ" \
"์๋
ํ์ธ์. ์ด ๋ฌธ์ฅ์ ํ๊ตญ์ด๋ก ์ฐ์ฌ์ก์ต๋๋ค."
)
print("----------- Tokenzier test -----------")
print(f"Phi_3.5_mini_instruct: {base_tokenizer.tokenize(text)}\n")
print(f"new tokenizer: {new_tokenizer.tokenize(text)}\n")
This tokenizer supports many Traditional Chinese and Korean characters, so it can likely be used for Korean mixed scripts as well.
from transformers import LlamaTokenizer
base_tokenizer = LlamaTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")
new_tokenizer = LlamaTokenizer.from_pretrained("nntsuzu/CJK-Phi-3.5-tokenizer")
text = "์ด ํ ํฌ๋์ด์ ๋ ๋ง์ ็นไฝๅญ๋ ํ๊ธ์ ๆฏๆดํ๊ธฐ ๋๋ฌธ์ ํ๊ธ๊ณผ ๆผขๅญ๊ฐ ๆทท็จํ ๆ็ซ ์๋ ไฝฟ็จํ ์ ์๋ ๅฏ่ฝๆง์ด ์์ต๋๋ค."
print("----------- Tokenzier test -----------")
print(f"Phi_3.5_mini_instruct: {base_tokenizer.tokenize(text)}\n")
print(f"new tokenizer: {new_tokenizer.tokenize(text)}\n")
Base model
microsoft/Phi-3.5-mini-instruct