Transformers

Phi-3.5-mini-instruct Tokenizer supporting Simplified Chinese, Traditional Chinese, Japanese, and Korean.

Example

from transformers import LlamaTokenizer

base_tokenizer = LlamaTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")
new_tokenizer = LlamaTokenizer.from_pretrained("nntsuzu/CJK-Phi-3.5-tokenizer")

text = (
    "Hello. This sentence is written in English." \
    "ไฝ ๅฅฝใ€‚่ฟ™ๅฅ่ฏๆ˜ฏ็”จ็ฎ€ไฝ“ไธญๆ–‡ๅ†™็š„ใ€‚" \
    "ไฝ ๅฅฝใ€‚้€™ๅฅ่ฉฑๆ˜ฏ็”จ็น้ซ”ไธญๆ–‡ๅฏซ็š„ใ€‚" \
    "ใ“ใ‚“ใซใกใฏใ€‚ใ“ใ‚Œใฏๆ—ฅๆœฌ่ชžใงๆ›ธใ‹ใ‚ŒใŸๆ–‡็ซ ใงใ™ใ€‚" \
    "์•ˆ๋…•ํ•˜์„ธ์š”. ์ด ๋ฌธ์žฅ์€ ํ•œ๊ตญ์–ด๋กœ ์“ฐ์—ฌ์กŒ์Šต๋‹ˆ๋‹ค."
)

print("----------- Tokenzier test -----------")
print(f"Phi_3.5_mini_instruct: {base_tokenizer.tokenize(text)}\n")
print(f"new tokenizer: {new_tokenizer.tokenize(text)}\n")

This tokenizer supports many Traditional Chinese and Korean characters, so it can likely be used for Korean mixed scripts as well.

Example

from transformers import LlamaTokenizer

base_tokenizer = LlamaTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")
new_tokenizer = LlamaTokenizer.from_pretrained("nntsuzu/CJK-Phi-3.5-tokenizer")

text = "์ด ํ† ํฌ๋‚˜์ด์ €๋Š” ๋งŽ์€ ็นไฝ“ๅญ—๋‚˜ ํ•œ๊ธ€์„ ๆ”ฏๆดํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•œ๊ธ€๊ณผ ๆผขๅญ—๊ฐ€ ๆทท็”จํ•œ ๆ–‡็ซ ์—๋„ ไฝฟ็”จํ•  ์ˆ˜ ์žˆ๋Š” ๅฏ่ƒฝๆ€ง์ด ์žˆ์Šต๋‹ˆ๋‹ค."

print("----------- Tokenzier test -----------")
print(f"Phi_3.5_mini_instruct: {base_tokenizer.tokenize(text)}\n")
print(f"new tokenizer: {new_tokenizer.tokenize(text)}\n")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for nntsuzu/CJK-Phi-3.5-tokenizer

Finetuned
(105)
this model

Dataset used to train nntsuzu/CJK-Phi-3.5-tokenizer

Collection including nntsuzu/CJK-Phi-3.5-tokenizer