nlp-waseda
/

roberta-large-japanese-seq512

Model card Files Files and versions

conan1024hao commited on Oct 21, 2022

Commit

e3ec7a1

·

1 Parent(s): 40948ff

Update README.md

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -34,6 +34,8 @@ You can fine-tune this model on downstream tasks.
 The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in advance. Juman++ 2.0.0-rc3 was used for pretraining. Each word is tokenized into tokens by [sentencepiece](https://github.com/google/sentencepiece).
 ## Vocabulary
 The vocabulary consists of 32000 tokens including words ([JumanDIC](https://github.com/ku-nlp/JumanDIC)) and subwords induced by the unigram language model of [sentencepiece](https://github.com/google/sentencepiece).

 The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in advance. Juman++ 2.0.0-rc3 was used for pretraining. Each word is tokenized into tokens by [sentencepiece](https://github.com/google/sentencepiece).
+`BertJapaneseTokenizer` now supports automatic `JumanppTokenizer` and `SentencepieceTokenizer`. You can use [this model](https://huggingface.co/nlp-waseda/roberta-large-japanese-seq512-with-auto-jumanpp) without any data preprocessing.
 ## Vocabulary
 The vocabulary consists of 32000 tokens including words ([JumanDIC](https://github.com/ku-nlp/JumanDIC)) and subwords induced by the unigram language model of [sentencepiece](https://github.com/google/sentencepiece).