Commit 
							
							·
						
						e3ec7a1
	
1
								Parent(s):
							
							40948ff
								
Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | @@ -34,6 +34,8 @@ You can fine-tune this model on downstream tasks. | |
| 34 |  | 
| 35 | 
             
            The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in advance. Juman++ 2.0.0-rc3 was used for pretraining. Each word is tokenized into tokens by [sentencepiece](https://github.com/google/sentencepiece).
         | 
| 36 |  | 
|  | |
|  | |
| 37 | 
             
            ## Vocabulary
         | 
| 38 |  | 
| 39 | 
             
            The vocabulary consists of 32000 tokens including words ([JumanDIC](https://github.com/ku-nlp/JumanDIC)) and subwords induced by the unigram language model of [sentencepiece](https://github.com/google/sentencepiece).
         | 
|  | |
| 34 |  | 
| 35 | 
             
            The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in advance. Juman++ 2.0.0-rc3 was used for pretraining. Each word is tokenized into tokens by [sentencepiece](https://github.com/google/sentencepiece).
         | 
| 36 |  | 
| 37 | 
            +
            `BertJapaneseTokenizer` now supports automatic `JumanppTokenizer` and `SentencepieceTokenizer`. You can use [this model](https://huggingface.co/nlp-waseda/roberta-large-japanese-seq512-with-auto-jumanpp) without any data preprocessing.
         | 
| 38 | 
            +
             | 
| 39 | 
             
            ## Vocabulary
         | 
| 40 |  | 
| 41 | 
             
            The vocabulary consists of 32000 tokens including words ([JumanDIC](https://github.com/ku-nlp/JumanDIC)) and subwords induced by the unigram language model of [sentencepiece](https://github.com/google/sentencepiece).
         | 
