Fix tokenizer reloading
#42
by
kylesayrs
- opened
Purpose
- Fixes bug where processor cannot be saved to disk and then loaded again
- Not all tokenizer kwargs are passed to the parent class,
pretrainedTokenizerBase. This means that some tokenizer kwargs are not inself.init_kwargs, which results in them not being saved bypretrainedTokenizerBase.save_pretrained
- Not all tokenizer kwargs are passed to the parent class,
Related Issues
Changes
- Pass
encode_special_tokensandimage_sizekwargs intosuper().__init__
Testing
- Confirmed that newly written
tokenizer_config.jsoncontains theimage_sizeandencode_special_tokensfields which were previously missing
from transformers import AutoTokenizer
processor = AutoTokenizer.from_pretrained("THUDM/glm-4v-9b")
assert processor.image_size is not None
processor.save_pretrained("test")
processor = AutoTokenizer.from_pretrained("test")
assert processor.image_size is not None
kylesayrs
changed pull request status to
open
ZHANGYUXUAN-zR
changed pull request status to
merged