The tokenizer vocab contains mosly English words and latin script rather than Arabic

#6
by issam9 - opened

Hi,
It seems that the tokenizer is not trained on text with mainly Arabic script. When applied to Arabic text it comes out over segmented and the performance of the model on my task is a lot worse compared to other Arabic models. When I checked vocab.txt file it seems to contain mostly English tokens.

Tunisia.AI org

@issam9 their script leverages a pretrained checkpoint, see https://github.com/instadeepai/tunbert/blob/main/models/bert-nvidia/configs/sentiment_analysis_config.yaml#L29 for more details.
let me know if this answers your questions

Sign up or log in to comment