The tokenizer vocab contains mosly English words and latin script rather than Arabic

by issam9 - opened May 28

May 28

Hi,
It seems that the tokenizer is not trained on text with mainly Arabic script. When applied to Arabic text it comes out over segmented and the performance of the model on my task is a lot worse compared to other Arabic models. When I checked vocab.txt file it seems to contain mostly English tokens.

not-lain

Tunisia.AI org 29 days ago

@issam9 their script leverages a pretrained checkpoint, see https://github.com/instadeepai/tunbert/blob/main/models/bert-nvidia/configs/sentiment_analysis_config.yaml#L29 for more details.
let me know if this answers your questions

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment