KenLM (arpa) models for Dutch based on Wikipedia
This repository contains KenLM models (n=5) for Dutch, based on the Dutch portion of Wikipedia - sentence-segmented (one sentence per line). Models are provided on tokens, part-of-speech, dependency labels, and lemmas, as processed with spaCy nl_core_news_sm:
- wiki_nl_token.arpa[.bin]: token
- wiki_nl_pos.arpa[.bin]: part-of-speech tag
- wiki_nl_dep.arpa[.bin]: dependency label
- wiki_nl_lemma.arpa[.bin]: lemma
Both regular .arpa files as well as more efficient KenLM binary files (.arpa.bin) are provided. You probably want to use the binary versions.
Usage from within Python
Make sure to install dependencies:
pip install huggingface_hub
pip install https://github.com/kpu/kenlm/archive/master.zip
# If you want to use spaCy preprocessing
pip install spacy
python -m spacy download nl_core_news_sm
We can then use the Hugging Face hub software to download and cache the model file that we want, and directly use it with KenLM.
import kenlm
from huggingface_hub import hf_hub_download
model_file = hf_hub_download(repo_id="BramVanroy/kenlm_wikipedia_nl", filename="wiki_nl_token.arpa.bin")
model = kenlm.Model(model_file)
text = "Ik eet graag koekjes !" # pre-tokenized
model.perplexity(text)
# 1790.5033832700467
It is recommended to use spaCy as a preprocessor to automatically use the same tagsets and tokenization as were used when creating the LMs.
import kenlm
import spacy
from huggingface_hub import hf_hub_download
model_file = hf_hub_download(repo_id="BramVanroy/kenlm_wikipedia_nl", filename="wiki_nl_pos.arpa.bin") # pos file
model = kenlm.Model(model_file)
nlp = spacy.load("nl_core_news_sm")
text = "Ik eet graag koekjes!"
pos_sequence = " ".join([token.pos_ for token in nlp(text)])
# 'PRON VERB ADV NOUN PUNCT'
model.perplexity(pos_sequence)
# 6.190638021041525
Reproduction
Example:
bin/lmplz -o 5 -S 75% -T ../data/tmp/ < ../data/wikipedia/nl/wiki_nl_processed_lemma_dedup.txt > ../data/wikipedia/nl/models/wiki_nl_lemma.arpa
bin/build_binary ../data/wikipedia/nl/models/wiki_nl_lemma.arpa ../data/wikipedia/nl/models/wiki_nl_lemma.arpa.bin
For class-based LMs (POS and DEP), the --discount_fallback was used and the parsed data was not deduplicated (but it was deduplicated on the sentence-level for token and lemma models).