Retraining Stanza to optimize depparse on a diachronic Swedish corpus
This repository contains forked code from the official Stanza github, with scripts that help prepare for and train models on different combinations of treebanks relevant to historical Swedish.
Guide
Dev/test for all models is 10/90 of our human validated gold sentences (https://github.com/alanev52/Diachronic_Treebanks_DigPhil/tree/main/parsed_data/validated)
Example work flow, training a model with pretrained vectors from kubhist2 1880 incremental and training data from Swedish and Bokmål treebanks, as well as our own machine-parsed diachronic corpus:
python prepare-train-val-test.py sv diachron bm
source scripts/config_alvis.sh
python -m stanza.utils.datasets.prepare_depparse_treebank UD_Swedish-diachronic --wordvec_pretrain_file /cephyr/users/cleland/Alvis/stanza_resources/sv/pretrain/diachronic.pt
python -m stanza.utils.training.run_depparse UD_Swedish-diachronic --wordvec_pretrain_file /cephyr/users/cleland/Alvis/stanza_resources/sv/pretrain/diachronic.pt --batch_size 32 --dropout 0.33
All of the above can be done with a single command by using:
./make_new_model.sh {vectors} {language codes}
which for the example becomes:
./make_new_model.sh diachronic.pt sv diachron bm
Pretrained vectors
We use the incremental vectors up until 1880 from Henchen & Tahmasebi 2021.
Jag konverterade först kubhist2-vektorerna från gensim fasttext .ft till en vanlig textfil med gensims pythonpaket, sedan använde jag stanzas konverterare till .pt:
from stanza.models.common.pretrain import Pretrain
pt = Pretrain("foo.pt", "new_vectors.txt")
pt.load()
Resultatet finns komprimerat i diachronic.pt.xz.
Results
DigPhil, UD-sv, bm, dk: UAS LAS CLAS MLAS BLEX 65.45 55.84 50.19 46.49 50.19
References
Hengchen, Simon & Tahmasebi, Nina. (2021).
A collection of Swedish diachronic word embedding models trained on historical newspaper data.
Journal of Open Humanities Data, 7(2), 1–7.
https://doi.org/10.5334/johd.22