YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Retraining Stanza to optimize depparse on a diachronic Swedish corpus

This repository contains forked code from the official Stanza github, with scripts that help prepare for and train models on different combinations of treebanks relevant to historical Swedish.

Guide

Dev/test for all models is 10/90 of our human validated gold sentences (https://github.com/alanev52/Diachronic_Treebanks_DigPhil/tree/main/parsed_data/validated)

Example work flow, training a model with pretrained vectors from kubhist2 1880 incremental and training data from Swedish and Bokmål treebanks, as well as our own machine-parsed diachronic corpus:

python prepare-train-val-test.py sv diachron bm

source scripts/config_alvis.sh

python -m stanza.utils.datasets.prepare_depparse_treebank UD_Swedish-diachronic --wordvec_pretrain_file /cephyr/users/cleland/Alvis/stanza_resources/sv/pretrain/diachronic.pt

python -m stanza.utils.training.run_depparse UD_Swedish-diachronic --wordvec_pretrain_file /cephyr/users/cleland/Alvis/stanza_resources/sv/pretrain/diachronic.pt --batch_size 32 --dropout 0.33

All of the above can be done with a single command by using:

./make_new_model.sh {vectors} {language codes}

which for the example becomes:

./make_new_model.sh diachronic.pt sv diachron bm

Pretrained vectors

We use the incremental vectors up until 1880 from Henchen & Tahmasebi 2021.

Jag konverterade först kubhist2-vektorerna från gensim fasttext .ft till en vanlig textfil med gensims pythonpaket, sedan använde jag stanzas konverterare till .pt:

from stanza.models.common.pretrain import Pretrain
pt = Pretrain("foo.pt", "new_vectors.txt")
pt.load()

Resultatet finns komprimerat i diachronic.pt.xz.

Results

DigPhil, UD-sv, bm, dk: UAS LAS CLAS MLAS BLEX 65.45 55.84 50.19 46.49 50.19

References

Hengchen, Simon & Tahmasebi, Nina. (2021).
A collection of Swedish diachronic word embedding models trained on historical newspaper data.
Journal of Open Humanities Data, 7(2), 1–7.
https://doi.org/10.5334/johd.22

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support