yemen2016 commited on
Commit
2382a07
·
verified ·
1 Parent(s): eef564d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -0
README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - da
4
+ - no
5
+ license: cc-by-4.0
6
+ datasets:
7
+ - MiMe-MeMo/Corpus-v1.1
8
+ - MiMe-MeMo/Sentiment-v1
9
+ - MiMe-MeMo/WSD-Skaebne
10
+ metrics:
11
+ - f1
12
+ tags:
13
+ - historical-texts
14
+ - digital-humanities
15
+ - sentiment-analysis
16
+ - word-sense-disambiguation
17
+ - danish
18
+ - norwegian
19
+ model-index:
20
+ - name: MeMo-BERT-01
21
+ results:
22
+ - task:
23
+ type: text-classification
24
+ name: Sentiment Analysis
25
+ dataset:
26
+ name: MiMe-MeMo/Sentiment-v1
27
+ type: text
28
+ metrics:
29
+ - name: f1
30
+ type: f1
31
+ value: 0.56
32
+ - task:
33
+ type: text-classification
34
+ name: Word Sense Disambiguation
35
+ dataset:
36
+ name: MiMe-MeMo/WSD-Skaebne
37
+ type: text
38
+ metrics:
39
+ - name: f1
40
+ type: f1
41
+ value: 0.43
42
+ ---
43
+
44
+ # MeMo-BERT-01
45
+
46
+ **MeMo-BERT-01** is a pre-trained language model for **historical Danish and Norwegian literary texts** (1870–1900).
47
+ It was introduced in [Al-Laith et al. (2024)](https://aclanthology.org/2024.lrec-main.431/) as part of the first dedicated PLMs for historical Danish and Norwegian.
48
+
49
+ ## Model Description
50
+
51
+ - **Architecture:** BERT-base (12 layers, hidden size 768, 12 attention heads, vocab size 30k)
52
+ - **Pre-training strategy:** Trained **from scratch** on the MeMo corpus (no prior pre-training)
53
+ - **Training objective:** Masked Language Modeling (MLM, 15% masking)
54
+ - **Training data:** MeMo Corpus v1.1 (839 novels, ~53M words, 1870–1900)
55
+ - **Hardware:** 2 × A100 GPUs
56
+ - **Training time:** ~44 hours
57
+
58
+ This model represents the **baseline historical-domain model** trained entirely on 19th-century Scandinavian novels.
59
+
60
+ ## Intended Use
61
+
62
+ - **Primary tasks:**
63
+ - Sentiment Analysis (positive, neutral, negative)
64
+ - Word Sense Disambiguation (historical vs. modern senses of *skæbne*, "fate")
65
+
66
+ - **Intended users:**
67
+ - Researchers in Digital Humanities, Computational Linguistics, and Scandinavian Studies.
68
+ - Historians of literature studying 19th-century Scandinavian novels.
69
+
70
+ - **Not intended for:**
71
+ - Contemporary Danish/Norwegian NLP tasks.
72
+ - High-stakes applications (e.g., legal, medical, political decision-making).
73
+
74
+ ## Training Data
75
+
76
+ - **Corpus:** [MeMo Corpus v1.1](https://huggingface.co/datasets/MiMe-MeMo/Corpus-v1.1) (Bjerring-Hansen et al. 2022)
77
+ - **Time period:** 1870–1900
78
+ - **Size:** 839 novels, 690 MB, 3.2M sentences, 52.7M words
79
+ - **Preprocessing:** OCR-corrected, normalized to modern Danish spelling, tokenized, lemmatized, annotated
80
+
81
+ ## Evaluation
82
+
83
+ ### Benchmarks
84
+
85
+ | Task | Dataset | Test F1 | Notes |
86
+ |------|---------|---------|-------|
87
+ | Sentiment Analysis | MiMe-MeMo/Sentiment-v1 | **0.56** | 3-class (pos/neg/neu) |
88
+ | Word Sense Disambiguation | MiMe-MeMo/WSD-Skaebne | **0.43** | 4-class (pre-modern, modern, figure of speech, ambiguous) |
89
+
90
+ ### Comparison
91
+
92
+ MeMo-BERT-01 performs **worse than MeMo-BERT-03** (continued pre-training), highlighting the limitations of training from scratch on historical data without leveraging contemporary PLMs.
93
+
94
+ ## Limitations
95
+
96
+ - Trained **only from scratch** on ~53M words (relatively small for BERT training).
97
+ - Underperforms compared to continued pre-training (MeMo-BERT-03).
98
+ - Domain-specific to late 19th-century novels.
99
+ - OCR and normalization errors may remain in training corpus.
100
+
101
+ ## Ethical Considerations
102
+
103
+ - All texts are **public domain** (authors deceased).
104
+ - Datasets released under **CC BY 4.0**.
105
+ - No sensitive personal data involved.
106
+
107
+ ## Citation
108
+
109
+ If you use this model, please cite:
110
+
111
+ ```bibtex
112
+ @inproceedings{al-laith-etal-2024-development,
113
+ title = "Development and Evaluation of Pre-trained Language Models for Historical {D}anish and {N}orwegian Literary Texts",
114
+ author = "Al-Laith, Ali and Conroy, Alexander and Bjerring-Hansen, Jens and Hershcovich, Daniel",
115
+ booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
116
+ year = "2024",
117
+ address = "Torino, Italia",
118
+ publisher = "ELRA and ICCL",
119
+ pages = "4811--4819",
120
+ url = "https://aclanthology.org/2024.lrec-main.431/"
121
+ }