File size: 11,466 Bytes
7542075
4db6bb4
 
 
 
 
 
 
7542075
 
4db6bb4
 
 
 
7542075
 
 
 
4db6bb4
7542075
7e45369
4db6bb4
7542075
4db6bb4
7542075
4db6bb4
7542075
 
4db6bb4
7542075
4db6bb4
 
 
 
 
 
 
 
7542075
4db6bb4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7542075
 
 
4db6bb4
 
7542075
4db6bb4
 
7542075
4db6bb4
7542075
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
---
license: apache-2.0
language:
- grc
datasets:
- Ericu950/Papyri_1
base_model:
- meta-llama/Meta-Llama-3.1-8B-Instruct
library_name: transformers
tags:
- papyrology
- textual criticism
- philology
- Ancient Greek
- mergekit
- merge

---
# Papy_2_Llama-3.1-8B-Instruct_text

This is a finetuned version Llama-3.1-8B-Instruct specialized on reconstructing spans of 1–20 missing characters in ancient Greek documentary papyri. In spans of 1–10 missing characters it did so with a Character Error Rate of 14.9%, a top-1 accuracy of 73.5%, and top-20 of 85.9% on a test set of 7,811 papyrus editions. It replaces Papy_1_Llama-3.1-8B-Instruct_text.
See https://arxiv.org/abs/2409.13870.

## Usage

To run the model on a GPU with large memory capacity, follow these steps:


### 1. Download and load the model 

```python
import json
from transformers import pipeline, AutoTokenizer, LlamaForCausalLM
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
import torch
import warnings
warnings.filterwarnings("ignore", message=".*copying from a non-meta parameter in the checkpoint*")
model_id = "Ericu950/Papy_2_Llama-3.1-8B-Instruct_text"

with init_empty_weights():
    model = LlamaForCausalLM.from_pretrained(model_id)

model = load_checkpoint_and_dispatch(
    model,
    model_id,
    device_map="auto",
    offload_folder="offload",
    offload_state_dict=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

generation_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
)
```

### 2. Run inference on a papyrus fragment of your choice
```python
papyrus_edition = """
ετουσ τεταρτου αυτοκρατοροσ καισαροσ ουεσπασιανου σεβαστου ------------------ 
ομολογει παυσιριων απολλωνιου του παuσιριωνοσ μητροσ ---------------τωι γεγονοτι αυτωι 
εκ τησ γενομενησ και μετηλλαχυιασ αυτου γυναικοσ ------------------------- 
απο τησ αυτησ πολεωσ εν αγυιαι συγχωρειν ειναι ---------------------------------- 
--------------------σ αυτωι εξ ησ συνεστιν ------------------------------------ 
----τησ αυτησ γενεασ την υπαρχουσαν αυτωι οικιαν ------------ 
------------------ ---------καὶ αιθριον και αυλη απερ ο υιοσ διοκοροσ -------------------------- 
--------εγραψεν του δ αυτου διοσκορου ειναι ------------------------------------ 
---------- και προ κατενγεγυηται τα δικαια -------------------------------------- 
νησ κατα τουσ τησ χωρασ νομουσ· εαν δε μη --------------------------------------- 
υπ αυτου τηι του διοσκορου σημαινομενηι -----------------------------------ενοικισμωι του 
ημισουσ μερουσ τησ προκειμενησ οικιασ --------------------------------- διοσκοροσ την τουτων αποχην 
---------------------------------------------μηδ υπεναντιον τουτοισ επιτελειν μηδε 
------------------------------------------------ ανασκευηι κατ αυτησ τιθεσθαι ομολογιαν μηδε 
----------------------------------- επιτελεσαι η χωρισ του κυρια ειναι τα διομολογημενα 
παραβαινειν, εκτεινειν δε τον παραβησομενον τωι υιωι διοσκορωι η τοισ παρ αυτου καθ εκαστην 
εφοδον το τε βλαβοσ και επιτιμον αργυριου δραχμασ 0 και εισ το δημο[7 missing letters] ισασ και μηθεν 
ησσον· δ -----ιων ομολογιαν συνεχωρησεν·
"""
system_prompt = "Fill in the missing letters in this papyrus fragment!"
input_messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": papyrus_edition},
]
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = generation_pipeline(
    input_messages,
    max_new_tokens=10,
    num_beams=30, # Set this as high as your memory will allow!
    num_return_sequences=10,
    early_stopping=True,
)
beam_contents = []
for output in outputs:
    generated_text = output.get('generated_text', [])
    for item in generated_text:
        if item.get('role') == 'assistant':
            beam_contents.append(item.get('content'))
real_response = "σιον τασ"
print(f"The masked sequence: {real_response}")
for i, content in enumerate(beam_contents, start=1):
    print(f"Suggestion {i}: {content}")
```
### Expected Output:
```
The masked sequence: σιον τασ
Suggestion 1: σιον τασ
Suggestion 2: σιν τασ ι
Suggestion 3: σ τασ ισα
Suggestion 4: σιου τασ
Suggestion 5: συ τασ ισ
Suggestion 6: ιον τασ ι
Suggestion 7: ν τασ ισα
Suggestion 8: σ ισασ κα
Suggestion 9: σασ τασ ι
Suggestion 10: σιωι τασ
```
## Usage on free tier in Google Colab

If you don’t have access to a larger GPU but want to try the model out, you can run it in a quantized format in Google Colab. **The quality of the responses will deteriorate significantly!** Follow these steps:

### Step 1: Connect to free GPU
1. Click Connect arrow_drop_down near the top right of the notebook.
2. Select Change runtime type.
3. In the modal window, select T4 GPU as your hardware accelerator.
4. Click Save.
5. Click the Connect button to connect to your runtime. After some time, the button will present a green checkmark, along with RAM and disk usage graphs. This indicates that a server has successfully been created with your required hardware.


### Step 2: Install Dependencies

```python
!pip install -U bitsandbytes
import os
os._exit(00)
```

### Step 3: Download and quantize the model
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
import torch
quant_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained("Ericu950/Papy_2_Llama-3.1-8B-Instruct_text",
device_map = "auto", quantization_config = quant_config)
tokenizer = AutoTokenizer.from_pretrained("Ericu950/Papy_2_Llama-3.1-8B-Instruct_text")
generation_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
)

```
### Step 4: Run inference on a papyrus fragment of your choice
```python
papyrus_edition = """
ετουσ τεταρτου αυτοκρατοροσ καισαροσ ουεσπασιανου σεβαστου ------------------ 
ομολογει παυσιριων απολλωνιου του παuσιριωνοσ μητροσ ---------------τωι γεγονοτι αυτωι 
εκ τησ γενομενησ και μετηλλαχυιασ αυτου γυναικοσ ------------------------- 
απο τησ αυτησ πολεωσ εν αγυιαι συγχωρειν ειναι ---------------------------------- 
--------------------σ αυτωι εξ ησ συνεστιν ------------------------------------ 
----τησ αυτησ γενεασ την υπαρχουσαν αυτωι οικιαν ------------ 
------------------ ---------καὶ αιθριον και αυλη απερ ο υιοσ διοκοροσ -------------------------- 
--------εγραψεν του δ αυτου διοσκορου ειναι ------------------------------------ 
---------- και προ κατενγεγυηται τα δικαια -------------------------------------- 
νησ κατα τουσ τησ χωρασ νομουσ· εαν δε μη --------------------------------------- 
υπ αυτου τηι του διοσκορου σημαινομενηι -----------------------------------ενοικισμωι του 
ημισουσ μερουσ τησ προκειμενησ οικιασ --------------------------------- διοσκοροσ την τουτων αποχην 
---------------------------------------------μηδ υπεναντιον τουτοισ επιτελειν μηδε 
------------------------------------------------ ανασκευηι κατ αυτησ τιθεσθαι ομολογιαν μηδε 
----------------------------------- επιτελεσαι η χωρισ του κυρια ειναι τα διομολογημενα 
παραβαινειν, εκτεινειν δε τον παραβησομενον τωι υιωι διοσκορωι η τοισ παρ αυτου καθ εκαστην 
εφοδον το τε βλαβοσ και επιτιμον αργυριου δραχμασ 0 και εισ το δημο[7 missing letters] ισασ και μηθεν 
ησσον· δ -----ιων ομολογιαν συνεχωρησεν·
"""
system_prompt = "Fill in the missing letters in this papyrus fragment!"
input_messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": papyrus_edition},
]
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = generation_pipeline(
    input_messages,
    max_new_tokens=10,
    num_beams=30, # Set this as high as your memory will allow!
    num_return_sequences=10,
    early_stopping=True,
)
beam_contents = []
for output in outputs:
    generated_text = output.get('generated_text', [])
    for item in generated_text:
        if item.get('role') == 'assistant':
            beam_contents.append(item.get('content'))
real_response = "σιον τασ"
print(f"The masked characters: {real_response}")
for i, content in enumerate(beam_contents, start=1):
    print(f"Suggestion {i}: {content}")
```
### Expected Output:
```
The masked characters: σιον τασ
Suggestion 1: σιον τα 00·
Suggestion 2: σιον αυτωι·
Suggestion 3: σιον 00 00
Suggestion 4: σιον και 0·
Suggestion 5: σιον τα 00··
Suggestion 6: σιον τασ 0
Suggestion 7: σιον τα 000·
Suggestion 8: σιον τα 0ο
Suggestion 9: σιον τασασ·
Suggestion 10: σιον τα 00
```
Observe that performance declines! If we change
```python  
   load_in_4bit=True,
   bnb_4bit_compute_dtype=torch.bfloat16
```
in the second cell to
```python  
   load_in_8bit=True,
```

we get
```
The masked characters: σιον τασ
Suggestion 1: σιον τασ
Suggestion 2: σιν τασ ι
Suggestion 3: σ τασ ισα
Suggestion 4: σιου τασ
Suggestion 5: σ ισασ κα
Suggestion 6: συ τασ ισ
Suggestion 7: σασ τασ ι
Suggestion 8: ν τασ ισα
Suggestion 9: ιον τασ ι
Suggestion 10: σισ τασ ι
```
## Information about configuration for merging

The finetuned model was remerged with Llama-3.1-8B-Instruct using the [TIES](https://arxiv.org/abs/2306.01708) merge method. This did not afect CER or top-1 accuracy, but the effect on top-20 accuracy was positive. The following YAML configuration was used:

```yaml
models:
  - model: original # Llama 3.1
  - model: DDbDP_reconstructer_5 # A model fintuned on the 95 % of the DDbDP for 11 epochs
    parameters:
      density: 1.1
      weight: 0.5
merge_method: ties
base_model: original # Llama 3.1
parameters:
  normalize: true
dtype: bfloat16


```