File size: 13,728 Bytes
67a9696
0d664d9
 
67a9696
 
 
55e3b89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86f8a42
55e3b89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0bd9c79
 
55e3b89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0bd9c79
 
55e3b89
 
 
 
 
4f2aac1
55e3b89
 
 
4f2aac1
55e3b89
 
 
 
 
0bd9c79
4f2aac1
0bd9c79
 
4f2aac1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0bd9c79
 
4f2aac1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0bd9c79
55e3b89
4f2aac1
 
 
 
 
 
 
 
 
 
 
 
55e3b89
4f2aac1
55e3b89
 
4f2aac1
 
55e3b89
 
4f2aac1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55e3b89
 
 
 
 
4f2aac1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55e3b89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4f2aac1
55e3b89
 
 
 
 
 
 
 
 
 
 
 
 
67a9696
 
55e3b89
67a9696
55e3b89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4f2aac1
55e3b89
 
 
 
 
 
 
 
4f2aac1
55e3b89
 
 
 
 
 
 
67a9696
55e3b89
 
 
67a9696
55e3b89
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
---
base_model:
- LiquidAI/LFM2-350M-Extract
license: apache-2.0
language:
- en
tags:
- text-generation
- instruction-tuning
- structured-output
- toon
- lfm2
- unsloth
- lora
- transformers
datasets:
- yasserrmd/TOON-Unstructured-Structured
model-index:
- name: yasserrmd/LFM2-350M-Extract-TOON
  results:
  - task:
      name: TOON conversion (schema-driven extraction)
      type: text-generation
    dataset:
      name: yasserrmd/TOON-Unstructured-Structured
      type: text
    metrics:
    - name: Final Training Loss
      type: loss
      value: 0.2178
    - name: Lowest Loss
      type: loss
      value: 0.2043
    - name: Total Steps
      type: steps
      value: 430
---

# yasserrmd/LFM2-350M-Extract-TOON
<img src="banner.png" />

`yasserrmd/LFM2-350M-Extract-TOON` is a **fine-tuned variant of LiquidAI’s LFM2-350M-Extract**, built using the **Unsloth AI** framework and the dataset [`yasserrmd/TOON-Unstructured-Structured`](https://huggingface.co/datasets/yasserrmd/TOON-Unstructured-Structured).

This model specializes in **schema-driven conversion of natural-language text into valid TOON (Token-Oriented Object Notation)** format — a compact, token-efficient alternative to JSON designed for large language models.

---

## Model Overview

| Property | Description |
|-----------|-------------|
| **Base Model** | LiquidAI/LFM2-350M-Extract |
| **Architecture** | LFM2-350M (Decoder-only Transformer) |
| **Fine-tuning Method** | LoRA (via Unsloth AI) |
| **Objective** | Structured extraction in TOON format |
| **Dataset** | yasserrmd/TOON-Unstructured-Structured |
| **Languages** | English |
| **Frameworks** | Transformers, Unsloth, PyTorch |
| **License** | LFM License v1.0 |
| **Final Loss** | 0.2178 (Step 430) |

---

## What is TOON?

**TOON (Token-Oriented Object Notation)** is a serialization format optimized for LLMs.  
It represents structured data with minimal tokens using a **header + rows** pattern:

```

users[2]{id,name,role}:
  1,Alice,admin
  2,Bob,user

````

Compared to JSON, TOON reduces token count by up to 60% and is easier for LLMs to generate deterministically.

---

## Training Summary

The model was trained on 430 steps with the following key trends:

- **Initial loss:** 1.3793  
- **Final loss:** 0.2178  
- **Lowest recorded loss:** 0.2043  
- **Steady convergence** after step 250 with consistent decline below 0.3.  
- **Training method:** Unsloth LoRA (rank 16, alpha 32, learning rate 2e-4, batch size 64).  
- **Hardware:** 1x NVIDIA T4 (15 GB VRAM).  
- **Duration:** 30 Minutes.

The training demonstrated strong stability and smooth convergence towards sub-0.25 loss, confirming excellent adaptation of the base model to TOON structure.

---

## Usage Example

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import TextStreamer

model_id = "yasserrmd/LFM2-350M-Extract-TOON"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

schema = """
"$schema": "http://json-schema.org/draft-07/schema#"
type: object
properties:
id:
type: string
pattern: "^(\\d+\\.\\d+) disturbing$"
description: Dot-separated integers representing the unique ID of each element in the hierarchy
title:
type: string
description: Descriptive title of the section or element
level:
type: integer
minimum: 0
maximum: 9
description: "Hierarchical level (0 - ROOT, 1 - SECTION, 2 - SUBSECTION, 3+ - DETAIL_N)"
level_type:
type: string
enum[4]: ROOT,SECTION,SUBSECTION,DETAIL_N
description: Type of the hierarchical element
component:
type: array
items:
type: object
properties:
idc:
type: integer
description: Component ID
component_type:
type: string
enum[4]: PARAGRAPH,TABLE,CALCULATION,CHECKBOX
description: Type of component
metadata:
type: string
description: "Additional metadata (e.g., title, note, or overview)"
properties:
type: object
properties:
variables:
type: array
items:
type: object
properties:
idx:
type: string
description: Unique row-column identifier (X.Y format)
name:
type: string
description: Attribute name
value:
type: string
description: Attribute value
unit:
type[2]: string,"null"
description: Optional unit for the value
metrics:
type: boolean
description: Boolean flag indicating if the attribute is a metric
formula:
type: boolean
description: Boolean flag indicating if the attribute is a formula
content:
type: array
items:
type[2]: string,"null"
description: Text content
children:
type: array
items:
"$ref": #
required[6]: id,title,level,level_type,component,children
"""
text = """
SUBSECTION component[1]: - idc: 1 component_type: PARAGRAPH metadata: "<note>Note: Specific to debtor risk.</note>" properties: variables[0]: content[1]: The risk of debtors failing to make payments on time. - id: "2.2" title: Liquidity Risk level: 2 level_type: SUBSECTION component[1]: - idc: 1 component_type: PARAGRAPH metadata: "<note>Note: Specific to liquidity risk.</note>" properties: variables[0]: content[1]: Liquidity risk is related to the difficulty in selling assets quickly without a significant loss.

The document begins with an inclusive overview, elucidating the purpose of the report and its objective to assess risks and propose mitigations for financial operations, such as compliance, fraud detection, and performance metrics. The overall framework is meticulously divided into several sections and subsections reflecting detailed and structured analysis.

This report is intended to provide a comprehensive understanding of risk exposure within financial operations. We will now delve into the first section of the report, which covers a vast array of compliance regulations critical for maintaining financial accountability.

Firstly, let’s examine the **Compliance Section**. The section’s primary aim is to highlight the key compliance regulations applicable to financial operations. Notably, this includes the **Anti-Money Laundering (AML) Regulation (RC.1)** and the **Data Privacy Act (RC.2)**. Highlighting the significance of these regulations, the Subsection on Anti-Money Laundering identifies several gaps within the current system. These gaps need to be addressed to ensure robust compliance. The analysis suggests the presence of several risk points where the current practices might fall short of regulatory standards.

Next, we have a **Detailed Risk Analysis** for the Anti-Money Laundering Regulation. This component outlines the specific risks and potential impacts on financial operations. In the document, a table detailing the risk assessment is provided outlining two primary risks, **Fraudulent Transactions (RA.1)**, and **Non-Compliance with AML (RA.2)**, each with a brief description of the risk and its possible consequences. Addressing these risks requires a systematic approach, ensuring all preventive measures are in place to mitigate financial risks effectively.

Moreover, a **Checklist** is included to assess the current status concerning the Anti-Money Laundering Regulation. The Checklist requires the selection of the best option that describes the current status as either **Option 1 (true)** or **Option 2 (false)**. This selection is pivotal in making informed decisions about regulatory compliance and operational adjustments.

In parallel, the **Data Privacy Act** (RC.2) Subsection identifies several issues in handling personal data. These issues need to be corrected to fully comply with the Data Privacy Act. The **Fraud Detection Section** and its **Subsections on Misrepresentation and Theft of Data** follow a similar structure, detailing the critical risks associated with these vulnerabilities and emphasizing the necessity for mitigation strategies.

In the **Fraud Detection Section**, we have a table outlining two major cases of fraud: **Misrepresentation (FC.1)** and **Theft of Data (FC.2)**. These cases are significant due to their impact on financial integrity and operational continuity. The analysis of these cases includes detailed descriptions of the nature and extent of the fraud, highlighting the importance of robust fraud detection mechanisms.

Each regulatory and fraud-related section is equipped with thorough analysis and checks, ensuring that every risk is identified and addressed. While the sections provide detailed tables and checklists, they also reflect the broader context of financial operations and the mitigation strategies required to ensure compliance and prevent fraud.

By providing these detailed sections and sub-sections, the report aims to equip stakeholders with the necessary information to assess and improve the risk management framework. This ensures that all financial operations are conducted in a compliant, transparent, and secure manner, thereby safeguarding the interests of all stakeholders involved.

"""

system_instruction = (
        "You are an intelligent model specialized in converting natural language text"
        "into valid TOON (Token-Oriented Object Notation) format. "
        "Always follow the given schema strictly, emit the correct header "
        "in the form <label>[1]{fields}: followed by exactly one values row. "
        "Do not include explanations or additional commentary."
    )
    

user_prompt = (
    f'Generate TOON format using the schema {schema} '
    f'for the below text "{text}".'
)


messages = [
    {"role": "system", "content": system_instruction},
    {"role": "user", "content": user_prompt}
]


inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
    tokenize = True,
    return_dict = True,
).to("cuda")


_ = model.generate(
    **inputs,
    max_new_tokens = 2046, # Increase for longer outputs!
    # Recommended Liquid settings!
    temperature = 0.3, min_p = 0.15, repetition_penalty = 1.05,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)
```

**Expected Output:**

```
id: "0.0"
title: Financial Risk Assessment Report
level: 0
description: Overview of financial risks and mitigation strategies.
level_type: ROOT
component[1]:
  - idc: 1
    component_type: PARAGRAPH
    metadata: <note>Note: Specific to debtor risk.</note>"
    properties:
      variables[0]:
      content[1]: The risk of debtors failing to make payments on time.
children[1]:
  - id: "1.0"
    title: Compliance Section
    level: 1
    level_type: SECTION
    component[1]:
      - idc: 1
        component_type: PARAGRAPH
        metadata: <note>Note: Specific to liquidity risk.</note>"
        properties:
          variables[0]:
          content[1]: The risk of liquidity risk is related to the difficulty in selling assets quickly without a significant loss.
    children[1]:
      - id: "1.1"
        title: Detailed Risk Analysis
        level: 2
        level_type: SUBSECTION
        component[1]:
          - idc: 1
            component_type: TABLE
            metadata: <note>Table of Risks</note>"
            properties:
              variables[2]{idx,name,value,unit,metrics}:
                "0.0",Risk Assessment,false,null,false
                "0.1",Risks,Fraudulent Transactions,null,false
              content[1]: Fraudulent Transactions (RA.1), Non-Compliance with AML,null,false
          - idc: 2
            component_type: CHECKBOX
            metadata: <note>Checklist for compliance</note>
            properties:
              variables[0]:
              content[1]: Option 1 (true),Option 2 (false)<|im_end|>
```

---

## 📈 Evaluation (Fine-tune Metrics)

| Metric              | Value                     |
| ------------------- | ------------------------- |
| Final Training Loss | **0.2178**                |
| Lowest Loss         | **0.2043**                |
| Total Steps         | **430**                   |
| Stability           | Excellent (no divergence) |

---

## Intended Use

* **Structured data extraction** from unstructured text.
* **Compact schema-based representations** for LLM pipelines.
* **Dataset generation** for downstream tasks (e.g., CSV, SQL, knowledge graph).
* Works best with short or medium-length text requiring structured outputs.

---

## Limitations

* Schema must be explicit; generic prompts reduce accuracy.
* English-only alignment (no multilingual fine-tuning yet).

---

## Future Work

* Fine-tune on multi-row (`[n]`) TOON conversions.
* Expand coverage to other domains (e.g., medical, legal, environmental).
* Evaluate zero-shot generalization on unseen schemas.
* Explore quantized (GGUF) release for CPU/edge inference.

---

## Citation

```bibtex
@misc{yasserrmd2025lfm2toon,
  title        = {LFM2-350M-Extract-TOON: Schema-driven TOON Output Model},
  author       = {Mohamed Yasser},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/yasserrmd/LFM2-350M-Extract-TOON}}
}
```

---

## Acknowledgements

* **Base model:** LiquidAI team for LFM2-350M-Extract
* **Fine-tuning framework:** Unsloth AI
* **Dataset:** yasserrmd/TOON-Unstructured-Structured
* **Concept:** Token-Oriented Object Notation (TOON)

---

## Version History

| Version | Date       | Changes                                  |
| ------- | ---------- | ---------------------------------------- |
| v1.0    | 2025-11-11 | Initial release (Unsloth LoRA fine-tune) |
| v1.1    | TBD        | Planned quantized GGUF release           |

---

**Model performance summary:**
The model successfully converged from **1.37 → 0.21 loss** over 430 steps, showing a 6× reduction in training loss.
It produces deterministic, schema-accurate TOON outputs under the specified system instruction, making it an efficient structured extraction model for lightweight and edge deployments.

---