File size: 2,636 Bytes
fb7b99d
 
 
 
 
4f7c5c5
e6f3609
 
 
543d977
e6f3609
 
 
 
 
 
 
543d977
 
 
 
 
 
 
 
 
 
 
 
 
fb7b99d
 
 
 
5b6b47c
fb7b99d
 
 
 
 
 
 
 
 
 
 
28a07e4
218c339
28a07e4
 
fb7b99d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
218c339
fb7b99d
218c339
fb7b99d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8bd52fc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
license: apache-2.0
datasets:
- abisee/cnn_dailymail
pipeline_tag: text-generation
library_name: transformers
language:
- en
metrics:
- perplexity
tags:
- Pytorch
- seq2seq
- encoder
- decoder
- t5-style
- tiny
model-index:
  - name: Haipai-7M
    results:
      - task:
          type: text-generation
        dataset:
          name: abisee/cnn_dailymail
          type: abisee/cnn_dailymail
        metrics:
          - name: Perplexity
            type: Perplexity
            value: 16
            
---

# Introduction

We are introducing a very small model called haipai. It is trained on cnn/dm dataset. We are opensourcing both the training code and the model checkpoints.

Small-footprint seq2seq Transformer that completes sentences.

## Highlights
- ~7.3M trainable parameters (4 encoder + 4 decoder layers, 288 hidden size, 6 heads).
- Factorised embeddings with linear projections back to the model dimension.
- Trained on 80k oracle/reference pairs with heavy denoising corruption (span masking, drops, sentence shuffles).
- All trained on news articles.
- 32k shared subword vocabulary (trained with `tokenizers`).

## Files
- `stage1_final.pt`: checkpoint with model weights and config.
- `stage1_best.pt`: checkpoint of the best model.
- `stage1_tokenizer.json`: 32k BPE tokenizer shared across stages.
- `stage1_infer.py`: CLI for greedy reconstructions.

## Architecture
```text
Shared Subword Embedding (32k) →
  Encoder (4 layers) →
  Decoder (4 layers) →
  Factorized output projection → vocab logits
```
- Encoder/decoder blocks are standard Transformer layers (multi-head attention + FFN + dropout).
- Embeddings start in a smaller space (`embed_dim`) and are linearly projected to the model dimension.
- Output logits reuse the embedding matrix (weight tying).

## Quick Inference

Stage 1 reconstructs text; feed it any summary to see what the autoencoder learned:

```bash
python -m src.stage1_infer \
  --run-dir models \
  --tokenizer-path models/stage1_tokenizer.json \
  --config-checkpoint models/stage1_final.pt \
  --input-text "US officials pledged immediate aid after the storm devastated the coastline."
```

You can also batch inputs from a file (`--input-file path/to/texts.txt`) or point to specific checkpoints with `--checkpoints`.

## Expected Output
- Greedy decode tends to copy the input with light denoising (that’s the objective).

## Requirements
```bash
pip install torch tokenizers
```

The repo already contains the training/inference scripts; no extra setup is needed beyond installing dependencies.

## Citation
If you use the model, please reference it as "Haipai-7M"