File size: 2,636 Bytes
fb7b99d 4f7c5c5 e6f3609 543d977 e6f3609 543d977 fb7b99d 5b6b47c fb7b99d 28a07e4 218c339 28a07e4 fb7b99d 218c339 fb7b99d 218c339 fb7b99d 8bd52fc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
---
license: apache-2.0
datasets:
- abisee/cnn_dailymail
pipeline_tag: text-generation
library_name: transformers
language:
- en
metrics:
- perplexity
tags:
- Pytorch
- seq2seq
- encoder
- decoder
- t5-style
- tiny
model-index:
- name: Haipai-7M
results:
- task:
type: text-generation
dataset:
name: abisee/cnn_dailymail
type: abisee/cnn_dailymail
metrics:
- name: Perplexity
type: Perplexity
value: 16
---
# Introduction
We are introducing a very small model called haipai. It is trained on cnn/dm dataset. We are opensourcing both the training code and the model checkpoints.
Small-footprint seq2seq Transformer that completes sentences.
## Highlights
- ~7.3M trainable parameters (4 encoder + 4 decoder layers, 288 hidden size, 6 heads).
- Factorised embeddings with linear projections back to the model dimension.
- Trained on 80k oracle/reference pairs with heavy denoising corruption (span masking, drops, sentence shuffles).
- All trained on news articles.
- 32k shared subword vocabulary (trained with `tokenizers`).
## Files
- `stage1_final.pt`: checkpoint with model weights and config.
- `stage1_best.pt`: checkpoint of the best model.
- `stage1_tokenizer.json`: 32k BPE tokenizer shared across stages.
- `stage1_infer.py`: CLI for greedy reconstructions.
## Architecture
```text
Shared Subword Embedding (32k) →
Encoder (4 layers) →
Decoder (4 layers) →
Factorized output projection → vocab logits
```
- Encoder/decoder blocks are standard Transformer layers (multi-head attention + FFN + dropout).
- Embeddings start in a smaller space (`embed_dim`) and are linearly projected to the model dimension.
- Output logits reuse the embedding matrix (weight tying).
## Quick Inference
Stage 1 reconstructs text; feed it any summary to see what the autoencoder learned:
```bash
python -m src.stage1_infer \
--run-dir models \
--tokenizer-path models/stage1_tokenizer.json \
--config-checkpoint models/stage1_final.pt \
--input-text "US officials pledged immediate aid after the storm devastated the coastline."
```
You can also batch inputs from a file (`--input-file path/to/texts.txt`) or point to specific checkpoints with `--checkpoints`.
## Expected Output
- Greedy decode tends to copy the input with light denoising (that’s the objective).
## Requirements
```bash
pip install torch tokenizers
```
The repo already contains the training/inference scripts; no extra setup is needed beyond installing dependencies.
## Citation
If you use the model, please reference it as "Haipai-7M" |