haipai-7M / README.md
rocky1410's picture
Update README.md
543d977 verified
metadata
license: apache-2.0
datasets:
  - abisee/cnn_dailymail
pipeline_tag: text-generation
library_name: transformers
language:
  - en
metrics:
  - perplexity
tags:
  - Pytorch
  - seq2seq
  - encoder
  - decoder
  - t5-style
  - tiny
model-index:
  - name: Haipai-7M
    results:
      - task:
          type: text-generation
        dataset:
          name: abisee/cnn_dailymail
          type: abisee/cnn_dailymail
        metrics:
          - name: Perplexity
            type: Perplexity
            value: 16

Introduction

We are introducing a very small model called haipai. It is trained on cnn/dm dataset. We are opensourcing both the training code and the model checkpoints.

Small-footprint seq2seq Transformer that completes sentences.

Highlights

  • ~7.3M trainable parameters (4 encoder + 4 decoder layers, 288 hidden size, 6 heads).
  • Factorised embeddings with linear projections back to the model dimension.
  • Trained on 80k oracle/reference pairs with heavy denoising corruption (span masking, drops, sentence shuffles).
  • All trained on news articles.
  • 32k shared subword vocabulary (trained with tokenizers).

Files

  • stage1_final.pt: checkpoint with model weights and config.
  • stage1_best.pt: checkpoint of the best model.
  • stage1_tokenizer.json: 32k BPE tokenizer shared across stages.
  • stage1_infer.py: CLI for greedy reconstructions.

Architecture

Shared Subword Embedding (32k) →
  Encoder (4 layers) →
  Decoder (4 layers) →
  Factorized output projection → vocab logits
  • Encoder/decoder blocks are standard Transformer layers (multi-head attention + FFN + dropout).
  • Embeddings start in a smaller space (embed_dim) and are linearly projected to the model dimension.
  • Output logits reuse the embedding matrix (weight tying).

Quick Inference

Stage 1 reconstructs text; feed it any summary to see what the autoencoder learned:

python -m src.stage1_infer \
  --run-dir models \
  --tokenizer-path models/stage1_tokenizer.json \
  --config-checkpoint models/stage1_final.pt \
  --input-text "US officials pledged immediate aid after the storm devastated the coastline."

You can also batch inputs from a file (--input-file path/to/texts.txt) or point to specific checkpoints with --checkpoints.

Expected Output

  • Greedy decode tends to copy the input with light denoising (that’s the objective).

Requirements

pip install torch tokenizers

The repo already contains the training/inference scripts; no extra setup is needed beyond installing dependencies.

Citation

If you use the model, please reference it as "Haipai-7M"