haipai-7M / README.md

rocky1410

Update README.md

543d977 verified 26 days ago

preview code

raw

history blame contribute delete

2.64 kB

metadata

license: apache-2.0
datasets:
  - abisee/cnn_dailymail
pipeline_tag: text-generation
library_name: transformers
language:
  - en
metrics:
  - perplexity
tags:
  - Pytorch
  - seq2seq
  - encoder
  - decoder
  - t5-style
  - tiny
model-index:
  - name: Haipai-7M
    results:
      - task:
          type: text-generation
        dataset:
          name: abisee/cnn_dailymail
          type: abisee/cnn_dailymail
        metrics:
          - name: Perplexity
            type: Perplexity
            value: 16

Introduction

We are introducing a very small model called haipai. It is trained on cnn/dm dataset. We are opensourcing both the training code and the model checkpoints.

Small-footprint seq2seq Transformer that completes sentences.

Highlights

~7.3M trainable parameters (4 encoder + 4 decoder layers, 288 hidden size, 6 heads).
Factorised embeddings with linear projections back to the model dimension.
Trained on 80k oracle/reference pairs with heavy denoising corruption (span masking, drops, sentence shuffles).
All trained on news articles.
32k shared subword vocabulary (trained with tokenizers).

Files

stage1_final.pt: checkpoint with model weights and config.
stage1_best.pt: checkpoint of the best model.
stage1_tokenizer.json: 32k BPE tokenizer shared across stages.
stage1_infer.py: CLI for greedy reconstructions.

Architecture

Shared Subword Embedding (32k) →
  Encoder (4 layers) →
  Decoder (4 layers) →
  Factorized output projection → vocab logits

Encoder/decoder blocks are standard Transformer layers (multi-head attention + FFN + dropout).
Embeddings start in a smaller space (embed_dim) and are linearly projected to the model dimension.
Output logits reuse the embedding matrix (weight tying).

Quick Inference

Stage 1 reconstructs text; feed it any summary to see what the autoencoder learned:

python -m src.stage1_infer \
  --run-dir models \
  --tokenizer-path models/stage1_tokenizer.json \
  --config-checkpoint models/stage1_final.pt \
  --input-text "US officials pledged immediate aid after the storm devastated the coastline."

You can also batch inputs from a file (--input-file path/to/texts.txt) or point to specific checkpoints with --checkpoints.

Expected Output

Greedy decode tends to copy the input with light denoising (that’s the objective).

Requirements

pip install torch tokenizers

The repo already contains the training/inference scripts; no extra setup is needed beyond installing dependencies.

Citation

If you use the model, please reference it as "Haipai-7M"