SimpleStories
/

SimpleStories-11M

Text Generation

small-language-model

story-generation

distilled-models

Model card Files Files and versions

SimpleStories-11M / README.md

chandan-sreedhara's picture

chandan-sreedhara

Update README.md

5c30600 verified 6 months ago

|

history blame contribute delete

2.69 kB

	---
	license: mit
	datasets:
	- lennart-finke/SimpleStories
	language:
	- en
	tags:
	- small-language-model
	- story-generation
	- text-generation
	- efficient-nlp
	- distilled-models
	---

	# SimpleStories Model Family
	The SimpleStories models are a tiny model family created for interpretability research, trained on the [SimpleStories dataset](https://huggingface.co/datasets/lennart-finke/SimpleStories).

	## Usage

	```python
	import torch
	from transformers import AutoTokenizer, LlamaForCausalLM


	MODEL_SIZE = "11M"
	model_path = "SimpleStories/SimpleStories-{}".format(MODEL_SIZE)

	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = LlamaForCausalLM.from_pretrained(model_path)
	model.to("cuda")
	model.eval()

	prompt = "The curious cat looked at the"

	inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
	input_ids = inputs.input_ids.to("cuda")

	eos_token_id = 1

	with torch.no_grad():
	output_ids = model.generate(
	input_ids=input_ids,
	max_new_tokens=400,
	temperature=0.7,
	do_sample=True,
	eos_token_id=eos_token_id
	)

	output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
	print(f"\nGenerated text:\n{output_text}")

	```

	## Model Variants

	\| Model Name \| n_params \| n_layers \| d_model \| n_heads \| n_ctx \| d_vocab \|
	\|------------\|----------\|----------\|---------\|---------\|-------\|---------\|
	\| SimpleStories-35M \| 35 million \| 12 \| 512 \| 8 \| 512 \| 4096 \|
	\| SimpleStories-30M \| 30 million \| 10 \| 512 \| 8 \| 512 \| 4096 \|
	\| SimpleStories-11M \| 11 million \| 6 \| 384 \| 6 \| 512 \| 4096 \|
	\| SimpleStories-5M \| 5 million \| 6 \| 256 \| 4 \| 512 \| 4096 \|
	\| SimpleStories-1.25M \| 1.25 million \| 4 \| 128 \| 4 \| 512 \| 4096 \|

	## Performance Comparison
	Model-evaluated generation quality metrics:
	<p align="center">
	<img width="80%" src="figures/simplestories_comparison.png">
	</p>


	## Tokenizer

	We use a custom WordPiece tokenizer with a small vocabulary size of 4096. We conducted morphological analysis and coverage gain analysis on the dataset
	to build a small tokenizer without compromising on the quality of generation.

	## Dataset

	The SimpleStories dataset is a collection of short stories generated by state-of-the-art language models. It features:

	- Story annotation with high-level concepts: theme, topic, style, etc.
	- Higher semantic and syntactic diversity through seeded story generation
	- Generated by 2024 models
	- Several NLP-metrics pre-computed to aid filtering
	- ASCII-only guarantee for the English dataset

	Read the dataset paper on [arXiv](https://arxiv.org/abs/2504.09184).

	## Training

	The training and evaluation scripts can be accessed at https://github.com/danbraunai/simple_stories_train