fyn1668-nemotron-base-tokenizer

A fork of geodesic-research/nemotron-base-tokenizer with two new special tokens registered to be loss-masked at training time by the geodesic-megatron training pipeline.

What's added

Token	ID
`<stage=training>`	`131072`
`</stage=training>`	`131073`

These appear in the fyn1668 quarantine campaign corpora (train-stage-only / TSO arm) as markers wrapping assistant turns. The model should learn the content between them but not learn to emit the markers themselves.

How it works

A top-level field is added to tokenizer_config.json:

"loss_mask_token_ids": [131072, 131073]

At training time, the geodesic-megatron pipeline reads this field via pipeline_training_run.py:_read_loss_mask_token_ids and propagates it to cfg.tokenizer.loss_mask_token_ids. The training step (src/megatron/bridge/training/gpt_step.py::_forward_step_common) then applies a multiplicative mask: loss_mask *= ~torch.isin(labels, loss_mask_token_ids). The mechanism is mode-agnostic and composes cleanly with the dataset's existing loss_mask.

Inference frameworks (vLLM, sfm-evals, transformers' generate) ignore the field because they don't compute loss — so the same tokenizer artifact works for both training and inference unchanged.

Compatibility notes

Embedding resize required: adding the two special tokens grows the vocab by 2. The training pipeline performs model.resize_token_embeddings(new_vocab_size) automatically when the tokenizer's vocab exceeds the model's embedding rows; the new embedding rows are randomly initialized and learned during training.
Same encoder otherwise: every other token in the vocab is byte-identical to the source tokenizer, so existing tokenized corpora that don't contain the new marker strings remain unaffected.
Source commit pinning: this fork was built from the source tokenizer's main revision as of 2026-05-13.

Provenance

Source tokenizer: geodesic-research/nemotron-base-tokenizer
Built by: scripts/data/build_fyn1668_tokenizers.py
Date: 2026-05-13
Campaign: im_fyn1668_v3 (quarantine masking)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support