Instructions to use geodesic-research/fyn1668-nemotron-base-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use geodesic-research/fyn1668-nemotron-base-tokenizer with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("geodesic-research/fyn1668-nemotron-base-tokenizer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
fyn1668-nemotron-base-tokenizer
A fork of geodesic-research/nemotron-base-tokenizer with two new special tokens registered
to be loss-masked at training time by the geodesic-megatron
training pipeline.
What's added
| Token | ID |
|---|---|
<stage=training> |
131072 |
</stage=training> |
131073 |
These appear in the fyn1668 quarantine campaign corpora (train-stage-only / TSO arm) as
markers wrapping assistant turns. The model should learn the content between them but not learn
to emit the markers themselves.
How it works
A top-level field is added to tokenizer_config.json:
"loss_mask_token_ids": [131072, 131073]
At training time, the geodesic-megatron pipeline reads this field via
pipeline_training_run.py:_read_loss_mask_token_ids and propagates it to
cfg.tokenizer.loss_mask_token_ids. The training step
(src/megatron/bridge/training/gpt_step.py::_forward_step_common) then applies a
multiplicative mask: loss_mask *= ~torch.isin(labels, loss_mask_token_ids). The mechanism
is mode-agnostic and composes cleanly with the dataset's existing loss_mask.
Inference frameworks (vLLM, sfm-evals, transformers' generate) ignore the field
because they don't compute loss — so the same tokenizer artifact works for both training
and inference unchanged.
Compatibility notes
- Embedding resize required: adding the two special tokens grows the vocab by 2. The
training pipeline performs
model.resize_token_embeddings(new_vocab_size)automatically when the tokenizer's vocab exceeds the model's embedding rows; the new embedding rows are randomly initialized and learned during training. - Same encoder otherwise: every other token in the vocab is byte-identical to the source tokenizer, so existing tokenized corpora that don't contain the new marker strings remain unaffected.
- Source commit pinning: this fork was built from the source tokenizer's
mainrevision as of2026-05-13.
Provenance
- Source tokenizer:
geodesic-research/nemotron-base-tokenizer - Built by:
scripts/data/build_fyn1668_tokenizers.py - Date:
2026-05-13 - Campaign:
im_fyn1668_v3(quarantine masking)