Icelandic Tokenizer README

Overview

This BPE (Byte Pair Encoding) tokenizer is designed for the Icelandic GPT model, available at Sigurdur/ice-gpt. Trained on the Icelandic Gigaword Corpus ({IGC}-2022) - annotated version, it excels in accurately segmenting Icelandic text into meaningful tokens.

Usage

Integrate this tokenizer into your NLP pipeline for preprocessing Icelandic text. The following example demonstrates basic usage:

from transformers import GPT2Tokenizer

# Load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("Sigurdur/ice-tokenizer")
tokenizer.pad_token_id = tokenizer.eos_token_id

tokenizer("Halló heimur!")["input_ids"]

Citation

If you use this tokenizer in your work, please cite the original source of the training data:

@misc{20.500.12537/254,
  title = {Icelandic Gigaword Corpus ({IGC}-2022) - annotated version},
  author = {Barkarson, Starkaður and Steingrímsson, Steinþór and Andrésdóttir, Þórdís Dröfn and Hafsteinsdóttir, Hildur and Ingimundarson, Finnur Ágúst and Magnússon, Árni Davíð},
  url = {http://hdl.handle.net/20.500.12537/254},
  note = {{CLARIN}-{IS}},
  year = {2022}
}

Feedback

We welcome user feedback to enhance the tokenizer's functionality. Feel free to reach out with your insights and suggestions.

Happy tokenizing!

Sigurdur Haukur Birgisson

(readme created with chatgpt)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Sigurdur/ice-tokenizer

Jónas Hallgrímsson gpt

Collection

a small gpt model fine-tuned on the works of Jónas Hallgrímsson, a well-known Icelandic poet. • 3 items • Updated Dec 26, 2023