Sifal (Sifal KLIOUI)

liked a dataset 2 days ago

osunlp/TravelPlanner

Viewer • Updated Jul 14, 2024 • 1.23k • 45.9k • 74

liked a model 8 days ago

jinaai/jina-code-embeddings-1.5b

Feature Extraction • 2B • Updated Oct 2 • 3.7k • 34

updated a dataset 11 days ago

Sifal/Kabyle-French

Viewer • Updated 11 days ago • 115k • 74 • 2

commented on Gotchas in Tokenizer Behavior Every Developer Should Know 11 days ago

Thanks for sharing, probably worth having a script to check:

import warnings
from transformers import AutoTokenizer

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

def check_tokenizer_gotchas(model_id):
    print(f"\n{'='*60}")
    print(f"Analyzing Tokenizer for: {model_id}")
    print(f"{'='*60}\n")

    try:
        # Load tokenizer (trust_remote_code=True is often needed for newer/custom models)
        tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    except Exception as e:
        print(f"Error loading tokenizer: {e}")
        return

    # Standard test input
    test_text = "Beautiful is better than ugly"
    
    # Standard test messages for Chat Templates
    messages = [
        {"role": "user", "content": "What is better than ugly?"},
        {"role": "assistant", "content": "Beautiful."}
    ]

    # --- GOTCHA 1 & 2: BOS Token Existence and Usage ---
    print(f"--- 1 & 2. BOS Token Analysis ---")
    if tokenizer.bos_token is None:
        print(f"⚠️  Gotcha #1: This tokenizer has NO BOS token defined.")
    else:
        print(f"✅  BOS token exists: '{tokenizer.bos_token}' (ID: {tokenizer.bos_token_id})")
        
        # Check usage in standard encoding
        encoded = tokenizer(test_text)["input_ids"]
        if tokenizer.bos_token_id in encoded:
             print(f"✅  BOS token IS automatically added during standard tokenization.")
        else:
            print(f"⚠️  Gotcha #2: BOS exists but is NOT added automatically.")

    # --- GOTCHA 3: EOS Token in Standard Tokenization ---
    print(f"--- 3. Standard EOS Token Analysis ---")
    encoded = tokenizer(test_text)["input_ids"]
    if tokenizer.eos_token_id and encoded[-1] == tokenizer.eos_token_id:
        print(f"ℹ️  EOS token WAS added automatically (Uncommon behavior).")
    else:
        print(f"⚠️  Gotcha #3: Tokenization did NOT add the EOS token automatically.")

    # --- GOTCHA 4: EOS in Chat Templates ---
    print(f"--- 4. Chat Template EOS Analysis ---")
    if tokenizer.chat_template:
        # Generate IDs without adding the generation prompt yet
        chat_encoded = tokenizer.apply_chat_template(messages, add_generation_prompt=False)
        
        if tokenizer.eos_token_id is None:
             print("❌  No EOS token defined in tokenizer.")
        
        elif len(chat_encoded) > 0:
            last_id = chat_encoded[-1]
            # Check if the very last token is EOS
            if last_id == tokenizer.eos_token_id:
                print(f"✅  Chat template correctly appends EOS ({tokenizer.eos_token}) at the very end.")
            
            # Check if EOS is second to last (common issue)
            elif len(chat_encoded) > 1 and chat_encoded[-2] == tokenizer.eos_token_id:
                # Decode the actual last token to show the user
                trailing_token = tokenizer.decode([last_id])
                # Escape newlines for visibility in print output
                trailing_repr = repr(trailing_token) 
                
                print(f"⚠️  Gotcha #4: EOS is present but NOT at the end.")
                print(f"    The actual last token is ID {last_id} ({trailing_repr}).")
                print(f"    (This is likely a trailing newline from the Jinja template).")
            
            else:
                print(f"⚠️  Gotcha #4: Chat template does NOT append the EOS token.")
    else:
        print("ℹ️  No chat template defined for this tokenizer.")

    # --- GOTCHA 5: PAD == EOS ---
    print(f"--- 5. Pad Token Collision Check ---")
    if tokenizer.pad_token_id is not None and tokenizer.eos_token_id is not None:
        if tokenizer.pad_token_id == tokenizer.eos_token_id:
            print(f"⚠️  Gotcha #5: PAD token ID equals EOS token ID ({tokenizer.pad_token_id}).")
            print(f"    Warning: Masking logic `input_ids == pad_token_id` will unintentionally mask EOS tokens.")
        else:
            print(f"✅  PAD ({tokenizer.pad_token_id}) and EOS ({tokenizer.eos_token_id}) are distinct.")
    else:
        print("ℹ️  PAD or EOS token not defined for this tokenizer.")

    # --- GOTCHA 6 & 7: Composition and Double Special Tokens ---
    print(f"--- 6 & 7. Chat Template Composition ---")
    if tokenizer.chat_template:
        # Step 1: Apply template directly to IDs (Correct way)
        direct_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False)
        
        # Step 2: Apply template to string, THEN tokenize (Incorrect way often used)
        str_template = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
        composed_ids = tokenizer(str_template)["input_ids"]

        if direct_ids != composed_ids:
            print(f"⚠️  Gotcha #7: Tokenizing the output of `apply_chat_template` ADDS extra special tokens.")
            print(f"    Direct ID length: {len(direct_ids)} vs Re-tokenized length: {len(composed_ids)}")
        else:
            print(f"✅  Tokenization of chat template string matches direct ID generation.")
    else:
      print("ℹ️  No chat template defined for this tokenizer.")

# Run for all models mentioned in the text
models = [
    "Qwen/Qwen2.5-0.5B",
    "microsoft/Phi-3-mini-128k-instruct",
    "CohereLabs/aya-expanse-8b",
    "meta-llama/Llama-3.2-1B-Instruct",
    "databricks/dbrx-instruct",
    "Qwen/Qwen2.5-0.5B-Instruct"
]

for model in models:
    check_tokenizer_gotchas(model)

upvoted an article 11 days ago

Article

Gotchas in Tokenizer Behavior Every Developer Should Know

Apr 18

•

65

New activity in Sifal/Kabyle-French 14 days ago

Wrong translation in some cases ?

1

#3 opened 14 days ago by

Djame

commented on Model statistics of the 50 most downloaded entities on Hugging Face 19 days ago

Very instersing example regarding CamemBERT, these were actually what I was referring to when I said "with a few exception", didn't know it was much more common, your point now on how this biases the results makes much more sense, thanks for clarifications!

commented on Model statistics of the 50 most downloaded entities on Hugging Face 19 days ago

Thanks for the extensive reply!

Valid point about how decoders openned the door to encoders in some applications.

Thanks for sharing the article, I'll try to check it out!

Intresting that you think that it is a strong assumption, because from memory, the downloads curve of models I check on the hub flattens pretty fast after the release (with a few the exceptions)

Regarding my pervious question, the paper you just shared seems the one to actually answer my question (p20), proprtionnaly the encoders seem to have been downloaded less overtime compared to the early days, although the curve has been pretty stable in the last 3-4 years:

Download were lower than decoders in a moment ahah! probably (a) big release(s)

commented on Model statistics of the 50 most downloaded entities on Hugging Face 20 days ago

Really intresting! Thanks for sharing! Wasn't surprised about the NLP domination, but was for that of the encoders, curious about how much this is changing given that most releases are decoders.

Side note: you seem to have missed the translation of one part (ctrl+f: présent)

upvoted an article 20 days ago

Article

Model statistics of the 50 most downloaded entities on Hugging Face

Oct 13

•

36

liked a Space 21 days ago

Evaluation Guidebook

📝

216

Display benchmark evaluation data for LLMs

upvoted an article 23 days ago

Article

Entropic Instruction Following: Does Semantic Coherence Help LLMs Follow Instructions?

23 days ago

•

1

published an article 23 days ago

Article

Entropic Instruction Following: Does Semantic Coherence Help LLMs Follow Instructions?

23 days ago

•

1

liked a Space about 2 months ago

The Smol Training Playbook

📚

2.69k

The secrets to building world-class LLMs

upvoted an article 4 months ago

Article

Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers

+5

Sep 11

•

174

authored a paper 4 months ago

Patient Trajectory Prediction: Integrating Clinical Notes with Transformers

Paper • 2502.18009 • Published Feb 25

commented on Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face 4 months ago

When the loss spikes, and you see that the training files contain import trackio as wandb

upvoted an article 4 months ago

Article

Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face

+3

Jul 29

•

205

liked a model 5 months ago

ariG23498/subliminal-learning-mnist-notebook

Updated Aug 1 • 2

commented on 🪆 Introduction to Matryoshka Embedding Models 5 months ago

I actually made a follow up to this, you might find it intersting: https://www.linkedin.com/pulse/do-you-need-matryoshka-model-sifal-klioui-k2jyf

Sifal KLIOUI

AI & ML interests

Recent Activity