--- library_name: transformers license: apache-2.0 datasets: - omarkamali/wikipedia-monthly - alibayram/hukuk_soru_cevap - umutertugrul/turkish-hospital-medical-articles - umutertugrul/turkish-medical-articles - alibayram/tr-books - selimfirat/bilkent-turkish-writings-dataset - umutertugrul/turkish-academic-theses-dataset - alibayram/onedio_haberler - habanoz/news-tr-1.8M - alibayram/hepsiburada_yorumlar - alibayram/kitapyurdu_yorumlar - alibayram/beyazperde_yorumlar language: - tr --- # Description: This is a Byte Pair Encoding (BPE) tokenizer trained specifically for Turkish text. The tokenizer was trained on a curated subset (~30 MB from each dataset) of multiple Turkish datasets, covering news, academic texts, legal Q&A, medical articles, books, and user reviews. The goal is to provide a high-quality subword tokenizer suitable for training or fine-tuning Turkish language models. # Vocab_size: 32768 **Training datasets (~30 MB from each):** - omarkamali/wikipedia-monthly - alibayram/hukuk_soru_cevap - umutertugrul/turkish-hospital-medical-articles - umutertugrul/turkish-medical-articles - alibayram/tr-books - selimfirat/bilkent-turkish-writings-dataset - umutertugrul/turkish-academic-theses-dataset - alibayram/onedio_haberler - habanoz/news-tr-1.8M - alibayram/hepsiburada_yorumlar - alibayram/kitapyurdu_yorumlar - alibayram/beyazperde_yorumlar total : ~360 MB # Usage: from transformers import AutoTokenizer fast_tokenizer = AutoTokenizer.from_pretrained("AhmetSemih/merged_dataset-32k-bpe-tokenizer", use_fast=True) fast_tokenizer.encode("Bugün hava çok güzel.") # İntended_use: - Training and fine-tuning Turkish language models - Tokenization of Turkish text for NLP tasks (classification, summarization, question answering) - Research and educational purposes ---