--- license: apache-2.0 base_model: - google-bert/bert-base-chinese metrics: - accuracy language: - en - zh pipeline_tag: fill-mask --- # Herbert: Pretrained Bert Model for Herbal Medicine **Herbert** is a pretrained model for herbal medicine research, developed based on the `bert-base-chinese` model. The model has been fine-tuned on domain-specific data from 675 ancient books and 32 Traditional Chinese Medicine (TCM) textbooks. It is designed to support a variety of TCM-related NLP tasks. --- ## Introduction This model is optimized for TCM-related tasks, including but not limited to: - Herbal formula encoding - Domain-specific word embedding - Classification, labeling, and sequence prediction tasks in TCM research Herbert combines the strengths of modern pretraining techniques and domain knowledge, allowing it to excel in TCM-related text processing tasks. --- ## Model Config ```json { "hidden_size": 1024, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 16, "num_hidden_layers": 24, "torch_dtype": "float32", "vocab_size": 21128 } ### requirements "transformers_version": "4.45.1" ### Quickstart #### Use Huggingface ```python from transformers import AutoTokenizer, AutoModel # Replace "Chengfengke/herbert" with the Hugging Face model repository name model_name = "Chengfengke/herbert" # Load tokenizer and model tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) # Input text text = "中医理论是我国传统文化的瑰宝。" # Tokenize and prepare input inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128) # Get the model's outputs with torch.no_grad(): outputs = model(**inputs) # Get the embedding (sentence-level average pooling) sentence_embedding = outputs.last_hidden_state.mean(dim=1) print("Embedding shape:", sentence_embedding.shape) print("Embedding vector:", sentence_embedding) ``` #### LocalModel ```python from transformers import BertTokenizer, BertForMaskedLM # Load the model and tokenizer model_name = "Chengfengke/herbert" tokenizer = BertTokenizer.from_pretrained(model_name) model = BertForMaskedLM.from_pretrained(model_name) inputs = tokenizer("This is an example text for herbal medicine.", return_tensors="pt") outputs = model(**inputs) ``` ## Citation If you find our work helpful, feel free to give us a cite. ```bibtex @misc{herbert-embedding, title = {Herbert: A Pretrain_Bert_Model for TCM_herb and downstream Tasks as Text Embedding Generation}, author = {Yehan Yang,Xinhan Zheng}, month = {December}, year = {2024} } @article{herbert-technical-report, title={Herbert: A Pretrain_Bert_Model for TCM_herb and downstream Tasks as Text Embedding Generation}, author={Yehan Yang,Xinhan Zheng}, institution={Beijing Angopro Technology Co., Ltd.}, year={2024}, note={Presented at the 2024 Machine Learning Applications Conference (MLAC)} }