--- license: mit language: - en base_model: - NousResearch/Hermes-3-Llama-3.1-8B --- # Inference with Your Model This guide explains how to run inference with your custom model using the Hugging Face `transformers` library. ## Prerequisites Make sure you have the following dependencies installed: - Python 3.7+ - PyTorch - Hugging Face `transformers` library You can install the required packages using pip: ```bash !git clone https://github.com/huggingface/transformers.git %cd transformers !git checkout !pip install . !pip install -q accelerate==0.34.2 bitsandbytes==0.44.1 peft==0.13.1 ``` ```py # quantization of model bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type='nf4' ) ``` ```py # Load model & tokenizer model_id = "Ahanaas/Hermes-3-Llama-3.1-8B_finetune_prashu" from transformers import AutoTokenizer, LlamaTokenizer, PreTrainedTokenizerFast base_model = AutoModelForCausalLM.from_pretrained( model_id, low_cpu_mem_usage=True, return_dict=True, torch_dtype=torch.float16, quantization_config=bnb_config, device_map=0, ) # Tokenizer tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side="right", use_fast=False) tokenizer.pad_token = tokenizer.eos_token ``` ```py # Run text generation pipeline with our next model system_prompt = '''''' prompt = '''''' pipe = pipeline( task="text-generation", model=base_model, tokenizer=tokenizer, max_new_tokens=128, # Increase this to allow for longer outputs temperature=0.4, # Encourages more varied outputs top_k=50, # Limits to the top 50 tokens do_sample=True, # Enables sampling return_full_text=True ) result = pipe(f"<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>") # print(result[0]['generated_text']) generated_text = result[0]['generated_text'] print(generated_text) ```