---
license: mit
language:
- en
base_model:
- NousResearch/Hermes-3-Llama-3.1-8B
---
# Inference with Your Model

This guide explains how to run inference with your custom model using the Hugging Face `transformers` library.

## Prerequisites

Make sure you have the following dependencies installed:

- Python 3.7+
- PyTorch
- Hugging Face `transformers` library

You can install the required packages using pip:

```bash
!git clone https://github.com/huggingface/transformers.git
%cd transformers
!git checkout <commit_id_for_4.47.0.dev0>
!pip install .
!pip install -q accelerate==0.34.2 bitsandbytes==0.44.1 peft==0.13.1

```

```py
# quantization of model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4'
)
```

```py
# Load model & tokenizer
model_id = "Ahanaas/Hermes-3-Llama-3.1-8B_finetune_prashu"

from transformers import AutoTokenizer, LlamaTokenizer, PreTrainedTokenizerFast
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    quantization_config=bnb_config,
    device_map=0,
)
# Tokenizer 
tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side="right", use_fast=False)
tokenizer.pad_token = tokenizer.eos_token
```

```py
# Run text generation pipeline with our next model
system_prompt = ''''''
prompt = ''''''

pipe = pipeline(
    task="text-generation", 
    model=base_model, 
    tokenizer=tokenizer, 
    max_new_tokens=128,  # Increase this to allow for longer outputs
    temperature=0.4,  # Encourages more varied outputs
    top_k=50,  # Limits to the top 50 tokens
    do_sample=True,  # Enables sampling
    return_full_text=True
)

result = pipe(f"<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>")
# print(result[0]['generated_text'])
generated_text = result[0]['generated_text']
print(generated_text)
```