ร†RA-4B (GGUF 4-bit)

Overview

ร†RA is a specialized 4 billion parameter language model developed by AND EMILI as an enterprise-focused foundation for building intelligent agents and automation pipelines. Unlike general-purpose conversational models, ร†RA is intentionally designed with a narrow, practical focus on context-based reasoning and structured outputs.

Note: This repository contains the 4-bit GGUF quantized weights intended for local runtimes such as LM Studio, Ollama, and llama.cpp. It is not the full-precision 16-bit model. For the full-precision version (best for use via Transformers/vLLM), see and-emili/aera-4b on Hugging Face.

Key Capabilities

๐Ÿ‡ฎ๐Ÿ‡น Native Italian Language Support

ร†RA excels at understanding and generating Italian text, making it ideal for Italian-speaking enterprises and applications.

๐Ÿ“„ Context-Only Responses

ร†RA is trained to rely exclusively on provided context rather than internal knowledge. When asked questions without relevant context, it will respond honestly:

"Currently I don't have access to information about the actors who played Dr. Who. Feel free to share content and I will analyze it and tell you what I can infer from it."

This behavior ensures reliability and reduces hallucination in enterprise applications.

๐Ÿ”ง Structured Output Generation

  • JSON Generation: Reliably produces well-formed JSON outputs
  • Entity Extraction: Identifies and extracts entities from provided text
  • Classification: Categorizes content based on given criteria
  • Sentiment Analysis: Analyzes emotional tone in context

๐Ÿ› ๏ธ Function Calling

Native support for tool use and function calling, enabling seamless integration into agentic workflows and automation pipelines.

Design Philosophy

ร†RA is not intended to be a general-knowledge assistant like ChatGPT. Instead, it serves as a lightweight, efficient starting point for enterprises exploring:

  • Retrieval Augmented Generation (RAG) implementations
  • Document analysis and information extraction
  • Automated workflows with structured outputs
  • Multi-agent systems requiring reliable, predictable behavior

Use Cases

This model is ideal for companies looking to:

  • Test the viability of RAG systems for their specific needs
  • Build proof-of-concepts for document processing pipelines
  • Implement lightweight automation without cloud dependencies
  • Evaluate whether LLM-based solutions fit their requirements

If initial tests with ร†RA prove successful, organizations can then invest in developing more specialized, powerful models tailored to their specific domain needs.

Technical Details

  • Parameters: 4 billion
  • Training: Post-trained on synthetic data focused on structured reasoning and Italian language tasks
  • Deployment: Optimized for local deployment on standard hardware
  • Privacy: Runs entirely on-premises with no external API calls

Precision & Memory

  • This release is a 4-bit GGUF quantized build optimized for local inference with llama.cpp-compatible runtimes (LM Studio, Ollama, llama.cpp).
  • 4-bit quantization dramatically reduces memory usage and enables CPU-only or modest-GPU setups, with a small quality trade-off versus the full-precision model and-emili/aera-4b.
  • Effective memory usage also depends on context length and batch size due to KV cache; adjust these based on your hardware.

Getting Started (GGUF 4-bit)

Below are examples for common local runtimes. If you need to use Python/Transformers or vLLM, please use the full-precision model and-emili/aera-4b instead.

LM Studio

  • Open LM Studio โ†’ Models โ†’ Download.
  • Search for and-emili/aera-4b-GGUF and select a 4-bit variant (e.g., Q4_K_M).
  • After download, start a local chat and prompt in Italian or English (the model is context-following; plain prompts work well).

Ollama

  1. Create a Modelfile (adjust the exact filename to match the chosen GGUF file in the HF repo):
FROM hf://and-emili/aera-4b-GGUF/aera-4b-q4_k_m.gguf
PARAMETER num_ctx 8192
  1. Create and run the model:
ollama create aera-4b -f Modelfile
ollama run aera-4b

llama.cpp (CLI)

Assuming you downloaded a 4-bit GGUF file locally:

./main -m ./aera-4b-q4_k_m.gguf -p "Chi sei?" -n 200 --temp 0.3 --ctx-size 8192

OpenAI-Compatible API (local servers)

You can expose an OpenAI-compatible endpoint using LM Studioโ€™s local server or Ollama and call it with the official OpenAI SDK. Example (Ollama default at http://localhost:11434/v1):

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",  # LM Studio typically http://localhost:1234/v1
    api_key="na",  # not used by Ollama/LM Studio
)

messages = [
    {"role": "user", "content": "Chi sei?"}
]

completion = client.chat.completions.create(
    model="aera-4b",  # the local model name you created
    messages=messages,
    temperature=0.3,
    max_tokens=200,
)

print(completion.choices[0].message.content)

Note: Advanced features like function-calling or strict JSON schema parsing depend on the chosen runtime. For maximum compatibility with structured outputs and Pydantic parsing, prefer the full-precision model served via frameworks like vLLM.

Advanced Examples (LM Studio, GGUF)

Structured JSON extraction (Business Email Analysis)

from openai import OpenAI

client = OpenAI(
    base_url="http://0.0.0.0:1234/v1",
    api_key="lm-studio"  # Not used by LM Studio but required by OpenAI client
)

# Sample business email for analysis
business_text = """
Subject: Urgent: Project Delivery Delay and Budget Concerns

Dear Team,

I'm writing to inform you about some critical issues with the Q4 marketing campaign project. 
Unfortunately, we're facing a 3-week delay due to vendor complications with Acme Corp and 
budget overruns of approximately $15,000. The client, Johnson & Associates, is extremely 
frustrated and threatening to terminate the contract.

Our lead designer, Sarah Martinez, has been working overtime to resolve the creative issues, 
but we still need approval from the legal department for the new compliance requirements. 
The project manager, Mike Chen, estimates we need an additional 2 weeks and $8,000 to 
complete the deliverables.

I'm scheduling an emergency meeting for tomorrow at 2 PM to discuss next steps. Please 
prioritize this matter as it could impact our relationship with our biggest client.

Best regards,
Jennifer Thompson
Project Director
"""

response = client.chat.completions.create(
    model="and emili/aera/aera-4b-q4_k_m.gguf",
    messages=[
        {
            "role": "system",
            "content": "You are a business analyst AI that extracts key information from business communications. Analyze the provided text and extract entities, classify the urgency level, and determine the sentiment."
        },
        {
            "role": "user",
            "content": f"Please analyze this business communication and extract key information:\n\n{business_text}"
        }
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "business_analysis",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "entities": {
                        "type": "object",
                        "properties": {
                            "people": {
                                "type": "array",
                                "items": {"type": "string"}
                            },
                            "companies": {
                                "type": "array",
                                "items": {"type": "string"}
                            },
                            "monetary_amounts": {
                                "type": "array",
                                "items": {"type": "string"}
                            },
                            "dates_timeframes": {
                                "type": "array",
                                "items": {"type": "string"}
                            }
                        },
                        "required": ["people", "companies", "monetary_amounts", "dates_timeframes"]
                    },
                    "classification": {
                        "type": "object",
                        "properties": {
                            "urgency_level": {
                                "type": "string",
                                "enum": ["low", "medium", "high", "critical"]
                            },
                            "document_type": {
                                "type": "string",
                                "enum": ["email", "report", "memo", "proposal", "other"]
                            },
                            "primary_topic": {
                                "type": "string"
                            }
                        },
                        "required": ["urgency_level", "document_type", "primary_topic"]
                    },
                    "sentiment_analysis": {
                        "type": "object",
                        "properties": {
                            "overall_sentiment": {
                                "type": "string",
                                "enum": ["positive", "neutral", "negative"]
                            },
                            "confidence_score": {
                                "type": "number",
                                "minimum": 0,
                                "maximum": 1
                            },
                            "key_concerns": {
                                "type": "array",
                                "items": {"type": "string"}
                            }
                        },
                        "required": ["overall_sentiment", "confidence_score", "key_concerns"]
                    }
                },
                "required": ["entities", "classification", "sentiment_analysis"]
            }
        }
    },
    temperature=0.3,
    max_tokens=800,
    stream=False
)

print(response.choices[0].message.content)

Tool use demo (Wikipedia, Italian)

"""
Demo di Utilizzo Strumenti LM Studio: Chatbot per Query Wikipedia
Dimostra come un modello LM Studio puรฒ interrogare Wikipedia
"""

# Importazioni librerie standard
import itertools
import json
import shutil
import sys
import threading
import time
import urllib.parse
import urllib.request

# Importazioni terze parti
from openai import OpenAI

# Inizializza client LM Studio
client = OpenAI(base_url="http://0.0.0.0:1234/v1", api_key="lm-studio")
MODEL = "and emili/aera/aera-4b-q4_k_m.gguf"


def fetch_wikipedia_content(search_query: str) -> dict:
    """Recupera contenuto wikipedia per una data search_query"""
    try:
        # Wikipedia richiede un User-Agent descrittivo
        USER_AGENT = (
            "Aera-Wikipedia-Demo/1.0 (+https://lmstudio.ai; contact: [email protected])"
        )
        REQUEST_HEADERS = {"User-Agent": USER_AGENT, "Accept": "application/json"}

        # Cerca l'articolo piรน rilevante
        search_url = "https://it.wikipedia.org/w/api.php"
        search_params = {
            "action": "query",
            "format": "json",
            "list": "search",
            "srsearch": search_query,
            "srlimit": 1,
        }

        url = f"{search_url}?{urllib.parse.urlencode(search_params)}"
        search_request = urllib.request.Request(url, headers=REQUEST_HEADERS)
        with urllib.request.urlopen(search_request) as response:
            search_data = json.loads(response.read().decode())

        if not search_data["query"]["search"]:
            return {
                "status": "error",
                "message": f"Nessun articolo Wikipedia trovato per '{search_query}'",
            }

        # Ottieni il titolo normalizzato dai risultati di ricerca
        normalized_title = search_data["query"]["search"][0]["title"]

        # Ora recupera il contenuto effettivo con il titolo normalizzato
        content_params = {
            "action": "query",
            "format": "json",
            "titles": normalized_title,
            "prop": "extracts",
            "exintro": "true",
            "explaintext": "true",
            "redirects": 1,
        }

        url = f"{search_url}?{urllib.parse.urlencode(content_params)}"
        content_request = urllib.request.Request(url, headers=REQUEST_HEADERS)
        with urllib.request.urlopen(content_request) as response:
            data = json.loads(response.read().decode())

        pages = data["query"]["pages"]
        page_id = list(pages.keys())[0]

        if page_id == "-1":
            return {
                "status": "error",
                "message": f"Nessun articolo Wikipedia trovato per '{search_query}'",
            }

        content = pages[page_id]["extract"].strip()
        return {
            "status": "success",
            "content": content,
            "title": pages[page_id]["title"],
        }

    except Exception as e:
        return {"status": "error", "message": str(e)}


# Definisci strumento per LM Studio
WIKI_TOOL = {
    "type": "function",
    "function": {
        "name": "fetch_wikipedia_content",
        "description": (
            "Cerca su Wikipedia e recupera l'introduzione dell'articolo piรน rilevante. "
            "Usa sempre questo se l'utente sta chiedendo qualcosa che probabilmente รจ su wikipedia. "
            "Se l'utente ha un errore di battitura nella sua query di ricerca, correggilo prima di cercare."
        ),
        "parameters": {
            "type": "object",
            "properties": {
                "search_query": {
                    "type": "string",
                    "description": "Query di ricerca per trovare l'articolo Wikipedia",
                },
            },
            "required": ["search_query"],
        },
    },
}


# Classe per visualizzare lo stato di elaborazione del modello
class Spinner:
    def __init__(self, message="Elaborazione..."):
        self.spinner = itertools.cycle(["-", "/", "|", "\\"])
        self.busy = False
        self.delay = 0.1
        self.message = message
        self.thread = None

    def write(self, text):
        sys.stdout.write(text)
        sys.stdout.flush()

    def _spin(self):
        while self.busy:
            self.write(f"\r{self.message} {next(self.spinner)}")
            time.sleep(self.delay)
        self.write("\r\033[K")  # Pulisci la riga

    def __enter__(self):
        self.busy = True
        self.thread = threading.Thread(target=self._spin)
        self.thread.start()
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.busy = False
        time.sleep(self.delay)
        if self.thread:
            self.thread.join()
        self.write("\r")  # Sposta il cursore all'inizio della riga


def chat_loop():
    """
    Loop di chat principale che elabora l'input dell'utente e gestisce le chiamate agli strumenti.
    """
    messages = [
        {
            "role": "system",
            "content": (
                "Sei un assistente che puรฒ recuperare articoli di Wikipedia. "
                "Quando ti viene chiesto di un argomento, puoi recuperare articoli di Wikipedia "
                "e citare informazioni da essi. Rispondi sempre in italiano."
            ),
        }
    ]

    print(
        "Assistente: "
        "Ciao! Posso accedere a Wikipedia per aiutarti a rispondere alle tue domande su storia, "
        "scienza, persone, luoghi o concetti - oppure possiamo semplicemente chattare di "
        "qualsiasi altra cosa!"
    )
    print("(Scrivi 'esci' per uscire)")

    while True:
        user_input = input("\nTu: ").strip()
        if user_input.lower() in ["esci", "quit", "exit"]:
            break

        messages.append({"role": "user", "content": user_input})
        try:
            with Spinner("Sto pensando..."):
                response = client.chat.completions.create(
                    model=MODEL,
                    messages=messages,
                    tools=[WIKI_TOOL],
                )

            if response.choices[0].message.tool_calls:
                # Gestisci tutte le chiamate agli strumenti
                tool_calls = response.choices[0].message.tool_calls

                # Aggiungi tutte le chiamate agli strumenti ai messaggi
                messages.append(
                    {
                        "role": "assistant",
                        "tool_calls": [
                            {
                                "id": tool_call.id,
                                "type": tool_call.type,
                                "function": tool_call.function,
                            }
                            for tool_call in tool_calls
                        ],
                    }
                )

                # Elabora ogni chiamata allo strumento e aggiungi i risultati
                for tool_call in tool_calls:
                    args = json.loads(tool_call.function.arguments)
                    result = fetch_wikipedia_content(args["search_query"])

                    # Stampa il contenuto di Wikipedia in modo formattato
                    terminal_width = shutil.get_terminal_size().columns
                    print("\n" + "=" * terminal_width)
                    if result["status"] == "success":
                        print(f"\nArticolo Wikipedia: {result['title']}")
                        print("-" * terminal_width)
                        print(result["content"])
                    else:
                        print(
                            f"\nErrore nel recupero del contenuto Wikipedia: {result['message']}"
                        )
                    print("=" * terminal_width + "\n")

                    messages.append(
                        {
                            "role": "tool",
                            "content": json.dumps(result),
                            "tool_call_id": tool_call.id,
                        }
                    )

                # Trasmetti la risposta post-chiamata-strumento
                print("\nAssistente:", end=" ", flush=True)
                stream_response = client.chat.completions.create(
                    model=MODEL, messages=messages, stream=True
                )
                collected_content = ""
                for chunk in stream_response:
                    if chunk.choices[0].delta.content:
                        content = chunk.choices[0].delta.content
                        print(content, end="", flush=True)
                        collected_content += content
                print()  # Nuova riga dopo il completamento dello streaming
                messages.append(
                    {
                        "role": "assistant",
                        "content": collected_content,
                    }
                )
            else:
                # Gestisci risposta normale
                print("\nAssistente:", response.choices[0].message.content)
                messages.append(
                    {
                        "role": "assistant",
                        "content": response.choices[0].message.content,
                    }
                )

        except Exception as e:
            print(
                f"\nErrore nella comunicazione con il server LM Studio!\n\n"
                f"Assicurati che:\n"
                f"1. Il server LM Studio sia in esecuzione su 0.0.0.0:1234 (hostname:porta)\n"
                f"2. Il modello '{MODEL}' sia scaricato\n"
                f"3. Il modello '{MODEL}' sia caricato, o che il caricamento just-in-time del modello sia abilitato\n\n"
                f"Dettagli errore: {str(e)}\n"
                "Vedi https://lmstudio.ai/docs/basics/server per maggiori informazioni"
            )
            exit(1)


if __name__ == "__main__":
    chat_loop()

Advanced Use Cases

For more complex examples including:

  • Customer support automation
  • Meeting notes summarization
  • Contract information extraction

Check the examples in our GitHub repository.

Limitations

  • Does not provide information beyond what's in the given context
  • Not suitable for open-ended creative tasks or general knowledge queries
  • Optimized for Italian; performance may vary in other languages
  • Designed for specific enterprise use cases, not general conversation

About AND EMILI

AND EMILI specializes in developing practical AI solutions for enterprise automation and intelligence augmentation.


License: Apache 2.0

Downloads last month
22
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

4-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for and-emili/aera-4b-GGUF

Base model

and-emili/aera-4b
Quantized
(3)
this model