Brello Thinking
Model Description
Brello Thinking is an advanced language model created by Epic Systems as a part of Brello AI Family. Built on the robust Tencent Hunyuan base model, Brello Thinking specializes in deep reasoning, mathematical problem-solving, coding, and creative thinking with enhanced chain-of-thought capabilities.
Key Features
- Advanced Reasoning: Enhanced chain-of-thought with both fast and slow thinking modes
- Mathematical Excellence: Superior at math and symbolic computation
- Programming Prowess: Strong coding abilities across Python, JS, C++, SQL, and more
- Long Context Understanding: Handles up to 256K tokens, long docs, and codebases
- Creative Problem Solving: Generates new solutions and approaches
- Multi-language Support: Fluent in English and Chinese, robust cross-lingual transfer
1. Executive Summary
Brello Thinking v1.1.0 (2025-08-07) is a 1.8B-parameter causal language model engineered for complex reasoning, mathematics, and creative tasks. It combines ultra-long context, dual “fast”/“deep” thinking modes, and a plugin SDK for live tool integration. It is designed for safe, sustainable, and fair production deployments.
Highlights in this Release
- Mixed-precision quantization (BF16 & INT8)
- Plugin SDK (JSON-RPC, HMAC auth, dynamic tool routing)
- Monitoring (Prometheus, Grafana, carbon tracking)
- Sustainability Dashboard (gCO₂eq/token metrics, CodeCarbon SDK)
2. Model Architecture
| Component |
Specification |
| Base Model |
Tencent Hunyuan / EpicBrelloV1ForCausalLM |
| Parameters |
1.8B (BF16/INT8 quantization; LoRA adapters optional) |
| Context Window |
256,000 tokens (rotary cache, sliding window, eviction logic) |
| Attention |
Grouped-Query + Multi-Head FlashAttention (16 heads, 4 KV heads) |
| Feed-Forward |
Two-stage (SiLU → Linear → SiLU) with RMSNorm, hidden size 6144 |
| Depth |
32 transformer blocks + 4 “Safety Adapter” blocks |
| Adapters |
LoRA for math, code, creative, and domain fine-tuning (10–18M params each) |
| Inference Modes |
Autoregressive sampling (top-k, top-p), beam, contrastive decoding |
| Sharding |
ZeRO-3 / tensor-parallel / model-parallel combinations |
3. Training & Tuning
3.1 Pretraining Corpus
- Web General: 400B tokens (CommonCrawl, CC-100, curated news)
- Science/Technical: 50B tokens (arXiv, PubMed, patents)
- Code: 20B tokens (public GitHub, CodeSearchNet, MBPP)
- Multilingual: 30B tokens (Chinese, Spanish, German, Arabic)
- Augmentations: 15% span corruption, zh–en back-translation, dynamic masking
3.2 Optimization
- Optimizer: AdamW (β₁=0.9, β₂=0.95, weight_decay=0.01)
- LR Schedule: Linear warmup (10K steps), cosine decay (500K steps)
- Batch: 2M tokens/step, grad accumulation ×8
3.3 Instruction/RLHF Tuning
- Instruction Pairs: 1.2M human-annotated QA/reasoning
- Reward Model: Dual human-preference ranking (5K raters, Elo)
- Algorithm: PPO w/ KL penalty (target KL=0.1), reward clipping
4. Specialized Modules
| Adapter Name |
Data Source |
Params (M) |
Use Case |
| math-adapter |
GSM8K, MATH, AIME datasets |
12 |
Math proof, step-by-step logic |
| code-adapter |
MBPP, MultiPL-E, GitHub repos |
18 |
Coding, debugging, codegen |
| creative-adapter |
Gutenberg, story corpora |
10 |
Narrative, dialogue, ideation |
5. Plugin & Tooling SDK
- Interface: JSON-RPC (Unix socket or REST), HMAC-SHA256 auth
- Plugins:
- DB connectors: PostgreSQL, MySQL, Snowflake
- HTTP client: retry/backoff
- Vector DB: FAISS, Pinecone
Tool Call Example
- Model emits:
{"tool_call": {"name": "weather_fetch", "args": {"location":"Mumbai"}}}
- Host executes plugin, returns:
{"tool_result": {"forecast":"Sunny, 32°C"}}
- Model resumes reasoning with tool result in context.
6. Inference, Monitoring & Scaling
6.1 Endpoint Performance
| Mode |
Batch |
Seq Len |
Throughput (tok/s) |
Latency (p50) |
| Fast-Think |
8 |
4,096 |
250,000 |
15 ms |
| Deep-Think |
1 |
256,000 |
18,000 |
120 ms |
| INT8 Quant |
16 |
2,048 |
320,000 |
12 ms |
6.2 Observability
- Prometheus Metrics:
brello_inference_latency_seconds
brello_generated_tokens_total
brello_cache_evictions_total
- Grafana:
- Token latency histograms, CO₂ per generation
7. Sustainability & Carbon Tracking
- Data Center PUE: 1.2
- Carbon Emission: ~0.0008 gCO₂eq/token (tracked with CodeCarbon)
- Offset: Epic Systems funds VER 2.0 credits
8. Robustness, Safety & Fairness
- Adapters: Real-time adversarial input filtering, personal data redaction, toxicity classifier (fine-tuned BERT-tox)
- Bias Audits:
- Toxicity variation <1.8% (12 demographic axes)
- Gender parity ±2%
- Dialect coverage 98% (EN & ZH)
9. Interpretability
- Chain-of-Thought logs: Token-level reasoning trace
- Integrated Gradients: Span attribution
- Attention Rollouts: Layer-wise visualization (custom plugin)
10. Hyperparameters
| Parameter |
Value |
| num_layers |
32 |
| d_model |
2048 |
| d_hidden |
6144 |
| num_heads |
16 |
| kv_heads |
4 |
| rotary_pct |
0.25 |
| lr_warmup_steps |
10,000 |
| weight_decay |
0.01 |
| batch_size |
2M |
| dropout_rate |
0.1 |
11. Evaluation & Error Analysis
- Benchmarks: GSM8K, MBPP, BBH, LongBench, MATH
- Analysis: Math/logic confusion matrix, hallucination drift cluster analysis
12. Roadmap
| Version |
Highlights |
ETA |
| v1.1.0 |
Plugins, carbon tracking, INT8 quantization |
Released |
| v1.2.0 |
Vision-language, adapter expansion |
Nov 2025 |
| v1.3.0 |
Audio, multilingual tuning |
Feb 2026 |
| v2.0 |
Federated RAG, continuous learning |
Q4 2026 |
13. Licensing & Compliance
- License: Proprietary, Epic Systems
- Privacy: GDPR, CCPA compliant
- Certifications: ISO 27001, SOC 2 Type II, HIPAA (BAA on request)
- Restrictions: No redistribution or large-scale rehosting
14. Usage Example
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
from brello_sdk import BrelloPluginManager
from codecarbon import EmissionsTracker
from prometheus_client import CollectorRegistry, Counter, Histogram, push_to_gateway
def setup_model(
model_id: str = "BrelloES/brello-thinking",
use_bf16: bool = True,
load_int8: bool = True,
):
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16 if use_bf16 else torch.float32,
load_in_8bit=load_int8,
)
model = PeftModel.from_pretrained(model, "adapters/math-adapter")
model = PeftModel.from_pretrained(model, "adapters/code-adapter")
return tokenizer, model
def setup_plugins():
pm = BrelloPluginManager()
pm.register(
name="weather_fetch",
path="/opt/brello/plugins/weather_plugin.so",
auth_key=os.getenv("WEATHER_PLUGIN_KEY", "CHANGE_ME"),
)
pm.register(
name="db_query",
path="/opt/brello/plugins/db_query_plugin.so",
auth_key=os.getenv("DB_PLUGIN_KEY", "CHANGE_ME"),
)
return pm
def setup_metrics():
registry = CollectorRegistry()
Histogram(
"brello_inference_latency_seconds",
"Inference latency (seconds) per request",
registry=registry,
buckets=(0.01, 0.05, 0.1, 0.2, 0.5, 1.0),
)
Counter(
"brello_generated_tokens_total",
"Total number of tokens generated by Brello",
registry=registry,
)
return registry
def generate_response(tokenizer, model, plugin_mgr, registry, messages, mode: str = "deep"):
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
enable_thinking=True if mode == "deep" else False,
)
tracker = EmissionsTracker(project_name="brello_inference", output_dir="carbon_logs")
tracker.start()
outputs = model.generate(
inputs.to(model.device),
max_new_tokens=512,
top_p=0.9,
temperature=0.6,
plugin_manager=plugin_mgr,
return_dict_in_generate=True,
output_scores=True,
)
emissions_kg = tracker.stop()
text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
return text, emissions_kg
def main():
tokenizer, model = setup_model()
plugin_mgr = setup_plugins()
registry = setup_metrics()
messages = [
{"role": "system", "content": "You are Brello Thinking in Deep-Think mode."},
{"role": "user", "content": "Explain why prime factorization is unique."},
]
response, co2 = generate_response(tokenizer, model, plugin_mgr, registry, messages, mode="deep")
print("=== Deep-Think Output ===\n", response)
print(f"CO₂ Emitted: {co2:.6f} kg")
messages[0]["content"] = "You are Brello Thinking in Fast-Think mode."
response_fast, co2_fast = generate_response(tokenizer, model, plugin_mgr, registry, messages, mode="fast")
print("\n=== Fast-Think Output ===\n", response_fast)
print(f"CO₂ Emitted: {co2_fast:.6f} kg")
if __name__ == "__main__":
main()
Otvd
- Creator: Epic Systems
- Engineer: Rehan Temkar
- Model: Brello Thinking v1.0.0
Brello Thinking - Advanced AI Reasoning by Epic Systems