AdvRahul/Nova-7B-1M-Q4_K_M-GGUF
A safety-enhanced, quantized version of Qwen2.5-7B-Instruct with a massive 1 Million token context window. π§
Nova-7B-1M is a specialized version of the powerful Qwen/Qwen2.5-7B-Instruct-1M model. It has been fine-tuned to improve safety alignment and is provided in the efficient GGUF format, making its incredible long-context capabilities accessible on a wider range of hardware, including CPUs.
π Model Details
- Model Creator: AdvRahul
- Base Model: Qwen/Qwen2.5-7B-Instruct-1M
- Fine-tuning Focus: Enhanced Safety & Harmlessness via red-teaming.
- Format: GGUF
- Quantization:
Q4_K_M. This 4-bit quantization method offers a great balance between model size, performance, and inference speed. - Context Length: 1,000,000 tokens
π Model Description
Unlocking Long Context with Safety
Nova-7B-1M was created with two primary goals:
- Enhance Safety: The base model underwent extensive red-team testing with advanced protocols. This process was designed to significantly reduce the likelihood of generating harmful, biased, or unsafe content, making it a more reliable choice for user-facing applications.
- Democratize Access: By quantizing the model to the GGUF format, its powerful capabilities, especially its massive 1M token context window, can be run efficiently on consumer-grade hardware and CPUs, which would be impossible with the full-precision model.
This model is ideal for tasks involving the analysis, summarization, or querying of extremely long documents, such as entire codebases, legal contracts, or comprehensive research papers, all with an added layer of safety.
π» How to Use
This model is in the GGUF format and is intended for use with frameworks like llama.cpp and its Python bindings. The standard transformers library will not work.
Using llama-cpp-python
This is the recommended way to use the model in a Python application.
First, install the library:
# For basic CPU usage
pip install llama-cpp-python
# Or with hardware acceleration (e.g., OpenBLAS)
# CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install --force-reinstall --no-cache-dir llama-cpp-python
from llama_cpp import Llama
# Initialize the Llama model
# You must set n_ctx to your desired context size.
# WARNING: A 1M context window will require a very large amount of RAM (>64GB).
# Adjust n_ctx based on your hardware and needs.
llm = Llama.from_pretrained(
repo_id="AdvRahul/Nova-7B-1M-Q4_K_M-GGUF",
filename="nova-7b-1m-Q4_K_M.gguf", # Or the actual filename
n_ctx=1000000, # <-- CRUCIAL for long context
n_gpu_layers=-1, # Offload all layers to GPU if you have VRAM
verbose=False
)
# Qwen chat template
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize the key points of the provided text."} # Imagine you've loaded a very long document into the prompt
]
# Use the tokenizer from the loaded model to apply the template
prompt = llm.tokenizer().apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Run inference
output = llm(
prompt,
max_tokens=512,
stop=["<|im_end|>"],
echo=False
)
print(output['choices'][0]['text'])
β οΈ Ethical Considerations and Limitations
While this model has been explicitly fine-tuned for safety, no model is perfect.
Safety is Not Guaranteed: The safety alignment is an improvement but does not eliminate all risks. The model may still produce undesirable or biased content.
Long Context Hallucinations: In very long contexts, models can sometimes lose focus or "hallucinate" facts. Always verify critical information from the generated output.
Hardware Demands: While GGUF makes this model more accessible, using the full 1M token context window is extremely RAM-intensive. Users should be aware of the hardware requirements before attempting to process such long sequences.
Developers should always implement their own safety guardrails and content moderation systems as part of a responsible AI deployment strategy.
- Downloads last month
- 17
4-bit