Qwen3-Yoyo-V4-42B-A3B-Thinking-TOTAL-RECALL-qx64bx-mlx
This is an experimental quant with variable precision and bit depth:
Element Bits Group Size
stores and most attention paths 4 64
select attention paths 6 32
head, embeddings, brainstorming 6 32
It is shaped for RP. It would probably perform different from any Qwen you met, 30B or higher.
Due to the brainstorming emphasis on this particular quant, the model is rating much lower in metrics than the other quants.
For a fully functional, high quality quant, check out Qwen3-Yoyo-V4-42B-A3B-Thinking-TOTAL-RECALL-qx86x-hi-mlx
This model is the foundation for the next Total-Recall with Star Trek TNG training, and a TNG version will be available soon.
What is qx64bx
This is the Deckard(qx) formula, an optical design-inspired quantization.
Modeled after my Nikon Noct Z 58mm F/0.95, I followed the same principles that form an image on a sensor, but using the transformer architecture.
All the qx models have mixed precision layers and are using the formula in one of its variations.
This works on most architectures the same way, but on Qwen it really shines, because it reaches to metaphors to create an image of the context. This lowers the perplexity(already low from the base model) and speeds up the inference.
There is no other process involved in this but the selective quantization.
The b in bx stands for Brainstorming eXtended, while the x stands for eXtended embeddings, that are at 6 bit too.
All my older models in the qx series use the embeddings at the same precision as the data stores. This was not an oversight, it was simply "taming the model" on older versions. The newer releases have different approaches to inference, like the Qwen3-Next-80B, that benefit from higher order embeddings that line up with the head.
The YOYO models effectively elevate the experience from the Qwen3-30B series to a level that even Qwen did not reach, thus needing a bit more finesse in the embeddings than your regular MoE. Same thing happens in the Almost-Human brainstormed series by DavidAU, that have metrics high enough to compare with models many times their size, just by adding this refinement to the formula.
I noticed this also in the GLM Air, and plan to do an experimental series once the 4.6 Air comes out.
For now, the best small test bed is the YOYO-V4, a model competent enough to be trained fast and be efficient.
Archiving models
As I am making room for new models, I am archiving old models, and if I find one that could benefit for a re-do, I will create a high quality quant of it and make it available for a week.
To make this easy, any new model that gathers five Likes in the first week becomes a collection item, otherwise it goes in an archival queue for another week, then the tensors get dropped.
You will still have access to the model config file and could make your own quant of it with the same configuration using the MLX tools.
This also means I will not keep any models that are duplicating functionality or underperform, so unless there are constant downloads on old models, they too will see the archival bin.
-G
This model Qwen3-Yoyo-V4-42B-A3B-Thinking-TOTAL-RECALL-qx64bx-mlx was converted to MLX format from DavidAU/Qwen3-Yoyo-V4-42B-A3B-Thinking-TOTAL-RECALL using mlx-lm version 0.28.3.
Use with mlx
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("Qwen3-Yoyo-V4-42B-A3B-Thinking-TOTAL-RECALL-qx64bx-mlx")
prompt = "hello"
if tokenizer.chat_template is not None:
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, verbose=True)
- Downloads last month
- 58
Model tree for nightmedia/Qwen3-Yoyo-V4-42B-A3B-Thinking-TOTAL-RECALL-qx64bx-mlx
Base model
YOYO-AI/Qwen3-30B-A3B-YOYO-V4