Woah.

#1
by cactopus - opened

A great model even after only a brief testing! Great with reasoning, and most importantly, it does what is said in the title: The model expresses plot and writing in actual interesting and human-like ways without compromising coherence. Gone are the repetitive reply styles and the signature AI-style phrasing from some past models I've used, and each generation feels genuinely interesting to read. What a banger. Keep up the great work!

Some settings I tweaked from my past few hours of testing (mainly improved reasoning in some homebrewed tests):
[Temperature to 1.14, TFS to 0.998, Min P kept at 0.05 from stock recommended settings]

Tested EXL2 4.5bpw on an A4500 with 31232 context size, with Ooba and SillyT.
EXL3 is a bit buggy on this machine with a multi-minutes long wait before first generaion even with plenty VRAM headroom. Didn't happen with my 4090 home computer but I don't have access to it at this moment.

Running q8_0 here, Temp 1.1, Min p 0.05, it's incoherent as fuck, at 6k+ context of the 128k. Changes random details of characters, mixing up perspectives between characters.

Haven't tried reasoning yet tho, I thought that was a feature from Magistral-24b, not mistral-small-24b.

Ready.Art org

Running q8_0 here, Temp 1.1, Min p 0.05, it's incoherent as fuck, at 6k+ context of the 128k. Changes random details of characters, mixing up perspectives between characters.

Haven't tried reasoning yet tho, I thought that was a feature from Magistral-24b, not mistral-small-24b.

Please post your entire sampler settings

Running q8_0 here, Temp 1.1, Min p 0.05, it's incoherent as fuck, at 6k+ context of the 128k. Changes random details of characters, mixing up perspectives between characters.

Haven't tried reasoning yet tho, I thought that was a feature from Magistral-24b, not mistral-small-24b.

Please post your entire sampler settings

Did I understand correctly that you recommend making these settings for samplers?

IMG_20250722_004842_172.jpg

Ready.Art org

Running q8_0 here, Temp 1.1, Min p 0.05, it's incoherent as fuck, at 6k+ context of the 128k. Changes random details of characters, mixing up perspectives between characters.

Haven't tried reasoning yet tho, I thought that was a feature from Magistral-24b, not mistral-small-24b.

Please post your entire sampler settings

Did I understand correctly that you recommend making these settings for samplers?

IMG_20250722_004842_172.jpg

Are you having the same problem as the poster above you?

I'm hoping the person I was replying to will respond back.

@FrenzyBiscuit
I communicate in Russian and anything above 0.84 starts to noticeably reduce the quality of answers. And in principle, any language model I've used is not particularly strong in Russian. I saw your recomendation Mistral-V7-Tekken-T8-XML Preset and configured everything as on that screenshot. But now I mostly use this setting:

image.png

Looks like I got you confused with the author of the model, lol

Ready.Art org

@PhoenixtheII looking forward to your response

@PhoenixtheII looking forward to your response

@FrenzyBiscuit , I think you may have missed my reply above because they are similar

Ready.Art org

@PhoenixtheII looking forward to your response

@FrenzyBiscuit , I think you may have missed my reply above because they are similar

They are not similar. He is complaining about the replies being incoherent while you are complaining about a drop in quality.

Unfortunately, I do not know russian. I cant test this on my end, because I cant read russian.

@PhoenixtheII looking forward to your response

@FrenzyBiscuit , I think you may have missed my reply above because they are similar

They are not similar. He is complaining about the replies being incoherent while you are complaining about a drop in quality.

Unfortunately, I do not know russian. I cant test this on my end, because I cant read russian.

This is not specifically a problem with this model, but with all models. Any model is first of all designed for English, and only then for other languages

@PhoenixtheII looking forward to your response

image.png

image.png

Ready.Art org
β€’
edited Jul 22

@PhoenixtheII and which quant are you using?

Edit: The Q8 GGUF, or 8.0 BPW EXL2/EXL3?

MS3.2-The-Omega-Directive-24B-Unslop-v2.0.Q8_0.gguf

https://huggingface.co/mradermacher/MS3.2-The-Omega-Directive-24B-Unslop-v2.0-GGUF

Ready.Art org
β€’
edited Jul 22

Okay, my main node is quanting. I'll test this in a bit once thats finished.

In the meantime, have you tried EXL2 (with Tensor Parallelism disabled) to narrow down if its a problem with the model or the GGUF quants?

Okay, my main node is quanting. I'll test this in a bit once thats finished.

In the meantime, have you tried EXL2 (with Tensor Parallelism disabled) to narrow down if its a problem with the model or the GGUF quants?

Can only fit 2.5bpw if EXL, can you really compare that against a Q8?

Ready.Art org

Okay, my main node is quanting. I'll test this in a bit once thats finished.

In the meantime, have you tried EXL2 (with Tensor Parallelism disabled) to narrow down if its a problem with the model or the GGUF quants?

Can only fit 2.5bpw if EXL, can you really compare that against a Q8?

No. I will test the exl2 8.0 bpw in a bit (probably ~1 hour).

My hunch is the problem isn't the model but the GGUF quants. Do you notice the same issue with imatrix Q6 GGUF quants?

Okay, my main node is quanting. I'll test this in a bit once thats finished.

In the meantime, have you tried EXL2 (with Tensor Parallelism disabled) to narrow down if its a problem with the model or the GGUF quants?

Can only fit 2.5bpw if EXL, can you really compare that against a Q8?

No. I will test the exl2 8.0 bpw in a bit (probably ~1 hour).

My hunch is the problem isn't the model but the GGUF quants. Do you notice the same issue with imatrix Q6 GGUF quants?

That I can test.

Ready.Art org
β€’
edited Jul 22

Recommended sampler settings, 8.0 BPW EXL2, 100k context (~20k used)...

its fine?

How many replies do you go before you run into issues?

image.png

It just happens, anytime, needing to reroll often, Imaxtrix Q6_K, Currently this story is at depth 9, (Each response of model is about 600-1500 tokens), Generated 4 responses since last message.

1 time, it repeated a character detail of mine that it itself wrote about in the message before, and got it wrong, as in this never appeared in the history context.
Another time it started writing a character named 'Layla', as 'Laylβ€”Laylβ€”'
Another time my character was wearing the underwear of 'Layla' all of a sudden.
Description of blue facial markings mentioned in depth 3, and repeated a few times in history. Suddenly referred as Emerald of color..., (Emerald is the eye color of Layla, by coincidence)

Ready.Art org

Another time it started writing a character named 'Layla', as 'Laylβ€”Laylβ€”' <----

Are you quanting your context or is it FP16?

I'm at 35k context right now (Q8) and its fine.

Ready.Art org

@mradermacher can you confirm your quants are stable/good?

Anyways, I'm gonna retire from playing with LLM's, I recently gave kimi k2, deepseek r1, gemini pro 2.5 all a try... But they all just really really suck at writing stories coherently (especially logical coherence, environmental/spatial awareness). Maybe in 2 years there's enough improvement on this.

Not that I'd be expected that of 24b, But these ^ mentioned one just stood out for being a 24b....

Another time it started writing a character named 'Layla', as 'Laylβ€”Laylβ€”' <----

Are you quanting your context or is it FP16?

I'm at 35k context right now (Q8) and its fine.

image.png

I only ever change settings on this page, These are the Q8_0, and imatrix Q6_K settings i have used ^

Ready.Art org

Anyways, I'm gonna retire from playing with LLM's, I recently gave kimi k2, deepseek r1, gemini pro 2.5 all a try... But they all just really really suck at writing stories coherently (especially logical coherence, environmental/spatial awareness). Maybe in 2 years there's enough improvement on this.

Not that I'd be expected that of 24b, But these ^ mentioned one just stood out for being a 24b....

Okay! I'm more interested in why your chats get incoherent. I can't replicate it with exl2.

Anyways, I'm gonna retire from playing with LLM's, I recently gave kimi k2, deepseek r1, gemini pro 2.5 all a try... But they all just really really suck at writing stories coherently (especially logical coherence, environmental/spatial awareness). Maybe in 2 years there's enough improvement on this.

Not that I'd be expected that of 24b, But these ^ mentioned one just stood out for being a 24b....

Okay! I'm more interested in why your chats get incoherent. I can't replicate it with exl2.

I Typically dont RP,

I like long storied replies. Hence I have a big system prompt telling how to narrate, (it's from https://huggingface.co/sophosympatheia's recent models of last months, a little bit tweaked in some wordings, but not much). We play a collab into writing a story. I provide some details for a chapter, what my character says, and then let it write 600-4096 tokens. And I might just press continue to let it run a 2nd addition on it's own. Before I steer the story a bit again.

I like to see what LLM's come up with in situations, it's creativity so to say.

Ready.Art org

I still wonder if its a quant issue with GGUF or if its something to do with your setup.

Oh well.

FrenzyBiscuit changed discussion status to closed

@mradermacher can you confirm your quants are stable/good?

I don't know what "stable" would even mean, and I have no idea how to assess quants as "good" :) But what was described here is not uncommon behaviour for llms, and smaller the quant, the more likely these appear. If only one person has issues, then that sounds more like an issue with that person's system or files.

Ready.Art org

@mradermacher can you confirm your quants are stable/good?

I don't know what "stable" would even mean, and I have no idea how to assess quants as "good" :) But what was described here is not uncommon behaviour for llms, and smaller the quant, the more likely these appear. If only one person has issues, then that sounds more like an issue with that person's system or files.

Yeah, I figured. Sorry for the bother.

FrenzyBiscuit changed discussion status to open
Ready.Art org

I owe you an apology @PhoenixtheII

I hope you are using text-completion, because if you are using chat completion there is no chat_template in the tokenizer_config.json

@gecfdo @mradermacher this is relevant to both of you. I know next to nothing about GGUF's, but the chat_template is a requirement for them correct?

Ready.Art org

yeah imagine my surprise finding out there is no chat_template when trying to create an INT8 for VLLM.

@gecfdo i think you can just rip the chat template from broken tutu unslop v2, but could be wrong. Anyway, its your model so ill let you decide how to fix it.

Ready.Art org
β€’
edited Jul 24

yeah imagine my surprise finding out there is no chat_template when trying to create an INT8 for VLLM.

@gecfdo i think you can just rip the chat template from broken tutu unslop v2, but could be wrong. Anyway, its your model so ill let you decide how to fix it.

Patched back lost chat_template, unsure if it fixes problem itself.

Ready.Art org

also was unable to find problem with model loosing coherency at X k tokens

the chat_template is a requirement for [GGUFs] correct?

nope. it might be a requirement to get good output from a model, but the ggufs will work fine without one.

ok, i am pretty late, but i tried this model in the last few days. not sure why, i usually avoid anything <70Bs because i am always disappointed.

while it clearly lacks awareness relative to the bigger 70Bs is usually use, it's pretty close. it can follow instructions very well, can handle multiple people and actions at the same time, and it's small size allows me to go to higher context lengths (for some reason I get crashes above 64k context, but that's not the models fault). the prose is competent, and other than a stronger tendency to repeat itself literally, I found no weaknesses, compared to larger models.

so, yeah, woah, I'm impressed. I think this is probably the only smaller model that I ever used to actually create something.

Ready.Art org

ok, i am pretty late, but i tried this model in the last few days. not sure why, i usually avoid anything <70Bs because i am always disappointed.

while it clearly lacks awareness relative to the bigger 70Bs is usually use, it's pretty close. it can follow instructions very well, can handle multiple people and actions at the same time, and it's small size allows me to go to higher context lengths (for some reason I get crashes above 64k context, but that's not the models fault). the prose is competent, and other than a stronger tendency to repeat itself literally, I found no weaknesses, compared to larger models.

so, yeah, woah, I'm impressed. I think this is probably the only smaller model that I ever used to actually create something.

It really is some of sleeps & GECFDO's greatest work. I miss sleep so much. :(

ok, i am pretty late, but i tried this model in the last few days. not sure why, i usually avoid anything <70Bs because i am always disappointed.

Same, I've found for a while anything under 32B tend to be a bit weak, at least for roleplay and writing. Somewhere going 32B and above i found i could do a roleplay within a roleplay or the like, which does feel like doing an F-list RP with someone.

while it clearly lacks awareness relative to the bigger 70Bs is usually use, it's pretty close. it can follow instructions very well, can handle multiple people and actions at the same time, and it's small size allows me to go to higher context lengths (for some reason I get crashes above 64k context, but that's not the models fault). the prose is competent, and other than a stronger tendency to repeat itself literally, I found no weaknesses, compared to larger models.

Mhmm. I found all the omega directive ones to be decent, smaller models when you don't need it to be quite as complex (and want more context or speed), while the 70B writing feels about 10% more genuine than the 24B while more variety in responses i think.

Anymore i have multiple profiles for models set up, for 8k 16k 32k context etc, so i can squeeze more onto the GPU until i need the extra context length and reload it; Speeds it up 25% usually. Though doing Q2/Q3 does a lot of that too, though likely somewhat limits responses.

so, yeah, woah, I'm impressed. I think this is probably the only smaller model that I ever used to actually create something.

I have TOD as my goto backup model(s) when i don't have specific ones i'm trying out. Well that and a couple others, the Trifecta 8B is good when i'm really low on RAM or i want it really fast and 'okay' responses.

Sign up or log in to comment