Triangle104
/

GLM4-9B-Neon-v2-Q6_K-GGUF

Model card Files Files and versions

Triangle104 commited on Apr 28

Commit

b9364ae

·

verified ·

1 Parent(s): 4465f54

Update README.md

Files changed (1) hide show

README.md +57 -0

README.md CHANGED Viewed

@@ -17,6 +17,63 @@ tags:
 This model was converted to GGUF format from [`allura-org/GLM4-9B-Neon-v2`](https://huggingface.co/allura-org/GLM4-9B-Neon-v2) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
 Refer to the [original model card](https://huggingface.co/allura-org/GLM4-9B-Neon-v2) for more details on the model.
 ## Use with llama.cpp
 Install llama.cpp through brew (works on Mac and Linux)

 This model was converted to GGUF format from [`allura-org/GLM4-9B-Neon-v2`](https://huggingface.co/allura-org/GLM4-9B-Neon-v2) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
 Refer to the [original model card](https://huggingface.co/allura-org/GLM4-9B-Neon-v2) for more details on the model.
+---
+RP finetune of GLM-4-9B-0414. Feels nice, lots of personality, if bit
+ quirky sometimes. Nice prose, not too Claude-ish or Gemini-ish. Doesn't
+ seem to like too long system prompts or charcards though. Seems to like
+ JSON formatted system prompts.
+Model was trained by Auri.
+Training notes
+-
+Model was trained on a dataset consisting of 77M tokens of synthetic
+RP and short story gen data for one epoch. Training took around 11 hours
+ on 2xRTX 3090 workstation, generously provided by OwenArli.
+ Went with some sane defaults for training config, QLoRA plus CCE for a
+nice chunk of memory usage optimization, 16k fit on 48GB nicely with
+some room to spare. I seem to have a problem with Eval/Loss being
+broken, not sure why, otherwise it trained smoothly.
+Huge thanks to ArliAI for providing compute and collaborating on this run!
+Format
+-
+Model responds to GLM4 instruct formatting, exactly like it's base
+model. Backends struggle to add BOS token automatically, so you'll need
+to do it yourself. Jinja template should work for chat completions.
+[gMASK]<sop><|system|>
+{system_prompt}<|user|>
+{prompt}<|assistant|>
+Recommended Samplers
+-
+Nothing special, just classics.
+Temperature - 1
+Min-P - 0.1
+Repetition Penalty - 1.03
+Example master import for SillyTavern (using Shingane-v1 system prompt by Steelskull)
+Running on KoboldCPP and other backends
+-
+To run GGUFs correctly, you need the most recent version of KoboldCPP, and to pass --overridekv glm4.rope.dimension_count=int:64 to the CLI command or put glm4.rope.dimension_count=int:64 into overridekv box in the GUI (under the Tokens tab at the very bottom).
+Thanks to DaringDuck and tofumagnate for info how to apply this fix.
+To run this model on vLLM, you'll need to build it from source from the git repo, full GLM4 support hasn't reached release yet.
+ExLLaMAv2 and v3 based backends, such as TabbyAPI should support the model out of the box.
+Latest versions of llama.cpp server should also allow running GGUFs out-of-the-box.
+---
 ## Use with llama.cpp
 Install llama.cpp through brew (works on Mac and Linux)