Testing experimental quants
I'm going to be testing Meta-Llama-3-8B-Instruct-f16-q4_K_S.gguf against Meta-Llama-3-8B-Instruct-q4_K_S.gguf, I'll share any findings in this thread
excellent, I appreciate it!!
Here is a repo with some results: ddh0/UnquantizedEmbeddingTesting
There are a couple files in the repo that are not detailed in the README, but there is some information there that may be interesting. Let me know if there are any specific models or tests that you'd like done.
TLDR: there is a measurable difference between models with unquantized vs quantized embedding/output tensors, but exactly how important the difference is should be investigated more
cc @ZeroWw
In my own tests with Mistral v03 and Wizard LM2 f16.q5 and f16.q6 gave the best results.
You can find the quantizations in my profile.
https://huggingface.co/ZeroWw/Samantha-Qwen-2-7B-GGUF <<<<<<<<<<<<<
https://huggingface.co/ZeroWw/Mistral-7B-Instruct-v0.3-GGUF
https://huggingface.co/ZeroWw/microsoft_WizardLM-2-7B-GGUF
https://huggingface.co/ZeroWw/Meta-Llama-3-8B-Instruct-GGUF
https://huggingface.co/ZeroWw/Mistroll-7B-v2.2-GGUF
@HiroseKoichi okay, running f16-q6 vs q6 and f16-q8 vs q8 soon
Test results for f16-q6_K vs q6_K and f16-q8_0 vs q8_0 are available in the repo (still need to update the README)
My feedback for q8_0 VS q8_1 based on a 4200-token 21 questions survey, Client= LM Studio, temp=0, topP=0.95, system prompt: Perform the task to the best of your ability.
First shot for each were basically the same, after regenerated more than 3 times, there was some differences: 1. q8_1 followed the instructions better, q8_0 stopped responding after a summarization task in the middle. 2. Quality of answered tasks was similar.
I suspect q8_0 file is broken, I also downloaded and tried bartowski/tabula-8b-GGUF q8_0 and q8_0_L. I don't know what's wrong with this, both doesn't work with LM Studio v.0.2.25, with presets Llama3 or ChatML.
Also, which models would you like me to compare?
Ah okay. I'll set that up
I made some more quantizations: (the q4, q5 and q8 are f16/q4 f16/q5 and f16/q8)
You find them all in the models section at https://huggingface.co/ZeroWw
P.S. I didn't do q4 because q4_k quantization imho are bad in most cases, but you are free to try f16/q4... but the f16/q5 is probably better.
https://huggingface.co/ZeroWw/Samantha-Qwen-2-7B-GGUF <<<<<<<<<<<<<
https://huggingface.co/ZeroWw/Mistral-7B-Instruct-v0.3-GGUF
https://huggingface.co/ZeroWw/microsoft_WizardLM-2-7B-GGUF
https://huggingface.co/ZeroWw/Meta-Llama-3-8B-Instruct-GGUF
https://huggingface.co/ZeroWw/Mistroll-7B-v2.2-GGUF
https://huggingface.co/ZeroWw/Phi-3-mini-128k-instruct-GGUF
https://huggingface.co/ZeroWw/Phi-3-medium-128k-instruct-GGUF
https://huggingface.co/ZeroWw/Qwen1.5-7B-Chat-GGUF
https://huggingface.co/ZeroWw/Mistroll-7B-v2.2-GGUF
https://huggingface.co/ZeroWw/NeuralDaredevil-8B-abliterated-GGUF
https://huggingface.co/ZeroWw/MixTAO-7Bx2-MoE-v8.1-GGUF
https://huggingface.co/ZeroWw/aya-23-8B-GGUF
I want each model individually run on the 40 prompts so that they each have their own text file
@HiroseKoichi sorry for the delay, this is done now. Each model has its results in a separate file in the repo: ddh0/UnquantizedEmbeddingTesting
All 20 different quantizations are included, from q2_K to q8_0 to f16-q2_K to f16-q8_0. I'm very interested to see what differences you find
All 20 different quantizations are included, from q2_K to q8_0 to f16-q2_K to f16-q8_0. I'm very interested to see what differences you find
too many because you used random seeds.
in a comparison like this the seed should be fixed and you should include also some questions that include reasoning and some that include creative writing.
That's because the output tensor affects the "way" it express itself, while the embed tensor affects more it's understanding.
Also, add one test of the pure f16 (convert the hf model to f16) like:
python llama.cpp/convert-hf-to-gguf.py --outtype f16 model_name --outfile ${model_name}.f16.gguf
that's because f16 above will be the "baseline".
here you can find a bunch of models with the f16 and f16.q5, f16.q6 and f16.q8: https://huggingface.co/RobertSinclair
CC @ddh0 , @bartowski @helloAI333
too many because you used random seeds.
Don't think seeds are relevant in this case as I'm not doing any sampling
too many because you used random seeds.
Don't think seeds are relevant in this case as I'm not doing any sampling
@ddho
in general, no... but making the same questions achieves different results according to seeds.. and it's more difficult to determine how a model is degraded if the seeds are random.
he's got temperature = 0.0 which means that seed doesn't play a role
Results for pure bf16 test are up: Results_Meta-Llama-3-8B-Instruct-bf16.gguf.txt
I created a pull request to fix the formatting of the files. The current ones have the escape sequences written in plain text instead of rendered.
Thank you, but this is intentional and I don't think it's a problem
Can you also drop an additional text file that has the file sizes of all the models? Thanks again for running all of this.
Will do now
Here is a text file with the sizes of each model in bytes (as outputted from
ls -alon my machine): sizes.txt
weird.. in your "sizes" I read:
7835472160 Jun 16 18:30 Meta-Llama-3-8B-Instruct-f16-q6_K.gguf
while in my quantization is:
7.84 GB
can you check if the file is the same?
https://huggingface.co/ZeroWw/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct.q6_k.gguf
I ask because I am not sure what makes my quantization better.. it might be anything.
I would suggest you to do tests comparing the F16 in my repository to the q5 q6 and q8 in the same directory.
Those are sure the right files.
to obtain them I run a colab notebook which main part is this:
import os
import subprocess
repo_model_name = 'gradientai/Llama-3-8B-Instruct-Gradient-1048k' #@param ["mistralai/Mistral-7B-Instruct-v0.3", "lucyknada/microsoft_WizardLM-2-7B", "meta-llama/Meta-Llama-3-8B-Instruct", "BarraHome/Mistroll-7B-v2.2","Qwen/Qwen1.5-7B-Chat","microsoft/Phi-3-mini-128k-instruct","microsoft/Phi-3-medium-128k-instruct","google/gemma-7b",'zhengr/MixTAO-7Bx2-MoE-v8.1','CohereForAI/aya-23-8B','01-ai/Yi-1.5-9B-32K','deepseek-ai/DeepSeek-Coder-V2-Lite-Base','01-ai/Yi-1.5-6B-Chat','ZeusLabs/L3-Aethora-15B-V2','Nitral-AI/Hathor_Stable-v0.2-L3-8B'] {allow-input: true}
model_name = os.path.basename(repo_model_name)
# Download Model
print(f'Downloading {repo_model_name}')
subprocess.run(['huggingface-cli', 'download', repo_model_name, '--local-dir', model_name], stdout=subprocess.DEVNULL)
# Convert Model
print('Converting model to f16.')
subprocess.run(['python', 'llama.cpp/convert-hf-to-gguf.py', '--outtype', 'f16', model_name, '--outfile', f'{model_name}.f16.gguf'], stdout=subprocess.DEVNULL)
# Remove the original model directory
os.system(f'rm -rf {model_name}')
# Quantize Model
quantization_types = ['q5_k', 'q6_k', 'q8_0']
for q_type in quantization_types:
print(f'Quantizing {q_type}')
subprocess.run(['./build/bin/llama-quantize', '--allow-requantize', '--output-tensor-type', 'f16', '--token-embedding-type', 'f16', f'{model_name}.f16.gguf', f'{model_name}.{q_type}.gguf', q_type, str(os.cpu_count())], stdout=subprocess.DEVNULL)
7835472160 bytes is equal to 7.835 GB, which rounds up to 7.84GB
7835472160 bytes is equal to 7.835 GB, which rounds up to 7.84GB
7835472160/1024/1024/1024 = 7.29 GB
No, I do not confirm that. If you want to confirm that on your own, go ahead
Edit: I don't think that the exact file size in bytes is going to help you figure anything out, for what it's worth
This is how the sizes should be:
-rw-r--r-- 1 root root 16068890912 Jun 28 05:55 Meta-Llama-3-8B-Instruct.f16.gguf
-rw-r--r-- 1 root root 7042224416 Jun 28 06:07 Meta-Llama-3-8B-Instruct.q5_k.gguf
-rw-r--r-- 1 root root 7835472160 Jun 28 06:15 Meta-Llama-3-8B-Instruct.q6_k.gguf
-rw-r--r-- 1 root root 9525776672 Jun 28 06:17 Meta-Llama-3-8B-Instruct.q8_0.gguf
What is your point, exactly? I don't think my file needs to be the exact same size in bytes as yours. What are you getting at?
What is your point, exactly? I don't think my file needs to be the exact same size in bytes as yours. What are you getting at?
No need to be snippy, but if the size is not the same it means the quantization process was different than the one I proposed. That's all.