bartowski/Samantha-Qwen-2-7B-GGUF · Testing experimental quants

Owner Jun 17, 2024

@ZeroWw try to compare these ones if you can

ddh0

Jun 17, 2024

I'm going to be testing Meta-Llama-3-8B-Instruct-f16-q4_K_S.gguf against Meta-Llama-3-8B-Instruct-q4_K_S.gguf, I'll share any findings in this thread

bartowski

Owner Jun 18, 2024

excellent, I appreciate it!!

ddh0

Jun 20, 2024

Here is a repo with some results: ddh0/UnquantizedEmbeddingTesting

There are a couple files in the repo that are not detailed in the README, but there is some information there that may be interesting. Let me know if there are any specific models or tests that you'd like done.

TLDR: there is a measurable difference between models with unquantized vs quantized embedding/output tensors, but exactly how important the difference is should be investigated more

cc @ZeroWw

deleted

Jun 20, 2024

This comment has been hidden (marked as Resolved)

ZeroWw

Jun 20, 2024

•

edited Jun 21, 2024

In my own tests with Mistral v03 and Wizard LM2 f16.q5 and f16.q6 gave the best results.
You can find the quantizations in my profile.
https://huggingface.co/ZeroWw/Samantha-Qwen-2-7B-GGUF <<<<<<<<<<<<<
https://huggingface.co/ZeroWw/Mistral-7B-Instruct-v0.3-GGUF
https://huggingface.co/ZeroWw/microsoft_WizardLM-2-7B-GGUF
https://huggingface.co/ZeroWw/Meta-Llama-3-8B-Instruct-GGUF
https://huggingface.co/ZeroWw/Mistroll-7B-v2.2-GGUF

ddh0

Jun 21, 2024

@HiroseKoichi okay, running f16-q6 vs q6 and f16-q8 vs q8 soon

ddh0

Jun 22, 2024

Test results for f16-q6_K vs q6_K and f16-q8_0 vs q8_0 are available in the repo (still need to update the README)

helloAI333

Jun 22, 2024

My feedback for q8_0 VS q8_1 based on a 4200-token 21 questions survey, Client= LM Studio, temp=0, topP=0.95, system prompt: Perform the task to the best of your ability.

First shot for each were basically the same, after regenerated more than 3 times, there was some differences: 1. q8_1 followed the instructions better, q8_0 stopped responding after a summarization task in the middle. 2. Quality of answered tasks was similar.

I suspect q8_0 file is broken, I also downloaded and tried bartowski/tabula-8b-GGUF q8_0 and q8_0_L. I don't know what's wrong with this, both doesn't work with LM Studio v.0.2.25, with presets Llama3 or ChatML.

deleted

Jun 22, 2024

This comment has been hidden (marked as Resolved)

ddh0

Jun 22, 2024

Actually, there's a mismatch on the README vs. file; the README says that it was f16.Q4_K_S vs. Q4_K_S, but the file says it was f16 vs. f16.Q4_K_S. @ddh0 could you clarify which of the two it was?

@HiroseKoichi It's f16-q4_K_S vs. regular q4_K_S

ddh0

Jun 22, 2024

Also, which models would you like me to compare?

deleted

Jun 22, 2024

This comment has been hidden (marked as Resolved)

ddh0

Jun 22, 2024

Ah okay. I'll set that up

ZeroWw

Jun 23, 2024

•

edited Jun 24, 2024

I made some more quantizations: (the q4, q5 and q8 are f16/q4 f16/q5 and f16/q8)
You find them all in the models section at https://huggingface.co/ZeroWw
P.S. I didn't do q4 because q4_k quantization imho are bad in most cases, but you are free to try f16/q4... but the f16/q5 is probably better.

ddh0

Jun 27, 2024

I want each model individually run on the 40 prompts so that they each have their own text file

@HiroseKoichi sorry for the delay, this is done now. Each model has its results in a separate file in the repo: ddh0/UnquantizedEmbeddingTesting

All 20 different quantizations are included, from q2_K to q8_0 to f16-q2_K to f16-q8_0. I'm very interested to see what differences you find

CC @bartowski @ZeroWw @helloAI333

ZeroWw

Jun 27, 2024

•

edited Jun 27, 2024

All 20 different quantizations are included, from q2_K to q8_0 to f16-q2_K to f16-q8_0. I'm very interested to see what differences you find

too many because you used random seeds.
in a comparison like this the seed should be fixed and you should include also some questions that include reasoning and some that include creative writing.
That's because the output tensor affects the "way" it express itself, while the embed tensor affects more it's understanding.
Also, add one test of the pure f16 (convert the hf model to f16) like:
python llama.cpp/convert-hf-to-gguf.py --outtype f16 model_name --outfile ${model_name}.f16.gguf

that's because f16 above will be the "baseline".

here you can find a bunch of models with the f16 and f16.q5, f16.q6 and f16.q8: https://huggingface.co/RobertSinclair

CC @ddh0 , @bartowski @helloAI333

ddh0

Jun 27, 2024

too many because you used random seeds.

Don't think seeds are relevant in this case as I'm not doing any sampling

ZeroWw

Jun 27, 2024

•

edited Jun 27, 2024

too many because you used random seeds.

Don't think seeds are relevant in this case as I'm not doing any sampling

@ddho
in general, no... but making the same questions achieves different results according to seeds.. and it's more difficult to determine how a model is degraded if the seeds are random.

bartowski

Owner Jun 28, 2024

he's got temperature = 0.0 which means that seed doesn't play a role

deleted

Jun 28, 2024

This comment has been hidden (marked as Resolved)

ddh0

Jun 28, 2024

Results for pure bf16 test are up: Results_Meta-Llama-3-8B-Instruct-bf16.gguf.txt

I created a pull request to fix the formatting of the files. The current ones have the escape sequences written in plain text instead of rendered.

Thank you, but this is intentional and I don't think it's a problem

Can you also drop an additional text file that has the file sizes of all the models? Thanks again for running all of this.

Will do now

ddh0

Jun 28, 2024

Here is a text file with the sizes of each model in bytes (as outputted from ls -al on my machine): sizes.txt

ZeroWw

Jun 28, 2024

•

edited Jun 28, 2024

Here is a text file with the sizes of each model in bytes (as outputted from ls -al on my machine): sizes.txt

weird.. in your "sizes" I read:
7835472160 Jun 16 18:30 Meta-Llama-3-8B-Instruct-f16-q6_K.gguf

while in my quantization is:
7.84 GB

can you check if the file is the same?
https://huggingface.co/ZeroWw/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct.q6_k.gguf

I ask because I am not sure what makes my quantization better.. it might be anything.

I would suggest you to do tests comparing the F16 in my repository to the q5 q6 and q8 in the same directory.
Those are sure the right files.

to obtain them I run a colab notebook which main part is this:

import os
import subprocess

repo_model_name = 'gradientai/Llama-3-8B-Instruct-Gradient-1048k' #@param ["mistralai/Mistral-7B-Instruct-v0.3", "lucyknada/microsoft_WizardLM-2-7B", "meta-llama/Meta-Llama-3-8B-Instruct", "BarraHome/Mistroll-7B-v2.2","Qwen/Qwen1.5-7B-Chat","microsoft/Phi-3-mini-128k-instruct","microsoft/Phi-3-medium-128k-instruct","google/gemma-7b",'zhengr/MixTAO-7Bx2-MoE-v8.1','CohereForAI/aya-23-8B','01-ai/Yi-1.5-9B-32K','deepseek-ai/DeepSeek-Coder-V2-Lite-Base','01-ai/Yi-1.5-6B-Chat','ZeusLabs/L3-Aethora-15B-V2','Nitral-AI/Hathor_Stable-v0.2-L3-8B'] {allow-input: true}
model_name = os.path.basename(repo_model_name)

# Download Model
print(f'Downloading {repo_model_name}')
subprocess.run(['huggingface-cli', 'download', repo_model_name, '--local-dir', model_name], stdout=subprocess.DEVNULL)

# Convert Model
print('Converting model to f16.')
subprocess.run(['python', 'llama.cpp/convert-hf-to-gguf.py', '--outtype', 'f16', model_name, '--outfile', f'{model_name}.f16.gguf'], stdout=subprocess.DEVNULL)

# Remove the original model directory
os.system(f'rm -rf {model_name}')

# Quantize Model
quantization_types = ['q5_k', 'q6_k', 'q8_0']
for q_type in quantization_types:
    print(f'Quantizing {q_type}')
    subprocess.run(['./build/bin/llama-quantize', '--allow-requantize', '--output-tensor-type', 'f16', '--token-embedding-type', 'f16', f'{model_name}.f16.gguf', f'{model_name}.{q_type}.gguf', q_type, str(os.cpu_count())], stdout=subprocess.DEVNULL)

ddh0

Jun 28, 2024

7835472160 bytes is equal to 7.835 GB, which rounds up to 7.84GB

ZeroWw

Jun 28, 2024

7835472160 bytes is equal to 7.835 GB, which rounds up to 7.84GB

7835472160/1024/1024/1024 = 7.29 GB

ddh0

Jun 28, 2024

No, that's 7.29 Gibibytes (GiB), not gigabytes (GB). See here

ZeroWw

Jun 28, 2024

No, that's 7.29 Gibibytes (GiB), not gigabytes (GB). See here

so you confirm your file has the same size in bytes?

ddh0

Jun 28, 2024

•

edited Jun 28, 2024

No, I do not confirm that. If you want to confirm that on your own, go ahead

Edit: I don't think that the exact file size in bytes is going to help you figure anything out, for what it's worth

ZeroWw

Jun 28, 2024

This is how the sizes should be:

-rw-r--r-- 1 root root 16068890912 Jun 28 05:55 Meta-Llama-3-8B-Instruct.f16.gguf
-rw-r--r-- 1 root root  7042224416 Jun 28 06:07 Meta-Llama-3-8B-Instruct.q5_k.gguf
-rw-r--r-- 1 root root  7835472160 Jun 28 06:15 Meta-Llama-3-8B-Instruct.q6_k.gguf
-rw-r--r-- 1 root root  9525776672 Jun 28 06:17 Meta-Llama-3-8B-Instruct.q8_0.gguf

ddh0

Jun 28, 2024

What is your point, exactly? I don't think my file needs to be the exact same size in bytes as yours. What are you getting at?

ZeroWw

Jun 28, 2024

What is your point, exactly? I don't think my file needs to be the exact same size in bytes as yours. What are you getting at?

No need to be snippy, but if the size is not the same it means the quantization process was different than the one I proposed. That's all.