Testing experimental quants

#2
by bartowski - opened

@ZeroWw try to compare these ones if you can

I'm going to be testing Meta-Llama-3-8B-Instruct-f16-q4_K_S.gguf against Meta-Llama-3-8B-Instruct-q4_K_S.gguf, I'll share any findings in this thread

excellent, I appreciate it!!

Here is a repo with some results: ddh0/UnquantizedEmbeddingTesting

There are a couple files in the repo that are not detailed in the README, but there is some information there that may be interesting. Let me know if there are any specific models or tests that you'd like done.

TLDR: there is a measurable difference between models with unquantized vs quantized embedding/output tensors, but exactly how important the difference is should be investigated more

cc @ZeroWw

deleted
This comment has been hidden (marked as Resolved)

@HiroseKoichi okay, running f16-q6 vs q6 and f16-q8 vs q8 soon

Test results for f16-q6_K vs q6_K and f16-q8_0 vs q8_0 are available in the repo (still need to update the README)

My feedback for q8_0 VS q8_1 based on a 4200-token 21 questions survey, Client= LM Studio, temp=0, topP=0.95, system prompt: Perform the task to the best of your ability.

First shot for each were basically the same, after regenerated more than 3 times, there was some differences: 1. q8_1 followed the instructions better, q8_0 stopped responding after a summarization task in the middle. 2. Quality of answered tasks was similar.

I suspect q8_0 file is broken, I also downloaded and tried bartowski/tabula-8b-GGUF q8_0 and q8_0_L. I don't know what's wrong with this, both doesn't work with LM Studio v.0.2.25, with presets Llama3 or ChatML.

deleted
This comment has been hidden (marked as Resolved)

Actually, there's a mismatch on the README vs. file; the README says that it was f16.Q4_K_S vs. Q4_K_S, but the file says it was f16 vs. f16.Q4_K_S. @ddh0 could you clarify which of the two it was?

@HiroseKoichi It's f16-q4_K_S vs. regular q4_K_S

Also, which models would you like me to compare?

deleted
This comment has been hidden (marked as Resolved)

Ah okay. I'll set that up

I want each model individually run on the 40 prompts so that they each have their own text file

@HiroseKoichi sorry for the delay, this is done now. Each model has its results in a separate file in the repo: ddh0/UnquantizedEmbeddingTesting

All 20 different quantizations are included, from q2_K to q8_0 to f16-q2_K to f16-q8_0. I'm very interested to see what differences you find

CC @bartowski @ZeroWw @helloAI333

All 20 different quantizations are included, from q2_K to q8_0 to f16-q2_K to f16-q8_0. I'm very interested to see what differences you find

too many because you used random seeds.
in a comparison like this the seed should be fixed and you should include also some questions that include reasoning and some that include creative writing.
That's because the output tensor affects the "way" it express itself, while the embed tensor affects more it's understanding.
Also, add one test of the pure f16 (convert the hf model to f16) like:
python llama.cpp/convert-hf-to-gguf.py --outtype f16 model_name --outfile ${model_name}.f16.gguf

that's because f16 above will be the "baseline".

here you can find a bunch of models with the f16 and f16.q5, f16.q6 and f16.q8: https://huggingface.co/RobertSinclair

CC @ddh0 , @bartowski @helloAI333

too many because you used random seeds.

Don't think seeds are relevant in this case as I'm not doing any sampling

too many because you used random seeds.

Don't think seeds are relevant in this case as I'm not doing any sampling

@ddho
in general, no... but making the same questions achieves different results according to seeds.. and it's more difficult to determine how a model is degraded if the seeds are random.

he's got temperature = 0.0 which means that seed doesn't play a role

deleted
This comment has been hidden (marked as Resolved)

Results for pure bf16 test are up: Results_Meta-Llama-3-8B-Instruct-bf16.gguf.txt

I created a pull request to fix the formatting of the files. The current ones have the escape sequences written in plain text instead of rendered.

Thank you, but this is intentional and I don't think it's a problem

Can you also drop an additional text file that has the file sizes of all the models? Thanks again for running all of this.

Will do now

Here is a text file with the sizes of each model in bytes (as outputted from ls -al on my machine): sizes.txt

Here is a text file with the sizes of each model in bytes (as outputted from ls -al on my machine): sizes.txt

weird.. in your "sizes" I read:
7835472160 Jun 16 18:30 Meta-Llama-3-8B-Instruct-f16-q6_K.gguf

while in my quantization is:
7.84 GB

can you check if the file is the same?
https://huggingface.co/ZeroWw/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct.q6_k.gguf

I ask because I am not sure what makes my quantization better.. it might be anything.

I would suggest you to do tests comparing the F16 in my repository to the q5 q6 and q8 in the same directory.
Those are sure the right files.

to obtain them I run a colab notebook which main part is this:

import os
import subprocess

repo_model_name = 'gradientai/Llama-3-8B-Instruct-Gradient-1048k' #@param ["mistralai/Mistral-7B-Instruct-v0.3", "lucyknada/microsoft_WizardLM-2-7B", "meta-llama/Meta-Llama-3-8B-Instruct", "BarraHome/Mistroll-7B-v2.2","Qwen/Qwen1.5-7B-Chat","microsoft/Phi-3-mini-128k-instruct","microsoft/Phi-3-medium-128k-instruct","google/gemma-7b",'zhengr/MixTAO-7Bx2-MoE-v8.1','CohereForAI/aya-23-8B','01-ai/Yi-1.5-9B-32K','deepseek-ai/DeepSeek-Coder-V2-Lite-Base','01-ai/Yi-1.5-6B-Chat','ZeusLabs/L3-Aethora-15B-V2','Nitral-AI/Hathor_Stable-v0.2-L3-8B'] {allow-input: true}
model_name = os.path.basename(repo_model_name)

# Download Model
print(f'Downloading {repo_model_name}')
subprocess.run(['huggingface-cli', 'download', repo_model_name, '--local-dir', model_name], stdout=subprocess.DEVNULL)

# Convert Model
print('Converting model to f16.')
subprocess.run(['python', 'llama.cpp/convert-hf-to-gguf.py', '--outtype', 'f16', model_name, '--outfile', f'{model_name}.f16.gguf'], stdout=subprocess.DEVNULL)

# Remove the original model directory
os.system(f'rm -rf {model_name}')

# Quantize Model
quantization_types = ['q5_k', 'q6_k', 'q8_0']
for q_type in quantization_types:
    print(f'Quantizing {q_type}')
    subprocess.run(['./build/bin/llama-quantize', '--allow-requantize', '--output-tensor-type', 'f16', '--token-embedding-type', 'f16', f'{model_name}.f16.gguf', f'{model_name}.{q_type}.gguf', q_type, str(os.cpu_count())], stdout=subprocess.DEVNULL)

7835472160 bytes is equal to 7.835 GB, which rounds up to 7.84GB

7835472160 bytes is equal to 7.835 GB, which rounds up to 7.84GB

7835472160/1024/1024/1024 = 7.29 GB

No, that's 7.29 Gibibytes (GiB), not gigabytes (GB). See here

No, that's 7.29 Gibibytes (GiB), not gigabytes (GB). See here

so you confirm your file has the same size in bytes?

No, I do not confirm that. If you want to confirm that on your own, go ahead

Edit: I don't think that the exact file size in bytes is going to help you figure anything out, for what it's worth

This is how the sizes should be:

-rw-r--r-- 1 root root 16068890912 Jun 28 05:55 Meta-Llama-3-8B-Instruct.f16.gguf
-rw-r--r-- 1 root root  7042224416 Jun 28 06:07 Meta-Llama-3-8B-Instruct.q5_k.gguf
-rw-r--r-- 1 root root  7835472160 Jun 28 06:15 Meta-Llama-3-8B-Instruct.q6_k.gguf
-rw-r--r-- 1 root root  9525776672 Jun 28 06:17 Meta-Llama-3-8B-Instruct.q8_0.gguf

What is your point, exactly? I don't think my file needs to be the exact same size in bytes as yours. What are you getting at?

What is your point, exactly? I don't think my file needs to be the exact same size in bytes as yours. What are you getting at?

No need to be snippy, but if the size is not the same it means the quantization process was different than the one I proposed. That's all.

Sign up or log in to comment