magiccodingman commited on
Commit
7aaed15
·
1 Parent(s): df89ab8

Found better variant

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-Q5_K.gguf → Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q5_K.gguf +2 -2
  2. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_f16/llamabench.txt +11 -0
  3. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_f16/perplexity_code.txt +156 -0
  4. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_f16/perplexity_general.txt +156 -0
  5. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_f16/perplexity_math.txt +156 -0
  6. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_f16/ppl_corpus_code.txt +0 -0
  7. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_f16/ppl_corpus_general.txt +0 -0
  8. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_f16/ppl_corpus_math.txt +0 -0
  9. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_q6_k/llamabench.txt +11 -0
  10. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_q6_k/perplexity_code.txt +157 -0
  11. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_q6_k/perplexity_general.txt +157 -0
  12. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_q6_k/perplexity_math.txt +157 -0
  13. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_q6_k/ppl_corpus_code.txt +0 -0
  14. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_q6_k/ppl_corpus_general.txt +0 -0
  15. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_q6_k/ppl_corpus_math.txt +0 -0
  16. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-embd_f16/llamabench.txt +11 -0
  17. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-embd_f16/perplexity_code.txt +157 -0
  18. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-embd_f16/perplexity_general.txt +157 -0
  19. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-embd_f16/perplexity_math.txt +157 -0
  20. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-embd_f16/ppl_corpus_code.txt +0 -0
  21. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-embd_f16/ppl_corpus_general.txt +0 -0
  22. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-embd_f16/ppl_corpus_math.txt +0 -0
  23. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_f16/llamabench.txt +11 -0
  24. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_f16/perplexity_code.txt +157 -0
  25. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_f16/perplexity_general.txt +157 -0
  26. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_f16/perplexity_math.txt +157 -0
  27. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_f16/ppl_corpus_code.txt +0 -0
  28. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_f16/ppl_corpus_general.txt +0 -0
  29. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_f16/ppl_corpus_math.txt +0 -0
  30. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q5_K/llamabench.txt +11 -0
  31. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q5_K/perplexity_code.txt +156 -0
  32. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q5_K/perplexity_general.txt +156 -0
  33. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q5_K/perplexity_math.txt +156 -0
  34. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q5_K/ppl_corpus_code.txt +0 -0
  35. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q5_K/ppl_corpus_general.txt +0 -0
  36. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q5_K/ppl_corpus_math.txt +0 -0
  37. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q6_K/llamabench.txt +11 -0
  38. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q6_K/perplexity_code.txt +157 -0
  39. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q6_K/perplexity_general.txt +157 -0
  40. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q6_K/perplexity_math.txt +157 -0
  41. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q6_K/ppl_corpus_code.txt +0 -0
  42. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q6_K/ppl_corpus_general.txt +0 -0
  43. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q6_K/ppl_corpus_math.txt +0 -0
  44. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q6_k-embd_f16/llamabench.txt +11 -0
  45. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q6_k-embd_f16/perplexity_code.txt +157 -0
  46. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q6_k-embd_f16/perplexity_general.txt +157 -0
  47. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q6_k-embd_f16/perplexity_math.txt +157 -0
  48. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q6_k-embd_f16/ppl_corpus_code.txt +0 -0
  49. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q6_k-embd_f16/ppl_corpus_general.txt +0 -0
  50. Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q6_k-embd_f16/ppl_corpus_math.txt +0 -0
Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-Q5_K.gguf → Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q5_K.gguf RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0e027d2983f9bfea5f5c6f7a547a0a0b85714e3d333916dd83f326f3490a21dd
3
- size 14459256032
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ca12b7ef4a694002e3a1fbf69bd5e21419cb21c544bb3371a1dd199c5d91db08
3
+ size 13138050272
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_f16/llamabench.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | llama 34B MXFP4 MoE | 19.41 GiB | 14.43 B | CUDA | 35 | pp8 | 53.08 ± 2.38 |
9
+ | llama 34B MXFP4 MoE | 19.41 GiB | 14.43 B | CUDA | 35 | tg128 | 7.44 ± 0.04 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_f16/perplexity_code.txt ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21059 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_f16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type f16: 98 tensors
49
+ llama_model_loader: - type q8_0: 240 tensors
50
+ print_info: file format = GGUF V3 (latest)
51
+ print_info: file type = MXFP4 MoE
52
+ print_info: file size = 19.41 GiB (11.55 BPW)
53
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
54
+ load: printing all EOG tokens:
55
+ load: - 2 ('</s>')
56
+ load: special tokens cache size = 1000
57
+ load: token to piece cache size = 0.8498 MB
58
+ print_info: arch = llama
59
+ print_info: vocab_only = 0
60
+ print_info: n_ctx_train = 262400
61
+ print_info: n_embd = 5120
62
+ print_info: n_embd_inp = 5120
63
+ print_info: n_layer = 48
64
+ print_info: n_head = 32
65
+ print_info: n_head_kv = 8
66
+ print_info: n_rot = 128
67
+ print_info: n_swa = 0
68
+ print_info: is_swa_any = 0
69
+ print_info: n_embd_head_k = 128
70
+ print_info: n_embd_head_v = 128
71
+ print_info: n_gqa = 4
72
+ print_info: n_embd_k_gqa = 1024
73
+ print_info: n_embd_v_gqa = 1024
74
+ print_info: f_norm_eps = 0.0e+00
75
+ print_info: f_norm_rms_eps = 1.0e-05
76
+ print_info: f_clamp_kqv = 0.0e+00
77
+ print_info: f_max_alibi_bias = 0.0e+00
78
+ print_info: f_logit_scale = 0.0e+00
79
+ print_info: f_attn_scale = 0.0e+00
80
+ print_info: n_ff = 14336
81
+ print_info: n_expert = 0
82
+ print_info: n_expert_used = 0
83
+ print_info: n_expert_groups = 0
84
+ print_info: n_group_used = 0
85
+ print_info: causal attn = 1
86
+ print_info: pooling type = 0
87
+ print_info: rope type = 0
88
+ print_info: rope scaling = linear
89
+ print_info: freq_base_train = 1000000000.0
90
+ print_info: freq_scale_train = 1
91
+ print_info: n_ctx_orig_yarn = 262400
92
+ print_info: rope_finetuned = unknown
93
+ print_info: model type = 34B
94
+ print_info: model params = 14.43 B
95
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
96
+ print_info: vocab type = BPE
97
+ print_info: n_vocab = 131072
98
+ print_info: n_merges = 269443
99
+ print_info: BOS token = 1 '<s>'
100
+ print_info: EOS token = 2 '</s>'
101
+ print_info: UNK token = 0 '<unk>'
102
+ print_info: PAD token = 11 '<pad>'
103
+ print_info: LF token = 1010 'Ċ'
104
+ print_info: EOG token = 2 '</s>'
105
+ print_info: max token length = 150
106
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
107
+ load_tensors: offloading 20 repeating layers to GPU
108
+ load_tensors: offloaded 20/49 layers to GPU
109
+ load_tensors: CPU_Mapped model buffer size = 12658.61 MiB
110
+ load_tensors: CUDA0 model buffer size = 3606.64 MiB
111
+ load_tensors: CUDA1 model buffer size = 3606.64 MiB
112
+ ..........................................................................................
113
+ llama_context: constructing llama_context
114
+ llama_context: n_seq_max = 1
115
+ llama_context: n_ctx = 2048
116
+ llama_context: n_ctx_seq = 2048
117
+ llama_context: n_batch = 2048
118
+ llama_context: n_ubatch = 512
119
+ llama_context: causal_attn = 1
120
+ llama_context: flash_attn = auto
121
+ llama_context: kv_unified = false
122
+ llama_context: freq_base = 1000000000.0
123
+ llama_context: freq_scale = 1
124
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
125
+ llama_context: CPU output buffer size = 0.50 MiB
126
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
127
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
128
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
130
+ llama_context: Flash Attention was auto, set to enabled
131
+ llama_context: CUDA0 compute buffer size = 1546.00 MiB
132
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
133
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
134
+ llama_context: graph nodes = 1495
135
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
136
+ common_init_from_params: added </s> logit bias = -inf
137
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
138
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
139
+
140
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
141
+ perplexity: tokenizing the input ..
142
+ perplexity: tokenization took 51.83 ms
143
+ perplexity: calculating perplexity over 46 chunks, n_ctx=2048, batch_size=2048, n_seq=1
144
+ perplexity: 4.56 seconds per pass - ETA 3.48 minutes
145
+ [1]3.4136,[2]2.9351,[3]2.0614,[4]1.8567,[5]1.6793,[6]1.8499,[7]1.9942,[8]2.0354,[9]1.9670,[10]1.8990,[11]1.8014,[12]1.8057,[13]1.7887,[14]1.7400,[15]1.6918,[16]1.7341,[17]1.7171,[18]1.6875,[19]1.6818,[20]1.6703,[21]1.6881,[22]1.7079,[23]1.6819,[24]1.6644,[25]1.6698,[26]1.6725,[27]1.6806,[28]1.6577,[29]1.6486,[30]1.6452,[31]1.6641,[32]1.6683,[33]1.6612,[34]1.6481,[35]1.6355,[36]1.6556,[37]1.6720,[38]1.7039,[39]1.7330,[40]1.7459,[41]1.7366,[42]1.7406,[43]1.7319,[44]1.7301,[45]1.7448,[46]1.7492,
146
+ Final estimate: PPL = 1.7492 +/- 0.01482
147
+
148
+ llama_perf_context_print: load time = 2703.89 ms
149
+ llama_perf_context_print: prompt eval time = 200634.63 ms / 94208 tokens ( 2.13 ms per token, 469.55 tokens per second)
150
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
151
+ llama_perf_context_print: total time = 201778.36 ms / 94209 tokens
152
+ llama_perf_context_print: graphs reused = 0
153
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
154
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 15770 + ( 5232 = 3606 + 80 + 1546) + 3103 |
155
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19668 + ( 3802 = 3606 + 80 + 116) + 653 |
156
+ llama_memory_breakdown_print: | - Host | 12896 = 12658 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_f16/perplexity_general.txt ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21061 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_f16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type f16: 98 tensors
49
+ llama_model_loader: - type q8_0: 240 tensors
50
+ print_info: file format = GGUF V3 (latest)
51
+ print_info: file type = MXFP4 MoE
52
+ print_info: file size = 19.41 GiB (11.55 BPW)
53
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
54
+ load: printing all EOG tokens:
55
+ load: - 2 ('</s>')
56
+ load: special tokens cache size = 1000
57
+ load: token to piece cache size = 0.8498 MB
58
+ print_info: arch = llama
59
+ print_info: vocab_only = 0
60
+ print_info: n_ctx_train = 262400
61
+ print_info: n_embd = 5120
62
+ print_info: n_embd_inp = 5120
63
+ print_info: n_layer = 48
64
+ print_info: n_head = 32
65
+ print_info: n_head_kv = 8
66
+ print_info: n_rot = 128
67
+ print_info: n_swa = 0
68
+ print_info: is_swa_any = 0
69
+ print_info: n_embd_head_k = 128
70
+ print_info: n_embd_head_v = 128
71
+ print_info: n_gqa = 4
72
+ print_info: n_embd_k_gqa = 1024
73
+ print_info: n_embd_v_gqa = 1024
74
+ print_info: f_norm_eps = 0.0e+00
75
+ print_info: f_norm_rms_eps = 1.0e-05
76
+ print_info: f_clamp_kqv = 0.0e+00
77
+ print_info: f_max_alibi_bias = 0.0e+00
78
+ print_info: f_logit_scale = 0.0e+00
79
+ print_info: f_attn_scale = 0.0e+00
80
+ print_info: n_ff = 14336
81
+ print_info: n_expert = 0
82
+ print_info: n_expert_used = 0
83
+ print_info: n_expert_groups = 0
84
+ print_info: n_group_used = 0
85
+ print_info: causal attn = 1
86
+ print_info: pooling type = 0
87
+ print_info: rope type = 0
88
+ print_info: rope scaling = linear
89
+ print_info: freq_base_train = 1000000000.0
90
+ print_info: freq_scale_train = 1
91
+ print_info: n_ctx_orig_yarn = 262400
92
+ print_info: rope_finetuned = unknown
93
+ print_info: model type = 34B
94
+ print_info: model params = 14.43 B
95
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
96
+ print_info: vocab type = BPE
97
+ print_info: n_vocab = 131072
98
+ print_info: n_merges = 269443
99
+ print_info: BOS token = 1 '<s>'
100
+ print_info: EOS token = 2 '</s>'
101
+ print_info: UNK token = 0 '<unk>'
102
+ print_info: PAD token = 11 '<pad>'
103
+ print_info: LF token = 1010 'Ċ'
104
+ print_info: EOG token = 2 '</s>'
105
+ print_info: max token length = 150
106
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
107
+ load_tensors: offloading 20 repeating layers to GPU
108
+ load_tensors: offloaded 20/49 layers to GPU
109
+ load_tensors: CPU_Mapped model buffer size = 12658.61 MiB
110
+ load_tensors: CUDA0 model buffer size = 3606.64 MiB
111
+ load_tensors: CUDA1 model buffer size = 3606.64 MiB
112
+ ..........................................................................................
113
+ llama_context: constructing llama_context
114
+ llama_context: n_seq_max = 1
115
+ llama_context: n_ctx = 2048
116
+ llama_context: n_ctx_seq = 2048
117
+ llama_context: n_batch = 2048
118
+ llama_context: n_ubatch = 512
119
+ llama_context: causal_attn = 1
120
+ llama_context: flash_attn = auto
121
+ llama_context: kv_unified = false
122
+ llama_context: freq_base = 1000000000.0
123
+ llama_context: freq_scale = 1
124
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
125
+ llama_context: CPU output buffer size = 0.50 MiB
126
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
127
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
128
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
130
+ llama_context: Flash Attention was auto, set to enabled
131
+ llama_context: CUDA0 compute buffer size = 1546.00 MiB
132
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
133
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
134
+ llama_context: graph nodes = 1495
135
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
136
+ common_init_from_params: added </s> logit bias = -inf
137
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
138
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
139
+
140
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
141
+ perplexity: tokenizing the input ..
142
+ perplexity: tokenization took 15.979 ms
143
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
144
+ perplexity: 4.59 seconds per pass - ETA 1.13 minutes
145
+ [1]9.0436,[2]12.4154,[3]13.9554,[4]13.2144,[5]12.7211,[6]10.5789,[7]9.3769,[8]9.3216,[9]9.9598,[10]10.1406,[11]10.1676,[12]10.5637,[13]10.7144,[14]10.8155,[15]11.0375,
146
+ Final estimate: PPL = 11.0375 +/- 0.29443
147
+
148
+ llama_perf_context_print: load time = 2747.10 ms
149
+ llama_perf_context_print: prompt eval time = 65721.69 ms / 30720 tokens ( 2.14 ms per token, 467.43 tokens per second)
150
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
151
+ llama_perf_context_print: total time = 66095.98 ms / 30721 tokens
152
+ llama_perf_context_print: graphs reused = 0
153
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
154
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 15649 + ( 5232 = 3606 + 80 + 1546) + 3224 |
155
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19668 + ( 3802 = 3606 + 80 + 116) + 653 |
156
+ llama_memory_breakdown_print: | - Host | 12896 = 12658 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_f16/perplexity_math.txt ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21178 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_f16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type f16: 98 tensors
49
+ llama_model_loader: - type q8_0: 240 tensors
50
+ print_info: file format = GGUF V3 (latest)
51
+ print_info: file type = MXFP4 MoE
52
+ print_info: file size = 19.41 GiB (11.55 BPW)
53
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
54
+ load: printing all EOG tokens:
55
+ load: - 2 ('</s>')
56
+ load: special tokens cache size = 1000
57
+ load: token to piece cache size = 0.8498 MB
58
+ print_info: arch = llama
59
+ print_info: vocab_only = 0
60
+ print_info: n_ctx_train = 262400
61
+ print_info: n_embd = 5120
62
+ print_info: n_embd_inp = 5120
63
+ print_info: n_layer = 48
64
+ print_info: n_head = 32
65
+ print_info: n_head_kv = 8
66
+ print_info: n_rot = 128
67
+ print_info: n_swa = 0
68
+ print_info: is_swa_any = 0
69
+ print_info: n_embd_head_k = 128
70
+ print_info: n_embd_head_v = 128
71
+ print_info: n_gqa = 4
72
+ print_info: n_embd_k_gqa = 1024
73
+ print_info: n_embd_v_gqa = 1024
74
+ print_info: f_norm_eps = 0.0e+00
75
+ print_info: f_norm_rms_eps = 1.0e-05
76
+ print_info: f_clamp_kqv = 0.0e+00
77
+ print_info: f_max_alibi_bias = 0.0e+00
78
+ print_info: f_logit_scale = 0.0e+00
79
+ print_info: f_attn_scale = 0.0e+00
80
+ print_info: n_ff = 14336
81
+ print_info: n_expert = 0
82
+ print_info: n_expert_used = 0
83
+ print_info: n_expert_groups = 0
84
+ print_info: n_group_used = 0
85
+ print_info: causal attn = 1
86
+ print_info: pooling type = 0
87
+ print_info: rope type = 0
88
+ print_info: rope scaling = linear
89
+ print_info: freq_base_train = 1000000000.0
90
+ print_info: freq_scale_train = 1
91
+ print_info: n_ctx_orig_yarn = 262400
92
+ print_info: rope_finetuned = unknown
93
+ print_info: model type = 34B
94
+ print_info: model params = 14.43 B
95
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
96
+ print_info: vocab type = BPE
97
+ print_info: n_vocab = 131072
98
+ print_info: n_merges = 269443
99
+ print_info: BOS token = 1 '<s>'
100
+ print_info: EOS token = 2 '</s>'
101
+ print_info: UNK token = 0 '<unk>'
102
+ print_info: PAD token = 11 '<pad>'
103
+ print_info: LF token = 1010 'Ċ'
104
+ print_info: EOG token = 2 '</s>'
105
+ print_info: max token length = 150
106
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
107
+ load_tensors: offloading 20 repeating layers to GPU
108
+ load_tensors: offloaded 20/49 layers to GPU
109
+ load_tensors: CPU_Mapped model buffer size = 12658.61 MiB
110
+ load_tensors: CUDA0 model buffer size = 3606.64 MiB
111
+ load_tensors: CUDA1 model buffer size = 3606.64 MiB
112
+ ..........................................................................................
113
+ llama_context: constructing llama_context
114
+ llama_context: n_seq_max = 1
115
+ llama_context: n_ctx = 2048
116
+ llama_context: n_ctx_seq = 2048
117
+ llama_context: n_batch = 2048
118
+ llama_context: n_ubatch = 512
119
+ llama_context: causal_attn = 1
120
+ llama_context: flash_attn = auto
121
+ llama_context: kv_unified = false
122
+ llama_context: freq_base = 1000000000.0
123
+ llama_context: freq_scale = 1
124
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
125
+ llama_context: CPU output buffer size = 0.50 MiB
126
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
127
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
128
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
130
+ llama_context: Flash Attention was auto, set to enabled
131
+ llama_context: CUDA0 compute buffer size = 1546.00 MiB
132
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
133
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
134
+ llama_context: graph nodes = 1495
135
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
136
+ common_init_from_params: added </s> logit bias = -inf
137
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
138
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
139
+
140
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
141
+ perplexity: tokenizing the input ..
142
+ perplexity: tokenization took 15.144 ms
143
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
144
+ perplexity: 4.55 seconds per pass - ETA 1.20 minutes
145
+ [1]7.5825,[2]8.3888,[3]8.8359,[4]9.4977,[5]9.5945,[6]9.6271,[7]9.6521,[8]9.5520,[9]9.5596,[10]9.5182,[11]9.4792,[12]9.5394,[13]9.6370,[14]9.7515,[15]9.7375,[16]9.6180,
146
+ Final estimate: PPL = 9.6180 +/- 0.24558
147
+
148
+ llama_perf_context_print: load time = 2832.00 ms
149
+ llama_perf_context_print: prompt eval time = 70405.82 ms / 32768 tokens ( 2.15 ms per token, 465.42 tokens per second)
150
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
151
+ llama_perf_context_print: total time = 70805.72 ms / 32769 tokens
152
+ llama_perf_context_print: graphs reused = 0
153
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
154
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 15778 + ( 5232 = 3606 + 80 + 1546) + 3095 |
155
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19668 + ( 3802 = 3606 + 80 + 116) + 653 |
156
+ llama_memory_breakdown_print: | - Host | 12896 = 12658 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_f16/ppl_corpus_code.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_f16/ppl_corpus_general.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_f16/ppl_corpus_math.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_q6_k/llamabench.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | llama 34B MXFP4 MoE | 14.80 GiB | 14.43 B | CUDA | 35 | pp8 | 63.92 ± 1.62 |
9
+ | llama 34B MXFP4 MoE | 14.80 GiB | 14.43 B | CUDA | 35 | tg128 | 9.06 ± 0.01 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_q6_k/perplexity_code.txt ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21191 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_q6_k.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type f16: 49 tensors
49
+ llama_model_loader: - type q8_0: 240 tensors
50
+ llama_model_loader: - type q6_K: 49 tensors
51
+ print_info: file format = GGUF V3 (latest)
52
+ print_info: file type = MXFP4 MoE
53
+ print_info: file size = 14.80 GiB (8.81 BPW)
54
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
55
+ load: printing all EOG tokens:
56
+ load: - 2 ('</s>')
57
+ load: special tokens cache size = 1000
58
+ load: token to piece cache size = 0.8498 MB
59
+ print_info: arch = llama
60
+ print_info: vocab_only = 0
61
+ print_info: n_ctx_train = 262400
62
+ print_info: n_embd = 5120
63
+ print_info: n_embd_inp = 5120
64
+ print_info: n_layer = 48
65
+ print_info: n_head = 32
66
+ print_info: n_head_kv = 8
67
+ print_info: n_rot = 128
68
+ print_info: n_swa = 0
69
+ print_info: is_swa_any = 0
70
+ print_info: n_embd_head_k = 128
71
+ print_info: n_embd_head_v = 128
72
+ print_info: n_gqa = 4
73
+ print_info: n_embd_k_gqa = 1024
74
+ print_info: n_embd_v_gqa = 1024
75
+ print_info: f_norm_eps = 0.0e+00
76
+ print_info: f_norm_rms_eps = 1.0e-05
77
+ print_info: f_clamp_kqv = 0.0e+00
78
+ print_info: f_max_alibi_bias = 0.0e+00
79
+ print_info: f_logit_scale = 0.0e+00
80
+ print_info: f_attn_scale = 0.0e+00
81
+ print_info: n_ff = 14336
82
+ print_info: n_expert = 0
83
+ print_info: n_expert_used = 0
84
+ print_info: n_expert_groups = 0
85
+ print_info: n_group_used = 0
86
+ print_info: causal attn = 1
87
+ print_info: pooling type = 0
88
+ print_info: rope type = 0
89
+ print_info: rope scaling = linear
90
+ print_info: freq_base_train = 1000000000.0
91
+ print_info: freq_scale_train = 1
92
+ print_info: n_ctx_orig_yarn = 262400
93
+ print_info: rope_finetuned = unknown
94
+ print_info: model type = 34B
95
+ print_info: model params = 14.43 B
96
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
97
+ print_info: vocab type = BPE
98
+ print_info: n_vocab = 131072
99
+ print_info: n_merges = 269443
100
+ print_info: BOS token = 1 '<s>'
101
+ print_info: EOS token = 2 '</s>'
102
+ print_info: UNK token = 0 '<unk>'
103
+ print_info: PAD token = 11 '<pad>'
104
+ print_info: LF token = 1010 'Ċ'
105
+ print_info: EOG token = 2 '</s>'
106
+ print_info: max token length = 150
107
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
108
+ load_tensors: offloading 20 repeating layers to GPU
109
+ load_tensors: offloaded 20/49 layers to GPU
110
+ load_tensors: CPU_Mapped model buffer size = 9591.43 MiB
111
+ load_tensors: CUDA0 model buffer size = 2780.86 MiB
112
+ load_tensors: CUDA1 model buffer size = 2780.86 MiB
113
+ ...........................................................................................
114
+ llama_context: constructing llama_context
115
+ llama_context: n_seq_max = 1
116
+ llama_context: n_ctx = 2048
117
+ llama_context: n_ctx_seq = 2048
118
+ llama_context: n_batch = 2048
119
+ llama_context: n_ubatch = 512
120
+ llama_context: causal_attn = 1
121
+ llama_context: flash_attn = auto
122
+ llama_context: kv_unified = false
123
+ llama_context: freq_base = 1000000000.0
124
+ llama_context: freq_scale = 1
125
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
126
+ llama_context: CPU output buffer size = 0.50 MiB
127
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
128
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
130
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
131
+ llama_context: Flash Attention was auto, set to enabled
132
+ llama_context: CUDA0 compute buffer size = 1546.00 MiB
133
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
134
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
135
+ llama_context: graph nodes = 1495
136
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
137
+ common_init_from_params: added </s> logit bias = -inf
138
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
139
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
140
+
141
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
142
+ perplexity: tokenizing the input ..
143
+ perplexity: tokenization took 54.672 ms
144
+ perplexity: calculating perplexity over 46 chunks, n_ctx=2048, batch_size=2048, n_seq=1
145
+ perplexity: 3.98 seconds per pass - ETA 3.05 minutes
146
+ [1]3.4033,[2]2.9322,[3]2.0600,[4]1.8552,[5]1.6779,[6]1.8485,[7]1.9924,[8]2.0338,[9]1.9660,[10]1.8988,[11]1.8007,[12]1.8050,[13]1.7879,[14]1.7393,[15]1.6910,[16]1.7337,[17]1.7171,[18]1.6875,[19]1.6817,[20]1.6702,[21]1.6878,[22]1.7077,[23]1.6819,[24]1.6641,[25]1.6697,[26]1.6723,[27]1.6804,[28]1.6576,[29]1.6485,[30]1.6451,[31]1.6640,[32]1.6684,[33]1.6614,[34]1.6485,[35]1.6358,[36]1.6563,[37]1.6729,[38]1.7049,[39]1.7341,[40]1.7468,[41]1.7375,[42]1.7416,[43]1.7328,[44]1.7309,[45]1.7457,[46]1.7502,
147
+ Final estimate: PPL = 1.7502 +/- 0.01485
148
+
149
+ llama_perf_context_print: load time = 2077.64 ms
150
+ llama_perf_context_print: prompt eval time = 169319.37 ms / 94208 tokens ( 1.80 ms per token, 556.39 tokens per second)
151
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
152
+ llama_perf_context_print: total time = 170476.84 ms / 94209 tokens
153
+ llama_perf_context_print: graphs reused = 0
154
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
155
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 16605 + (4406 = 2780 + 80 + 1546) + 3095 |
156
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20494 + (2976 = 2780 + 80 + 116) + 653 |
157
+ llama_memory_breakdown_print: | - Host | 9829 = 9591 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_q6_k/perplexity_general.txt ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21191 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_q6_k.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type f16: 49 tensors
49
+ llama_model_loader: - type q8_0: 240 tensors
50
+ llama_model_loader: - type q6_K: 49 tensors
51
+ print_info: file format = GGUF V3 (latest)
52
+ print_info: file type = MXFP4 MoE
53
+ print_info: file size = 14.80 GiB (8.81 BPW)
54
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
55
+ load: printing all EOG tokens:
56
+ load: - 2 ('</s>')
57
+ load: special tokens cache size = 1000
58
+ load: token to piece cache size = 0.8498 MB
59
+ print_info: arch = llama
60
+ print_info: vocab_only = 0
61
+ print_info: n_ctx_train = 262400
62
+ print_info: n_embd = 5120
63
+ print_info: n_embd_inp = 5120
64
+ print_info: n_layer = 48
65
+ print_info: n_head = 32
66
+ print_info: n_head_kv = 8
67
+ print_info: n_rot = 128
68
+ print_info: n_swa = 0
69
+ print_info: is_swa_any = 0
70
+ print_info: n_embd_head_k = 128
71
+ print_info: n_embd_head_v = 128
72
+ print_info: n_gqa = 4
73
+ print_info: n_embd_k_gqa = 1024
74
+ print_info: n_embd_v_gqa = 1024
75
+ print_info: f_norm_eps = 0.0e+00
76
+ print_info: f_norm_rms_eps = 1.0e-05
77
+ print_info: f_clamp_kqv = 0.0e+00
78
+ print_info: f_max_alibi_bias = 0.0e+00
79
+ print_info: f_logit_scale = 0.0e+00
80
+ print_info: f_attn_scale = 0.0e+00
81
+ print_info: n_ff = 14336
82
+ print_info: n_expert = 0
83
+ print_info: n_expert_used = 0
84
+ print_info: n_expert_groups = 0
85
+ print_info: n_group_used = 0
86
+ print_info: causal attn = 1
87
+ print_info: pooling type = 0
88
+ print_info: rope type = 0
89
+ print_info: rope scaling = linear
90
+ print_info: freq_base_train = 1000000000.0
91
+ print_info: freq_scale_train = 1
92
+ print_info: n_ctx_orig_yarn = 262400
93
+ print_info: rope_finetuned = unknown
94
+ print_info: model type = 34B
95
+ print_info: model params = 14.43 B
96
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
97
+ print_info: vocab type = BPE
98
+ print_info: n_vocab = 131072
99
+ print_info: n_merges = 269443
100
+ print_info: BOS token = 1 '<s>'
101
+ print_info: EOS token = 2 '</s>'
102
+ print_info: UNK token = 0 '<unk>'
103
+ print_info: PAD token = 11 '<pad>'
104
+ print_info: LF token = 1010 'Ċ'
105
+ print_info: EOG token = 2 '</s>'
106
+ print_info: max token length = 150
107
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
108
+ load_tensors: offloading 20 repeating layers to GPU
109
+ load_tensors: offloaded 20/49 layers to GPU
110
+ load_tensors: CPU_Mapped model buffer size = 9591.43 MiB
111
+ load_tensors: CUDA0 model buffer size = 2780.86 MiB
112
+ load_tensors: CUDA1 model buffer size = 2780.86 MiB
113
+ ...........................................................................................
114
+ llama_context: constructing llama_context
115
+ llama_context: n_seq_max = 1
116
+ llama_context: n_ctx = 2048
117
+ llama_context: n_ctx_seq = 2048
118
+ llama_context: n_batch = 2048
119
+ llama_context: n_ubatch = 512
120
+ llama_context: causal_attn = 1
121
+ llama_context: flash_attn = auto
122
+ llama_context: kv_unified = false
123
+ llama_context: freq_base = 1000000000.0
124
+ llama_context: freq_scale = 1
125
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
126
+ llama_context: CPU output buffer size = 0.50 MiB
127
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
128
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
130
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
131
+ llama_context: Flash Attention was auto, set to enabled
132
+ llama_context: CUDA0 compute buffer size = 1546.00 MiB
133
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
134
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
135
+ llama_context: graph nodes = 1495
136
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
137
+ common_init_from_params: added </s> logit bias = -inf
138
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
139
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
140
+
141
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
142
+ perplexity: tokenizing the input ..
143
+ perplexity: tokenization took 15.567 ms
144
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
145
+ perplexity: 4.04 seconds per pass - ETA 1.00 minutes
146
+ [1]9.0192,[2]12.4326,[3]13.9872,[4]13.2406,[5]12.7098,[6]10.5774,[7]9.3714,[8]9.3172,[9]9.9622,[10]10.1364,[11]10.1667,[12]10.5604,[13]10.7088,[14]10.8127,[15]11.0324,
147
+ Final estimate: PPL = 11.0324 +/- 0.29406
148
+
149
+ llama_perf_context_print: load time = 2139.46 ms
150
+ llama_perf_context_print: prompt eval time = 55439.95 ms / 30720 tokens ( 1.80 ms per token, 554.11 tokens per second)
151
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
152
+ llama_perf_context_print: total time = 55824.65 ms / 30721 tokens
153
+ llama_perf_context_print: graphs reused = 0
154
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
155
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 16607 + (4406 = 2780 + 80 + 1546) + 3093 |
156
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20494 + (2976 = 2780 + 80 + 116) + 653 |
157
+ llama_memory_breakdown_print: | - Host | 9829 = 9591 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_q6_k/perplexity_math.txt ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21166 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_q6_k.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type f16: 49 tensors
49
+ llama_model_loader: - type q8_0: 240 tensors
50
+ llama_model_loader: - type q6_K: 49 tensors
51
+ print_info: file format = GGUF V3 (latest)
52
+ print_info: file type = MXFP4 MoE
53
+ print_info: file size = 14.80 GiB (8.81 BPW)
54
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
55
+ load: printing all EOG tokens:
56
+ load: - 2 ('</s>')
57
+ load: special tokens cache size = 1000
58
+ load: token to piece cache size = 0.8498 MB
59
+ print_info: arch = llama
60
+ print_info: vocab_only = 0
61
+ print_info: n_ctx_train = 262400
62
+ print_info: n_embd = 5120
63
+ print_info: n_embd_inp = 5120
64
+ print_info: n_layer = 48
65
+ print_info: n_head = 32
66
+ print_info: n_head_kv = 8
67
+ print_info: n_rot = 128
68
+ print_info: n_swa = 0
69
+ print_info: is_swa_any = 0
70
+ print_info: n_embd_head_k = 128
71
+ print_info: n_embd_head_v = 128
72
+ print_info: n_gqa = 4
73
+ print_info: n_embd_k_gqa = 1024
74
+ print_info: n_embd_v_gqa = 1024
75
+ print_info: f_norm_eps = 0.0e+00
76
+ print_info: f_norm_rms_eps = 1.0e-05
77
+ print_info: f_clamp_kqv = 0.0e+00
78
+ print_info: f_max_alibi_bias = 0.0e+00
79
+ print_info: f_logit_scale = 0.0e+00
80
+ print_info: f_attn_scale = 0.0e+00
81
+ print_info: n_ff = 14336
82
+ print_info: n_expert = 0
83
+ print_info: n_expert_used = 0
84
+ print_info: n_expert_groups = 0
85
+ print_info: n_group_used = 0
86
+ print_info: causal attn = 1
87
+ print_info: pooling type = 0
88
+ print_info: rope type = 0
89
+ print_info: rope scaling = linear
90
+ print_info: freq_base_train = 1000000000.0
91
+ print_info: freq_scale_train = 1
92
+ print_info: n_ctx_orig_yarn = 262400
93
+ print_info: rope_finetuned = unknown
94
+ print_info: model type = 34B
95
+ print_info: model params = 14.43 B
96
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
97
+ print_info: vocab type = BPE
98
+ print_info: n_vocab = 131072
99
+ print_info: n_merges = 269443
100
+ print_info: BOS token = 1 '<s>'
101
+ print_info: EOS token = 2 '</s>'
102
+ print_info: UNK token = 0 '<unk>'
103
+ print_info: PAD token = 11 '<pad>'
104
+ print_info: LF token = 1010 'Ċ'
105
+ print_info: EOG token = 2 '</s>'
106
+ print_info: max token length = 150
107
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
108
+ load_tensors: offloading 20 repeating layers to GPU
109
+ load_tensors: offloaded 20/49 layers to GPU
110
+ load_tensors: CPU_Mapped model buffer size = 9591.43 MiB
111
+ load_tensors: CUDA0 model buffer size = 2780.86 MiB
112
+ load_tensors: CUDA1 model buffer size = 2780.86 MiB
113
+ ...........................................................................................
114
+ llama_context: constructing llama_context
115
+ llama_context: n_seq_max = 1
116
+ llama_context: n_ctx = 2048
117
+ llama_context: n_ctx_seq = 2048
118
+ llama_context: n_batch = 2048
119
+ llama_context: n_ubatch = 512
120
+ llama_context: causal_attn = 1
121
+ llama_context: flash_attn = auto
122
+ llama_context: kv_unified = false
123
+ llama_context: freq_base = 1000000000.0
124
+ llama_context: freq_scale = 1
125
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
126
+ llama_context: CPU output buffer size = 0.50 MiB
127
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
128
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
130
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
131
+ llama_context: Flash Attention was auto, set to enabled
132
+ llama_context: CUDA0 compute buffer size = 1546.00 MiB
133
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
134
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
135
+ llama_context: graph nodes = 1495
136
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
137
+ common_init_from_params: added </s> logit bias = -inf
138
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
139
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
140
+
141
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
142
+ perplexity: tokenizing the input ..
143
+ perplexity: tokenization took 15.49 ms
144
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
145
+ perplexity: 3.90 seconds per pass - ETA 1.03 minutes
146
+ [1]7.5819,[2]8.3894,[3]8.8283,[4]9.4902,[5]9.5894,[6]9.6210,[7]9.6532,[8]9.5578,[9]9.5600,[10]9.5171,[11]9.4811,[12]9.5417,[13]9.6396,[14]9.7530,[15]9.7356,[16]9.6194,
147
+ Final estimate: PPL = 9.6194 +/- 0.24552
148
+
149
+ llama_perf_context_print: load time = 2057.99 ms
150
+ llama_perf_context_print: prompt eval time = 59177.95 ms / 32768 tokens ( 1.81 ms per token, 553.72 tokens per second)
151
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
152
+ llama_perf_context_print: total time = 59578.99 ms / 32769 tokens
153
+ llama_perf_context_print: graphs reused = 0
154
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
155
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 16599 + (4406 = 2780 + 80 + 1546) + 3101 |
156
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20494 + (2976 = 2780 + 80 + 116) + 653 |
157
+ llama_memory_breakdown_print: | - Host | 9829 = 9591 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_q6_k/ppl_corpus_code.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_q6_k/ppl_corpus_general.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_f16-router_gate_emb_q6_k/ppl_corpus_math.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-embd_f16/llamabench.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | llama 34B MXFP4 MoE | 14.04 GiB | 14.43 B | CUDA | 35 | pp8 | 76.15 ± 1.19 |
9
+ | llama 34B MXFP4 MoE | 14.04 GiB | 14.43 B | CUDA | 35 | tg128 | 11.32 ± 0.06 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-embd_f16/perplexity_code.txt ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21179 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-embd_f16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type f16: 1 tensors
49
+ llama_model_loader: - type q8_0: 288 tensors
50
+ llama_model_loader: - type mxfp4: 49 tensors
51
+ print_info: file format = GGUF V3 (latest)
52
+ print_info: file type = MXFP4 MoE
53
+ print_info: file size = 14.04 GiB (8.36 BPW)
54
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
55
+ load: printing all EOG tokens:
56
+ load: - 2 ('</s>')
57
+ load: special tokens cache size = 1000
58
+ load: token to piece cache size = 0.8498 MB
59
+ print_info: arch = llama
60
+ print_info: vocab_only = 0
61
+ print_info: n_ctx_train = 262400
62
+ print_info: n_embd = 5120
63
+ print_info: n_embd_inp = 5120
64
+ print_info: n_layer = 48
65
+ print_info: n_head = 32
66
+ print_info: n_head_kv = 8
67
+ print_info: n_rot = 128
68
+ print_info: n_swa = 0
69
+ print_info: is_swa_any = 0
70
+ print_info: n_embd_head_k = 128
71
+ print_info: n_embd_head_v = 128
72
+ print_info: n_gqa = 4
73
+ print_info: n_embd_k_gqa = 1024
74
+ print_info: n_embd_v_gqa = 1024
75
+ print_info: f_norm_eps = 0.0e+00
76
+ print_info: f_norm_rms_eps = 1.0e-05
77
+ print_info: f_clamp_kqv = 0.0e+00
78
+ print_info: f_max_alibi_bias = 0.0e+00
79
+ print_info: f_logit_scale = 0.0e+00
80
+ print_info: f_attn_scale = 0.0e+00
81
+ print_info: n_ff = 14336
82
+ print_info: n_expert = 0
83
+ print_info: n_expert_used = 0
84
+ print_info: n_expert_groups = 0
85
+ print_info: n_group_used = 0
86
+ print_info: causal attn = 1
87
+ print_info: pooling type = 0
88
+ print_info: rope type = 0
89
+ print_info: rope scaling = linear
90
+ print_info: freq_base_train = 1000000000.0
91
+ print_info: freq_scale_train = 1
92
+ print_info: n_ctx_orig_yarn = 262400
93
+ print_info: rope_finetuned = unknown
94
+ print_info: model type = 34B
95
+ print_info: model params = 14.43 B
96
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
97
+ print_info: vocab type = BPE
98
+ print_info: n_vocab = 131072
99
+ print_info: n_merges = 269443
100
+ print_info: BOS token = 1 '<s>'
101
+ print_info: EOS token = 2 '</s>'
102
+ print_info: UNK token = 0 '<unk>'
103
+ print_info: PAD token = 11 '<pad>'
104
+ print_info: LF token = 1010 'Ċ'
105
+ print_info: EOG token = 2 '</s>'
106
+ print_info: max token length = 150
107
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
108
+ load_tensors: offloading 20 repeating layers to GPU
109
+ load_tensors: offloaded 20/49 layers to GPU
110
+ load_tensors: CPU_Mapped model buffer size = 9058.61 MiB
111
+ load_tensors: CUDA0 model buffer size = 2656.64 MiB
112
+ load_tensors: CUDA1 model buffer size = 2656.64 MiB
113
+ ...........................................................................................
114
+ llama_context: constructing llama_context
115
+ llama_context: n_seq_max = 1
116
+ llama_context: n_ctx = 2048
117
+ llama_context: n_ctx_seq = 2048
118
+ llama_context: n_batch = 2048
119
+ llama_context: n_ubatch = 512
120
+ llama_context: causal_attn = 1
121
+ llama_context: flash_attn = auto
122
+ llama_context: kv_unified = false
123
+ llama_context: freq_base = 1000000000.0
124
+ llama_context: freq_scale = 1
125
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
126
+ llama_context: CPU output buffer size = 0.50 MiB
127
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
128
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
130
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
131
+ llama_context: Flash Attention was auto, set to enabled
132
+ llama_context: CUDA0 compute buffer size = 606.00 MiB
133
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
134
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
135
+ llama_context: graph nodes = 1495
136
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
137
+ common_init_from_params: added </s> logit bias = -inf
138
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
139
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
140
+
141
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
142
+ perplexity: tokenizing the input ..
143
+ perplexity: tokenization took 53.763 ms
144
+ perplexity: calculating perplexity over 46 chunks, n_ctx=2048, batch_size=2048, n_seq=1
145
+ perplexity: 3.60 seconds per pass - ETA 2.75 minutes
146
+ [1]3.4116,[2]2.9350,[3]2.0638,[4]1.8657,[5]1.6915,[6]1.8667,[7]2.0159,[8]2.0530,[9]1.9833,[10]1.9133,[11]1.8139,[12]1.8175,[13]1.8029,[14]1.7514,[15]1.7030,[16]1.7473,[17]1.7302,[18]1.6995,[19]1.6939,[20]1.6830,[21]1.6998,[22]1.7196,[23]1.6927,[24]1.6748,[25]1.6801,[26]1.6836,[27]1.6919,[28]1.6686,[29]1.6590,[30]1.6555,[31]1.6752,[32]1.6793,[33]1.6716,[34]1.6588,[35]1.6459,[36]1.6668,[37]1.6835,[38]1.7153,[39]1.7452,[40]1.7575,[41]1.7474,[42]1.7518,[43]1.7428,[44]1.7404,[45]1.7547,[46]1.7594,
147
+ Final estimate: PPL = 1.7594 +/- 0.01472
148
+
149
+ llama_perf_context_print: load time = 2082.04 ms
150
+ llama_perf_context_print: prompt eval time = 156783.58 ms / 94208 tokens ( 1.66 ms per token, 600.88 tokens per second)
151
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
152
+ llama_perf_context_print: total time = 158039.74 ms / 94209 tokens
153
+ llama_perf_context_print: graphs reused = 0
154
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
155
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17741 + (3342 = 2656 + 80 + 606) + 3022 |
156
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20630 + (2852 = 2656 + 80 + 116) + 641 |
157
+ llama_memory_breakdown_print: | - Host | 9296 = 9058 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-embd_f16/perplexity_general.txt ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21179 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-embd_f16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type f16: 1 tensors
49
+ llama_model_loader: - type q8_0: 288 tensors
50
+ llama_model_loader: - type mxfp4: 49 tensors
51
+ print_info: file format = GGUF V3 (latest)
52
+ print_info: file type = MXFP4 MoE
53
+ print_info: file size = 14.04 GiB (8.36 BPW)
54
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
55
+ load: printing all EOG tokens:
56
+ load: - 2 ('</s>')
57
+ load: special tokens cache size = 1000
58
+ load: token to piece cache size = 0.8498 MB
59
+ print_info: arch = llama
60
+ print_info: vocab_only = 0
61
+ print_info: n_ctx_train = 262400
62
+ print_info: n_embd = 5120
63
+ print_info: n_embd_inp = 5120
64
+ print_info: n_layer = 48
65
+ print_info: n_head = 32
66
+ print_info: n_head_kv = 8
67
+ print_info: n_rot = 128
68
+ print_info: n_swa = 0
69
+ print_info: is_swa_any = 0
70
+ print_info: n_embd_head_k = 128
71
+ print_info: n_embd_head_v = 128
72
+ print_info: n_gqa = 4
73
+ print_info: n_embd_k_gqa = 1024
74
+ print_info: n_embd_v_gqa = 1024
75
+ print_info: f_norm_eps = 0.0e+00
76
+ print_info: f_norm_rms_eps = 1.0e-05
77
+ print_info: f_clamp_kqv = 0.0e+00
78
+ print_info: f_max_alibi_bias = 0.0e+00
79
+ print_info: f_logit_scale = 0.0e+00
80
+ print_info: f_attn_scale = 0.0e+00
81
+ print_info: n_ff = 14336
82
+ print_info: n_expert = 0
83
+ print_info: n_expert_used = 0
84
+ print_info: n_expert_groups = 0
85
+ print_info: n_group_used = 0
86
+ print_info: causal attn = 1
87
+ print_info: pooling type = 0
88
+ print_info: rope type = 0
89
+ print_info: rope scaling = linear
90
+ print_info: freq_base_train = 1000000000.0
91
+ print_info: freq_scale_train = 1
92
+ print_info: n_ctx_orig_yarn = 262400
93
+ print_info: rope_finetuned = unknown
94
+ print_info: model type = 34B
95
+ print_info: model params = 14.43 B
96
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
97
+ print_info: vocab type = BPE
98
+ print_info: n_vocab = 131072
99
+ print_info: n_merges = 269443
100
+ print_info: BOS token = 1 '<s>'
101
+ print_info: EOS token = 2 '</s>'
102
+ print_info: UNK token = 0 '<unk>'
103
+ print_info: PAD token = 11 '<pad>'
104
+ print_info: LF token = 1010 'Ċ'
105
+ print_info: EOG token = 2 '</s>'
106
+ print_info: max token length = 150
107
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
108
+ load_tensors: offloading 20 repeating layers to GPU
109
+ load_tensors: offloaded 20/49 layers to GPU
110
+ load_tensors: CPU_Mapped model buffer size = 9058.61 MiB
111
+ load_tensors: CUDA0 model buffer size = 2656.64 MiB
112
+ load_tensors: CUDA1 model buffer size = 2656.64 MiB
113
+ ...........................................................................................
114
+ llama_context: constructing llama_context
115
+ llama_context: n_seq_max = 1
116
+ llama_context: n_ctx = 2048
117
+ llama_context: n_ctx_seq = 2048
118
+ llama_context: n_batch = 2048
119
+ llama_context: n_ubatch = 512
120
+ llama_context: causal_attn = 1
121
+ llama_context: flash_attn = auto
122
+ llama_context: kv_unified = false
123
+ llama_context: freq_base = 1000000000.0
124
+ llama_context: freq_scale = 1
125
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
126
+ llama_context: CPU output buffer size = 0.50 MiB
127
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
128
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
130
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
131
+ llama_context: Flash Attention was auto, set to enabled
132
+ llama_context: CUDA0 compute buffer size = 606.00 MiB
133
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
134
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
135
+ llama_context: graph nodes = 1495
136
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
137
+ common_init_from_params: added </s> logit bias = -inf
138
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
139
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
140
+
141
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
142
+ perplexity: tokenizing the input ..
143
+ perplexity: tokenization took 16.297 ms
144
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
145
+ perplexity: 3.58 seconds per pass - ETA 0.88 minutes
146
+ [1]9.4016,[2]12.9714,[3]14.3473,[4]13.6273,[5]13.0148,[6]10.7683,[7]9.5511,[8]9.4940,[9]10.1423,[10]10.3362,[11]10.3422,[12]10.7583,[13]10.8963,[14]11.0036,[15]11.2350,
147
+ Final estimate: PPL = 11.2350 +/- 0.29593
148
+
149
+ llama_perf_context_print: load time = 2109.58 ms
150
+ llama_perf_context_print: prompt eval time = 50801.51 ms / 30720 tokens ( 1.65 ms per token, 604.71 tokens per second)
151
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
152
+ llama_perf_context_print: total time = 51163.11 ms / 30721 tokens
153
+ llama_perf_context_print: graphs reused = 0
154
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
155
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17739 + (3342 = 2656 + 80 + 606) + 3024 |
156
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20630 + (2852 = 2656 + 80 + 116) + 641 |
157
+ llama_memory_breakdown_print: | - Host | 9296 = 9058 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-embd_f16/perplexity_math.txt ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21181 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-embd_f16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type f16: 1 tensors
49
+ llama_model_loader: - type q8_0: 288 tensors
50
+ llama_model_loader: - type mxfp4: 49 tensors
51
+ print_info: file format = GGUF V3 (latest)
52
+ print_info: file type = MXFP4 MoE
53
+ print_info: file size = 14.04 GiB (8.36 BPW)
54
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
55
+ load: printing all EOG tokens:
56
+ load: - 2 ('</s>')
57
+ load: special tokens cache size = 1000
58
+ load: token to piece cache size = 0.8498 MB
59
+ print_info: arch = llama
60
+ print_info: vocab_only = 0
61
+ print_info: n_ctx_train = 262400
62
+ print_info: n_embd = 5120
63
+ print_info: n_embd_inp = 5120
64
+ print_info: n_layer = 48
65
+ print_info: n_head = 32
66
+ print_info: n_head_kv = 8
67
+ print_info: n_rot = 128
68
+ print_info: n_swa = 0
69
+ print_info: is_swa_any = 0
70
+ print_info: n_embd_head_k = 128
71
+ print_info: n_embd_head_v = 128
72
+ print_info: n_gqa = 4
73
+ print_info: n_embd_k_gqa = 1024
74
+ print_info: n_embd_v_gqa = 1024
75
+ print_info: f_norm_eps = 0.0e+00
76
+ print_info: f_norm_rms_eps = 1.0e-05
77
+ print_info: f_clamp_kqv = 0.0e+00
78
+ print_info: f_max_alibi_bias = 0.0e+00
79
+ print_info: f_logit_scale = 0.0e+00
80
+ print_info: f_attn_scale = 0.0e+00
81
+ print_info: n_ff = 14336
82
+ print_info: n_expert = 0
83
+ print_info: n_expert_used = 0
84
+ print_info: n_expert_groups = 0
85
+ print_info: n_group_used = 0
86
+ print_info: causal attn = 1
87
+ print_info: pooling type = 0
88
+ print_info: rope type = 0
89
+ print_info: rope scaling = linear
90
+ print_info: freq_base_train = 1000000000.0
91
+ print_info: freq_scale_train = 1
92
+ print_info: n_ctx_orig_yarn = 262400
93
+ print_info: rope_finetuned = unknown
94
+ print_info: model type = 34B
95
+ print_info: model params = 14.43 B
96
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
97
+ print_info: vocab type = BPE
98
+ print_info: n_vocab = 131072
99
+ print_info: n_merges = 269443
100
+ print_info: BOS token = 1 '<s>'
101
+ print_info: EOS token = 2 '</s>'
102
+ print_info: UNK token = 0 '<unk>'
103
+ print_info: PAD token = 11 '<pad>'
104
+ print_info: LF token = 1010 'Ċ'
105
+ print_info: EOG token = 2 '</s>'
106
+ print_info: max token length = 150
107
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
108
+ load_tensors: offloading 20 repeating layers to GPU
109
+ load_tensors: offloaded 20/49 layers to GPU
110
+ load_tensors: CPU_Mapped model buffer size = 9058.61 MiB
111
+ load_tensors: CUDA0 model buffer size = 2656.64 MiB
112
+ load_tensors: CUDA1 model buffer size = 2656.64 MiB
113
+ ...........................................................................................
114
+ llama_context: constructing llama_context
115
+ llama_context: n_seq_max = 1
116
+ llama_context: n_ctx = 2048
117
+ llama_context: n_ctx_seq = 2048
118
+ llama_context: n_batch = 2048
119
+ llama_context: n_ubatch = 512
120
+ llama_context: causal_attn = 1
121
+ llama_context: flash_attn = auto
122
+ llama_context: kv_unified = false
123
+ llama_context: freq_base = 1000000000.0
124
+ llama_context: freq_scale = 1
125
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
126
+ llama_context: CPU output buffer size = 0.50 MiB
127
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
128
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
130
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
131
+ llama_context: Flash Attention was auto, set to enabled
132
+ llama_context: CUDA0 compute buffer size = 606.00 MiB
133
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
134
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
135
+ llama_context: graph nodes = 1495
136
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
137
+ common_init_from_params: added </s> logit bias = -inf
138
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
139
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
140
+
141
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
142
+ perplexity: tokenizing the input ..
143
+ perplexity: tokenization took 15.526 ms
144
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
145
+ perplexity: 3.58 seconds per pass - ETA 0.95 minutes
146
+ [1]7.6451,[2]8.4207,[3]8.8075,[4]9.4736,[5]9.5530,[6]9.6070,[7]9.6038,[8]9.4783,[9]9.4648,[10]9.4059,[11]9.3776,[12]9.4277,[13]9.5209,[14]9.6283,[15]9.5887,[16]9.4772,
147
+ Final estimate: PPL = 9.4772 +/- 0.23786
148
+
149
+ llama_perf_context_print: load time = 2095.76 ms
150
+ llama_perf_context_print: prompt eval time = 54047.04 ms / 32768 tokens ( 1.65 ms per token, 606.29 tokens per second)
151
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
152
+ llama_perf_context_print: total time = 54428.45 ms / 32769 tokens
153
+ llama_perf_context_print: graphs reused = 0
154
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
155
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17740 + (3342 = 2656 + 80 + 606) + 3023 |
156
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20630 + (2852 = 2656 + 80 + 116) + 641 |
157
+ llama_memory_breakdown_print: | - Host | 9296 = 9058 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-embd_f16/ppl_corpus_code.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-embd_f16/ppl_corpus_general.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-embd_f16/ppl_corpus_math.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_f16/llamabench.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | llama 34B MXFP4 MoE | 17.11 GiB | 14.43 B | CUDA | 35 | pp8 | 65.80 ± 1.22 |
9
+ | llama 34B MXFP4 MoE | 17.11 GiB | 14.43 B | CUDA | 35 | tg128 | 9.29 ± 0.03 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_f16/perplexity_code.txt ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21187 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_f16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type f16: 49 tensors
49
+ llama_model_loader: - type q8_0: 240 tensors
50
+ llama_model_loader: - type mxfp4: 49 tensors
51
+ print_info: file format = GGUF V3 (latest)
52
+ print_info: file type = MXFP4 MoE
53
+ print_info: file size = 17.11 GiB (10.19 BPW)
54
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
55
+ load: printing all EOG tokens:
56
+ load: - 2 ('</s>')
57
+ load: special tokens cache size = 1000
58
+ load: token to piece cache size = 0.8498 MB
59
+ print_info: arch = llama
60
+ print_info: vocab_only = 0
61
+ print_info: n_ctx_train = 262400
62
+ print_info: n_embd = 5120
63
+ print_info: n_embd_inp = 5120
64
+ print_info: n_layer = 48
65
+ print_info: n_head = 32
66
+ print_info: n_head_kv = 8
67
+ print_info: n_rot = 128
68
+ print_info: n_swa = 0
69
+ print_info: is_swa_any = 0
70
+ print_info: n_embd_head_k = 128
71
+ print_info: n_embd_head_v = 128
72
+ print_info: n_gqa = 4
73
+ print_info: n_embd_k_gqa = 1024
74
+ print_info: n_embd_v_gqa = 1024
75
+ print_info: f_norm_eps = 0.0e+00
76
+ print_info: f_norm_rms_eps = 1.0e-05
77
+ print_info: f_clamp_kqv = 0.0e+00
78
+ print_info: f_max_alibi_bias = 0.0e+00
79
+ print_info: f_logit_scale = 0.0e+00
80
+ print_info: f_attn_scale = 0.0e+00
81
+ print_info: n_ff = 14336
82
+ print_info: n_expert = 0
83
+ print_info: n_expert_used = 0
84
+ print_info: n_expert_groups = 0
85
+ print_info: n_group_used = 0
86
+ print_info: causal attn = 1
87
+ print_info: pooling type = 0
88
+ print_info: rope type = 0
89
+ print_info: rope scaling = linear
90
+ print_info: freq_base_train = 1000000000.0
91
+ print_info: freq_scale_train = 1
92
+ print_info: n_ctx_orig_yarn = 262400
93
+ print_info: rope_finetuned = unknown
94
+ print_info: model type = 34B
95
+ print_info: model params = 14.43 B
96
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
97
+ print_info: vocab type = BPE
98
+ print_info: n_vocab = 131072
99
+ print_info: n_merges = 269443
100
+ print_info: BOS token = 1 '<s>'
101
+ print_info: EOS token = 2 '</s>'
102
+ print_info: UNK token = 0 '<unk>'
103
+ print_info: PAD token = 11 '<pad>'
104
+ print_info: LF token = 1010 'Ċ'
105
+ print_info: EOG token = 2 '</s>'
106
+ print_info: max token length = 150
107
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
108
+ load_tensors: offloading 20 repeating layers to GPU
109
+ load_tensors: offloaded 20/49 layers to GPU
110
+ load_tensors: CPU_Mapped model buffer size = 10896.11 MiB
111
+ load_tensors: CUDA0 model buffer size = 3312.89 MiB
112
+ load_tensors: CUDA1 model buffer size = 3312.89 MiB
113
+ .............................................................................................
114
+ llama_context: constructing llama_context
115
+ llama_context: n_seq_max = 1
116
+ llama_context: n_ctx = 2048
117
+ llama_context: n_ctx_seq = 2048
118
+ llama_context: n_batch = 2048
119
+ llama_context: n_ubatch = 512
120
+ llama_context: causal_attn = 1
121
+ llama_context: flash_attn = auto
122
+ llama_context: kv_unified = false
123
+ llama_context: freq_base = 1000000000.0
124
+ llama_context: freq_scale = 1
125
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
126
+ llama_context: CPU output buffer size = 0.50 MiB
127
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
128
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
130
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
131
+ llama_context: Flash Attention was auto, set to enabled
132
+ llama_context: CUDA0 compute buffer size = 606.00 MiB
133
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
134
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
135
+ llama_context: graph nodes = 1495
136
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
137
+ common_init_from_params: added </s> logit bias = -inf
138
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
139
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
140
+
141
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
142
+ perplexity: tokenizing the input ..
143
+ perplexity: tokenization took 54.38 ms
144
+ perplexity: calculating perplexity over 46 chunks, n_ctx=2048, batch_size=2048, n_seq=1
145
+ perplexity: 4.17 seconds per pass - ETA 3.18 minutes
146
+ [1]3.4039,[2]2.9329,[3]2.0633,[4]1.8650,[5]1.6910,[6]1.8663,[7]2.0163,[8]2.0537,[9]1.9837,[10]1.9144,[11]1.8156,[12]1.8191,[13]1.8045,[14]1.7529,[15]1.7046,[16]1.7489,[17]1.7313,[18]1.7007,[19]1.6949,[20]1.6838,[21]1.7006,[22]1.7203,[23]1.6934,[24]1.6760,[25]1.6814,[26]1.6849,[27]1.6933,[28]1.6700,[29]1.6604,[30]1.6569,[31]1.6766,[32]1.6804,[33]1.6726,[34]1.6598,[35]1.6469,[36]1.6677,[37]1.6845,[38]1.7163,[39]1.7460,[40]1.7583,[41]1.7483,[42]1.7527,[43]1.7437,[44]1.7414,[45]1.7558,[46]1.7605,
147
+ Final estimate: PPL = 1.7605 +/- 0.01476
148
+
149
+ llama_perf_context_print: load time = 2279.72 ms
150
+ llama_perf_context_print: prompt eval time = 182375.73 ms / 94208 tokens ( 1.94 ms per token, 516.56 tokens per second)
151
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
152
+ llama_perf_context_print: total time = 183530.11 ms / 94209 tokens
153
+ llama_perf_context_print: graphs reused = 0
154
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
155
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17075 + ( 3998 = 3312 + 80 + 606) + 3033 |
156
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19962 + ( 3508 = 3312 + 80 + 116) + 653 |
157
+ llama_memory_breakdown_print: | - Host | 11134 = 10896 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_f16/perplexity_general.txt ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21187 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_f16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type f16: 49 tensors
49
+ llama_model_loader: - type q8_0: 240 tensors
50
+ llama_model_loader: - type mxfp4: 49 tensors
51
+ print_info: file format = GGUF V3 (latest)
52
+ print_info: file type = MXFP4 MoE
53
+ print_info: file size = 17.11 GiB (10.19 BPW)
54
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
55
+ load: printing all EOG tokens:
56
+ load: - 2 ('</s>')
57
+ load: special tokens cache size = 1000
58
+ load: token to piece cache size = 0.8498 MB
59
+ print_info: arch = llama
60
+ print_info: vocab_only = 0
61
+ print_info: n_ctx_train = 262400
62
+ print_info: n_embd = 5120
63
+ print_info: n_embd_inp = 5120
64
+ print_info: n_layer = 48
65
+ print_info: n_head = 32
66
+ print_info: n_head_kv = 8
67
+ print_info: n_rot = 128
68
+ print_info: n_swa = 0
69
+ print_info: is_swa_any = 0
70
+ print_info: n_embd_head_k = 128
71
+ print_info: n_embd_head_v = 128
72
+ print_info: n_gqa = 4
73
+ print_info: n_embd_k_gqa = 1024
74
+ print_info: n_embd_v_gqa = 1024
75
+ print_info: f_norm_eps = 0.0e+00
76
+ print_info: f_norm_rms_eps = 1.0e-05
77
+ print_info: f_clamp_kqv = 0.0e+00
78
+ print_info: f_max_alibi_bias = 0.0e+00
79
+ print_info: f_logit_scale = 0.0e+00
80
+ print_info: f_attn_scale = 0.0e+00
81
+ print_info: n_ff = 14336
82
+ print_info: n_expert = 0
83
+ print_info: n_expert_used = 0
84
+ print_info: n_expert_groups = 0
85
+ print_info: n_group_used = 0
86
+ print_info: causal attn = 1
87
+ print_info: pooling type = 0
88
+ print_info: rope type = 0
89
+ print_info: rope scaling = linear
90
+ print_info: freq_base_train = 1000000000.0
91
+ print_info: freq_scale_train = 1
92
+ print_info: n_ctx_orig_yarn = 262400
93
+ print_info: rope_finetuned = unknown
94
+ print_info: model type = 34B
95
+ print_info: model params = 14.43 B
96
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
97
+ print_info: vocab type = BPE
98
+ print_info: n_vocab = 131072
99
+ print_info: n_merges = 269443
100
+ print_info: BOS token = 1 '<s>'
101
+ print_info: EOS token = 2 '</s>'
102
+ print_info: UNK token = 0 '<unk>'
103
+ print_info: PAD token = 11 '<pad>'
104
+ print_info: LF token = 1010 'Ċ'
105
+ print_info: EOG token = 2 '</s>'
106
+ print_info: max token length = 150
107
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
108
+ load_tensors: offloading 20 repeating layers to GPU
109
+ load_tensors: offloaded 20/49 layers to GPU
110
+ load_tensors: CPU_Mapped model buffer size = 10896.11 MiB
111
+ load_tensors: CUDA0 model buffer size = 3312.89 MiB
112
+ load_tensors: CUDA1 model buffer size = 3312.89 MiB
113
+ .............................................................................................
114
+ llama_context: constructing llama_context
115
+ llama_context: n_seq_max = 1
116
+ llama_context: n_ctx = 2048
117
+ llama_context: n_ctx_seq = 2048
118
+ llama_context: n_batch = 2048
119
+ llama_context: n_ubatch = 512
120
+ llama_context: causal_attn = 1
121
+ llama_context: flash_attn = auto
122
+ llama_context: kv_unified = false
123
+ llama_context: freq_base = 1000000000.0
124
+ llama_context: freq_scale = 1
125
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
126
+ llama_context: CPU output buffer size = 0.50 MiB
127
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
128
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
130
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
131
+ llama_context: Flash Attention was auto, set to enabled
132
+ llama_context: CUDA0 compute buffer size = 606.00 MiB
133
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
134
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
135
+ llama_context: graph nodes = 1495
136
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
137
+ common_init_from_params: added </s> logit bias = -inf
138
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
139
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
140
+
141
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
142
+ perplexity: tokenizing the input ..
143
+ perplexity: tokenization took 14.766 ms
144
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
145
+ perplexity: 4.17 seconds per pass - ETA 1.03 minutes
146
+ [1]9.4248,[2]12.9852,[3]14.3508,[4]13.6174,[5]13.0019,[6]10.7668,[7]9.5483,[8]9.4925,[9]10.1421,[10]10.3381,[11]10.3435,[12]10.7524,[13]10.8944,[14]11.0048,[15]11.2353,
147
+ Final estimate: PPL = 11.2353 +/- 0.29596
148
+
149
+ llama_perf_context_print: load time = 2314.14 ms
150
+ llama_perf_context_print: prompt eval time = 59697.66 ms / 30720 tokens ( 1.94 ms per token, 514.59 tokens per second)
151
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
152
+ llama_perf_context_print: total time = 60068.86 ms / 30721 tokens
153
+ llama_perf_context_print: graphs reused = 0
154
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
155
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17079 + ( 3998 = 3312 + 80 + 606) + 3028 |
156
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19962 + ( 3508 = 3312 + 80 + 116) + 653 |
157
+ llama_memory_breakdown_print: | - Host | 11134 = 10896 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_f16/perplexity_math.txt ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21183 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_f16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type f16: 49 tensors
49
+ llama_model_loader: - type q8_0: 240 tensors
50
+ llama_model_loader: - type mxfp4: 49 tensors
51
+ print_info: file format = GGUF V3 (latest)
52
+ print_info: file type = MXFP4 MoE
53
+ print_info: file size = 17.11 GiB (10.19 BPW)
54
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
55
+ load: printing all EOG tokens:
56
+ load: - 2 ('</s>')
57
+ load: special tokens cache size = 1000
58
+ load: token to piece cache size = 0.8498 MB
59
+ print_info: arch = llama
60
+ print_info: vocab_only = 0
61
+ print_info: n_ctx_train = 262400
62
+ print_info: n_embd = 5120
63
+ print_info: n_embd_inp = 5120
64
+ print_info: n_layer = 48
65
+ print_info: n_head = 32
66
+ print_info: n_head_kv = 8
67
+ print_info: n_rot = 128
68
+ print_info: n_swa = 0
69
+ print_info: is_swa_any = 0
70
+ print_info: n_embd_head_k = 128
71
+ print_info: n_embd_head_v = 128
72
+ print_info: n_gqa = 4
73
+ print_info: n_embd_k_gqa = 1024
74
+ print_info: n_embd_v_gqa = 1024
75
+ print_info: f_norm_eps = 0.0e+00
76
+ print_info: f_norm_rms_eps = 1.0e-05
77
+ print_info: f_clamp_kqv = 0.0e+00
78
+ print_info: f_max_alibi_bias = 0.0e+00
79
+ print_info: f_logit_scale = 0.0e+00
80
+ print_info: f_attn_scale = 0.0e+00
81
+ print_info: n_ff = 14336
82
+ print_info: n_expert = 0
83
+ print_info: n_expert_used = 0
84
+ print_info: n_expert_groups = 0
85
+ print_info: n_group_used = 0
86
+ print_info: causal attn = 1
87
+ print_info: pooling type = 0
88
+ print_info: rope type = 0
89
+ print_info: rope scaling = linear
90
+ print_info: freq_base_train = 1000000000.0
91
+ print_info: freq_scale_train = 1
92
+ print_info: n_ctx_orig_yarn = 262400
93
+ print_info: rope_finetuned = unknown
94
+ print_info: model type = 34B
95
+ print_info: model params = 14.43 B
96
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
97
+ print_info: vocab type = BPE
98
+ print_info: n_vocab = 131072
99
+ print_info: n_merges = 269443
100
+ print_info: BOS token = 1 '<s>'
101
+ print_info: EOS token = 2 '</s>'
102
+ print_info: UNK token = 0 '<unk>'
103
+ print_info: PAD token = 11 '<pad>'
104
+ print_info: LF token = 1010 'Ċ'
105
+ print_info: EOG token = 2 '</s>'
106
+ print_info: max token length = 150
107
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
108
+ load_tensors: offloading 20 repeating layers to GPU
109
+ load_tensors: offloaded 20/49 layers to GPU
110
+ load_tensors: CPU_Mapped model buffer size = 10896.11 MiB
111
+ load_tensors: CUDA0 model buffer size = 3312.89 MiB
112
+ load_tensors: CUDA1 model buffer size = 3312.89 MiB
113
+ .............................................................................................
114
+ llama_context: constructing llama_context
115
+ llama_context: n_seq_max = 1
116
+ llama_context: n_ctx = 2048
117
+ llama_context: n_ctx_seq = 2048
118
+ llama_context: n_batch = 2048
119
+ llama_context: n_ubatch = 512
120
+ llama_context: causal_attn = 1
121
+ llama_context: flash_attn = auto
122
+ llama_context: kv_unified = false
123
+ llama_context: freq_base = 1000000000.0
124
+ llama_context: freq_scale = 1
125
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
126
+ llama_context: CPU output buffer size = 0.50 MiB
127
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
128
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
130
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
131
+ llama_context: Flash Attention was auto, set to enabled
132
+ llama_context: CUDA0 compute buffer size = 606.00 MiB
133
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
134
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
135
+ llama_context: graph nodes = 1495
136
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
137
+ common_init_from_params: added </s> logit bias = -inf
138
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
139
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
140
+
141
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
142
+ perplexity: tokenizing the input ..
143
+ perplexity: tokenization took 14.942 ms
144
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
145
+ perplexity: 4.15 seconds per pass - ETA 1.10 minutes
146
+ [1]7.6467,[2]8.4238,[3]8.8148,[4]9.4762,[5]9.5518,[6]9.6077,[7]9.6029,[8]9.4813,[9]9.4651,[10]9.4071,[11]9.3760,[12]9.4236,[13]9.5178,[14]9.6245,[15]9.5861,[16]9.4731,
147
+ Final estimate: PPL = 9.4731 +/- 0.23780
148
+
149
+ llama_perf_context_print: load time = 2275.15 ms
150
+ llama_perf_context_print: prompt eval time = 63412.78 ms / 32768 tokens ( 1.94 ms per token, 516.74 tokens per second)
151
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
152
+ llama_perf_context_print: total time = 63810.33 ms / 32769 tokens
153
+ llama_perf_context_print: graphs reused = 0
154
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
155
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17073 + ( 3998 = 3312 + 80 + 606) + 3035 |
156
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19962 + ( 3508 = 3312 + 80 + 116) + 653 |
157
+ llama_memory_breakdown_print: | - Host | 11134 = 10896 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_f16/ppl_corpus_code.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_f16/ppl_corpus_general.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_f16/ppl_corpus_math.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q5_K/llamabench.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | llama 34B MXFP4 MoE | 12.23 GiB | 14.43 B | CUDA | 35 | pp8 | 76.24 ± 4.32 |
9
+ | llama 34B MXFP4 MoE | 12.23 GiB | 14.43 B | CUDA | 35 | tg128 | 11.94 ± 0.06 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q5_K/perplexity_code.txt ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21179 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q5_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type q8_0: 240 tensors
49
+ llama_model_loader: - type q5_K: 98 tensors
50
+ print_info: file format = GGUF V3 (latest)
51
+ print_info: file type = MXFP4 MoE
52
+ print_info: file size = 12.23 GiB (7.28 BPW)
53
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
54
+ load: printing all EOG tokens:
55
+ load: - 2 ('</s>')
56
+ load: special tokens cache size = 1000
57
+ load: token to piece cache size = 0.8498 MB
58
+ print_info: arch = llama
59
+ print_info: vocab_only = 0
60
+ print_info: n_ctx_train = 262400
61
+ print_info: n_embd = 5120
62
+ print_info: n_embd_inp = 5120
63
+ print_info: n_layer = 48
64
+ print_info: n_head = 32
65
+ print_info: n_head_kv = 8
66
+ print_info: n_rot = 128
67
+ print_info: n_swa = 0
68
+ print_info: is_swa_any = 0
69
+ print_info: n_embd_head_k = 128
70
+ print_info: n_embd_head_v = 128
71
+ print_info: n_gqa = 4
72
+ print_info: n_embd_k_gqa = 1024
73
+ print_info: n_embd_v_gqa = 1024
74
+ print_info: f_norm_eps = 0.0e+00
75
+ print_info: f_norm_rms_eps = 1.0e-05
76
+ print_info: f_clamp_kqv = 0.0e+00
77
+ print_info: f_max_alibi_bias = 0.0e+00
78
+ print_info: f_logit_scale = 0.0e+00
79
+ print_info: f_attn_scale = 0.0e+00
80
+ print_info: n_ff = 14336
81
+ print_info: n_expert = 0
82
+ print_info: n_expert_used = 0
83
+ print_info: n_expert_groups = 0
84
+ print_info: n_group_used = 0
85
+ print_info: causal attn = 1
86
+ print_info: pooling type = 0
87
+ print_info: rope type = 0
88
+ print_info: rope scaling = linear
89
+ print_info: freq_base_train = 1000000000.0
90
+ print_info: freq_scale_train = 1
91
+ print_info: n_ctx_orig_yarn = 262400
92
+ print_info: rope_finetuned = unknown
93
+ print_info: model type = 34B
94
+ print_info: model params = 14.43 B
95
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
96
+ print_info: vocab type = BPE
97
+ print_info: n_vocab = 131072
98
+ print_info: n_merges = 269443
99
+ print_info: BOS token = 1 '<s>'
100
+ print_info: EOS token = 2 '</s>'
101
+ print_info: UNK token = 0 '<unk>'
102
+ print_info: PAD token = 11 '<pad>'
103
+ print_info: LF token = 1010 'Ċ'
104
+ print_info: EOG token = 2 '</s>'
105
+ print_info: max token length = 150
106
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
107
+ load_tensors: offloading 20 repeating layers to GPU
108
+ load_tensors: offloaded 20/49 layers to GPU
109
+ load_tensors: CPU_Mapped model buffer size = 7671.11 MiB
110
+ load_tensors: CUDA0 model buffer size = 2425.39 MiB
111
+ load_tensors: CUDA1 model buffer size = 2425.39 MiB
112
+ ...............................................................................................
113
+ llama_context: constructing llama_context
114
+ llama_context: n_seq_max = 1
115
+ llama_context: n_ctx = 2048
116
+ llama_context: n_ctx_seq = 2048
117
+ llama_context: n_batch = 2048
118
+ llama_context: n_ubatch = 512
119
+ llama_context: causal_attn = 1
120
+ llama_context: flash_attn = auto
121
+ llama_context: kv_unified = false
122
+ llama_context: freq_base = 1000000000.0
123
+ llama_context: freq_scale = 1
124
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
125
+ llama_context: CPU output buffer size = 0.50 MiB
126
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
127
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
128
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
130
+ llama_context: Flash Attention was auto, set to enabled
131
+ llama_context: CUDA0 compute buffer size = 706.00 MiB
132
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
133
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
134
+ llama_context: graph nodes = 1495
135
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
136
+ common_init_from_params: added </s> logit bias = -inf
137
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
138
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
139
+
140
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
141
+ perplexity: tokenizing the input ..
142
+ perplexity: tokenization took 52.882 ms
143
+ perplexity: calculating perplexity over 46 chunks, n_ctx=2048, batch_size=2048, n_seq=1
144
+ perplexity: 3.40 seconds per pass - ETA 2.60 minutes
145
+ [1]3.3826,[2]2.9232,[3]2.0553,[4]1.8575,[5]1.6782,[6]1.8518,[7]1.9970,[8]2.0379,[9]1.9682,[10]1.8999,[11]1.8016,[12]1.8073,[13]1.7902,[14]1.7416,[15]1.6933,[16]1.7351,[17]1.7185,[18]1.6893,[19]1.6836,[20]1.6720,[21]1.6904,[22]1.7103,[23]1.6843,[24]1.6666,[25]1.6719,[26]1.6745,[27]1.6826,[28]1.6595,[29]1.6507,[30]1.6476,[31]1.6671,[32]1.6713,[33]1.6642,[34]1.6509,[35]1.6379,[36]1.6585,[37]1.6752,[38]1.7066,[39]1.7355,[40]1.7485,[41]1.7390,[42]1.7431,[43]1.7346,[44]1.7327,[45]1.7470,[46]1.7511,
146
+ Final estimate: PPL = 1.7511 +/- 0.01482
147
+
148
+ llama_perf_context_print: load time = 1816.73 ms
149
+ llama_perf_context_print: prompt eval time = 147455.21 ms / 94208 tokens ( 1.57 ms per token, 638.89 tokens per second)
150
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
151
+ llama_perf_context_print: total time = 148675.07 ms / 94209 tokens
152
+ llama_perf_context_print: graphs reused = 0
153
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
154
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17871 + (3211 = 2425 + 80 + 706) + 3024 |
155
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20862 + (2621 = 2425 + 80 + 116) + 640 |
156
+ llama_memory_breakdown_print: | - Host | 7909 = 7671 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q5_K/perplexity_general.txt ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21181 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q5_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type q8_0: 240 tensors
49
+ llama_model_loader: - type q5_K: 98 tensors
50
+ print_info: file format = GGUF V3 (latest)
51
+ print_info: file type = MXFP4 MoE
52
+ print_info: file size = 12.23 GiB (7.28 BPW)
53
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
54
+ load: printing all EOG tokens:
55
+ load: - 2 ('</s>')
56
+ load: special tokens cache size = 1000
57
+ load: token to piece cache size = 0.8498 MB
58
+ print_info: arch = llama
59
+ print_info: vocab_only = 0
60
+ print_info: n_ctx_train = 262400
61
+ print_info: n_embd = 5120
62
+ print_info: n_embd_inp = 5120
63
+ print_info: n_layer = 48
64
+ print_info: n_head = 32
65
+ print_info: n_head_kv = 8
66
+ print_info: n_rot = 128
67
+ print_info: n_swa = 0
68
+ print_info: is_swa_any = 0
69
+ print_info: n_embd_head_k = 128
70
+ print_info: n_embd_head_v = 128
71
+ print_info: n_gqa = 4
72
+ print_info: n_embd_k_gqa = 1024
73
+ print_info: n_embd_v_gqa = 1024
74
+ print_info: f_norm_eps = 0.0e+00
75
+ print_info: f_norm_rms_eps = 1.0e-05
76
+ print_info: f_clamp_kqv = 0.0e+00
77
+ print_info: f_max_alibi_bias = 0.0e+00
78
+ print_info: f_logit_scale = 0.0e+00
79
+ print_info: f_attn_scale = 0.0e+00
80
+ print_info: n_ff = 14336
81
+ print_info: n_expert = 0
82
+ print_info: n_expert_used = 0
83
+ print_info: n_expert_groups = 0
84
+ print_info: n_group_used = 0
85
+ print_info: causal attn = 1
86
+ print_info: pooling type = 0
87
+ print_info: rope type = 0
88
+ print_info: rope scaling = linear
89
+ print_info: freq_base_train = 1000000000.0
90
+ print_info: freq_scale_train = 1
91
+ print_info: n_ctx_orig_yarn = 262400
92
+ print_info: rope_finetuned = unknown
93
+ print_info: model type = 34B
94
+ print_info: model params = 14.43 B
95
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
96
+ print_info: vocab type = BPE
97
+ print_info: n_vocab = 131072
98
+ print_info: n_merges = 269443
99
+ print_info: BOS token = 1 '<s>'
100
+ print_info: EOS token = 2 '</s>'
101
+ print_info: UNK token = 0 '<unk>'
102
+ print_info: PAD token = 11 '<pad>'
103
+ print_info: LF token = 1010 'Ċ'
104
+ print_info: EOG token = 2 '</s>'
105
+ print_info: max token length = 150
106
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
107
+ load_tensors: offloading 20 repeating layers to GPU
108
+ load_tensors: offloaded 20/49 layers to GPU
109
+ load_tensors: CPU_Mapped model buffer size = 7671.11 MiB
110
+ load_tensors: CUDA0 model buffer size = 2425.39 MiB
111
+ load_tensors: CUDA1 model buffer size = 2425.39 MiB
112
+ ...............................................................................................
113
+ llama_context: constructing llama_context
114
+ llama_context: n_seq_max = 1
115
+ llama_context: n_ctx = 2048
116
+ llama_context: n_ctx_seq = 2048
117
+ llama_context: n_batch = 2048
118
+ llama_context: n_ubatch = 512
119
+ llama_context: causal_attn = 1
120
+ llama_context: flash_attn = auto
121
+ llama_context: kv_unified = false
122
+ llama_context: freq_base = 1000000000.0
123
+ llama_context: freq_scale = 1
124
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
125
+ llama_context: CPU output buffer size = 0.50 MiB
126
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
127
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
128
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
130
+ llama_context: Flash Attention was auto, set to enabled
131
+ llama_context: CUDA0 compute buffer size = 706.00 MiB
132
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
133
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
134
+ llama_context: graph nodes = 1495
135
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
136
+ common_init_from_params: added </s> logit bias = -inf
137
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
138
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
139
+
140
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
141
+ perplexity: tokenizing the input ..
142
+ perplexity: tokenization took 15.511 ms
143
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
144
+ perplexity: 3.50 seconds per pass - ETA 0.87 minutes
145
+ [1]8.9571,[2]12.3586,[3]13.8879,[4]13.1642,[5]12.6663,[6]10.5320,[7]9.3461,[8]9.2913,[9]9.9181,[10]10.0903,[11]10.1243,[12]10.5003,[13]10.6617,[14]10.7604,[15]10.9787,
146
+ Final estimate: PPL = 10.9787 +/- 0.29237
147
+
148
+ llama_perf_context_print: load time = 1812.87 ms
149
+ llama_perf_context_print: prompt eval time = 48331.85 ms / 30720 tokens ( 1.57 ms per token, 635.61 tokens per second)
150
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
151
+ llama_perf_context_print: total time = 48693.40 ms / 30721 tokens
152
+ llama_perf_context_print: graphs reused = 0
153
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
154
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17871 + (3211 = 2425 + 80 + 706) + 3024 |
155
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20862 + (2621 = 2425 + 80 + 116) + 640 |
156
+ llama_memory_breakdown_print: | - Host | 7909 = 7671 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q5_K/perplexity_math.txt ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21177 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q5_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type q8_0: 240 tensors
49
+ llama_model_loader: - type q5_K: 98 tensors
50
+ print_info: file format = GGUF V3 (latest)
51
+ print_info: file type = MXFP4 MoE
52
+ print_info: file size = 12.23 GiB (7.28 BPW)
53
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
54
+ load: printing all EOG tokens:
55
+ load: - 2 ('</s>')
56
+ load: special tokens cache size = 1000
57
+ load: token to piece cache size = 0.8498 MB
58
+ print_info: arch = llama
59
+ print_info: vocab_only = 0
60
+ print_info: n_ctx_train = 262400
61
+ print_info: n_embd = 5120
62
+ print_info: n_embd_inp = 5120
63
+ print_info: n_layer = 48
64
+ print_info: n_head = 32
65
+ print_info: n_head_kv = 8
66
+ print_info: n_rot = 128
67
+ print_info: n_swa = 0
68
+ print_info: is_swa_any = 0
69
+ print_info: n_embd_head_k = 128
70
+ print_info: n_embd_head_v = 128
71
+ print_info: n_gqa = 4
72
+ print_info: n_embd_k_gqa = 1024
73
+ print_info: n_embd_v_gqa = 1024
74
+ print_info: f_norm_eps = 0.0e+00
75
+ print_info: f_norm_rms_eps = 1.0e-05
76
+ print_info: f_clamp_kqv = 0.0e+00
77
+ print_info: f_max_alibi_bias = 0.0e+00
78
+ print_info: f_logit_scale = 0.0e+00
79
+ print_info: f_attn_scale = 0.0e+00
80
+ print_info: n_ff = 14336
81
+ print_info: n_expert = 0
82
+ print_info: n_expert_used = 0
83
+ print_info: n_expert_groups = 0
84
+ print_info: n_group_used = 0
85
+ print_info: causal attn = 1
86
+ print_info: pooling type = 0
87
+ print_info: rope type = 0
88
+ print_info: rope scaling = linear
89
+ print_info: freq_base_train = 1000000000.0
90
+ print_info: freq_scale_train = 1
91
+ print_info: n_ctx_orig_yarn = 262400
92
+ print_info: rope_finetuned = unknown
93
+ print_info: model type = 34B
94
+ print_info: model params = 14.43 B
95
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
96
+ print_info: vocab type = BPE
97
+ print_info: n_vocab = 131072
98
+ print_info: n_merges = 269443
99
+ print_info: BOS token = 1 '<s>'
100
+ print_info: EOS token = 2 '</s>'
101
+ print_info: UNK token = 0 '<unk>'
102
+ print_info: PAD token = 11 '<pad>'
103
+ print_info: LF token = 1010 'Ċ'
104
+ print_info: EOG token = 2 '</s>'
105
+ print_info: max token length = 150
106
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
107
+ load_tensors: offloading 20 repeating layers to GPU
108
+ load_tensors: offloaded 20/49 layers to GPU
109
+ load_tensors: CPU_Mapped model buffer size = 7671.11 MiB
110
+ load_tensors: CUDA0 model buffer size = 2425.39 MiB
111
+ load_tensors: CUDA1 model buffer size = 2425.39 MiB
112
+ ...............................................................................................
113
+ llama_context: constructing llama_context
114
+ llama_context: n_seq_max = 1
115
+ llama_context: n_ctx = 2048
116
+ llama_context: n_ctx_seq = 2048
117
+ llama_context: n_batch = 2048
118
+ llama_context: n_ubatch = 512
119
+ llama_context: causal_attn = 1
120
+ llama_context: flash_attn = auto
121
+ llama_context: kv_unified = false
122
+ llama_context: freq_base = 1000000000.0
123
+ llama_context: freq_scale = 1
124
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
125
+ llama_context: CPU output buffer size = 0.50 MiB
126
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
127
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
128
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
130
+ llama_context: Flash Attention was auto, set to enabled
131
+ llama_context: CUDA0 compute buffer size = 706.00 MiB
132
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
133
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
134
+ llama_context: graph nodes = 1495
135
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
136
+ common_init_from_params: added </s> logit bias = -inf
137
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
138
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
139
+
140
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
141
+ perplexity: tokenizing the input ..
142
+ perplexity: tokenization took 15.1 ms
143
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
144
+ perplexity: 3.40 seconds per pass - ETA 0.90 minutes
145
+ [1]7.4937,[2]8.3442,[3]8.7696,[4]9.4429,[5]9.5354,[6]9.5740,[7]9.5993,[8]9.4999,[9]9.5014,[10]9.4612,[11]9.4301,[12]9.4841,[13]9.5780,[14]9.6940,[15]9.6803,[16]9.5631,
146
+ Final estimate: PPL = 9.5631 +/- 0.24350
147
+
148
+ llama_perf_context_print: load time = 1807.29 ms
149
+ llama_perf_context_print: prompt eval time = 51496.97 ms / 32768 tokens ( 1.57 ms per token, 636.31 tokens per second)
150
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
151
+ llama_perf_context_print: total time = 51883.75 ms / 32769 tokens
152
+ llama_perf_context_print: graphs reused = 0
153
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
154
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17869 + (3211 = 2425 + 80 + 706) + 3026 |
155
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20862 + (2621 = 2425 + 80 + 116) + 640 |
156
+ llama_memory_breakdown_print: | - Host | 7909 = 7671 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q5_K/ppl_corpus_code.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q5_K/ppl_corpus_general.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q5_K/ppl_corpus_math.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q6_K/llamabench.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | llama 34B MXFP4 MoE | 12.75 GiB | 14.43 B | CUDA | 35 | pp8 | 75.02 ± 1.98 |
9
+ | llama 34B MXFP4 MoE | 12.75 GiB | 14.43 B | CUDA | 35 | tg128 | 11.58 ± 0.01 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q6_K/perplexity_code.txt ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21183 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q6_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type q8_0: 240 tensors
49
+ llama_model_loader: - type q5_K: 49 tensors
50
+ llama_model_loader: - type q6_K: 49 tensors
51
+ print_info: file format = GGUF V3 (latest)
52
+ print_info: file type = MXFP4 MoE
53
+ print_info: file size = 12.75 GiB (7.59 BPW)
54
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
55
+ load: printing all EOG tokens:
56
+ load: - 2 ('</s>')
57
+ load: special tokens cache size = 1000
58
+ load: token to piece cache size = 0.8498 MB
59
+ print_info: arch = llama
60
+ print_info: vocab_only = 0
61
+ print_info: n_ctx_train = 262400
62
+ print_info: n_embd = 5120
63
+ print_info: n_embd_inp = 5120
64
+ print_info: n_layer = 48
65
+ print_info: n_head = 32
66
+ print_info: n_head_kv = 8
67
+ print_info: n_rot = 128
68
+ print_info: n_swa = 0
69
+ print_info: is_swa_any = 0
70
+ print_info: n_embd_head_k = 128
71
+ print_info: n_embd_head_v = 128
72
+ print_info: n_gqa = 4
73
+ print_info: n_embd_k_gqa = 1024
74
+ print_info: n_embd_v_gqa = 1024
75
+ print_info: f_norm_eps = 0.0e+00
76
+ print_info: f_norm_rms_eps = 1.0e-05
77
+ print_info: f_clamp_kqv = 0.0e+00
78
+ print_info: f_max_alibi_bias = 0.0e+00
79
+ print_info: f_logit_scale = 0.0e+00
80
+ print_info: f_attn_scale = 0.0e+00
81
+ print_info: n_ff = 14336
82
+ print_info: n_expert = 0
83
+ print_info: n_expert_used = 0
84
+ print_info: n_expert_groups = 0
85
+ print_info: n_group_used = 0
86
+ print_info: causal attn = 1
87
+ print_info: pooling type = 0
88
+ print_info: rope type = 0
89
+ print_info: rope scaling = linear
90
+ print_info: freq_base_train = 1000000000.0
91
+ print_info: freq_scale_train = 1
92
+ print_info: n_ctx_orig_yarn = 262400
93
+ print_info: rope_finetuned = unknown
94
+ print_info: model type = 34B
95
+ print_info: model params = 14.43 B
96
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
97
+ print_info: vocab type = BPE
98
+ print_info: n_vocab = 131072
99
+ print_info: n_merges = 269443
100
+ print_info: BOS token = 1 '<s>'
101
+ print_info: EOS token = 2 '</s>'
102
+ print_info: UNK token = 0 '<unk>'
103
+ print_info: PAD token = 11 '<pad>'
104
+ print_info: LF token = 1010 'Ċ'
105
+ print_info: EOG token = 2 '</s>'
106
+ print_info: max token length = 150
107
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
108
+ load_tensors: offloading 20 repeating layers to GPU
109
+ load_tensors: offloaded 20/49 layers to GPU
110
+ load_tensors: CPU_Mapped model buffer size = 8016.43 MiB
111
+ load_tensors: CUDA0 model buffer size = 2518.36 MiB
112
+ load_tensors: CUDA1 model buffer size = 2518.36 MiB
113
+ ...............................................................................................
114
+ llama_context: constructing llama_context
115
+ llama_context: n_seq_max = 1
116
+ llama_context: n_ctx = 2048
117
+ llama_context: n_ctx_seq = 2048
118
+ llama_context: n_batch = 2048
119
+ llama_context: n_ubatch = 512
120
+ llama_context: causal_attn = 1
121
+ llama_context: flash_attn = auto
122
+ llama_context: kv_unified = false
123
+ llama_context: freq_base = 1000000000.0
124
+ llama_context: freq_scale = 1
125
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
126
+ llama_context: CPU output buffer size = 0.50 MiB
127
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
128
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
130
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
131
+ llama_context: Flash Attention was auto, set to enabled
132
+ llama_context: CUDA0 compute buffer size = 706.00 MiB
133
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
134
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
135
+ llama_context: graph nodes = 1495
136
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
137
+ common_init_from_params: added </s> logit bias = -inf
138
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
139
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
140
+
141
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
142
+ perplexity: tokenizing the input ..
143
+ perplexity: tokenization took 52.958 ms
144
+ perplexity: calculating perplexity over 46 chunks, n_ctx=2048, batch_size=2048, n_seq=1
145
+ perplexity: 3.47 seconds per pass - ETA 2.65 minutes
146
+ [1]3.4077,[2]2.9239,[3]2.0565,[4]1.8562,[5]1.6781,[6]1.8494,[7]1.9933,[8]2.0356,[9]1.9675,[10]1.9009,[11]1.8029,[12]1.8075,[13]1.7906,[14]1.7417,[15]1.6937,[16]1.7364,[17]1.7195,[18]1.6901,[19]1.6844,[20]1.6729,[21]1.6904,[22]1.7103,[23]1.6844,[24]1.6669,[25]1.6724,[26]1.6752,[27]1.6835,[28]1.6606,[29]1.6515,[30]1.6481,[31]1.6672,[32]1.6711,[33]1.6642,[34]1.6510,[35]1.6381,[36]1.6587,[37]1.6757,[38]1.7076,[39]1.7367,[40]1.7493,[41]1.7398,[42]1.7438,[43]1.7351,[44]1.7332,[45]1.7477,[46]1.7522,
147
+ Final estimate: PPL = 1.7522 +/- 0.01486
148
+
149
+ llama_perf_context_print: load time = 1989.08 ms
150
+ llama_perf_context_print: prompt eval time = 153124.90 ms / 94208 tokens ( 1.63 ms per token, 615.24 tokens per second)
151
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
152
+ llama_perf_context_print: total time = 154234.35 ms / 94209 tokens
153
+ llama_perf_context_print: graphs reused = 0
154
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
155
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17778 + (3304 = 2518 + 80 + 706) + 3024 |
156
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20768 + (2714 = 2518 + 80 + 116) + 641 |
157
+ llama_memory_breakdown_print: | - Host | 8254 = 8016 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q6_K/perplexity_general.txt ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21185 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q6_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type q8_0: 240 tensors
49
+ llama_model_loader: - type q5_K: 49 tensors
50
+ llama_model_loader: - type q6_K: 49 tensors
51
+ print_info: file format = GGUF V3 (latest)
52
+ print_info: file type = MXFP4 MoE
53
+ print_info: file size = 12.75 GiB (7.59 BPW)
54
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
55
+ load: printing all EOG tokens:
56
+ load: - 2 ('</s>')
57
+ load: special tokens cache size = 1000
58
+ load: token to piece cache size = 0.8498 MB
59
+ print_info: arch = llama
60
+ print_info: vocab_only = 0
61
+ print_info: n_ctx_train = 262400
62
+ print_info: n_embd = 5120
63
+ print_info: n_embd_inp = 5120
64
+ print_info: n_layer = 48
65
+ print_info: n_head = 32
66
+ print_info: n_head_kv = 8
67
+ print_info: n_rot = 128
68
+ print_info: n_swa = 0
69
+ print_info: is_swa_any = 0
70
+ print_info: n_embd_head_k = 128
71
+ print_info: n_embd_head_v = 128
72
+ print_info: n_gqa = 4
73
+ print_info: n_embd_k_gqa = 1024
74
+ print_info: n_embd_v_gqa = 1024
75
+ print_info: f_norm_eps = 0.0e+00
76
+ print_info: f_norm_rms_eps = 1.0e-05
77
+ print_info: f_clamp_kqv = 0.0e+00
78
+ print_info: f_max_alibi_bias = 0.0e+00
79
+ print_info: f_logit_scale = 0.0e+00
80
+ print_info: f_attn_scale = 0.0e+00
81
+ print_info: n_ff = 14336
82
+ print_info: n_expert = 0
83
+ print_info: n_expert_used = 0
84
+ print_info: n_expert_groups = 0
85
+ print_info: n_group_used = 0
86
+ print_info: causal attn = 1
87
+ print_info: pooling type = 0
88
+ print_info: rope type = 0
89
+ print_info: rope scaling = linear
90
+ print_info: freq_base_train = 1000000000.0
91
+ print_info: freq_scale_train = 1
92
+ print_info: n_ctx_orig_yarn = 262400
93
+ print_info: rope_finetuned = unknown
94
+ print_info: model type = 34B
95
+ print_info: model params = 14.43 B
96
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
97
+ print_info: vocab type = BPE
98
+ print_info: n_vocab = 131072
99
+ print_info: n_merges = 269443
100
+ print_info: BOS token = 1 '<s>'
101
+ print_info: EOS token = 2 '</s>'
102
+ print_info: UNK token = 0 '<unk>'
103
+ print_info: PAD token = 11 '<pad>'
104
+ print_info: LF token = 1010 'Ċ'
105
+ print_info: EOG token = 2 '</s>'
106
+ print_info: max token length = 150
107
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
108
+ load_tensors: offloading 20 repeating layers to GPU
109
+ load_tensors: offloaded 20/49 layers to GPU
110
+ load_tensors: CPU_Mapped model buffer size = 8016.43 MiB
111
+ load_tensors: CUDA0 model buffer size = 2518.36 MiB
112
+ load_tensors: CUDA1 model buffer size = 2518.36 MiB
113
+ ...............................................................................................
114
+ llama_context: constructing llama_context
115
+ llama_context: n_seq_max = 1
116
+ llama_context: n_ctx = 2048
117
+ llama_context: n_ctx_seq = 2048
118
+ llama_context: n_batch = 2048
119
+ llama_context: n_ubatch = 512
120
+ llama_context: causal_attn = 1
121
+ llama_context: flash_attn = auto
122
+ llama_context: kv_unified = false
123
+ llama_context: freq_base = 1000000000.0
124
+ llama_context: freq_scale = 1
125
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
126
+ llama_context: CPU output buffer size = 0.50 MiB
127
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
128
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
130
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
131
+ llama_context: Flash Attention was auto, set to enabled
132
+ llama_context: CUDA0 compute buffer size = 706.00 MiB
133
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
134
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
135
+ llama_context: graph nodes = 1495
136
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
137
+ common_init_from_params: added </s> logit bias = -inf
138
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
139
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
140
+
141
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
142
+ perplexity: tokenizing the input ..
143
+ perplexity: tokenization took 16.224 ms
144
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
145
+ perplexity: 3.51 seconds per pass - ETA 0.87 minutes
146
+ [1]9.0552,[2]12.4121,[3]13.9576,[4]13.1972,[5]12.6756,[6]10.5525,[7]9.3543,[8]9.2984,[9]9.9224,[10]10.1079,[11]10.1444,[12]10.5294,[13]10.6804,[14]10.7773,[15]10.9919,
147
+ Final estimate: PPL = 10.9919 +/- 0.29265
148
+
149
+ llama_perf_context_print: load time = 1920.15 ms
150
+ llama_perf_context_print: prompt eval time = 50208.46 ms / 30720 tokens ( 1.63 ms per token, 611.85 tokens per second)
151
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
152
+ llama_perf_context_print: total time = 50570.07 ms / 30721 tokens
153
+ llama_perf_context_print: graphs reused = 0
154
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
155
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17781 + (3304 = 2518 + 80 + 706) + 3020 |
156
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20768 + (2714 = 2518 + 80 + 116) + 641 |
157
+ llama_memory_breakdown_print: | - Host | 8254 = 8016 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q6_K/perplexity_math.txt ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21180 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q6_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type q8_0: 240 tensors
49
+ llama_model_loader: - type q5_K: 49 tensors
50
+ llama_model_loader: - type q6_K: 49 tensors
51
+ print_info: file format = GGUF V3 (latest)
52
+ print_info: file type = MXFP4 MoE
53
+ print_info: file size = 12.75 GiB (7.59 BPW)
54
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
55
+ load: printing all EOG tokens:
56
+ load: - 2 ('</s>')
57
+ load: special tokens cache size = 1000
58
+ load: token to piece cache size = 0.8498 MB
59
+ print_info: arch = llama
60
+ print_info: vocab_only = 0
61
+ print_info: n_ctx_train = 262400
62
+ print_info: n_embd = 5120
63
+ print_info: n_embd_inp = 5120
64
+ print_info: n_layer = 48
65
+ print_info: n_head = 32
66
+ print_info: n_head_kv = 8
67
+ print_info: n_rot = 128
68
+ print_info: n_swa = 0
69
+ print_info: is_swa_any = 0
70
+ print_info: n_embd_head_k = 128
71
+ print_info: n_embd_head_v = 128
72
+ print_info: n_gqa = 4
73
+ print_info: n_embd_k_gqa = 1024
74
+ print_info: n_embd_v_gqa = 1024
75
+ print_info: f_norm_eps = 0.0e+00
76
+ print_info: f_norm_rms_eps = 1.0e-05
77
+ print_info: f_clamp_kqv = 0.0e+00
78
+ print_info: f_max_alibi_bias = 0.0e+00
79
+ print_info: f_logit_scale = 0.0e+00
80
+ print_info: f_attn_scale = 0.0e+00
81
+ print_info: n_ff = 14336
82
+ print_info: n_expert = 0
83
+ print_info: n_expert_used = 0
84
+ print_info: n_expert_groups = 0
85
+ print_info: n_group_used = 0
86
+ print_info: causal attn = 1
87
+ print_info: pooling type = 0
88
+ print_info: rope type = 0
89
+ print_info: rope scaling = linear
90
+ print_info: freq_base_train = 1000000000.0
91
+ print_info: freq_scale_train = 1
92
+ print_info: n_ctx_orig_yarn = 262400
93
+ print_info: rope_finetuned = unknown
94
+ print_info: model type = 34B
95
+ print_info: model params = 14.43 B
96
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
97
+ print_info: vocab type = BPE
98
+ print_info: n_vocab = 131072
99
+ print_info: n_merges = 269443
100
+ print_info: BOS token = 1 '<s>'
101
+ print_info: EOS token = 2 '</s>'
102
+ print_info: UNK token = 0 '<unk>'
103
+ print_info: PAD token = 11 '<pad>'
104
+ print_info: LF token = 1010 'Ċ'
105
+ print_info: EOG token = 2 '</s>'
106
+ print_info: max token length = 150
107
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
108
+ load_tensors: offloading 20 repeating layers to GPU
109
+ load_tensors: offloaded 20/49 layers to GPU
110
+ load_tensors: CPU_Mapped model buffer size = 8016.43 MiB
111
+ load_tensors: CUDA0 model buffer size = 2518.36 MiB
112
+ load_tensors: CUDA1 model buffer size = 2518.36 MiB
113
+ ...............................................................................................
114
+ llama_context: constructing llama_context
115
+ llama_context: n_seq_max = 1
116
+ llama_context: n_ctx = 2048
117
+ llama_context: n_ctx_seq = 2048
118
+ llama_context: n_batch = 2048
119
+ llama_context: n_ubatch = 512
120
+ llama_context: causal_attn = 1
121
+ llama_context: flash_attn = auto
122
+ llama_context: kv_unified = false
123
+ llama_context: freq_base = 1000000000.0
124
+ llama_context: freq_scale = 1
125
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
126
+ llama_context: CPU output buffer size = 0.50 MiB
127
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
128
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
130
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
131
+ llama_context: Flash Attention was auto, set to enabled
132
+ llama_context: CUDA0 compute buffer size = 706.00 MiB
133
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
134
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
135
+ llama_context: graph nodes = 1495
136
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
137
+ common_init_from_params: added </s> logit bias = -inf
138
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
139
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
140
+
141
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
142
+ perplexity: tokenizing the input ..
143
+ perplexity: tokenization took 14.782 ms
144
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
145
+ perplexity: 3.45 seconds per pass - ETA 0.92 minutes
146
+ [1]7.5468,[2]8.3539,[3]8.7745,[4]9.4491,[5]9.5383,[6]9.5676,[7]9.5871,[8]9.4879,[9]9.4906,[10]9.4496,[11]9.4171,[12]9.4693,[13]9.5604,[14]9.6767,[15]9.6612,[16]9.5459,
147
+ Final estimate: PPL = 9.5459 +/- 0.24280
148
+
149
+ llama_perf_context_print: load time = 1909.25 ms
150
+ llama_perf_context_print: prompt eval time = 53016.11 ms / 32768 tokens ( 1.62 ms per token, 618.08 tokens per second)
151
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
152
+ llama_perf_context_print: total time = 53397.89 ms / 32769 tokens
153
+ llama_perf_context_print: graphs reused = 0
154
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
155
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17780 + (3304 = 2518 + 80 + 706) + 3022 |
156
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20768 + (2714 = 2518 + 80 + 116) + 641 |
157
+ llama_memory_breakdown_print: | - Host | 8254 = 8016 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q6_K/ppl_corpus_code.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q6_K/ppl_corpus_general.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q5_K-router_gate_emb_q6_K/ppl_corpus_math.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q6_k-embd_f16/llamabench.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | llama 34B MXFP4 MoE | 14.49 GiB | 14.43 B | CUDA | 35 | pp8 | 66.67 ± 6.77 |
9
+ | llama 34B MXFP4 MoE | 14.49 GiB | 14.43 B | CUDA | 35 | tg128 | 10.64 ± 0.03 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q6_k-embd_f16/perplexity_code.txt ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21181 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q6_k-embd_f16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type f16: 1 tensors
49
+ llama_model_loader: - type q8_0: 288 tensors
50
+ llama_model_loader: - type q6_K: 49 tensors
51
+ print_info: file format = GGUF V3 (latest)
52
+ print_info: file type = MXFP4 MoE
53
+ print_info: file size = 14.49 GiB (8.62 BPW)
54
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
55
+ load: printing all EOG tokens:
56
+ load: - 2 ('</s>')
57
+ load: special tokens cache size = 1000
58
+ load: token to piece cache size = 0.8498 MB
59
+ print_info: arch = llama
60
+ print_info: vocab_only = 0
61
+ print_info: n_ctx_train = 262400
62
+ print_info: n_embd = 5120
63
+ print_info: n_embd_inp = 5120
64
+ print_info: n_layer = 48
65
+ print_info: n_head = 32
66
+ print_info: n_head_kv = 8
67
+ print_info: n_rot = 128
68
+ print_info: n_swa = 0
69
+ print_info: is_swa_any = 0
70
+ print_info: n_embd_head_k = 128
71
+ print_info: n_embd_head_v = 128
72
+ print_info: n_gqa = 4
73
+ print_info: n_embd_k_gqa = 1024
74
+ print_info: n_embd_v_gqa = 1024
75
+ print_info: f_norm_eps = 0.0e+00
76
+ print_info: f_norm_rms_eps = 1.0e-05
77
+ print_info: f_clamp_kqv = 0.0e+00
78
+ print_info: f_max_alibi_bias = 0.0e+00
79
+ print_info: f_logit_scale = 0.0e+00
80
+ print_info: f_attn_scale = 0.0e+00
81
+ print_info: n_ff = 14336
82
+ print_info: n_expert = 0
83
+ print_info: n_expert_used = 0
84
+ print_info: n_expert_groups = 0
85
+ print_info: n_group_used = 0
86
+ print_info: causal attn = 1
87
+ print_info: pooling type = 0
88
+ print_info: rope type = 0
89
+ print_info: rope scaling = linear
90
+ print_info: freq_base_train = 1000000000.0
91
+ print_info: freq_scale_train = 1
92
+ print_info: n_ctx_orig_yarn = 262400
93
+ print_info: rope_finetuned = unknown
94
+ print_info: model type = 34B
95
+ print_info: model params = 14.43 B
96
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
97
+ print_info: vocab type = BPE
98
+ print_info: n_vocab = 131072
99
+ print_info: n_merges = 269443
100
+ print_info: BOS token = 1 '<s>'
101
+ print_info: EOS token = 2 '</s>'
102
+ print_info: UNK token = 0 '<unk>'
103
+ print_info: PAD token = 11 '<pad>'
104
+ print_info: LF token = 1010 'Ċ'
105
+ print_info: EOG token = 2 '</s>'
106
+ print_info: max token length = 150
107
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
108
+ load_tensors: offloading 20 repeating layers to GPU
109
+ load_tensors: offloaded 20/49 layers to GPU
110
+ load_tensors: CPU_Mapped model buffer size = 9405.49 MiB
111
+ load_tensors: CUDA0 model buffer size = 2714.45 MiB
112
+ load_tensors: CUDA1 model buffer size = 2714.45 MiB
113
+ ..........................................................................................
114
+ llama_context: constructing llama_context
115
+ llama_context: n_seq_max = 1
116
+ llama_context: n_ctx = 2048
117
+ llama_context: n_ctx_seq = 2048
118
+ llama_context: n_batch = 2048
119
+ llama_context: n_ubatch = 512
120
+ llama_context: causal_attn = 1
121
+ llama_context: flash_attn = auto
122
+ llama_context: kv_unified = false
123
+ llama_context: freq_base = 1000000000.0
124
+ llama_context: freq_scale = 1
125
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
126
+ llama_context: CPU output buffer size = 0.50 MiB
127
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
128
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
130
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
131
+ llama_context: Flash Attention was auto, set to enabled
132
+ llama_context: CUDA0 compute buffer size = 791.00 MiB
133
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
134
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
135
+ llama_context: graph nodes = 1495
136
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
137
+ common_init_from_params: added </s> logit bias = -inf
138
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
139
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
140
+
141
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
142
+ perplexity: tokenizing the input ..
143
+ perplexity: tokenization took 57.065 ms
144
+ perplexity: calculating perplexity over 46 chunks, n_ctx=2048, batch_size=2048, n_seq=1
145
+ perplexity: 3.65 seconds per pass - ETA 2.80 minutes
146
+ [1]3.3975,[2]2.9283,[3]2.0589,[4]1.8564,[5]1.6795,[6]1.8529,[7]1.9968,[8]2.0386,[9]1.9702,[10]1.9013,[11]1.8042,[12]1.8082,[13]1.7910,[14]1.7420,[15]1.6935,[16]1.7355,[17]1.7182,[18]1.6884,[19]1.6827,[20]1.6711,[21]1.6888,[22]1.7083,[23]1.6824,[24]1.6648,[25]1.6702,[26]1.6725,[27]1.6808,[28]1.6576,[29]1.6486,[30]1.6452,[31]1.6640,[32]1.6683,[33]1.6613,[34]1.6482,[35]1.6355,[36]1.6557,[37]1.6721,[38]1.7038,[39]1.7329,[40]1.7457,[41]1.7363,[42]1.7403,[43]1.7318,[44]1.7299,[45]1.7446,[46]1.7489,
147
+ Final estimate: PPL = 1.7489 +/- 0.01480
148
+
149
+ llama_perf_context_print: load time = 2165.46 ms
150
+ llama_perf_context_print: prompt eval time = 160428.83 ms / 94208 tokens ( 1.70 ms per token, 587.23 tokens per second)
151
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
152
+ llama_perf_context_print: total time = 161611.20 ms / 94209 tokens
153
+ llama_perf_context_print: graphs reused = 0
154
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
155
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17501 + (3585 = 2714 + 80 + 791) + 3019 |
156
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20572 + (2910 = 2714 + 80 + 116) + 641 |
157
+ llama_memory_breakdown_print: | - Host | 9643 = 9405 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q6_k-embd_f16/perplexity_general.txt ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21183 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q6_k-embd_f16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type f16: 1 tensors
49
+ llama_model_loader: - type q8_0: 288 tensors
50
+ llama_model_loader: - type q6_K: 49 tensors
51
+ print_info: file format = GGUF V3 (latest)
52
+ print_info: file type = MXFP4 MoE
53
+ print_info: file size = 14.49 GiB (8.62 BPW)
54
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
55
+ load: printing all EOG tokens:
56
+ load: - 2 ('</s>')
57
+ load: special tokens cache size = 1000
58
+ load: token to piece cache size = 0.8498 MB
59
+ print_info: arch = llama
60
+ print_info: vocab_only = 0
61
+ print_info: n_ctx_train = 262400
62
+ print_info: n_embd = 5120
63
+ print_info: n_embd_inp = 5120
64
+ print_info: n_layer = 48
65
+ print_info: n_head = 32
66
+ print_info: n_head_kv = 8
67
+ print_info: n_rot = 128
68
+ print_info: n_swa = 0
69
+ print_info: is_swa_any = 0
70
+ print_info: n_embd_head_k = 128
71
+ print_info: n_embd_head_v = 128
72
+ print_info: n_gqa = 4
73
+ print_info: n_embd_k_gqa = 1024
74
+ print_info: n_embd_v_gqa = 1024
75
+ print_info: f_norm_eps = 0.0e+00
76
+ print_info: f_norm_rms_eps = 1.0e-05
77
+ print_info: f_clamp_kqv = 0.0e+00
78
+ print_info: f_max_alibi_bias = 0.0e+00
79
+ print_info: f_logit_scale = 0.0e+00
80
+ print_info: f_attn_scale = 0.0e+00
81
+ print_info: n_ff = 14336
82
+ print_info: n_expert = 0
83
+ print_info: n_expert_used = 0
84
+ print_info: n_expert_groups = 0
85
+ print_info: n_group_used = 0
86
+ print_info: causal attn = 1
87
+ print_info: pooling type = 0
88
+ print_info: rope type = 0
89
+ print_info: rope scaling = linear
90
+ print_info: freq_base_train = 1000000000.0
91
+ print_info: freq_scale_train = 1
92
+ print_info: n_ctx_orig_yarn = 262400
93
+ print_info: rope_finetuned = unknown
94
+ print_info: model type = 34B
95
+ print_info: model params = 14.43 B
96
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
97
+ print_info: vocab type = BPE
98
+ print_info: n_vocab = 131072
99
+ print_info: n_merges = 269443
100
+ print_info: BOS token = 1 '<s>'
101
+ print_info: EOS token = 2 '</s>'
102
+ print_info: UNK token = 0 '<unk>'
103
+ print_info: PAD token = 11 '<pad>'
104
+ print_info: LF token = 1010 'Ċ'
105
+ print_info: EOG token = 2 '</s>'
106
+ print_info: max token length = 150
107
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
108
+ load_tensors: offloading 20 repeating layers to GPU
109
+ load_tensors: offloaded 20/49 layers to GPU
110
+ load_tensors: CPU_Mapped model buffer size = 9405.49 MiB
111
+ load_tensors: CUDA0 model buffer size = 2714.45 MiB
112
+ load_tensors: CUDA1 model buffer size = 2714.45 MiB
113
+ ..........................................................................................
114
+ llama_context: constructing llama_context
115
+ llama_context: n_seq_max = 1
116
+ llama_context: n_ctx = 2048
117
+ llama_context: n_ctx_seq = 2048
118
+ llama_context: n_batch = 2048
119
+ llama_context: n_ubatch = 512
120
+ llama_context: causal_attn = 1
121
+ llama_context: flash_attn = auto
122
+ llama_context: kv_unified = false
123
+ llama_context: freq_base = 1000000000.0
124
+ llama_context: freq_scale = 1
125
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
126
+ llama_context: CPU output buffer size = 0.50 MiB
127
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
128
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
130
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
131
+ llama_context: Flash Attention was auto, set to enabled
132
+ llama_context: CUDA0 compute buffer size = 791.00 MiB
133
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
134
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
135
+ llama_context: graph nodes = 1495
136
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
137
+ common_init_from_params: added </s> logit bias = -inf
138
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
139
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
140
+
141
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
142
+ perplexity: tokenizing the input ..
143
+ perplexity: tokenization took 25.108 ms
144
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
145
+ perplexity: 3.65 seconds per pass - ETA 0.90 minutes
146
+ [1]9.0308,[2]12.4289,[3]13.9506,[4]13.2066,[5]12.7049,[6]10.5678,[7]9.3747,[8]9.3238,[9]9.9632,[10]10.1451,[11]10.1711,[12]10.5666,[13]10.7180,[14]10.8205,[15]11.0427,
147
+ Final estimate: PPL = 11.0427 +/- 0.29455
148
+
149
+ llama_perf_context_print: load time = 2141.38 ms
150
+ llama_perf_context_print: prompt eval time = 52215.53 ms / 30720 tokens ( 1.70 ms per token, 588.33 tokens per second)
151
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
152
+ llama_perf_context_print: total time = 52584.46 ms / 30721 tokens
153
+ llama_perf_context_print: graphs reused = 0
154
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
155
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17497 + (3585 = 2714 + 80 + 791) + 3023 |
156
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20572 + (2910 = 2714 + 80 + 116) + 641 |
157
+ llama_memory_breakdown_print: | - Host | 9643 = 9405 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q6_k-embd_f16/perplexity_math.txt ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21185 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 36 key-value pairs and 435 tensors from /mnt/world8/AI/Models/Apriel-1.5-15b-Thinker-unsloth/GGUF/MXFP4/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q6_k-embd_f16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = llama
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Apriel 1.5 15b Thinker Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Thinker-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Apriel-1.5
16
+ llama_model_loader: - kv 5: general.size_label str = 15B
17
+ llama_model_loader: - kv 6: general.license str = mit
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Apriel 1.5 15b Thinker
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ServiceNow AI
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ServiceNow-AI/...
22
+ llama_model_loader: - kv 11: general.tags arr[str,2] = ["unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: llama.block_count u32 = 48
24
+ llama_model_loader: - kv 13: llama.context_length u32 = 262400
25
+ llama_model_loader: - kv 14: llama.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
27
+ llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
28
+ llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: llama.rope.freq_base f32 = 1000000000.000000
30
+ llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
31
+ llama_model_loader: - kv 20: llama.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: llama.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: llama.vocab_size u32 = 131072
34
+ llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = pixtral
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 0
43
+ llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 11
44
+ llama_model_loader: - kv 33: tokenizer.chat_template str = {# Unsloth template fixes #}\n{%- set ...
45
+ llama_model_loader: - kv 34: general.quantization_version u32 = 2
46
+ llama_model_loader: - kv 35: general.file_type u32 = 38
47
+ llama_model_loader: - type f32: 97 tensors
48
+ llama_model_loader: - type f16: 1 tensors
49
+ llama_model_loader: - type q8_0: 288 tensors
50
+ llama_model_loader: - type q6_K: 49 tensors
51
+ print_info: file format = GGUF V3 (latest)
52
+ print_info: file type = MXFP4 MoE
53
+ print_info: file size = 14.49 GiB (8.62 BPW)
54
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
55
+ load: printing all EOG tokens:
56
+ load: - 2 ('</s>')
57
+ load: special tokens cache size = 1000
58
+ load: token to piece cache size = 0.8498 MB
59
+ print_info: arch = llama
60
+ print_info: vocab_only = 0
61
+ print_info: n_ctx_train = 262400
62
+ print_info: n_embd = 5120
63
+ print_info: n_embd_inp = 5120
64
+ print_info: n_layer = 48
65
+ print_info: n_head = 32
66
+ print_info: n_head_kv = 8
67
+ print_info: n_rot = 128
68
+ print_info: n_swa = 0
69
+ print_info: is_swa_any = 0
70
+ print_info: n_embd_head_k = 128
71
+ print_info: n_embd_head_v = 128
72
+ print_info: n_gqa = 4
73
+ print_info: n_embd_k_gqa = 1024
74
+ print_info: n_embd_v_gqa = 1024
75
+ print_info: f_norm_eps = 0.0e+00
76
+ print_info: f_norm_rms_eps = 1.0e-05
77
+ print_info: f_clamp_kqv = 0.0e+00
78
+ print_info: f_max_alibi_bias = 0.0e+00
79
+ print_info: f_logit_scale = 0.0e+00
80
+ print_info: f_attn_scale = 0.0e+00
81
+ print_info: n_ff = 14336
82
+ print_info: n_expert = 0
83
+ print_info: n_expert_used = 0
84
+ print_info: n_expert_groups = 0
85
+ print_info: n_group_used = 0
86
+ print_info: causal attn = 1
87
+ print_info: pooling type = 0
88
+ print_info: rope type = 0
89
+ print_info: rope scaling = linear
90
+ print_info: freq_base_train = 1000000000.0
91
+ print_info: freq_scale_train = 1
92
+ print_info: n_ctx_orig_yarn = 262400
93
+ print_info: rope_finetuned = unknown
94
+ print_info: model type = 34B
95
+ print_info: model params = 14.43 B
96
+ print_info: general.name = Apriel 1.5 15b Thinker Unsloth
97
+ print_info: vocab type = BPE
98
+ print_info: n_vocab = 131072
99
+ print_info: n_merges = 269443
100
+ print_info: BOS token = 1 '<s>'
101
+ print_info: EOS token = 2 '</s>'
102
+ print_info: UNK token = 0 '<unk>'
103
+ print_info: PAD token = 11 '<pad>'
104
+ print_info: LF token = 1010 'Ċ'
105
+ print_info: EOG token = 2 '</s>'
106
+ print_info: max token length = 150
107
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
108
+ load_tensors: offloading 20 repeating layers to GPU
109
+ load_tensors: offloaded 20/49 layers to GPU
110
+ load_tensors: CPU_Mapped model buffer size = 9405.49 MiB
111
+ load_tensors: CUDA0 model buffer size = 2714.45 MiB
112
+ load_tensors: CUDA1 model buffer size = 2714.45 MiB
113
+ ..........................................................................................
114
+ llama_context: constructing llama_context
115
+ llama_context: n_seq_max = 1
116
+ llama_context: n_ctx = 2048
117
+ llama_context: n_ctx_seq = 2048
118
+ llama_context: n_batch = 2048
119
+ llama_context: n_ubatch = 512
120
+ llama_context: causal_attn = 1
121
+ llama_context: flash_attn = auto
122
+ llama_context: kv_unified = false
123
+ llama_context: freq_base = 1000000000.0
124
+ llama_context: freq_scale = 1
125
+ llama_context: n_ctx_seq (2048) < n_ctx_train (262400) -- the full capacity of the model will not be utilized
126
+ llama_context: CPU output buffer size = 0.50 MiB
127
+ llama_kv_cache: CPU KV buffer size = 224.00 MiB
128
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
129
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
130
+ llama_kv_cache: size = 384.00 MiB ( 2048 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
131
+ llama_context: Flash Attention was auto, set to enabled
132
+ llama_context: CUDA0 compute buffer size = 791.00 MiB
133
+ llama_context: CUDA1 compute buffer size = 116.01 MiB
134
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
135
+ llama_context: graph nodes = 1495
136
+ llama_context: graph splits = 313 (with bs=512), 4 (with bs=1)
137
+ common_init_from_params: added </s> logit bias = -inf
138
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
139
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
140
+
141
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
142
+ perplexity: tokenizing the input ..
143
+ perplexity: tokenization took 15.649 ms
144
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
145
+ perplexity: 3.67 seconds per pass - ETA 0.97 minutes
146
+ [1]7.6033,[2]8.3880,[3]8.8223,[4]9.4775,[5]9.5700,[6]9.6030,[7]9.6320,[8]9.5362,[9]9.5426,[10]9.5018,[11]9.4679,[12]9.5258,[13]9.6252,[14]9.7402,[15]9.7274,[16]9.6082,
147
+ Final estimate: PPL = 9.6082 +/- 0.24520
148
+
149
+ llama_perf_context_print: load time = 2147.74 ms
150
+ llama_perf_context_print: prompt eval time = 55818.89 ms / 32768 tokens ( 1.70 ms per token, 587.04 tokens per second)
151
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
152
+ llama_perf_context_print: total time = 56204.81 ms / 32769 tokens
153
+ llama_perf_context_print: graphs reused = 0
154
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
155
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17501 + (3585 = 2714 + 80 + 791) + 3019 |
156
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20572 + (2910 = 2714 + 80 + 116) + 641 |
157
+ llama_memory_breakdown_print: | - Host | 9643 = 9405 + 224 + 14 |
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q6_k-embd_f16/ppl_corpus_code.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q6_k-embd_f16/ppl_corpus_general.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Apriel-1.5-15b-Thinker-Unsloth-MXFP4_MOE-output_q6_k-embd_f16/ppl_corpus_math.txt ADDED
The diff for this file is too large to render. See raw diff