MoE models
There is PR-6840 from Justine Tunney in llama.cpp, but it has not been merged since April 23, so I'll compare performance to the master branch for Mixtral-8x7B. As Mixtral8x7B quantization is quite a lengthy process, the following table shows data only for Q4_K_S (a commonly used k-quant, 4 bit), Q5_0 (a legacy quant, 5 bit), and IQ4_XXS (a 3-bit i-quant)
| model |
size |
backend |
threads |
test |
t/s (llama.cpp) |
t/s (iqk_mul_mat) |
Speedup |
| 8x7B Q4_K_S |
48.75 GiB |
AVX2 |
16 |
pp512 |
54.92 ± 0.23 |
102.94 ± 0.37 |
1.874 |
|
|
NEON |
8 |
pp512 |
23.54 ± 1.56 |
38.32 ± 0.54 |
1.628 |
|
|
AVX2 |
4 |
tg128 |
7.80 ± 0.07 |
7.83 ± 0.09 |
1.004 |
|
|
NEON |
8 |
tg128 |
14.95 ± 0.25 |
15.28 ± 0.24 |
2.022 |
| 8x7B IQ3_XXS |
33.07 GiB |
AVX2 |
16 |
pp512 |
17.58 ± 0.04 |
68.45 ± 0.22 |
3.894 |
|
|
NEON |
8 |
pp512 |
7.75 ± 0.04 |
34.67 ± 0.40 |
4.474 |
|
|
AVX2 |
4 |
tg128 |
4.60 ± 0.01 |
5.45 ± 0.09 |
1.185 |
|
|
AVX2 |
8 |
tg128 |
8.04 ± 0.65 |
9.83 ± 0.06 |
1.223 |
|
|
AVX2 |
16 |
tg128 |
10.42 ± 0.01 |
10.57 ± 0.01 |
1.014 |
|
|
NEON |
8 |
tg128 |
6.19 ± 1.16 |
7.27 ± 0.14 |
1.174 |
| 8x7B Q5_0 |
59.11 GiB |
AVX2 |
16 |
pp512 |
29.06 ± 0.43 |
62.67 ± 0.32 |
2.157 |
|
|
NEON |
8 |
pp512 |
15.17 ± 0.51 |
27.36 ± 1.03 |
1.804 |
|
|
AVX2 |
4 |
tg128 |
5.44 ± 0.10 |
6.81 ± 0.06 |
1.252 |
|
|
NEON |
8 |
tg128 |
12.03 ± 0.77 |
12.41 ± 1.27 |
1.032 |