July 2024: MoE performance comparison

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-07-17 00:08:27 -04:00

Table of Contents

MoE models

MoE models

There is PR-6840 from Justine Tunney in llama.cpp, but it has not been merged since April 23, so I'll compare performance to the master branch for Mixtral-8x7B. As Mixtral8x7B quantization is quite a lengthy process, the following table shows data only for Q4_K_S (a commonly used k-quant, 4 bit), Q5_0 (a legacy quant, 5 bit), and IQ4_XXS (a 3-bit i-quant)

model	size	backend	threads	test	t/s (llama.cpp)	t/s (iqk_mul_mat)	Speedup
8x7B Q4_K_S	48.75 GiB	AVX2	16	pp512	54.92 ± 0.23	102.94 ± 0.37	1.874
		NEON	8	pp512	23.54 ± 1.56	38.32 ± 0.54	1.628
		AVX2	4	tg128	7.80 ± 0.07	7.83 ± 0.09	1.004
		NEON	8	tg128	14.95 ± 0.25	15.28 ± 0.24	2.022
8x7B IQ3_XXS	33.07 GiB	AVX2	16	pp512	17.58 ± 0.04	68.45 ± 0.22	3.894
		NEON	8	pp512	7.75 ± 0.04	34.67 ± 0.40	4.474
		AVX2	4	tg128	4.60 ± 0.01	5.45 ± 0.09	1.185
		AVX2	8	tg128	8.04 ± 0.65	9.83 ± 0.06	1.223
		AVX2	16	tg128	10.42 ± 0.01	10.57 ± 0.01	1.014
		NEON	8	tg128	6.19 ± 1.16	7.27 ± 0.14	1.174
8x7B Q5_0	59.11 GiB	AVX2	16	pp512	29.06 ± 0.43	62.67 ± 0.32	2.157
		NEON	8	pp512	15.17 ± 0.51	27.36 ± 1.03	1.804
		AVX2	4	tg128	5.44 ± 0.10	6.81 ± 0.06	1.252
		NEON	8	tg128	12.03 ± 0.77	12.41 ± 1.27	1.032