July 2024: Prompt processing perfomance comparison

rcheung/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-05-19 03:31:45 -04:00

Table of Contents

Performance comparison to llama.cpp

Prompt processing

Performance comparison to llama.cpp

The results in the following tables are obtained with these parameters:

Model is LLaMA-v3-8B for AVX2 and LLaMA-v2-7B for ARM_NEON
The AVX2 CPU is a 16-core Ryzen-7950X
The ARM_NEON CPU is M2-Max
tinyBLAS is enabled in llama.cpp
llama.cpp results are for build: 081fe431 (3441), which was the current llama.cpp master branch when I pulled on July 23 2024.
The projects are built without CUDA support, no BLAS, and Accelerate framework disabled

Prompt processing

Here I set the number of threads to be equal to the number of (performance) cores of the CPU, so 16 threads for the Ryzen-7950X and 8 threads for the M2-Max. The following table summarizes the results. To not make the table too long, I have listed only quantized models containing predominantly one quantization type (i.e., excluded the QX_K - Medium/Large variants, which are typically a mix of QX_K and Q(X+1)_K, as well as IQ2_S and IQ3_XS).

The command line to generate the benchmark data is

./bin/llama-bench -m $model -p 512 -n 0 -t $num_threads -ngl 0

Quantization	size	backend	threads	t/s (llama.cpp)	t/s (iqk_mul_mat)	Speedup
8B F16	14.96 GiB	AVX2	16	112.37 ± 0.40	131.27 ± 0.38	1.168
7B F16	12.55 GiB	NEON	8	90.28 ± 1.25	95.34 ± 0.15	1.056
8B Q8_0	7.95 GiB	AVX2	16	118.07 ± 0.53	134.00 ± 0.47	1.135
7B Q8_0	6.67 GiB	NEON	8	77.25 ± 1.81	94.14 ± 1.15	1.219
8B Q4_0	4.35 GiB	AVX2	16	104.46 ± 0.33	130.20 ± 0.29	1.246
7B Q4_0	3.57 GiB	NEON	8	65.46 ± 0.79	76.22 ± 0.71	1.164
8B Q4_1	4.77 GiB	AVX2	16	57.83 ± 0.24	160.69 ± 0.49	2.779
7B Q4_1	3.95 GiB	NEON	8	37.40 ± 0.50	65.83 ± 0.98	1.760
8B Q5_0	5.22 GiB	AVX2	16	53.50 ± 0.35	122.62 ± 0.48	2.292
7B Q5_0	4.34 GiB	NEON	8	29.31 ± 0.51	67.51 ± 1.17	2.303
8B Q5_1	5.64 GiB	AVX2	16	50.85 ± 0.36	147.15 ± 0.47	2.894
7B Q5_1	4.72 GiB	NEON	8	26.02 ± 0.37	58.49 ± 0.85	2.248
8B Q2_K_S	2.78 GiB	AVX2	16	110.11 ± 0.28	192.47 ± 1.35	1.748
7B Q2_K_S	2.16 GiB	NEON	8	35.44 ± 0.06	77.93 ± 1.64	2.199
8B Q3_K_S	3.41 GiB	AVX2	16	77.42 ± 0.36	181.64 ± 0.44	2.346
7B Q3_K_S	2.75 GiB	NEON	8	26.79 ± 0.03	59.38 ± 1.08	2.216
8B Q4_K_S	4.36 GiB	AVX2	16	98.92 ± 0.34	185.35 ± 0.39	1.874
7B Q4_K_S	3.59 GiB	NEON	8	46.55 ± 0.67	76.31 ± 0.38	1.639
8B Q5_K_S	5.21 GiB	AVX2	16	69.44 ± 0.31	179.62 ± 0.69	2.587
7B Q5_K_S	4.33 GiB	NEON	8	30.18 ± 0.23	65.34 ± 0.79	2.165
8B Q6_K	6.14 GiB	AVX2	16	74.89 ± 0.26	181.86 ± 0.55	2.428
7B Q6_K	5.15 GiB	NEON	8	28.12 ± 1.24	60.75 ± 1.15	2.160
8B IQ2_XXS	2.23 GiB	AVX2	16	42.57 ± 0.16	126.63 ± 0.55	2.975
7B IQ2_XXS	1.73 GiB	NEON	8	20.87 ± 0.20	64.29 ± 1.12	3.080
8B IQ2_XS	2.42 GiB	AVX2	16	46.45 ± 0.27	125.46 ± 0.43	2.701
7B IQ2_XS	1.89 GiB	NEON	8	22.77 ± 0.21	51.15 ± 0.24	2.246
8B IQ2_M	2.74 GiB	AVX2	16	40.76 ± 0.18	113.07 ± 0.48	2.774
7B IQ2_M	2.20 GiB	NEON	8	14.95 ± 0.26	44.87 ± 0.50	3.001
8B IQ3_XXS	3.04 GiB	AVX2	16	31.95 ± 0.20	109.86 ± 0.45	3.438
7B IQ3_XXS	2.41 GiB	NEON	8	14.40 ± 0.10	53.58 ± 0.85	3.721
8B IQ3_S	3.42 GiB	AVX2	16	28.04 ± 0.08	96.28 ± 0.45	3.434
7B IQ3_S	2.75 GiB	NEON	8	12.08 ± 0.30	49.72 ± 0.06	4.116
8B IQ4_XS	4.13 GiB	AVX2	16	68.98 ± 0.31	180.34 ± 0.55	2.614
7B IQ4_XS	3.37 GiB	NEON	8	40.67 ± 1.97	75.11 ± 1.97	1.847
8B IQ4_NL	4.35 GiB	AVX2	16	59.94 ± 0.21	129.06 ± 0.43	2.153
7B IQ4_NL	3.56 GiB	NEON	8	34.36 ± 0.81	76.02 ± 1.36	2.212

We see that llama.cpp achieves respectable performance for fp16, Q8_0, and Q4_0, being only up to 25% slower than this implementation. This is thanks to the use of Justine Tunney's tinyBLAS, which is utilized for these quantization types. For all other quants we observe performance gains in the 1.75X - 4X range, which is not a small feat considering that the ggml matrix multiplication functions has been rewritten several times since llama.cpp was first published. Performance gains are larger for i-quants due to the higher quant unpacking cost (see discussion in "To tile or not to tile")