LLM Token Generation Speed Simulator & Benchmark

Compare token generation speeds for different devices and models. Find the best hardware setup for your local LLM inference needs.

Device:

Model:

Tokens per second:

tokens/s

Maximum: 32,768 tokens/s

Total tokens to generate:

tokens

Maximum: 32,768 tokens

Generated Tokens (0 tokens)

Generation Time

Elapsed: 0.000 seconds

Expected: 0.000 seconds

Elapsed time may differ from expected time due to system performance and browser limitations.

About This LLM Performance Simulator

This interactive tool simulates token generation speeds for various language models (LLMs) and hardware configurations. Here's what you can learn:

Token Generation Speed: Understand how different devices and models affect LLM inference speed.
Hardware Comparison: Compare GPUs, CPUs, and Apple Silicon chips for LLM performance.
Model Optimization: See the impact of quantization on token generation speed.
Real-world Expectations: Get insights into expected performance for your specific hardware and model combination.

Use this simulator to make informed decisions about hardware requirements for running LLMs locally on your machine.

How to Use This LLM Benchmark Tool

Select your device from the dropdown menu.
Choose an LLM model and quantization level.
Adjust the token generation speed if desired.
Set the total number of tokens to generate.
Click "Start" to run the simulation.
Analyze the results to understand potential performance on your hardware.

Compare different configurations to find the optimal setup for your local LLM deployment.

AI Model Benchmarks

Apple Processing Benchmarks

Device	Llama2_7B_F16	Llama2_7B_Q4_0	Llama2_7B_Q8_0	Llama3_70B_F16	Llama3_70B_Q4_K_M	Llama3_8B_F16	Llama3_8B_Q4_K_M
M1_GPUCores_7_BW_68GBs	-	107.81	108.21	-	-	-	87.26
M1_GPUCores_8_BW_68GBs	-	117.96	117.25	-	-	-	-
M1_Pro_GPUCores_14_BW_200GBs	-	232.55	235.16	-	-	-	-
M1_Pro_GPUCores_16_BW_200GBs	302.14	266.25	270.37	-	-	-	-
M1_Max_GPUCores_24_BW_400GBs	453.03	400.26	405.87	-	-	-	-
M1_Max_GPUCores_32_BW_400GBs	599.53	530.06	537.37	-	33.01	418.77	355.45
M1_Ultra_GPUCores_48_BW_800GBs	875.81	772.24	783.45	-	-	-	-
M2_GPUCores_10_BW_100GBs	201.34	179.57	181.4	-	-	-	-
M2_Pro_GPUCores_16_BW_200GBs	312.65	294.24	288.46	-	-	-	-
M2_Pro_GPUCores_19_BW_200GBs	384.38	341.19	344.5	-	-	-	-
M2_Max_GPUCores_30_BW_400GBs	600.46	537.6	540.15	-	-	-	-
M2_Max_GPUCores_38_BW_400GBs	755.67	671.31	677.91	-	-	-	-
M2_Ultra_GPUCores_60_BW_800GBs	1128.59	1013.81	1003.16	-	-	-	-
M2_Ultra_GPUCores_76_BW_800GBs	1401.85	1238.48	1248.59	145.82	117.76	1202.74	1023.89
M3_GPUCores_10_BW_100GBs	-	186.75	187.52	-	-	-	-
M3_Pro_GPUCores_14_BW_150GBs	-	269.49	272.11	-	-	-	-
M3_Pro_GPUCores_18_BW_150GBs	357.45	341.67	344.66	-	-	-	-
M3_Max_GPUCores_40_BW_400GBs	779.17	759.7	757.64	-	62.88	751.49	678.04

Apple Generation Benchmarks

Device	Llama2_7B_F16	Llama2_7B_Q4_0	Llama2_7B_Q8_0	Llama3_70B_F16	Llama3_70B_Q4_K_M	Llama3_8B_F16	Llama3_8B_Q4_K_M
M1_GPUCores_7_BW_68GBs	-	14.19	7.92	-	-	-	9.72
M1_GPUCores_8_BW_68GBs	-	14.15	7.91	-	-	-	-
M1_Pro_GPUCores_14_BW_200GBs	-	35.52	21.95	-	-	-	-
M1_Pro_GPUCores_16_BW_200GBs	12.75	36.41	22.34	-	-	-	-
M1_Max_GPUCores_24_BW_400GBs	22.55	54.61	37.81	-	-	-	-
M1_Max_GPUCores_32_BW_400GBs	23.03	61.19	40.2	-	4.09	18.43	34.49
M1_Ultra_GPUCores_48_BW_800GBs	33.92	74.93	55.69	-	-	-	-
M2_GPUCores_10_BW_100GBs	6.72	21.91	12.21	-	-	-	-
M2_Pro_GPUCores_16_BW_200GBs	12.47	37.87	22.7	-	-	-	-
M2_Pro_GPUCores_19_BW_200GBs	13.06	38.86	23.01	-	-	-	-
M2_Max_GPUCores_30_BW_400GBs	24.16	60.99	39.97	-	-	-	-
M2_Max_GPUCores_38_BW_400GBs	24.65	65.95	41.83	-	-	-	-
M2_Ultra_GPUCores_60_BW_800GBs	39.86	88.64	62.14	-	-	-	-
M2_Ultra_GPUCores_76_BW_800GBs	41.02	94.27	66.64	4.71	12.13	36.25	76.28
M3_GPUCores_10_BW_100GBs	-	21.34	12.27	-	-	-	-
M3_Pro_GPUCores_14_BW_150GBs	-	30.65	17.44	-	-	-	-
M3_Pro_GPUCores_18_BW_150GBs	9.89	30.74	17.53	-	-	-	-
M3_Max_GPUCores_40_BW_400GBs	25.09	66.31	42.75	-	7.53	22.39	50.74

NVIDIA Processing Benchmarks

Device	Llama3_70B_F16	Llama3_70B_Q4_K_M	Llama3_8B_F16	Llama3_8B_Q4_K_M
3070_8GB	-	-	-	2283.62
3080_10GB	-	-	-	3557.02
3080_Ti_12GB	-	-	-	3556.67
4070_Ti_12GB	-	-	-	3653.07
4080_16GB	-	-	6758.9	5064.99
RTX_4000_Ada_20GB	-	-	2951.87	2310.53
3090_24GB	-	-	4239.64	3865.39
4090_24GB	-	-	9056.26	6898.71
RTX_5000_Ada_32GB	-	-	5835.41	4467.46
3090_24GB_x2	-	393.89	4690.5	4004.14
4090_24GB_x2	-	905.38	11094.51	8545.0
RTX_A6000_48GB	-	466.82	4315.18	3621.81
RTX_6000_Ada_48GB	-	547.03	6205.44	5560.94
A40_48GB	-	239.92	4043.05	3240.95
L40S_48GB	-	649.08	2491.65	5908.52
RTX_4000_Ada_20GB_x4	-	306.44	4366.64	3369.24
A100_PCIe_80GB	-	726.65	7504.24	5800.48
A100_SXM_80GB	-	-	-	-

NVIDIA Generation Benchmarks

Device	Llama3_70B_F16	Llama3_70B_Q4_K_M	Llama3_8B_F16	Llama3_8B_Q4_K_M
3070_8GB	-	-	-	70.94
3080_10GB	-	-	-	106.4
3080_Ti_12GB	-	-	-	106.71
4070_Ti_12GB	-	-	-	82.21
4080_16GB	-	-	40.29	106.22
RTX_4000_Ada_20GB	-	-	20.85	58.59
3090_24GB	-	-	46.51	111.74
4090_24GB	-	-	54.34	127.74
RTX_5000_Ada_32GB	-	-	32.67	89.87
3090_24GB_x2	-	16.29	47.15	108.07
4090_24GB_x2	-	19.06	53.27	122.56
RTX_A6000_48GB	-	14.58	40.25	102.22
RTX_6000_Ada_48GB	-	18.36	51.97	130.99
A40_48GB	-	12.08	33.95	88.95
L40S_48GB	-	15.31	43.42	113.6
RTX_4000_Ada_20GB_x4	-	7.33	20.58	56.14
A100_PCIe_80GB	-	22.11	54.56	138.31
A100_SXM_80GB	-	24.33	53.18	133.38

Understanding the Simulation

This is a simulation of token generation, designed to demonstrate the concept of tokens per second in language models. Here's what's happening:

Token Generation: We're simulating the generation of tokens, where each token is approximately 4 characters of text.
Expected Time: Calculated as (Total Tokens) / (Tokens per Second). This is the ideal time if token generation was instantaneous.
Elapsed Time: Measured using the browser's performance API. This includes the time taken to generate tokens and update the display.
Differences in Times: The elapsed time may be longer due to factors like:
- JavaScript execution speed
- DOM update performance
- Your device's current load and capabilities
Not a Real Language Model: This simulation uses pre-defined text and doesn't perform actual language model computations. Real language models would have different performance characteristics.

This simulator is meant for educational purposes to help visualize token generation speeds. It's not indicative of real-world language model performance, which involves complex computations and can vary greatly based on model size and hardware.

Benchmark	Result
Device
Memory Bandwidth
GPU Cores
Model
Tokens per Second