Compare token generation speeds for different devices and models. Find the best hardware setup for your local LLM inference needs.
Elapsed: 0.000 seconds
Expected: 0.000 seconds
Elapsed time may differ from expected time due to system performance and browser limitations.
This interactive tool simulates token generation speeds for various language models (LLMs) and hardware configurations. Here's what you can learn:
Use this simulator to make informed decisions about hardware requirements for running LLMs locally on your machine.
Compare different configurations to find the optimal setup for your local LLM deployment.
Device | Llama2_7B_F16 | Llama2_7B_Q4_0 | Llama2_7B_Q8_0 | Llama3_70B_F16 | Llama3_70B_Q4_K_M | Llama3_8B_F16 | Llama3_8B_Q4_K_M |
---|---|---|---|---|---|---|---|
M1_GPUCores_7_BW_68GBs | - | 107.81 | 108.21 | - | - | - | 87.26 |
M1_GPUCores_8_BW_68GBs | - | 117.96 | 117.25 | - | - | - | - |
M1_Pro_GPUCores_14_BW_200GBs | - | 232.55 | 235.16 | - | - | - | - |
M1_Pro_GPUCores_16_BW_200GBs | 302.14 | 266.25 | 270.37 | - | - | - | - |
M1_Max_GPUCores_24_BW_400GBs | 453.03 | 400.26 | 405.87 | - | - | - | - |
M1_Max_GPUCores_32_BW_400GBs | 599.53 | 530.06 | 537.37 | - | 33.01 | 418.77 | 355.45 |
M1_Ultra_GPUCores_48_BW_800GBs | 875.81 | 772.24 | 783.45 | - | - | - | - |
M2_GPUCores_10_BW_100GBs | 201.34 | 179.57 | 181.4 | - | - | - | - |
M2_Pro_GPUCores_16_BW_200GBs | 312.65 | 294.24 | 288.46 | - | - | - | - |
M2_Pro_GPUCores_19_BW_200GBs | 384.38 | 341.19 | 344.5 | - | - | - | - |
M2_Max_GPUCores_30_BW_400GBs | 600.46 | 537.6 | 540.15 | - | - | - | - |
M2_Max_GPUCores_38_BW_400GBs | 755.67 | 671.31 | 677.91 | - | - | - | - |
M2_Ultra_GPUCores_60_BW_800GBs | 1128.59 | 1013.81 | 1003.16 | - | - | - | - |
M2_Ultra_GPUCores_76_BW_800GBs | 1401.85 | 1238.48 | 1248.59 | 145.82 | 117.76 | 1202.74 | 1023.89 |
M3_GPUCores_10_BW_100GBs | - | 186.75 | 187.52 | - | - | - | - |
M3_Pro_GPUCores_14_BW_150GBs | - | 269.49 | 272.11 | - | - | - | - |
M3_Pro_GPUCores_18_BW_150GBs | 357.45 | 341.67 | 344.66 | - | - | - | - |
M3_Max_GPUCores_40_BW_400GBs | 779.17 | 759.7 | 757.64 | - | 62.88 | 751.49 | 678.04 |
Device | Llama2_7B_F16 | Llama2_7B_Q4_0 | Llama2_7B_Q8_0 | Llama3_70B_F16 | Llama3_70B_Q4_K_M | Llama3_8B_F16 | Llama3_8B_Q4_K_M |
---|---|---|---|---|---|---|---|
M1_GPUCores_7_BW_68GBs | - | 14.19 | 7.92 | - | - | - | 9.72 |
M1_GPUCores_8_BW_68GBs | - | 14.15 | 7.91 | - | - | - | - |
M1_Pro_GPUCores_14_BW_200GBs | - | 35.52 | 21.95 | - | - | - | - |
M1_Pro_GPUCores_16_BW_200GBs | 12.75 | 36.41 | 22.34 | - | - | - | - |
M1_Max_GPUCores_24_BW_400GBs | 22.55 | 54.61 | 37.81 | - | - | - | - |
M1_Max_GPUCores_32_BW_400GBs | 23.03 | 61.19 | 40.2 | - | 4.09 | 18.43 | 34.49 |
M1_Ultra_GPUCores_48_BW_800GBs | 33.92 | 74.93 | 55.69 | - | - | - | - |
M2_GPUCores_10_BW_100GBs | 6.72 | 21.91 | 12.21 | - | - | - | - |
M2_Pro_GPUCores_16_BW_200GBs | 12.47 | 37.87 | 22.7 | - | - | - | - |
M2_Pro_GPUCores_19_BW_200GBs | 13.06 | 38.86 | 23.01 | - | - | - | - |
M2_Max_GPUCores_30_BW_400GBs | 24.16 | 60.99 | 39.97 | - | - | - | - |
M2_Max_GPUCores_38_BW_400GBs | 24.65 | 65.95 | 41.83 | - | - | - | - |
M2_Ultra_GPUCores_60_BW_800GBs | 39.86 | 88.64 | 62.14 | - | - | - | - |
M2_Ultra_GPUCores_76_BW_800GBs | 41.02 | 94.27 | 66.64 | 4.71 | 12.13 | 36.25 | 76.28 |
M3_GPUCores_10_BW_100GBs | - | 21.34 | 12.27 | - | - | - | - |
M3_Pro_GPUCores_14_BW_150GBs | - | 30.65 | 17.44 | - | - | - | - |
M3_Pro_GPUCores_18_BW_150GBs | 9.89 | 30.74 | 17.53 | - | - | - | - |
M3_Max_GPUCores_40_BW_400GBs | 25.09 | 66.31 | 42.75 | - | 7.53 | 22.39 | 50.74 |
Device | Llama3_70B_F16 | Llama3_70B_Q4_K_M | Llama3_8B_F16 | Llama3_8B_Q4_K_M |
---|---|---|---|---|
3070_8GB | - | - | - | 2283.62 |
3080_10GB | - | - | - | 3557.02 |
3080_Ti_12GB | - | - | - | 3556.67 |
4070_Ti_12GB | - | - | - | 3653.07 |
4080_16GB | - | - | 6758.9 | 5064.99 |
RTX_4000_Ada_20GB | - | - | 2951.87 | 2310.53 |
3090_24GB | - | - | 4239.64 | 3865.39 |
4090_24GB | - | - | 9056.26 | 6898.71 |
RTX_5000_Ada_32GB | - | - | 5835.41 | 4467.46 |
3090_24GB_x2 | - | 393.89 | 4690.5 | 4004.14 |
4090_24GB_x2 | - | 905.38 | 11094.51 | 8545.0 |
RTX_A6000_48GB | - | 466.82 | 4315.18 | 3621.81 |
RTX_6000_Ada_48GB | - | 547.03 | 6205.44 | 5560.94 |
A40_48GB | - | 239.92 | 4043.05 | 3240.95 |
L40S_48GB | - | 649.08 | 2491.65 | 5908.52 |
RTX_4000_Ada_20GB_x4 | - | 306.44 | 4366.64 | 3369.24 |
A100_PCIe_80GB | - | 726.65 | 7504.24 | 5800.48 |
A100_SXM_80GB | - | - | - | - |
Device | Llama3_70B_F16 | Llama3_70B_Q4_K_M | Llama3_8B_F16 | Llama3_8B_Q4_K_M |
---|---|---|---|---|
3070_8GB | - | - | - | 70.94 |
3080_10GB | - | - | - | 106.4 |
3080_Ti_12GB | - | - | - | 106.71 |
4070_Ti_12GB | - | - | - | 82.21 |
4080_16GB | - | - | 40.29 | 106.22 |
RTX_4000_Ada_20GB | - | - | 20.85 | 58.59 |
3090_24GB | - | - | 46.51 | 111.74 |
4090_24GB | - | - | 54.34 | 127.74 |
RTX_5000_Ada_32GB | - | - | 32.67 | 89.87 |
3090_24GB_x2 | - | 16.29 | 47.15 | 108.07 |
4090_24GB_x2 | - | 19.06 | 53.27 | 122.56 |
RTX_A6000_48GB | - | 14.58 | 40.25 | 102.22 |
RTX_6000_Ada_48GB | - | 18.36 | 51.97 | 130.99 |
A40_48GB | - | 12.08 | 33.95 | 88.95 |
L40S_48GB | - | 15.31 | 43.42 | 113.6 |
RTX_4000_Ada_20GB_x4 | - | 7.33 | 20.58 | 56.14 |
A100_PCIe_80GB | - | 22.11 | 54.56 | 138.31 |
A100_SXM_80GB | - | 24.33 | 53.18 | 133.38 |
This is a simulation of token generation, designed to demonstrate the concept of tokens per second in language models. Here's what's happening:
This simulator is meant for educational purposes to help visualize token generation speeds. It's not indicative of real-world language model performance, which involves complex computations and can vary greatly based on model size and hardware.