LLM Token Generation Speed Simulator & Benchmark

Compare token generation speeds for different devices and models. Find the best hardware setup for your local LLM inference needs.

tokens/s
Maximum: 32,768 tokens/s
tokens
Maximum: 32,768 tokens

Generated Tokens (0 tokens)

Generation Time

Elapsed: 0.000 seconds

Expected: 0.000 seconds

Elapsed time may differ from expected time due to system performance and browser limitations.

About This LLM Performance Simulator

This interactive tool simulates token generation speeds for various language models (LLMs) and hardware configurations. Here's what you can learn:

  • Token Generation Speed: Understand how different devices and models affect LLM inference speed.
  • Hardware Comparison: Compare GPUs, CPUs, and Apple Silicon chips for LLM performance.
  • Model Optimization: See the impact of quantization on token generation speed.
  • Real-world Expectations: Get insights into expected performance for your specific hardware and model combination.

Use this simulator to make informed decisions about hardware requirements for running LLMs locally on your machine.

How to Use This LLM Benchmark Tool

  1. Select your device from the dropdown menu.
  2. Choose an LLM model and quantization level.
  3. Adjust the token generation speed if desired.
  4. Set the total number of tokens to generate.
  5. Click "Start" to run the simulation.
  6. Analyze the results to understand potential performance on your hardware.

Compare different configurations to find the optimal setup for your local LLM deployment.

AI Model Benchmarks

Apple Processing Benchmarks

Device Llama2_7B_F16 Llama2_7B_Q4_0 Llama2_7B_Q8_0 Llama3_70B_F16 Llama3_70B_Q4_K_M Llama3_8B_F16 Llama3_8B_Q4_K_M
M1_GPUCores_7_BW_68GBs - 107.81 108.21 - - - 87.26
M1_GPUCores_8_BW_68GBs - 117.96 117.25 - - - -
M1_Pro_GPUCores_14_BW_200GBs - 232.55 235.16 - - - -
M1_Pro_GPUCores_16_BW_200GBs 302.14 266.25 270.37 - - - -
M1_Max_GPUCores_24_BW_400GBs 453.03 400.26 405.87 - - - -
M1_Max_GPUCores_32_BW_400GBs 599.53 530.06 537.37 - 33.01 418.77 355.45
M1_Ultra_GPUCores_48_BW_800GBs 875.81 772.24 783.45 - - - -
M2_GPUCores_10_BW_100GBs 201.34 179.57 181.4 - - - -
M2_Pro_GPUCores_16_BW_200GBs 312.65 294.24 288.46 - - - -
M2_Pro_GPUCores_19_BW_200GBs 384.38 341.19 344.5 - - - -
M2_Max_GPUCores_30_BW_400GBs 600.46 537.6 540.15 - - - -
M2_Max_GPUCores_38_BW_400GBs 755.67 671.31 677.91 - - - -
M2_Ultra_GPUCores_60_BW_800GBs 1128.59 1013.81 1003.16 - - - -
M2_Ultra_GPUCores_76_BW_800GBs 1401.85 1238.48 1248.59 145.82 117.76 1202.74 1023.89
M3_GPUCores_10_BW_100GBs - 186.75 187.52 - - - -
M3_Pro_GPUCores_14_BW_150GBs - 269.49 272.11 - - - -
M3_Pro_GPUCores_18_BW_150GBs 357.45 341.67 344.66 - - - -
M3_Max_GPUCores_40_BW_400GBs 779.17 759.7 757.64 - 62.88 751.49 678.04

Apple Generation Benchmarks

Device Llama2_7B_F16 Llama2_7B_Q4_0 Llama2_7B_Q8_0 Llama3_70B_F16 Llama3_70B_Q4_K_M Llama3_8B_F16 Llama3_8B_Q4_K_M
M1_GPUCores_7_BW_68GBs - 14.19 7.92 - - - 9.72
M1_GPUCores_8_BW_68GBs - 14.15 7.91 - - - -
M1_Pro_GPUCores_14_BW_200GBs - 35.52 21.95 - - - -
M1_Pro_GPUCores_16_BW_200GBs 12.75 36.41 22.34 - - - -
M1_Max_GPUCores_24_BW_400GBs 22.55 54.61 37.81 - - - -
M1_Max_GPUCores_32_BW_400GBs 23.03 61.19 40.2 - 4.09 18.43 34.49
M1_Ultra_GPUCores_48_BW_800GBs 33.92 74.93 55.69 - - - -
M2_GPUCores_10_BW_100GBs 6.72 21.91 12.21 - - - -
M2_Pro_GPUCores_16_BW_200GBs 12.47 37.87 22.7 - - - -
M2_Pro_GPUCores_19_BW_200GBs 13.06 38.86 23.01 - - - -
M2_Max_GPUCores_30_BW_400GBs 24.16 60.99 39.97 - - - -
M2_Max_GPUCores_38_BW_400GBs 24.65 65.95 41.83 - - - -
M2_Ultra_GPUCores_60_BW_800GBs 39.86 88.64 62.14 - - - -
M2_Ultra_GPUCores_76_BW_800GBs 41.02 94.27 66.64 4.71 12.13 36.25 76.28
M3_GPUCores_10_BW_100GBs - 21.34 12.27 - - - -
M3_Pro_GPUCores_14_BW_150GBs - 30.65 17.44 - - - -
M3_Pro_GPUCores_18_BW_150GBs 9.89 30.74 17.53 - - - -
M3_Max_GPUCores_40_BW_400GBs 25.09 66.31 42.75 - 7.53 22.39 50.74

NVIDIA Processing Benchmarks

Device Llama3_70B_F16 Llama3_70B_Q4_K_M Llama3_8B_F16 Llama3_8B_Q4_K_M
3070_8GB - - - 2283.62
3080_10GB - - - 3557.02
3080_Ti_12GB - - - 3556.67
4070_Ti_12GB - - - 3653.07
4080_16GB - - 6758.9 5064.99
RTX_4000_Ada_20GB - - 2951.87 2310.53
3090_24GB - - 4239.64 3865.39
4090_24GB - - 9056.26 6898.71
RTX_5000_Ada_32GB - - 5835.41 4467.46
3090_24GB_x2 - 393.89 4690.5 4004.14
4090_24GB_x2 - 905.38 11094.51 8545.0
RTX_A6000_48GB - 466.82 4315.18 3621.81
RTX_6000_Ada_48GB - 547.03 6205.44 5560.94
A40_48GB - 239.92 4043.05 3240.95
L40S_48GB - 649.08 2491.65 5908.52
RTX_4000_Ada_20GB_x4 - 306.44 4366.64 3369.24
A100_PCIe_80GB - 726.65 7504.24 5800.48
A100_SXM_80GB - - - -

NVIDIA Generation Benchmarks

Device Llama3_70B_F16 Llama3_70B_Q4_K_M Llama3_8B_F16 Llama3_8B_Q4_K_M
3070_8GB - - - 70.94
3080_10GB - - - 106.4
3080_Ti_12GB - - - 106.71
4070_Ti_12GB - - - 82.21
4080_16GB - - 40.29 106.22
RTX_4000_Ada_20GB - - 20.85 58.59
3090_24GB - - 46.51 111.74
4090_24GB - - 54.34 127.74
RTX_5000_Ada_32GB - - 32.67 89.87
3090_24GB_x2 - 16.29 47.15 108.07
4090_24GB_x2 - 19.06 53.27 122.56
RTX_A6000_48GB - 14.58 40.25 102.22
RTX_6000_Ada_48GB - 18.36 51.97 130.99
A40_48GB - 12.08 33.95 88.95
L40S_48GB - 15.31 43.42 113.6
RTX_4000_Ada_20GB_x4 - 7.33 20.58 56.14
A100_PCIe_80GB - 22.11 54.56 138.31
A100_SXM_80GB - 24.33 53.18 133.38

Understanding the Simulation

This is a simulation of token generation, designed to demonstrate the concept of tokens per second in language models. Here's what's happening:

  • Token Generation: We're simulating the generation of tokens, where each token is approximately 4 characters of text.
  • Expected Time: Calculated as (Total Tokens) / (Tokens per Second). This is the ideal time if token generation was instantaneous.
  • Elapsed Time: Measured using the browser's performance API. This includes the time taken to generate tokens and update the display.
  • Differences in Times: The elapsed time may be longer due to factors like:
    • JavaScript execution speed
    • DOM update performance
    • Your device's current load and capabilities
  • Not a Real Language Model: This simulation uses pre-defined text and doesn't perform actual language model computations. Real language models would have different performance characteristics.

This simulator is meant for educational purposes to help visualize token generation speeds. It's not indicative of real-world language model performance, which involves complex computations and can vary greatly based on model size and hardware.