preloader

Search Here

How Long to Train a 70B LLM on 15T Tokens with 1024 H100s


Tune in to the podcast for an insightful discussion narrated by AI

Listen on Apple Podcasts Listen on Spotify Podcasts Listen on YouTube Music Listen on Amazon Music

In this article:

blog-image

Let’s break down the FLOPs, throughput, and other variables to get a solid estimate for this massive training run.

If you’re into large language models, you know that scale is a huge factor. Bigger models trained on more data generally perform better. This inevitably leads to a practical question: how much time and compute does it actually take? Let’s get specific and run the numbers for a common but hefty scenario: training a 70B parameter model on 15 trillion tokens with a 1024 H100 GPU cluster.

The Compute Stack: 1024 H100s

The workhorse for this job is the NVIDIA H100 GPU. Its architecture is purpose-built for the tensor math that makes transformer models tick. When you hook up 1024 of them with high-speed interconnects like NVLink and InfiniBand, you have a serious number-crunching machine. But throwing more GPUs at the problem doesn’t give you linear speed-ups. The final performance depends heavily on the interplay between the hardware, the software stack, and network bandwidth.

Crunching the Numbers: From Tokens per Second to Total Training Time

Instead of relying on purely theoretical FLOPs, let’s use some real-world data. NVIDIA provides benchmarks for training similar models on this exact kind of setup. They use optimized software like Maxtext, which gives us a solid baseline for throughput, measured in tokens per second.

A key variable here is numerical precision. Modern training often uses BFloat16 (BF16) or, for even more speed, 8-bit Floating-Point (FP8). The trade-off is precision for speed, with FP8 offering higher throughput.

Based on NVIDIA’s benchmarks for a 70B model on 1024 H100s, here’s the kind of throughput we can expect:

  • Using FP8 precision: ~1,487,000 to 1,657,000 tokens/sec.
  • Using BF16 precision: ~1,124,000 to 1,184,000 tokens/sec.

Now, let’s do the math for our 15 trillion (15×101215 \times 10^{12}) token dataset.

For FP8 training:

  • Total seconds = Total Tokens / Tokens per Second
  • Total seconds = 15,000,000,000,000/1,572,00015,000,000,000,000 / 1,572,000 (using an average of the range) ≈9,541,984\approx 9,541,984 seconds
  • Estimated Training Time (FP8) ≈\approx 110.4 days

For BF16 training:

  • Total seconds = Total Tokens / Tokens per Second
  • Total seconds = 15,000,000,000,000/1,154,00015,000,000,000,000 / 1,154,000 (using an average of the range) ≈12,998,266\approx 12,998,266 seconds
  • Estimated Training Time (BF16) ≈\approx 150.4 days

To cross-reference this, we can use NVIDIA’s “time to train on 1T tokens” metric for a Llama 3.1 70B model. They report ~7.78 days for FP8 and ~10.29 days for BF16. Let’s multiply that by 15.

  • FP8: 7.78 days/T tokens×15 T tokens≈116.7 days7.78 \text{ days/T tokens} \times 15 \text{ T tokens} \approx 116.7 \text{ days}
  • BF16: 10.29 days/T tokens×15 T tokens≈154.35 days10.29 \text{ days/T tokens} \times 15 \text{ T tokens} \approx 154.35 \text{ days}

The numbers line up nicely, giving us confidence in our estimate.


An Alternate View: The FLOPs-Based Estimate

Another way to approach this problem is from the bottom up: calculate the total number of floating-point operations (FLOPs) required and divide that by the effective compute rate of our GPU cluster. This gives us a great sanity check for our previous estimate.

Let’s use a standard formula from the field:

  1. Total FLOPs Needed: For a transformer, the generally accepted rule of thumb is 6∗Parameters∗Tokens6 * Parameters * Tokens FLOPs. This calculates the total compute workload. The rule takes approximately 6 FLOPs per parameter to process one token (this accounts for both the forward and backward passes during training).
6∗(70∗109)∗(15∗1012)=6.3∗10246 * (70 * 10^9) * (15 * 10^{12}) = 6.3 * 10^{24}

That’s 6,300 zettaFLOPs – a truly astronomical number.

  1. Effective Cluster Speed: This is where it gets interesting. A single H100 has a peak FP16/BF16 performance of about 2,000 TFLOP/s. But you never achieve the peak in a real workload. A reasonable assumption for sustained performance is about 50% of that, so ~1,000 TFLOP/s per GPU. Then, we have to account for overall efficiency. Model FLOPS Utilization (MFU) is the metric for this; it represents the percentage of time the GPUs are actually crunching numbers versus waiting for data or communicating. A 50% MFU is a solid target for a large, well-optimized system.

So, the effective daily compute of our cluster is:

  • Effective FLOPs/day =(1,000∗1012= (1,000 * 10^{12} FLOPs/sec/GPU)∗1024) * 1024 GPUs ∗0.50 * 0.50 MFU ∗86,400 * 86,400 sec/day
  • Effective FLOPs/day ≈4.42∗1022≈ 4.42 * 10^22 FLOPs/day
  1. Time to Train: Now we just divide:

Days = (Total FLOPs) / (Effective FLOPs/day) =

=(6.3∗1024)/(4.42∗1022)≈142.5= (6.3 * 10^{24}) / (4.42 * 10^{22}) ≈ 142.5

The result of ~143 days is incredibly close to our throughput-based estimate for BF16 training. This consistency between two different estimation methods gives us high confidence that, barring major setbacks, a 70B model build on 15T tokens is roughly a 4.5 to 5-month project on a 1024 H100 cluster.

Caveats and Other Variables

It’s important to remember these are clean, on-paper estimates. The actual time can shift based on several factors:

  • Software Stack and Optimization: The efficiency of your training framework (e.g., PyTorch, JAX) and your parallelism strategy (Data, Tensor, Pipeline) matter immensely. Optimized libraries like cuDNN are non-negotiable for performance.
  • Hardware Stability: Running a 1024-GPU cluster for months isn’t trivial. Hardware failures, node restarts, and scheduled maintenance will inevitably add to the total time.
  • Model Architecture: The specific architectural details of your 70B model – number of layers, attention heads, context length – can alter the FLOPs required per token, thus affecting training speed.
  • Dataset Characteristics: The average sequence length of the documents in your 15T token dataset can also influence the overall throughput.

The Bottom Line: What This Means for AI Development

So, we land at an estimate of roughly four to five months. This timeline shows just how resource-intensive it is to build a large-scale model from scratch. The cost, both in time and hardware, explains why only a handful of major tech companies and well-funded research labs can undertake these projects.

To sum up, the math gives us a solid ballpark figure, but the actual time-to-train for a 70B model is a function of the entire tech stack. As hardware gets faster and software toolchains become more efficient, we’ll see these numbers continue to drop, making large-scale model training more accessible over time.


Thank you!

signature

--------------------

Follow me:

Share this post among others: