preloader

Search Here

blog-image

How Long to Train a 70B LLM on 15T Tokens with 1024 H100s


In this article:

Let’s break down the FLOPs, throughput, and other variables to get a solid estimate for this massive training run.

If you’re into large language models, you know that scale is a huge factor. Bigger models trained on more data generally perform better. This inevitably leads to a practical question: how much time and compute does it actually take? Let’s get specific and run the numbers for a common but hefty scenario: training a 70B parameter model on 15 trillion tokens with a 1024 H100 GPU cluster.

The Compute Stack: 1024 H100s

The workhorse for this job is the NVIDIA H100 GPU. Its architecture is purpose-built for the tensor math that makes transformer models tick. When you hook up 1024 of them with high-speed interconnects like NVLink and InfiniBand, you have a serious number-crunching machine. But throwing more GPUs at the problem doesn’t give you linear speed-ups. The final performance depends heavily on the interplay between the hardware, the software stack, and network bandwidth.

Crunching the Numbers: From Tokens per Second to Total Training Time

Instead of relying on purely theoretical FLOPs, let’s use some real-world data. NVIDIA provides benchmarks for training similar models on this exact kind of setup. They use optimized software like Maxtext, which gives us a solid baseline for throughput, measured in tokens per second.

A key variable here is numerical precision. Modern training often uses BFloat16 (BF16) or, for even more speed, 8-bit Floating-Point (FP8). The trade-off is precision for speed, with FP8 offering higher throughput.

Based on NVIDIA’s benchmarks for a 70B model on 1024 H100s, here’s the kind of throughput we can expect:

  • Using FP8 precision: ~1,487,000 to 1,657,000 tokens/sec.
  • Using BF16 precision: ~1,124,000 to 1,184,000 tokens/sec.

Now, let’s do the math for our 15 trillion (15×101215 \times 10^{12}) token dataset.

For FP8 training:

  • Total seconds = Total Tokens / Tokens per Second
  • Total seconds = 15,000,000,000,000/1,572,00015,000,000,000,000 / 1,572,000 (using an average of the range) ≈9,541,984\approx 9,541,984 seconds
  • Estimated Training Time (FP8) ≈\approx 110.4 days

For BF16 training:

  • Total seconds = Total Tokens / Tokens per Second
  • Total seconds = 15,000,000,000,000/1,154,00015,000,000,000,000 / 1,154,000 (using an average of the range) ≈12,998,266\approx 12,998,266 seconds
  • Estimated Training Time (BF16) ≈\approx 150.4 days

To cross-reference this, we can use NVIDIA’s “time to train on 1T tokens” metric for a Llama 3.1 70B model. They report ~7.78 days for FP8 and ~10.29 days for BF16. Let’s multiply that by 15.

  • FP8: 7.78 days/T tokens×15 T tokens≈116.7 days7.78 \text{ days/T tokens} \times 15 \text{ T tokens} \approx 116.7 \text{ days}
  • BF16: 10.29 days/T tokens×15 T tokens≈154.35 days10.29 \text{ days/T tokens} \times 15 \text{ T tokens} \approx 154.35 \text{ days}

The numbers line up nicely, giving us confidence in our estimate.


An Alternate View: The FLOPs-Based Estimate

Another way to approach this problem is from the bottom up: calculate the total number of floating-point operations (FLOPs) required and divide that by the effective compute rate of our GPU cluster. This gives us a great sanity check for our previous estimate.

Let’s use a standard formula from the field:

  1. Total FLOPs Needed: For a transformer, the generally accepted rule of thumb is 6∗Parameters∗Tokens6 * Parameters * Tokens FLOPs. This calculates the total compute workload. The rule takes approximately 6 FLOPs per parameter to process one token (this accounts for both the forward and backward passes during training).
6∗(70∗109)∗(15∗1012)=6.3∗10246 * (70 * 10^9) * (15 * 10^{12}) = 6.3 * 10^{24}

That’s 6,300 zettaFLOPs – a truly astronomical number.

  1. Effective Cluster Speed: This is where it gets interesting. A single H100 has a peak FP16/BF16 performance of about 2,000 TFLOP/s. But you never achieve the peak in a real workload. A reasonable assumption for sustained performance is about 50% of that, so ~1,000 TFLOP/s per GPU. Then, we have to account for overall efficiency. Model FLOPS Utilization (MFU) is the metric for this; it represents the percentage of time the GPUs are actually crunching numbers versus waiting for data or communicating. A 50% MFU is a solid target for a large, well-optimized system.

So, the effective daily compute of our cluster is:

  • Effective FLOPs/day =(1,000∗1012= (1,000 * 10^{12} FLOPs/sec/GPU)∗1024) * 1024 GPUs ∗0.50 * 0.50 MFU ∗86,400 * 86,400 sec/day
  • Effective FLOPs/day ≈4.42∗1022≈ 4.42 * 10^22 FLOPs/day
  1. Time to Train: Now we just divide:

Days = (Total FLOPs) / (Effective FLOPs/day) =

=(6.3∗1024)/(4.42∗1022)≈142.5= (6.3 * 10^{24}) / (4.42 * 10^{22}) ≈ 142.5

The result of ~143 days is incredibly close to our throughput-based estimate for BF16 training. This consistency between two different estimation methods gives us high confidence that, barring major setbacks, a 70B model build on 15T tokens is roughly a 4.5 to 5-month project on a 1024 H100 cluster.

Caveats and Other Variables

It’s important to remember these are clean, on-paper estimates. The actual time can shift based on several factors:

  • Software Stack and Optimization: The efficiency of your training framework (e.g., PyTorch, JAX) and your parallelism strategy (Data, Tensor, Pipeline) matter immensely. Optimized libraries like cuDNN are non-negotiable for performance.
  • Hardware Stability: Running a 1024-GPU cluster for months isn’t trivial. Hardware failures, node restarts, and scheduled maintenance will inevitably add to the total time.
  • Model Architecture: The specific architectural details of your 70B model – number of layers, attention heads, context length – can alter the FLOPs required per token, thus affecting training speed.
  • Dataset Characteristics: The average sequence length of the documents in your 15T token dataset can also influence the overall throughput.

The Bottom Line: What This Means for AI Development

So, we land at an estimate of roughly four to five months. This timeline shows just how resource-intensive it is to build a large-scale model from scratch. The cost, both in time and hardware, explains why only a handful of major tech companies and well-funded research labs can undertake these projects.

To sum up, the math gives us a solid ballpark figure, but the actual time-to-train for a 70B model is a function of the entire tech stack. As hardware gets faster and software toolchains become more efficient, we’ll see these numbers continue to drop, making large-scale model training more accessible over time.


Thank you!

signature

--------------------

Follow me:

Share this post among others: