How Long to Train a 70B LLM on 15T Tokens with 1024 H100s

Let’s break down the FLOPs, throughput, and other variables to get a solid estimate for this massive training run.

If you’re into large language models, you know that scale is a huge factor. Bigger models trained on more data generally perform better. This inevitably leads to a practical question: how much time and compute does it actually take? Let’s get specific and run the numbers for a common but hefty scenario: training a 70B parameter model on 15 trillion tokens with a 1024 H100 GPU cluster.

The Compute Stack: 1024 H100s

The workhorse for this job is the NVIDIA H100 GPU. Its architecture is purpose-built for the tensor math that makes transformer models tick. When you hook up 1024 of them with high-speed interconnects like NVLink and InfiniBand, you have a serious number-crunching machine. But throwing more GPUs at the problem doesn’t give you linear speed-ups. The final performance depends heavily on the interplay between the hardware, the software stack, and network bandwidth.

Crunching the Numbers: From Tokens per Second to Total Training Time

Instead of relying on purely theoretical FLOPs, let’s use some real-world data. NVIDIA provides benchmarks for training similar models on this exact kind of setup. They use optimized software like Maxtext, which gives us a solid baseline for throughput, measured in tokens per second.

A key variable here is numerical precision. Modern training often uses BFloat16 (BF16) or, for even more speed, 8-bit Floating-Point (FP8). The trade-off is precision for speed, with FP8 offering higher throughput.

Based on NVIDIA’s benchmarks for a 70B model on 1024 H100s, here’s the kind of throughput we can expect:

Using FP8 precision: ~1,487,000 to 1,657,000 tokens/sec.
Using BF16 precision: ~1,124,000 to 1,184,000 tokens/sec.

Now, let’s do the math for our 15 trillion ( $15 \times 10^{12}$ ) token dataset.

For FP8 training:

Total seconds = Total Tokens / Tokens per Second
Total seconds = $15,000,000,000,000 / 1,572,000$ (using an average of the range) $\approx 9,541,984$ seconds
Estimated Training Time (FP8) $\approx$ 110.4 days

For BF16 training:

Total seconds = Total Tokens / Tokens per Second
Total seconds = $15,000,000,000,000 / 1,154,000$ (using an average of the range) $\approx 12,998,266$ seconds
Estimated Training Time (BF16) $\approx$ 150.4 days

To cross-reference this, we can use NVIDIA’s “time to train on 1T tokens” metric for a Llama 3.1 70B model. They report ~7.78 days for FP8 and ~10.29 days for BF16. Let’s multiply that by 15.

FP8: $7.78 \text{ days/T tokens} \times 15 \text{ T tokens} \approx 116.7 \text{ days}$
BF16: $10.29 \text{ days/T tokens} \times 15 \text{ T tokens} \approx 154.35 \text{ days}$

The numbers line up nicely, giving us confidence in our estimate.

An Alternate View: The FLOPs-Based Estimate

Another way to approach this problem is from the bottom up: calculate the total number of floating-point operations (FLOPs) required and divide that by the effective compute rate of our GPU cluster. This gives us a great sanity check for our previous estimate.

Let’s use a standard formula from the field:

Total FLOPs Needed: For a transformer, the generally accepted rule of thumb is $6 * Parameters * Tokens$ FLOPs. This calculates the total compute workload. The rule takes approximately 6 FLOPs per parameter to process one token (this accounts for both the forward and backward passes during training).

6 * (70 * 10^9) * (15 * 10^{12}) = 6.3 * 10^{24}

That’s 6,300 zettaFLOPs – a truly astronomical number.

Effective Cluster Speed: This is where it gets interesting. A single H100 has a peak FP16/BF16 performance of about 2,000 TFLOP/s. But you never achieve the peak in a real workload. A reasonable assumption for sustained performance is about 50% of that, so ~1,000 TFLOP/s per GPU. Then, we have to account for overall efficiency. Model FLOPS Utilization (MFU) is the metric for this; it represents the percentage of time the GPUs are actually crunching numbers versus waiting for data or communicating. A 50% MFU is a solid target for a large, well-optimized system.

So, the effective daily compute of our cluster is:

Effective FLOPs/day $= (1,000 * 10^{12}$ FLOPs/sec/GPU $) * 1024$ GPUs $* 0.50$ MFU $* 86,400$ sec/day
Effective FLOPs/day $≈ 4.42 * 10^22$ FLOPs/day

Time to Train: Now we just divide:

Days = (Total FLOPs) / (Effective FLOPs/day) =

= (6.3 * 10^{24}) / (4.42 * 10^{22}) ≈ 142.5

The result of ~143 days is incredibly close to our throughput-based estimate for BF16 training. This consistency between two different estimation methods gives us high confidence that, barring major setbacks, a 70B model build on 15T tokens is roughly a 4.5 to 5-month project on a 1024 H100 cluster.

Caveats and Other Variables

It’s important to remember these are clean, on-paper estimates. The actual time can shift based on several factors:

Software Stack and Optimization: The efficiency of your training framework (e.g., PyTorch, JAX) and your parallelism strategy (Data, Tensor, Pipeline) matter immensely. Optimized libraries like cuDNN are non-negotiable for performance.
Hardware Stability: Running a 1024-GPU cluster for months isn’t trivial. Hardware failures, node restarts, and scheduled maintenance will inevitably add to the total time.
Model Architecture: The specific architectural details of your 70B model – number of layers, attention heads, context length – can alter the FLOPs required per token, thus affecting training speed.
Dataset Characteristics: The average sequence length of the documents in your 15T token dataset can also influence the overall throughput.

The Bottom Line: What This Means for AI Development

So, we land at an estimate of roughly four to five months. This timeline shows just how resource-intensive it is to build a large-scale model from scratch. The cost, both in time and hardware, explains why only a handful of major tech companies and well-funded research labs can undertake these projects.

To sum up, the math gives us a solid ballpark figure, but the actual time-to-train for a 70B model is a function of the entire tech stack. As hardware gets faster and software toolchains become more efficient, we’ll see these numbers continue to drop, making large-scale model training more accessible over time.

Thank you!

--------------------

Follow me:

How Long to Train a 70B LLM on 15T Tokens with 1024 H100s

In this article:

The Compute Stack: 1024 H100s

Crunching the Numbers: From Tokens per Second to Total Training Time

An Alternate View: The FLOPs-Based Estimate

Caveats and Other Variables

The Bottom Line: What This Means for AI Development

Related News

Categories

Recent News

The Architect's Playbook: How to Choose the Right User Interface

The Ambient Compute Era: Architecting for Voice (VUI) and Natural (NUI) Interfaces

Graphical User Interface (GUI) and Architectural Patterns

Tags

Search Here