Let’s break down the FLOPs, throughput, and other variables to get a solid estimate for this massive training run.
If you’re into large language models, you know that scale is a huge factor. Bigger models trained on more data generally perform better. This inevitably leads to a practical question: how much time and compute does it actually take? Let’s get specific and run the numbers for a common but hefty scenario: training a 70B parameter model on 15 trillion tokens with a 1024 H100 GPU cluster.
The Compute Stack: 1024 H100s
The workhorse for this job is the NVIDIA H100 GPU. Its architecture is purpose-built for the tensor math that makes transformer models tick. When you hook up 1024 of them with high-speed interconnects like NVLink and InfiniBand, you have a serious number-crunching machine. But throwing more GPUs at the problem doesn’t give you linear speed-ups. The final performance depends heavily on the interplay between the hardware, the software stack, and network bandwidth.
Crunching the Numbers: From Tokens per Second to Total Training Time
Instead of relying on purely theoretical FLOPs, let’s use some real-world data. NVIDIA provides benchmarks for training similar models on this exact kind of setup. They use optimized software like Maxtext, which gives us a solid baseline for throughput, measured in tokens per second.
A key variable here is numerical precision. Modern training often uses BFloat16 (BF16) or, for even more speed, 8-bit Floating-Point (FP8). The trade-off is precision for speed, with FP8 offering higher throughput.
Based on NVIDIA’s benchmarks for a 70B model on 1024 H100s, here’s the kind of throughput we can expect:
- Using FP8 precision: ~1,487,000 to 1,657,000 tokens/sec.
- Using BF16 precision: ~1,124,000 to 1,184,000 tokens/sec.
Now, let’s do the math for our 15 trillion () token dataset.
For FP8 training:
- Total seconds = Total Tokens / Tokens per Second
- Total seconds = (using an average of the range) seconds
- Estimated Training Time (FP8) 110.4 days
For BF16 training:
- Total seconds = Total Tokens / Tokens per Second
- Total seconds = (using an average of the range) seconds
- Estimated Training Time (BF16) 150.4 days
To cross-reference this, we can use NVIDIA’s “time to train on 1T tokens” metric for a Llama 3.1 70B model. They report ~7.78 days for FP8 and ~10.29 days for BF16. Let’s multiply that by 15.
- FP8:
- BF16:
The numbers line up nicely, giving us confidence in our estimate.
An Alternate View: The FLOPs-Based Estimate
Another way to approach this problem is from the bottom up: calculate the total number of floating-point operations (FLOPs) required and divide that by the effective compute rate of our GPU cluster. This gives us a great sanity check for our previous estimate.
Let’s use a standard formula from the field:
- Total FLOPs Needed: For a transformer, the generally accepted rule of thumb is FLOPs. This calculates the total compute workload. The rule takes approximately 6 FLOPs per parameter to process one token (this accounts for both the forward and backward passes during training).
That’s 6,300 zettaFLOPs – a truly astronomical number.
- Effective Cluster Speed: This is where it gets interesting. A single H100 has a peak FP16/BF16 performance of about 2,000 TFLOP/s. But you never achieve the peak in a real workload. A reasonable assumption for sustained performance is about 50% of that, so ~1,000 TFLOP/s per GPU. Then, we have to account for overall efficiency. Model FLOPS Utilization (MFU) is the metric for this; it represents the percentage of time the GPUs are actually crunching numbers versus waiting for data or communicating. A 50% MFU is a solid target for a large, well-optimized system.
So, the effective daily compute of our cluster is:
- Effective FLOPs/day FLOPs/sec/GPU GPUs MFU sec/day
- Effective FLOPs/day FLOPs/day
- Time to Train: Now we just divide:
Days = (Total FLOPs) / (Effective FLOPs/day) =
The result of ~143 days is incredibly close to our throughput-based estimate for BF16 training. This consistency between two different estimation methods gives us high confidence that, barring major setbacks, a 70B model build on 15T tokens is roughly a 4.5 to 5-month project on a 1024 H100 cluster.
Caveats and Other Variables
It’s important to remember these are clean, on-paper estimates. The actual time can shift based on several factors:
- Software Stack and Optimization: The efficiency of your training framework (e.g., PyTorch, JAX) and your parallelism strategy (Data, Tensor, Pipeline) matter immensely. Optimized libraries like cuDNN are non-negotiable for performance.
- Hardware Stability: Running a 1024-GPU cluster for months isn’t trivial. Hardware failures, node restarts, and scheduled maintenance will inevitably add to the total time.
- Model Architecture: The specific architectural details of your 70B model – number of layers, attention heads, context length – can alter the FLOPs required per token, thus affecting training speed.
- Dataset Characteristics: The average sequence length of the documents in your 15T token dataset can also influence the overall throughput.
The Bottom Line: What This Means for AI Development
So, we land at an estimate of roughly four to five months. This timeline shows just how resource-intensive it is to build a large-scale model from scratch. The cost, both in time and hardware, explains why only a handful of major tech companies and well-funded research labs can undertake these projects.
To sum up, the math gives us a solid ballpark figure, but the actual time-to-train for a 70B model is a function of the entire tech stack. As hardware gets faster and software toolchains become more efficient, we’ll see these numbers continue to drop, making large-scale model training more accessible over time.
Thank you!