Learning (to reproduce Pythia 2.8b) pretraining

Posted on Feb 25, 2026

Some researchers found through tracing the data provenance of open source models that Pythia 2.8b deduped may have not been trained with the same data as the other deduplicated Pythias. All the Pythias were trained on the Pile, which exists in roughly 4 formats: deduped or standard, and for each of those, preshuffled or not. If the model was trained on the wrong dataset, it is most likely to be one of the other Pile formats.

I ran a series of experiments to test this theory, and from the results it seems most likely that pythia-2.8b-deduped was trained on the standard Pile rather than the deduped Pile. Through these experiments I also found gradient accumulation steps to impact reproduction quality the most - getting these closer to the original training config resulted in increasingly close reproductions. Other factors such as per-device batch size, GPU type, dependency versions, and data shuffling methods make a smaller difference.

Goal

To validate whether that pythia-2.8b-deduped was trained on the wrong dataset, the plan was to first reproduce Pythia training, using the original training environment and configs, then determine if any Pile other than the deduped preshuffled Pile (from which it was supposed to be trained on) can be used to reproduce the published checkpoints of that model.

As control and test, I’d also take models from the same family that we think were trained correctly, for example pythia-70m-deduped, train them in the same environment on all candidate datasets, and ideally see that training on the deduped Pile gets the closest to the published checkpoints.

Beyond validating the hypothesis, I also wanted to gain an intuition for how pretraining reproduction is affected by different factors. I wanted to understand how much different GPU types, PyTorch/CUDA versions, attention implementations, per-device batch size/GPU count/gradient accumulation steps at the same global batch size, data shuffling, and data distributions (i.e. standard vs deduped) changed the model’s training trajectory.

Setup

Starting point

To eliminate model initialization as a source of divergence, I wanted to resume training from the uploaded checkpoint 0 rather than reinitialize the model, so the first step is to pick the right checkpoint 0.

There are five generations of the 2.8b Pythia on HuggingFace. The standard (non-deduped) versions of the latest three all had some form of corruption, while the first two had no noticeable issues. In addition, the first two and the last three seem to come from two distinct training setups, where gen 1/2 are similar to each other and 3/4/5 are similar to each other.

Gen	Repos	Notes
1	`pythia-2.7b` / `pythia-2.7b-deduped`	No noticeable issues.
2	`pythia-2.8b-v0` / `pythia-2.8b-deduped-v0`	All models are identical to gen 1.
3	`neox-ckpt-pythia-2.8b` / `neox-ckpt-pythia-2.8b-deduped`	Both variants are corrupt: every checkpoint is a clone of gen 1/2’s final checkpoint.
4	`neox-ckpt-pythia-2.8b-v1` / `neox-ckpt-pythia-2.8b-deduped-v1`	Deduped is unproblematic. Standard is corrupt - every checkpoint is the same, and extremely similar but not identical to step 143k of the deduped model (>0.99 param cosine sim, vs ~0.45 param cosine sim between `pythia-2.8b-v0` standard and deduped).
5	`pythia-2.8b` / `pythia-2.8b-deduped`	Deduped is unproblematic. Standard is corrupt in a slightly more complex way where `model.safetensors` and `pytorch_model.bin` have gen 4’s issue, while each checkpoint of the sharded safetensor is a clone of the deduplicated file at that same step.

Since we’re investigating the deduped model and the deduped variants of gen 4/5 are healthy, I resumed training from checkpoint 0 of neox-ckpt-pythia-2.8b-deduped-v1.

Metrics

To measure how similar a trained model checkpoint is to the HuggingFace checkpoint, the most obvious is to check if the weights are bit-wise identical, but since all of my reproductions ultimately had some deviation, I needed other metrics that gave more directional signals. I used these two metrics to through the experiments:

Parameter Cosine Similarity (θ Cos): flatten all parameter tensors from our reproduction and from the published checkpoint into single vectors, then compute cosine similarity. θ Cos = +1.0 means the parameter vectors point in exactly the same direction; +0.0 means orthogonal.
L2 Distance: calculated over the same flattened parameter vectors, but with parameter magnitude differences retained.

Hardware

I had access to 3 different hardware configs: 8xA40-48GB, 16xA40-48GB, and 8xA100-80GB. While none of these are the same as the original Pythia training setup (2.8B used 64x A100-40GB, 70m used 32x A100-40GB), they allowed me to test the effect of different gradient accumulation steps, per device batch size, and attention implementations.

Environment

The GPT-NeoX v1.0 tag for Pythia led to two Dockerfiles in the repo, one based on PyTorch 1.13 + CUDA 11.7, the other on PyTorch 1.10 + CUDA 11.1. Stella from Eleuther verified that the latter was used for the original Pythia training. All experiments other than those comparing dependency versions (further down) are run using this config.

The ‘correct’ environment for repro:

PyTorch 1.10.0+cu111
CUDA 11.1
Apex with commit a651e2c
DeepSpeed 0.3.15 (EleutherAI’s DeeperSpeed fork)
flash-attn 0.2.2

Experiments

Gradient accumulation steps (gas)

Of all the knobs I had, gas seemed to make the biggest and most consistent difference in reproduction quality. The original training used 64 GPUs with gas=1 and global batch size of 1024 samples/step. Since I had few GPUs, I couldn’t run any experiments with gas=1, but by varying the number of GPUs and per device batch size (PDBS) and keeping global batch size steady, I was able to run gas from 32 -> 8. I saw lowering gas steadily got us closer to the uploaded checkpoints, while the repro gap between training on the standard and deduped Pile remains roughly steady.

GPUs (A40s)	PDBS	gas	Standard θ Cos	Deduped θ Cos	Standard L2	Deduped L2
16	8	8	0.999994	0.999992	2.36	2.63
8	8	16	0.999992	0.999990	2.72	3.00
8	4	32	0.999984	0.999982	3.75	3.93

Why does gas matter so much? I didn’t know, but this is what Claude said:

In the original training, with 64 GPUs and gas=1, each microbatch’s gradients are computed independently and then averaged across all devices in a single all-reduce before the optimizer step. With fewer GPUs and higher gas, gradients are instead accumulated locally over multiple microbatches before the all-reduce. Since floating point addition is non-associative, the order and grouping of these reductions changes the result — accumulating 16 microbatches locally before averaging across 8 GPUs produces different rounding patterns than averaging 64 microbatches across 64 GPUs in one shot. These small numerical differences compound over training steps.

GPU type

Since the original training was done on A100s (albeit the 40GB version, not the 80GB which I had), I compared 8xA100 to 8xA40 and didn’t find much difference, I had limited time with the A100s so didn’t run both Piles, only the deduped.

GPU	θ Cos	L2
8xA100	0.999990	3.00
8xA40	0.999990	3.00

PDBS vs gas

Since the A100s had 80GB VRAM I had the option to double PDBS and half gas compared to the A40s, while keeping global batch size and number of GPUs constant. Here are the results:

GPU	PDBS	gas	Std θ Cos	Ded θ Cos	Std L2	Ded L2
8xA100	16	8	0.999992	0.999990	2.69	3.01
8xA40	8	16	0.999992	0.999990	2.72	3.00

Global vs flash attention

Pythia v1.0 was trained with flash attention 0.2.2. For the 2.8b Pythia which has a head_dim of 80, the A40 did not have enough shared memory so all the A40 experiments were run with global attention. To check whether this made any difference I compared flash attention to global attention on the A100 cluster.

attention	θ Cos	L2
flash	0.999990	3.01
global	0.999990	3.01

Dependency versions

As mentioned earlier I found two setups with different PyTorch and CUDA versions. Even though we verified which one was used for the original Pythia training, I wanted to see how much difference this made.

PyTorch	CUDA	Std θ Cos	Ded θ Cos	Std L2	Ded L2
1.10+cu111	11.1	0.999992	0.999990	2.72	3.00
1.13+cu117	11.7	0.999991	0.999990	2.73	2.98

Pythia 70M control

I also resumed training of the 70m Pythia deduped from checkpoint 0 in the same environment on both Piles. Since we have no reason to suspect that model was trained on the wrong data, I wanted to see if we can reproduce a consistent gap with our chosen metric. I trained the 70m up to 1k steps and saw deduped is closer for all steps, with the gap growing over training:

Step	Deduped θ Cos	Standard θ Cos	Deduped L2	Standard L2
2	1.000000	1.000000	0.00	0.00
8	1.000000	1.000000	0.06	0.06
32	0.999999	0.999998	0.41	0.44
128	0.999869	0.999778	3.89	5.06
512	0.989679	0.987252	38.11	42.33
1000	0.932638	0.922152	113.34	121.67

Data shuffling

GPT-NeoX has two levels of shuffling, document-level and sample-level. Eleuther also has both a regular Pile and a “preshuffled” Pile, the latter tokenized into a single continuous stream with no document boundaries (doc_count=1), so document-level shuffle is a no-op on that Pile.

I tested different shuffle configurations at 70M up to step 128, with the standard “non-preshuffled” Pile, the “preshuffled” Pile with sample-shuffling, and “preshuffled Pile without sample-shuffling:

Variant	θ Cos	L2
Preshuffled deduped, sample-shuffling	0.999869	3.89
Preshuffled deduped, no sample-shuffling	0.999866	3.93
Non-preshuffled deduped, document-shuffling + sample-shuffling	0.999771	5.14

The preshuffled Piles with and without sample-shuffling had almost the same θ Cos so I manually checked the first few tokens to confirm they are actually different. Using “preshuffled” Pile and default shuffling on got us slightly closer results to HuggingFace checkpoint 128, but the margin is insignificant.

Versus gen 1/2

I noticed that the first two generations of the 2.8b Pythias were seemingly a distinct training run to the latter three (fairly different weights), so I took a late checkpoint of our suspected problematic 2.8b deduped, and compared it to both the standard and deduped versions of the earlier generations, while this is an extremely noisy signal, it also showed that the 2.8b deduped was closer to the first two generations’ standard model.

Step	2.8b-v0 vs 2.8b-deduped (final)	2.8b-deduped-v0 vs 2.8b-deduped (final)	2.8b-v0 vs 2.8b-deduped-v0
1000	0.994286	0.992782	0.992740
16000	0.499541	0.476072	0.475921
64000	0.329775	0.306820	0.367098
143000	0.456450	0.417573	0.427488

Conclusions

While I was not able to bit-exact reproduce training of Pythia models given the resources I had, in every set of experiments, the model trained with standard Pile was consistently closer to the uploaded checkpoints of pythia-2.8b-deduped than those trained with deduped Pile. By contrast, training the 70m Pythia showed the opposite behaviour (i.e. pythia-70m-deduped checkpoints on HuggingFace are closer to those we trained with the deduped Pile). Additionally the pythia-2.8b-deduped is closer at all checkpoints from the standard v0 than to those from the deduped v0.

I’d say this is convincing but not conclusive evidence that the 2.8b-deduped pythia was trained on the standard rather than deduped Pile.

P.S. the issues with corrupt files are now being fixed by Eleuther.

P.P.S. thanks to Lucia and Stella from Eleuther for helping with this work.