For a pretraining experiment using the built-in Trainer, setting batch size 32 x 16 accumulation steps seems to yield a different training loss curve from 64 x 8. Even though they converge to approximately the same value, shouldn't the curves be exactly the same? What would cause the difference?

I've also seen similar things with using 1 vs. multiple GPUs (under DataParallel), which may or may not be relevant.