Gradient accumulation causing different training curves

For a pretraining experiment using the built-in `Trainer`, setting batch size 32 x 16 accumulation steps seems to yield a different training loss curve from 64 x 8. Even though they converge to approximately the same value, shouldn't the curves be exactly the same? What would cause the difference?
![image](https://user-images.githubusercontent.com/11954789/144804002-fbd31936-c58a-45c4-a991-5eabcde42f16.png)

I've also seen similar things with using 1 vs. multiple GPUs (under DataParallel), which may or may not be relevant.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gradient accumulation causing different training curves #14638

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gradient accumulation causing different training curves #14638

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions