-
Notifications
You must be signed in to change notification settings - Fork 29.9k
[seq2seq testing] multigpu test run via subprocess #7281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
To finish up the test, I don't yet know this functionality, it'd be something like (adapting the end from
but I get test results in the metrics and not validation... I'm sure you can quickly sort it out since you're familiar with what it's supposed to do. I hope it actually does the right thing. As it works with tiny models, it's impossible to tell whether it works or not quality-wise. |
The only dealbreaker here is hanging. Also I'd love to move the test to a separate file. |
We have the timeout already. But it still hangs - when the sub-process fails - and it does dump the error. I will poke at it some more. I want it to
Just the multigpu test? or split all those unrelated example test into their own |
Just multigpu. |
Will do. I think I understand why you want it apart - a troublemaker that affects other tests. |
Made some progress on this, I think pl 1.0.0 will obviate the need to comment out the checking output_diir logic. Will push my changes soon, but I can take this from here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you want merge!
yes, please. |
This PR is trying to fix the hanging/misbehaving/self-replicating pytest for tests using PL with
gpu>1
(ddp
backend).OK, I couldn't figure out how to make
dp
orddp_spawn
to work, all kinds of obscure errors inside PL (It doesn't look like these are closely maintained as it's recommended not to use either), soddp
it is. I tried to getdp
to work first, since it doesn't require forking a new process and special handling inside pytest.Here is a working solution for
ddp
. Bottom line - you have to fork a new process and run the distributed script from it - to get it working withddp
- otherwise pytest either hangs or runs itself multiple times, breaking other scripts, a big mess.I borrow the idea from PL itself https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/models/test_gpu.py#L111 - what a better place to find the correct way to test something but from the horse's mouth.
So what I had to do:
test_seq2seq_examples_multi_gpu.py
as requestedsubprocess.Popen
- but then replaced it with the modernasyncio
- apparently using stdout and stderr pipes withwait
can still cause a deadlock - but let's see if that works for our needs.--overwrite_output_dir
to support multi-gpu processes, as one of the children otherwise creates the output dir and the other fails to do so. instead we create the dir in the parent process.lightning_base.py
to not use attributes in 2 accessors. (I didn't check - it's possible I need to adjust other scripts if they useself.total_steps
) - I'm not 100% sure what is different under ddp - but somehow things behave differently and we have no access to module's attributes unless they are part of the model - seenn.Module.__getattr__
- might have to do with modules getting pickled. if you have insights I'm all ears.@sshleifer
Fixes #5887