Skip to content

[seq2seq testing] multigpu test run via subprocess #7281

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Oct 21, 2020

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Sep 21, 2020

This PR is trying to fix the hanging/misbehaving/self-replicating pytest for tests using PL with gpu>1 (ddp backend).

OK, I couldn't figure out how to make dp or ddp_spawn to work, all kinds of obscure errors inside PL (It doesn't look like these are closely maintained as it's recommended not to use either), so ddp it is. I tried to get dp to work first, since it doesn't require forking a new process and special handling inside pytest.

Here is a working solution for ddp. Bottom line - you have to fork a new process and run the distributed script from it - to get it working with ddp - otherwise pytest either hangs or runs itself multiple times, breaking other scripts, a big mess.
I borrow the idea from PL itself https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/models/test_gpu.py#L111 - what a better place to find the correct way to test something but from the horse's mouth.

So what I had to do:

  • split into test_seq2seq_examples_multi_gpu.py as requested
  • added subprocess.Popen - but then replaced it with the modern asyncio- apparently using stdout and stderr pipes with wait can still cause a deadlock - but let's see if that works for our needs.
  • had to mess with args to correctly convert them into cl args - so many of them! perhaps there is already a helper util that does that - I probably re-invented the wheel
  • had to provide a new flag --overwrite_output_dir to support multi-gpu processes, as one of the children otherwise creates the output dir and the other fails to do so. instead we create the dir in the parent process.
  • there are some issues with accessing module attributes in the distributed env (see details here: Seq2Seq: same MultiGPU test failing twice! #5887 (comment)) - had to tweak the lightning_base.py to not use attributes in 2 accessors. (I didn't check - it's possible I need to adjust other scripts if they use self.total_steps) - I'm not 100% sure what is different under ddp - but somehow things behave differently and we have no access to module's attributes unless they are part of the model - see nn.Module.__getattr__ - might have to do with modules getting pickled. if you have insights I'm all ears.
  • the test validation had to be adjust to handle 2 gpus

@sshleifer

Fixes #5887

@stas00
Copy link
Contributor Author

stas00 commented Sep 21, 2020

To finish up the test, I don't yet know this functionality, it'd be something like (adapting the end from _test_distiller_cli):

        [...]
        p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, env=env)
        print("\nWarning: there will be no output while subprocess will take some time to complete")
        out, err = p.communicate(timeout=360)
        out = out.decode("utf-8").strip()
        err = err.decode("utf-8").strip()
        print(f"err: {err}")
        print(f"out: {out}")
        assert out, "produced no output"
        if p.returncode > 0:
            pytest.fail(err)

        # model = distill_main(argparse.Namespace(**args_d))
        # if not check_contents:
        #     return model
        contents = os.listdir(output_dir)
        contents = {os.path.basename(p) for p in contents}
        ckpt_files = [p for p in contents if p.endswith("ckpt")]
        assert len(ckpt_files) > 0

        self.assertIn("test_generations.txt", contents)
        self.assertIn("test_results.txt", contents)

        # XXX: get the following from the module, (we don't have access to `model` here)
        metrics_save_path = os.path.join(output_dir, "metrics.json")
        val_metric = "rouge2"
        
        metrics = load_json(metrics_save_path)
        # {'test': [{'test_avg_loss': 10.63731575012207, 'test_avg_rouge1': 0.0, 'test_avg_rouge2': 0.0, 'test_avg_rougeL': 0.0, 'test_avg_gen_time': 0.1822289228439331, 'test_avg_gen_len': 142.0, 'step_count': 1}]}
        print(metrics)
        last_step_stats = metrics["val"][-1]
        self.assertGreaterEqual(last_step_stats["val_avg_gen_time"], 0.01)
        self.assertGreaterEqual(1.0, last_step_stats["val_avg_gen_time"])
        self.assertIsInstance(last_step_stats[f"val_avg_{val_metric}"], float)
        desired_n_evals = int(args_d["max_epochs"] * (1 / args_d["val_check_interval"]) + 1)
        self.assertEqual(len(metrics["val"]), desired_n_evals)
        self.assertEqual(len(metrics["test"]), 1)

but I get test results in the metrics and not validation...

I'm sure you can quickly sort it out since you're familiar with what it's supposed to do. I hope it actually does the right thing. As it works with tiny models, it's impossible to tell whether it works or not quality-wise.

@huggingface huggingface deleted a comment from codecov bot Sep 21, 2020
@sshleifer
Copy link
Contributor

The only dealbreaker here is hanging.
Will timeout_decorator work in this context.

Also I'd love to move the test to a separate file.

@stas00
Copy link
Contributor Author

stas00 commented Sep 21, 2020

The only dealbreaker here is hanging.
Will timeout_decorator work in this context.

We have the timeout already. But it still hangs - when the sub-process fails - and it does dump the error. I will poke at it some more. I want it to tee the outputs of the subprocess, instead of the silent-until-done treatment.

Also I'd love to move the test to a separate file.

Just the multigpu test? or split all those unrelated example test into their own test_*specific_feature*

@sshleifer
Copy link
Contributor

Just multigpu.

@stas00
Copy link
Contributor Author

stas00 commented Sep 21, 2020

Just multigpu.

Will do. I think I understand why you want it apart - a troublemaker that affects other tests.

@huggingface huggingface deleted a comment from codecov bot Sep 22, 2020
@sshleifer
Copy link
Contributor

Made some progress on this, I think pl 1.0.0 will obviate the need to comment out the checking output_diir logic. Will push my changes soon, but I can take this from here.
You made huge progress on this thank you @stas00 !

@stas00 stas00 changed the title [wip] [seq2seq] multigpu test needs to be run via a subprocess [seq2seq testing] multigpu test needs to be run via a subprocess Oct 17, 2020
Copy link
Contributor

@sshleifer sshleifer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you want merge!

@stas00
Copy link
Contributor Author

stas00 commented Oct 17, 2020

yes, please.

@sshleifer sshleifer changed the title [seq2seq testing] multigpu test needs to be run via a subprocess [seq2seq testing] multigpu test run via subprocess Oct 21, 2020
@sshleifer sshleifer merged commit 8b38173 into huggingface:master Oct 21, 2020
@stas00 stas00 deleted the multigpu branch October 21, 2020 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Seq2Seq: same MultiGPU test failing twice!
2 participants