[seq2seq testing] multigpu test run via subprocess #7281

stas00 · 2020-09-21T07:55:40Z

This PR is trying to fix the hanging/misbehaving/self-replicating pytest for tests using PL with gpu>1 (ddp backend).

OK, I couldn't figure out how to make dp or ddp_spawn to work, all kinds of obscure errors inside PL (It doesn't look like these are closely maintained as it's recommended not to use either), so ddp it is. I tried to get dp to work first, since it doesn't require forking a new process and special handling inside pytest.

Here is a working solution for ddp. Bottom line - you have to fork a new process and run the distributed script from it - to get it working with ddp - otherwise pytest either hangs or runs itself multiple times, breaking other scripts, a big mess.
I borrow the idea from PL itself https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/models/test_gpu.py#L111 - what a better place to find the correct way to test something but from the horse's mouth.

So what I had to do:

split into test_seq2seq_examples_multi_gpu.py as requested
added subprocess.Popen - but then replaced it with the modern asyncio- apparently using stdout and stderr pipes with wait can still cause a deadlock - but let's see if that works for our needs.
had to mess with args to correctly convert them into cl args - so many of them! perhaps there is already a helper util that does that - I probably re-invented the wheel
had to provide a new flag --overwrite_output_dir to support multi-gpu processes, as one of the children otherwise creates the output dir and the other fails to do so. instead we create the dir in the parent process.
there are some issues with accessing module attributes in the distributed env (see details here: Seq2Seq: same MultiGPU test failing twice! #5887 (comment)) - had to tweak the lightning_base.py to not use attributes in 2 accessors. (I didn't check - it's possible I need to adjust other scripts if they use self.total_steps) - I'm not 100% sure what is different under ddp - but somehow things behave differently and we have no access to module's attributes unless they are part of the model - see nn.Module.__getattr__ - might have to do with modules getting pickled. if you have insights I'm all ears.
the test validation had to be adjust to handle 2 gpus

@sshleifer

Fixes #5887

stas00 · 2020-09-21T08:17:12Z

To finish up the test, I don't yet know this functionality, it'd be something like (adapting the end from _test_distiller_cli):

        [...]
        p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, env=env)
        print("\nWarning: there will be no output while subprocess will take some time to complete")
        out, err = p.communicate(timeout=360)
        out = out.decode("utf-8").strip()
        err = err.decode("utf-8").strip()
        print(f"err: {err}")
        print(f"out: {out}")
        assert out, "produced no output"
        if p.returncode > 0:
            pytest.fail(err)

        # model = distill_main(argparse.Namespace(**args_d))
        # if not check_contents:
        #     return model
        contents = os.listdir(output_dir)
        contents = {os.path.basename(p) for p in contents}
        ckpt_files = [p for p in contents if p.endswith("ckpt")]
        assert len(ckpt_files) > 0

        self.assertIn("test_generations.txt", contents)
        self.assertIn("test_results.txt", contents)

        # XXX: get the following from the module, (we don't have access to `model` here)
        metrics_save_path = os.path.join(output_dir, "metrics.json")
        val_metric = "rouge2"
        
        metrics = load_json(metrics_save_path)
        # {'test': [{'test_avg_loss': 10.63731575012207, 'test_avg_rouge1': 0.0, 'test_avg_rouge2': 0.0, 'test_avg_rougeL': 0.0, 'test_avg_gen_time': 0.1822289228439331, 'test_avg_gen_len': 142.0, 'step_count': 1}]}
        print(metrics)
        last_step_stats = metrics["val"][-1]
        self.assertGreaterEqual(last_step_stats["val_avg_gen_time"], 0.01)
        self.assertGreaterEqual(1.0, last_step_stats["val_avg_gen_time"])
        self.assertIsInstance(last_step_stats[f"val_avg_{val_metric}"], float)
        desired_n_evals = int(args_d["max_epochs"] * (1 / args_d["val_check_interval"]) + 1)
        self.assertEqual(len(metrics["val"]), desired_n_evals)
        self.assertEqual(len(metrics["test"]), 1)

but I get test results in the metrics and not validation...

I'm sure you can quickly sort it out since you're familiar with what it's supposed to do. I hope it actually does the right thing. As it works with tiny models, it's impossible to tell whether it works or not quality-wise.

sshleifer · 2020-09-21T18:35:47Z

The only dealbreaker here is hanging.
Will timeout_decorator work in this context.

Also I'd love to move the test to a separate file.

stas00 · 2020-09-21T18:51:23Z

The only dealbreaker here is hanging.
Will timeout_decorator work in this context.

We have the timeout already. But it still hangs - when the sub-process fails - and it does dump the error. I will poke at it some more. I want it to tee the outputs of the subprocess, instead of the silent-until-done treatment.

Also I'd love to move the test to a separate file.

Just the multigpu test? or split all those unrelated example test into their own test_*specific_feature*

sshleifer · 2020-09-21T19:14:31Z

Just multigpu.

stas00 · 2020-09-21T19:37:46Z

Just multigpu.

Will do. I think I understand why you want it apart - a troublemaker that affects other tests.

sshleifer · 2020-10-07T14:57:22Z

Made some progress on this, I think pl 1.0.0 will obviate the need to comment out the checking output_diir logic. Will push my changes soon, but I can take this from here.
You made huge progress on this thank you @stas00 !

sshleifer

Let me know if you want merge!

examples/seq2seq/distillation.py

stas00 · 2020-10-17T21:34:42Z

yes, please.

multigpu test needs to be run via a subprocess

a5f4029

stas00 mentioned this pull request Sep 21, 2020

Seq2Seq: same MultiGPU test failing twice! #5887

Closed

huggingface deleted a comment from codecov bot Sep 21, 2020

sshleifer self-assigned this Sep 21, 2020

sshleifer added Distributed Training / Models Tests Related to tests labels Sep 21, 2020

stas00 added 11 commits September 21, 2020 13:05

Merge remote-tracking branch 'origin/master' into multigpu

77deb2a

split off the multi-gpu test

c763d0d

set env dynamically

251ecb1

style

3951605

style

e465aa4

first pass at async io

e7906ea

add timeout

f9eb807

cleanup

e06dce6

can now remove alpha_loss_encoder skip

97a4515

add a warning

71164ec

comment

fb94b8f

huggingface deleted a comment from codecov bot Sep 22, 2020

sshleifer added 2 commits October 5, 2020 09:56

Fixed merge conflicts, nothing else.

27336cc

Merge branch 'master' into multigpu

30ceb1c

sshleifer mentioned this pull request Oct 15, 2020

[s2s trainer] tests fail on multi-gpu machine #7833

Closed

stas00 added 3 commits October 15, 2020 20:07

Merge remote-tracking branch 'origin/master' into multigpu

623d65e

--alpha_loss_encoder=0.0 --alpha_encoder_loss=0.4 are no more

dc1cc6f

Merge remote-tracking branch 'origin/master' into multigpu

cc9b9f9

stas00 added 3 commits October 15, 2020 21:38

add --skip_output_dir_check to support multi-gpu processes

d1f5362

perhaps stderr will look less scary than ERR - as it's just stderr

4dbd95e

args is argh!

df42f2a

stas00 mentioned this pull request Oct 16, 2020

[seq2seq distributed] child process stuck on error #7844

Closed

use --overwrite_output_dir consistently and refactor its use

604e31d

stas00 mentioned this pull request Oct 17, 2020

[s2s testing] turn all to unittests, use auto-delete temp dirs #7859

Merged

stas00 added 4 commits October 16, 2020 17:53

distillation already filled out this dir, so it's not empty

7fc7bae

style

ef5d291

almost done

0c8d285

complete test validation

e7d4e67

stas00 changed the title ~~[wip] [seq2seq] multigpu test needs to be run via a subprocess~~ [seq2seq testing] multigpu test needs to be run via a subprocess Oct 17, 2020

Merge remote-tracking branch 'origin/master' into multigpu

b67ecc4

sshleifer approved these changes Oct 17, 2020

View reviewed changes

examples/seq2seq/distillation.py Outdated Show resolved Hide resolved

stas00 added 5 commits October 21, 2020 12:50

Merge remote-tracking branch 'origin/master' into multigpu

322eecb

add expected_items threshold for check_output_dir

de4e412

fix

8b56ffe

sync with the main test file on temp dirs

2f92885

better error message

3d30dd1

sshleifer changed the title ~~[seq2seq testing] multigpu test needs to be run via a subprocess~~ [seq2seq testing] multigpu test run via subprocess Oct 21, 2020

sshleifer merged commit 8b38173 into huggingface:master Oct 21, 2020

stas00 deleted the multigpu branch October 21, 2020 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[seq2seq testing] multigpu test run via subprocess #7281

[seq2seq testing] multigpu test run via subprocess #7281

Uh oh!

stas00 commented Sep 21, 2020 •

edited

Loading

Uh oh!

stas00 commented Sep 21, 2020 •

edited

Loading

Uh oh!

sshleifer commented Sep 21, 2020

Uh oh!

stas00 commented Sep 21, 2020

Uh oh!

sshleifer commented Sep 21, 2020

Uh oh!

stas00 commented Sep 21, 2020

Uh oh!

sshleifer commented Oct 7, 2020

Uh oh!

sshleifer left a comment

Uh oh!

Uh oh!

stas00 commented Oct 17, 2020

Uh oh!

Uh oh!

[seq2seq testing] multigpu test run via subprocess #7281

[seq2seq testing] multigpu test run via subprocess #7281

Uh oh!

Conversation

stas00 commented Sep 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Sep 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sshleifer commented Sep 21, 2020

Uh oh!

stas00 commented Sep 21, 2020

Uh oh!

sshleifer commented Sep 21, 2020

Uh oh!

stas00 commented Sep 21, 2020

Uh oh!

sshleifer commented Oct 7, 2020

Uh oh!

sshleifer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stas00 commented Oct 17, 2020

Uh oh!

Uh oh!

stas00 commented Sep 21, 2020 •

edited

Loading

stas00 commented Sep 21, 2020 •

edited

Loading