Video reference scripts #1180

fmassa · 2019-07-29T14:38:48Z

This PR adds training and evaluation scripts for video models.

It also adds a few extra helper functions, which should ideally be integrated in PyTorch / Torchvision instead of being part of the reference scripts. For now they are added here to avoid having to worry about backwards-compatibility.

Some parts of the main training script needs cleanup, specially the part handling caching of the dataset.
I'm sending this PR now for early feedback.

Note that the first commit is only copying as is the training scripts from image classification, and do not need to be reviewed.

cc @bjuncek

references/video_classification/train.py

torchvision/datasets/kinetics.py

torchvision/io/video.py

references/video_classification/sampler.py

references/video_classification/train.py

Gives even slightly better results than expected, with 57.336 top1 clip accuracy. But we count some clips twice in this evaluation

codecov-io · 2019-07-31T12:41:34Z

Codecov Report

Merging #1180 into master will increase coverage by <.01%.
The diff coverage is 86.36%.

@@            Coverage Diff             @@
##           master    #1180      +/-   ##
==========================================
+ Coverage   65.78%   65.78%   +<.01%     
==========================================
  Files          79       79              
  Lines        5834     5849      +15     
  Branches      887      890       +3     
==========================================
+ Hits         3838     3848      +10     
- Misses       1726     1730       +4     
- Partials      270      271       +1

Impacted Files	Coverage Δ
torchvision/io/video.py	`72% <100%> (+0.57%)`	⬆️
torchvision/datasets/video_utils.py	`85.18% <100%> (+1.57%)`	⬆️
torchvision/datasets/kinetics.py	`34.78% <25%> (-5.22%)`	⬇️
torchvision/transforms/transforms.py	`80.35% <0%> (-0.59%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2287c8f...2c9624f. Read the comment docs.

fmassa · 2019-07-31T12:49:06Z

torchvision/datasets/kinetics.py

@@ -23,4 +24,7 @@ def __getitem__(self, idx):
        video, audio, info, video_idx = self.video_clips.get_clip(idx)
        label = self.samples[video_idx][1]

+        if self.transform is not None:


@bjuncek @soumith Looking for your opinions on how we should call the video transforms.

Indeed, we might have both video_transform and audio_transform, given that the dataset returns both data.

Would you prefer to stick with transform for meaning video_transform, or choose a different name?

For this PR, I think we should keep this as transform, and maybe have a wrapper for audio transforms, or wait until batched transforms for both audio and video are ready

bjuncek

Looks good to me, and we were able to match the performance of the baseline models and beat prototype implementation

fmassa · 2019-07-31T13:16:42Z

For reference and to complement @bjuncek note, training on Kinetics400 for r2plus1d_18 gives the following results for clip-len=16:

Clip Acc@1 57.351 Clip Acc@5 78.523

which matches the expected results from https://github.com/facebookresearch/VMZ/blob/master/tutorials/models.md

fmassa added 3 commits July 26, 2019 05:58

Copy classification scripts for video classification

3fd24e3

Initial version of video classification

6c89d04

add version

824438c