diff --git a/README.md b/README.md
index 058f288f..40b7720e 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,8 @@
 # Deep High-Resolution Representation Learning for Human Pose Estimation (CVPR 2019)
 ## News
+- [2021/04/12] Welcome to check out our recent work on bottom-up pose estimation (CVPR 2021) [HRNet-DEKR](https://github.com/HRNet/DEKR)!
+- [2020/07/05] [A very nice blog](https://towardsdatascience.com/overview-of-human-pose-estimation-neural-networks-hrnet-higherhrnet-architectures-and-faq-1954b2f8b249) from Towards Data Science introducing HRNet and HigherHRNet for human pose estimation.
+- [2020/03/13] A longer version is accepted by TPAMI: [Deep High-Resolution Representation Learning for Visual Recognition](https://arxiv.org/pdf/1908.07919.pdf). It includes more HRNet applications, and the codes are available: [semantic segmentation](https://github.com/HRNet/HRNet-Semantic-Segmentation),  [objection detection](https://github.com/HRNet/HRNet-Object-Detection),  [facial landmark detection](https://github.com/HRNet/HRNet-Facial-Landmark-Detection), and [image classification](https://github.com/HRNet/HRNet-Image-Classification).
 - [2020/02/01] We have added demo code for HRNet. Thanks [Alex Simes](https://github.com/alex9311). 
 - Visualization code for showing the pose estimation results. Thanks Depu!
 - [2019/08/27] HigherHRNet is now on [ArXiv](https://arxiv.org/abs/1908.10357), which is a bottom-up approach for human pose estimation powerd by HRNet. We will also release code and models at [Higher-HRNet-Human-Pose-Estimation](https://github.com/HRNet/Higher-HRNet-Human-Pose-Estimation), stay tuned!
@@ -239,6 +242,12 @@ python visualization/plot_coco.py \
 ### Other applications
 Many other dense prediction tasks, such as segmentation, face alignment and object detection, etc. have been benefited by HRNet. More information can be found at [High-Resolution Networks](https://github.com/HRNet).
 
+### Other implementation
+[mmpose](https://github.com/open-mmlab/mmpose) </br>
+[ModelScope (中文)](https://modelscope.cn/models/damo/cv_hrnetv2w32_body-2d-keypoints_image/summary)</br>
+[timm](https://huggingface.co/docs/timm/main/en/models/hrnet)
+
+
 ### Citation
 If you use our code or models in your research, please cite with:
 ```
@@ -261,8 +270,7 @@ If you use our code or models in your research, please cite with:
   author={Jingdong Wang and Ke Sun and Tianheng Cheng and 
           Borui Jiang and Chaorui Deng and Yang Zhao and Dong Liu and Yadong Mu and 
           Mingkui Tan and Xinggang Wang and Wenyu Liu and Bin Xiao},
-  journal   = {CoRR},
-  volume    = {abs/1908.07919},
+  journal   = {TPAMI}
   year={2019}
 }
 ```
diff --git a/demo/README.md b/demo/README.md
index 35d590b8..aff81f44 100644
--- a/demo/README.md
+++ b/demo/README.md
@@ -1,41 +1,75 @@
-This demo code is meant to be run on a video and includes a person detector.
-[Nvidia-docker](https://github.com/NVIDIA/nvidia-docker) and GPUs are required.
-It only expects there to be one person in each frame of video, though the code could easily be extended to support multiple people.
+# Inference hrnet
 
-### Prep
+Inferencing the deep-high-resolution-net.pytoch without using Docker. 
+
+## Prep
 1. Download the researchers' pretrained pose estimator from [google drive](https://drive.google.com/drive/folders/1hOTihvbyIxsm5ygDpbUuJ7O_tzv4oXjC?usp=sharing) to this directory under `models/`
 2. Put the video file you'd like to infer on in this directory under `videos`
-3. build the docker container in this directory with `./build-docker.sh` (this can take time because it involves compiling opencv)
-4. update the `inference-config.yaml` file to reflect the number of GPUs you have available
+3. (OPTIONAL) build the docker container in this directory with `./build-docker.sh` (this can take time because it involves compiling opencv)
+4. update the `inference-config.yaml` file to reflect the number of GPUs you have available and which trained model you want to use.
+
+## Running the Model
+### 1. Running on the video
+```
+python demo/inference.py --cfg demo/inference-config.yaml \
+    --videoFile ../../multi_people.mp4 \
+    --writeBoxFrames \
+    --outputDir output \
+    TEST.MODEL_FILE ../models/pytorch/pose_coco/pose_hrnet_w32_256x192.pth 
 
-### Running the Model
-Start your docker container with:
 ```
-nvidia-docker run --rm -it \
-  -v $(pwd)/output:/output \
-  -v $(pwd)/videos:/videos \
-  -v $(pwd)/models:/models \
-  -w /pose_root \
-  hrnet_demo_inference \
-  /bin/bash
+
+The above command will create a video under *output* directory and a lot of pose image under *output/pose* directory. 
+Even with usage of GPU (GTX1080 in my case), the person detection will take nearly **0.06 sec**, the person pose match will
+ take nearly **0.07 sec**. In total. inference time per frame will be **0.13 sec**, nearly 10fps. So if you prefer a real-time (fps >= 20) 
+ pose estimation then you should try other approach.
+
+**===Result===**
+
+Some output images are as:
+
+![1 person](inference_1.jpg)
+Fig: 1 person inference
+
+![3 person](inference_3.jpg)
+Fig: 3 person inference
+
+![3 person](inference_5.jpg)
+Fig: 3 person inference
+
+### 2. Demo with more common functions
+Remember to update` TEST.MODEL_FILE` in `demo/inference-config.yaml `according to your model path.
+
+`demo.py` provides the following functions:
+
+- use `--webcam` when the input is a real-time camera.
+- use `--video [video-path]`  when the input is a video.
+- use `--image [image-path]` when the input is an image.
+- use `--write` to save the image, camera or video result.
+- use `--showFps` to show the fps (this fps includes the detection part).
+- draw connections between joints.
+
+#### (1) the input is a real-time carema
+```python
+python demo/demo.py --webcam --showFps --write
 ```
 
-Once the container is running, you can run inference with:
+#### (2) the input is a video
+```python
+python demo/demo.py --video test.mp4 --showFps --write
 ```
-python tools/inference.py \
-  --cfg inference-config.yaml \
-  --videoFile /videos/my-video.mp4 \
-  --inferenceFps 10 \
-  --writeBoxFrames \
-  TEST.MODEL_FILE \
-  /models/pytorch/pose_coco/pose_hrnet_w32_384x288.pth
+#### (3) the input is a image
+
+```python
+python demo/demo.py --image test.jpg --showFps --write
 ```
 
-The command above will output frames with boxes,
-frames with poses,
-a video with poses,
-and a csv with the keypoint coordinates for each frame.
+**===Result===**
+
+![show_fps](inference_6.jpg)
+
+Fig: show fps
 
-![](hrnet-demo.gif)
+![multi-people](inference_7.jpg)
 
-Original source for demo video above is licensed for `Free for commercial use No attribution required` by [Pixabay](https://pixabay.com/service/license/)
+Fig: multi-people
\ No newline at end of file
diff --git a/demo/_init_paths.py b/demo/_init_paths.py
new file mode 100644
index 00000000..b1aea8fe
--- /dev/null
+++ b/demo/_init_paths.py
@@ -0,0 +1,27 @@
+# ------------------------------------------------------------------------------
+# pose.pytorch
+# Copyright (c) 2018-present Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Written by Bin Xiao (Bin.Xiao@microsoft.com)
+# ------------------------------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os.path as osp
+import sys
+
+
+def add_path(path):
+    if path not in sys.path:
+        sys.path.insert(0, path)
+
+
+this_dir = osp.dirname(__file__)
+
+lib_path = osp.join(this_dir, '..', 'lib')
+add_path(lib_path)
+
+mm_path = osp.join(this_dir, '..', 'lib/poseeval/py-motmetrics')
+add_path(mm_path)
diff --git a/demo/demo.py b/demo/demo.py
new file mode 100644
index 00000000..d482e838
--- /dev/null
+++ b/demo/demo.py
@@ -0,0 +1,343 @@
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import argparse
+import csv
+import os
+import shutil
+
+from PIL import Image
+import torch
+import torch.nn.parallel
+import torch.backends.cudnn as cudnn
+import torch.optim
+import torch.utils.data
+import torch.utils.data.distributed
+import torchvision.transforms as transforms
+import torchvision
+import cv2
+import numpy as np
+import time
+
+
+import _init_paths
+import models
+from config import cfg
+from config import update_config
+from core.function import get_final_preds
+from utils.transforms import get_affine_transform
+
+COCO_KEYPOINT_INDEXES = {
+    0: 'nose',
+    1: 'left_eye',
+    2: 'right_eye',
+    3: 'left_ear',
+    4: 'right_ear',
+    5: 'left_shoulder',
+    6: 'right_shoulder',
+    7: 'left_elbow',
+    8: 'right_elbow',
+    9: 'left_wrist',
+    10: 'right_wrist',
+    11: 'left_hip',
+    12: 'right_hip',
+    13: 'left_knee',
+    14: 'right_knee',
+    15: 'left_ankle',
+    16: 'right_ankle'
+}
+
+COCO_INSTANCE_CATEGORY_NAMES = [
+    '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
+    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
+    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
+    'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
+    'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
+    'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
+    'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
+    'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
+    'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
+    'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
+    'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
+    'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
+]
+
+SKELETON = [
+    [1,3],[1,0],[2,4],[2,0],[0,5],[0,6],[5,7],[7,9],[6,8],[8,10],[5,11],[6,12],[11,12],[11,13],[13,15],[12,14],[14,16]
+]
+
+CocoColors = [[255, 0, 0], [255, 85, 0], [255, 170, 0], [255, 255, 0], [170, 255, 0], [85, 255, 0], [0, 255, 0],
+              [0, 255, 85], [0, 255, 170], [0, 255, 255], [0, 170, 255], [0, 85, 255], [0, 0, 255], [85, 0, 255],
+              [170, 0, 255], [255, 0, 255], [255, 0, 170], [255, 0, 85]]
+
+NUM_KPTS = 17
+
+CTX = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
+
+def draw_pose(keypoints,img):
+    """draw the keypoints and the skeletons.
+    :params keypoints: the shape should be equal to [17,2]
+    :params img:
+    """
+    assert keypoints.shape == (NUM_KPTS,2)
+    for i in range(len(SKELETON)):
+        kpt_a, kpt_b = SKELETON[i][0], SKELETON[i][1]
+        x_a, y_a = keypoints[kpt_a][0],keypoints[kpt_a][1]
+        x_b, y_b = keypoints[kpt_b][0],keypoints[kpt_b][1] 
+        cv2.circle(img, (int(x_a), int(y_a)), 6, CocoColors[i], -1)
+        cv2.circle(img, (int(x_b), int(y_b)), 6, CocoColors[i], -1)
+        cv2.line(img, (int(x_a), int(y_a)), (int(x_b), int(y_b)), CocoColors[i], 2)
+
+def draw_bbox(box,img):
+    """draw the detected bounding box on the image.
+    :param img:
+    """
+    cv2.rectangle(img, box[0], box[1], color=(0, 255, 0),thickness=3)
+
+
+def get_person_detection_boxes(model, img, threshold=0.5):
+    pred = model(img)
+    pred_classes = [COCO_INSTANCE_CATEGORY_NAMES[i]
+                    for i in list(pred[0]['labels'].cpu().numpy())]  # Get the Prediction Score
+    pred_boxes = [[(i[0], i[1]), (i[2], i[3])]
+                  for i in list(pred[0]['boxes'].detach().cpu().numpy())]  # Bounding boxes
+    pred_score = list(pred[0]['scores'].detach().cpu().numpy())
+    if not pred_score or max(pred_score)<threshold:
+        return []
+    # Get list of index with score greater than threshold
+    pred_t = [pred_score.index(x) for x in pred_score if x > threshold][-1]
+    pred_boxes = pred_boxes[:pred_t+1]
+    pred_classes = pred_classes[:pred_t+1]
+
+    person_boxes = []
+    for idx, box in enumerate(pred_boxes):
+        if pred_classes[idx] == 'person':
+            person_boxes.append(box)
+
+    return person_boxes
+
+
+def get_pose_estimation_prediction(pose_model, image, center, scale):
+    rotation = 0
+
+    # pose estimation transformation
+    trans = get_affine_transform(center, scale, rotation, cfg.MODEL.IMAGE_SIZE)
+    model_input = cv2.warpAffine(
+        image,
+        trans,
+        (int(cfg.MODEL.IMAGE_SIZE[0]), int(cfg.MODEL.IMAGE_SIZE[1])),
+        flags=cv2.INTER_LINEAR)
+    transform = transforms.Compose([
+        transforms.ToTensor(),
+        transforms.Normalize(mean=[0.485, 0.456, 0.406],
+                             std=[0.229, 0.224, 0.225]),
+    ])
+
+    # pose estimation inference
+    model_input = transform(model_input).unsqueeze(0)
+    # switch to evaluate mode
+    pose_model.eval()
+    with torch.no_grad():
+        # compute output heatmap
+        output = pose_model(model_input)
+        preds, _ = get_final_preds(
+            cfg,
+            output.clone().cpu().numpy(),
+            np.asarray([center]),
+            np.asarray([scale]))
+
+        return preds
+
+
+def box_to_center_scale(box, model_image_width, model_image_height):
+    """convert a box to center,scale information required for pose transformation
+    Parameters
+    ----------
+    box : list of tuple
+        list of length 2 with two tuples of floats representing
+        bottom left and top right corner of a box
+    model_image_width : int
+    model_image_height : int
+
+    Returns
+    -------
+    (numpy array, numpy array)
+        Two numpy arrays, coordinates for the center of the box and the scale of the box
+    """
+    center = np.zeros((2), dtype=np.float32)
+
+    bottom_left_corner = box[0]
+    top_right_corner = box[1]
+    box_width = top_right_corner[0]-bottom_left_corner[0]
+    box_height = top_right_corner[1]-bottom_left_corner[1]
+    bottom_left_x = bottom_left_corner[0]
+    bottom_left_y = bottom_left_corner[1]
+    center[0] = bottom_left_x + box_width * 0.5
+    center[1] = bottom_left_y + box_height * 0.5
+
+    aspect_ratio = model_image_width * 1.0 / model_image_height
+    pixel_std = 200
+
+    if box_width > aspect_ratio * box_height:
+        box_height = box_width * 1.0 / aspect_ratio
+    elif box_width < aspect_ratio * box_height:
+        box_width = box_height * aspect_ratio
+    scale = np.array(
+        [box_width * 1.0 / pixel_std, box_height * 1.0 / pixel_std],
+        dtype=np.float32)
+    if center[0] != -1:
+        scale = scale * 1.25
+
+    return center, scale
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Train keypoints network')
+    # general
+    parser.add_argument('--cfg', type=str, default='demo/inference-config.yaml')
+    parser.add_argument('--video', type=str)
+    parser.add_argument('--webcam',action='store_true')
+    parser.add_argument('--image',type=str)
+    parser.add_argument('--write',action='store_true')
+    parser.add_argument('--showFps',action='store_true')
+
+    parser.add_argument('opts',
+                        help='Modify config options using the command-line',
+                        default=None,
+                        nargs=argparse.REMAINDER)
+
+    args = parser.parse_args()
+
+    # args expected by supporting codebase  
+    args.modelDir = ''
+    args.logDir = ''
+    args.dataDir = ''
+    args.prevModelDir = ''
+    return args
+
+
+def main():
+    # cudnn related setting
+    cudnn.benchmark = cfg.CUDNN.BENCHMARK
+    torch.backends.cudnn.deterministic = cfg.CUDNN.DETERMINISTIC
+    torch.backends.cudnn.enabled = cfg.CUDNN.ENABLED
+
+    args = parse_args()
+    update_config(cfg, args)
+
+    box_model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
+    box_model.to(CTX)
+    box_model.eval()
+
+    pose_model = eval('models.'+cfg.MODEL.NAME+'.get_pose_net')(
+        cfg, is_train=False
+    )
+
+    if cfg.TEST.MODEL_FILE:
+        print('=> loading model from {}'.format(cfg.TEST.MODEL_FILE))
+        pose_model.load_state_dict(torch.load(cfg.TEST.MODEL_FILE), strict=False)
+    else:
+        print('expected model defined in config at TEST.MODEL_FILE')
+
+    pose_model = torch.nn.DataParallel(pose_model, device_ids=cfg.GPUS)
+    pose_model.to(CTX)
+    pose_model.eval()
+
+    # Loading an video or an image or webcam 
+    if args.webcam:
+        vidcap = cv2.VideoCapture(0)
+    elif args.video:
+        vidcap = cv2.VideoCapture(args.video)
+    elif args.image:
+        image_bgr = cv2.imread(args.image)
+    else:
+        print('please use --video or --webcam or --image to define the input.')
+        return 
+
+    if args.webcam or args.video:
+        if args.write:
+            save_path = 'output.avi'
+            fourcc = cv2.VideoWriter_fourcc(*'XVID')
+            out = cv2.VideoWriter(save_path,fourcc, 24.0, (int(vidcap.get(3)),int(vidcap.get(4))))
+        while True:
+            ret, image_bgr = vidcap.read()
+            if ret:
+                last_time = time.time()
+                image = image_bgr[:, :, [2, 1, 0]]
+
+                input = []
+                img = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB)
+                img_tensor = torch.from_numpy(img/255.).permute(2,0,1).float().to(CTX)
+                input.append(img_tensor)
+
+                # object detection box
+                pred_boxes = get_person_detection_boxes(box_model, input, threshold=0.9)
+
+                # pose estimation
+                if len(pred_boxes) >= 1:
+                    for box in pred_boxes:
+                        center, scale = box_to_center_scale(box, cfg.MODEL.IMAGE_SIZE[0], cfg.MODEL.IMAGE_SIZE[1])
+                        image_pose = image.copy() if cfg.DATASET.COLOR_RGB else image_bgr.copy()
+                        pose_preds = get_pose_estimation_prediction(pose_model, image_pose, center, scale)
+                        if len(pose_preds)>=1:
+                            for kpt in pose_preds:
+                                draw_pose(kpt,image_bgr) # draw the poses
+
+                if args.showFps:
+                    fps = 1/(time.time()-last_time)
+                    img = cv2.putText(image_bgr, 'fps: '+ "%.2f"%(fps), (25, 40), cv2.FONT_HERSHEY_SIMPLEX, 1.2, (0, 255, 0), 2)
+
+                if args.write:
+                    out.write(image_bgr)
+
+                cv2.imshow('demo',image_bgr)
+                if cv2.waitKey(1) & 0XFF==ord('q'):
+                    break
+            else:
+                print('cannot load the video.')
+                break
+
+        cv2.destroyAllWindows()
+        vidcap.release()
+        if args.write:
+            print('video has been saved as {}'.format(save_path))
+            out.release()
+
+    else:
+        # estimate on the image
+        last_time = time.time()
+        image = image_bgr[:, :, [2, 1, 0]]
+
+        input = []
+        img = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB)
+        img_tensor = torch.from_numpy(img/255.).permute(2,0,1).float().to(CTX)
+        input.append(img_tensor)
+
+        # object detection box
+        pred_boxes = get_person_detection_boxes(box_model, input, threshold=0.9)
+
+        # pose estimation
+        if len(pred_boxes) >= 1:
+            for box in pred_boxes:
+                center, scale = box_to_center_scale(box, cfg.MODEL.IMAGE_SIZE[0], cfg.MODEL.IMAGE_SIZE[1])
+                image_pose = image.copy() if cfg.DATASET.COLOR_RGB else image_bgr.copy()
+                pose_preds = get_pose_estimation_prediction(pose_model, image_pose, center, scale)
+                if len(pose_preds)>=1:
+                    for kpt in pose_preds:
+                        draw_pose(kpt,image_bgr) # draw the poses
+        
+        if args.showFps:
+            fps = 1/(time.time()-last_time)
+            img = cv2.putText(image_bgr, 'fps: '+ "%.2f"%(fps), (25, 40), cv2.FONT_HERSHEY_SIMPLEX, 1.2, (0, 255, 0), 2)
+        
+        if args.write:
+            save_path = 'output.jpg'
+            cv2.imwrite(save_path,image_bgr)
+            print('the result image has been saved as {}'.format(save_path))
+
+        cv2.imshow('demo',image_bgr)
+        if cv2.waitKey(0) & 0XFF==ord('q'):
+            cv2.destroyAllWindows()
+        
+if __name__ == '__main__':
+    main()
diff --git a/demo/inference-config.yaml b/demo/inference-config.yaml
index 9e57cf20..14bce176 100644
--- a/demo/inference-config.yaml
+++ b/demo/inference-config.yaml
@@ -26,7 +26,7 @@ MODEL:
   INIT_WEIGHTS: true
   NAME: pose_hrnet
   NUM_JOINTS: 17
-  PRETRAINED: 'models/pytorch/imagenet/hrnet_w32-36af842e.pth'
+  PRETRAINED: 'models/pytorch/pose_coco/pose_hrnet_w32_384x288.pth'
   TARGET_TYPE: gaussian
   IMAGE_SIZE:
   - 288
@@ -112,7 +112,7 @@ TEST:
   BBOX_THRE: 1.0
   IMAGE_THRE: 0.0
   IN_VIS_THRE: 0.2
-  MODEL_FILE: ''
+  MODEL_FILE: 'models/pytorch/pose_coco/pose_hrnet_w32_384x288.pth'
   NMS_THRE: 1.0
   OKS_THRE: 0.9
   USE_GT_BBOX: true
diff --git a/demo/inference.py b/demo/inference.py
index bee22bf8..efff86a7 100644
--- a/demo/inference.py
+++ b/demo/inference.py
@@ -19,14 +19,20 @@
 import cv2
 import numpy as np
 
+import sys
+sys.path.append("../lib")
+import time
 
-import _init_paths
+# import _init_paths
 import models
 from config import cfg
 from config import update_config
-from core.function import get_final_preds
+from core.inference import get_final_preds
 from utils.transforms import get_affine_transform
 
+CTX = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
+
+
 COCO_KEYPOINT_INDEXES = {
     0: 'nose',
     1: 'left_eye',
@@ -67,57 +73,53 @@ def get_person_detection_boxes(model, img, threshold=0.5):
     pil_image = Image.fromarray(img)  # Load the image
     transform = transforms.Compose([transforms.ToTensor()])  # Defing PyTorch Transform
     transformed_img = transform(pil_image)  # Apply the transform to the image
-    pred = model([transformed_img])  # Pass the image to the model
+    pred = model([transformed_img.to(CTX)])  # Pass the image to the model
+    # Use the first detected person
     pred_classes = [COCO_INSTANCE_CATEGORY_NAMES[i]
-                    for i in list(pred[0]['labels'].numpy())]  # Get the Prediction Score
+                    for i in list(pred[0]['labels'].cpu().numpy())]  # Get the Prediction Score
     pred_boxes = [[(i[0], i[1]), (i[2], i[3])]
-                  for i in list(pred[0]['boxes'].detach().numpy())]  # Bounding boxes
-    pred_score = list(pred[0]['scores'].detach().numpy())
-    if not pred_score:
-        return []
-    # Get list of index with score greater than threshold
-    pred_t = [pred_score.index(x) for x in pred_score if x > threshold][-1]
-    pred_boxes = pred_boxes[:pred_t+1]
-    pred_classes = pred_classes[:pred_t+1]
+                  for i in list(pred[0]['boxes'].cpu().detach().numpy())]  # Bounding boxes
+    pred_scores = list(pred[0]['scores'].cpu().detach().numpy())
 
     person_boxes = []
-    for idx, box in enumerate(pred_boxes):
-        if pred_classes[idx] == 'person':
-            person_boxes.append(box)
+    # Select box has score larger than threshold and is person
+    for pred_class, pred_box, pred_score in zip(pred_classes, pred_boxes, pred_scores):
+        if (pred_score > threshold) and (pred_class == 'person'):
+            person_boxes.append(pred_box)
 
     return person_boxes
 
 
-def get_pose_estimation_prediction(pose_model, image, center, scale):
+def get_pose_estimation_prediction(pose_model, image, centers, scales, transform):
     rotation = 0
 
     # pose estimation transformation
-    trans = get_affine_transform(center, scale, rotation, cfg.MODEL.IMAGE_SIZE)
-    model_input = cv2.warpAffine(
-        image,
-        trans,
-        (int(cfg.MODEL.IMAGE_SIZE[0]), int(cfg.MODEL.IMAGE_SIZE[1])),
-        flags=cv2.INTER_LINEAR)
-    transform = transforms.Compose([
-        transforms.ToTensor(),
-        transforms.Normalize(mean=[0.485, 0.456, 0.406],
-                             std=[0.229, 0.224, 0.225]),
-    ])
-
-    # pose estimation inference
-    model_input = transform(model_input).unsqueeze(0)
-    # switch to evaluate mode
-    pose_model.eval()
-    with torch.no_grad():
-        # compute output heatmap
-        output = pose_model(model_input)
-        preds, _ = get_final_preds(
-            cfg,
-            output.clone().cpu().numpy(),
-            np.asarray([center]),
-            np.asarray([scale]))
-
-        return preds
+    model_inputs = []
+    for center, scale in zip(centers, scales):
+        trans = get_affine_transform(center, scale, rotation, cfg.MODEL.IMAGE_SIZE)
+        # Crop smaller image of people
+        model_input = cv2.warpAffine(
+            image,
+            trans,
+            (int(cfg.MODEL.IMAGE_SIZE[0]), int(cfg.MODEL.IMAGE_SIZE[1])),
+            flags=cv2.INTER_LINEAR)
+
+        # hwc -> 1chw
+        model_input = transform(model_input)#.unsqueeze(0)
+        model_inputs.append(model_input)
+
+    # n * 1chw -> nchw
+    model_inputs = torch.stack(model_inputs)
+
+    # compute output heatmap
+    output = pose_model(model_inputs.to(CTX))
+    coords, _ = get_final_preds(
+        cfg,
+        output.cpu().detach().numpy(),
+        np.asarray(centers),
+        np.asarray(scales))
+
+    return coords
 
 
 def box_to_center_scale(box, model_image_width, model_image_height):
@@ -163,15 +165,11 @@ def box_to_center_scale(box, model_image_width, model_image_height):
 
 
 def prepare_output_dirs(prefix='/output/'):
-    pose_dir = prefix+'poses/'
-    box_dir = prefix+'boxes/'
+    pose_dir = os.path.join(prefix, "pose")
     if os.path.exists(pose_dir) and os.path.isdir(pose_dir):
         shutil.rmtree(pose_dir)
-    if os.path.exists(box_dir) and os.path.isdir(box_dir):
-        shutil.rmtree(box_dir)
     os.makedirs(pose_dir, exist_ok=True)
-    os.makedirs(box_dir, exist_ok=True)
-    return pose_dir, box_dir
+    return pose_dir
 
 
 def parse_args():
@@ -199,6 +197,13 @@ def parse_args():
 
 
 def main():
+    # transformation
+    pose_transform = transforms.Compose([
+        transforms.ToTensor(),
+        transforms.Normalize(mean=[0.485, 0.456, 0.406],
+                             std=[0.229, 0.224, 0.225]),
+    ])
+
     # cudnn related setting
     cudnn.benchmark = cfg.CUDNN.BENCHMARK
     torch.backends.cudnn.deterministic = cfg.CUDNN.DETERMINISTIC
@@ -206,13 +211,12 @@ def main():
 
     args = parse_args()
     update_config(cfg, args)
-    pose_dir, box_dir = prepare_output_dirs(args.outputDir)
-    csv_output_filename = args.outputDir+'pose-data.csv'
+    pose_dir = prepare_output_dirs(args.outputDir)
     csv_output_rows = []
 
     box_model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
+    box_model.to(CTX)
     box_model.eval()
-
     pose_model = eval('models.'+cfg.MODEL.NAME+'.get_pose_net')(
         cfg, is_train=False
     )
@@ -223,7 +227,8 @@ def main():
     else:
         print('expected model defined in config at TEST.MODEL_FILE')
 
-    pose_model = torch.nn.DataParallel(pose_model, device_ids=cfg.GPUS).cuda()
+    pose_model.to(CTX)
+    pose_model.eval()
 
     # Loading an video
     vidcap = cv2.VideoCapture(args.videoFile)
@@ -231,68 +236,105 @@ def main():
     if fps < args.inferenceFps:
         print('desired inference fps is '+str(args.inferenceFps)+' but video fps is '+str(fps))
         exit()
-    every_nth_frame = round(fps/args.inferenceFps)
+    skip_frame_cnt = round(fps / args.inferenceFps)
+    frame_width = int(vidcap.get(cv2.CAP_PROP_FRAME_WIDTH))
+    frame_height = int(vidcap.get(cv2.CAP_PROP_FRAME_HEIGHT))
+    outcap = cv2.VideoWriter('{}/{}_pose.avi'.format(args.outputDir, os.path.splitext(os.path.basename(args.videoFile))[0]),
+                             cv2.VideoWriter_fourcc('M', 'J', 'P', 'G'), int(skip_frame_cnt), (frame_width, frame_height))
 
-    success, image_bgr = vidcap.read()
     count = 0
+    while vidcap.isOpened():
+        total_now = time.time()
+        ret, image_bgr = vidcap.read()
+        count += 1
 
-    while success:
-        if count % every_nth_frame != 0:
-            success, image_bgr = vidcap.read()
-            count += 1
+        if not ret:
             continue
 
-        image = image_bgr[:, :, [2, 1, 0]]
-        count_str = str(count).zfill(32)
+        if count % skip_frame_cnt != 0:
+            continue
+
+        image_rgb = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB)
+
+        # Clone 2 image for person detection and pose estimation
+        if cfg.DATASET.COLOR_RGB:
+            image_per = image_rgb.copy()
+            image_pose = image_rgb.copy()
+        else:
+            image_per = image_bgr.copy()
+            image_pose = image_bgr.copy()
+
+        # Clone 1 image for debugging purpose
+        image_debug = image_bgr.copy()
 
         # object detection box
-        pred_boxes = get_person_detection_boxes(box_model, image, threshold=0.8)
-        if args.writeBoxFrames:
-            image_bgr_box = image_bgr.copy()
-            for box in pred_boxes:
-                cv2.rectangle(image_bgr_box, box[0], box[1], color=(0, 255, 0),
-                              thickness=3)  # Draw Rectangle with the coordinates
-            cv2.imwrite(box_dir+'box%s.jpg' % count_str, image_bgr_box)
+        now = time.time()
+        pred_boxes = get_person_detection_boxes(box_model, image_per, threshold=0.9)
+        then = time.time()
+        print("Find person bbox in: {} sec".format(then - now))
+
+        # Can not find people. Move to next frame
         if not pred_boxes:
-            success, image_bgr = vidcap.read()
             count += 1
             continue
 
-        # pose estimation
-        box = pred_boxes[0]  # assume there is only 1 person
-        center, scale = box_to_center_scale(box, cfg.MODEL.IMAGE_SIZE[0], cfg.MODEL.IMAGE_SIZE[1])
-        image_pose = image.copy() if cfg.DATASET.COLOR_RGB else image_bgr.copy()
-        pose_preds = get_pose_estimation_prediction(pose_model, image_pose, center, scale)
+        if args.writeBoxFrames:
+            for box in pred_boxes:
+                cv2.rectangle(image_debug, box[0], box[1], color=(0, 255, 0),
+                              thickness=3)  # Draw Rectangle with the coordinates
+
+        # pose estimation : for multiple people
+        centers = []
+        scales = []
+        for box in pred_boxes:
+            center, scale = box_to_center_scale(box, cfg.MODEL.IMAGE_SIZE[0], cfg.MODEL.IMAGE_SIZE[1])
+            centers.append(center)
+            scales.append(scale)
+
+        now = time.time()
+        pose_preds = get_pose_estimation_prediction(pose_model, image_pose, centers, scales, transform=pose_transform)
+        then = time.time()
+        print("Find person pose in: {} sec".format(then - now))
 
         new_csv_row = []
-        for _, mat in enumerate(pose_preds[0]):
-            x_coord, y_coord = int(mat[0]), int(mat[1])
-            cv2.circle(image_bgr, (x_coord, y_coord), 4, (255, 0, 0), 2)
-            new_csv_row.extend([x_coord, y_coord])
+        for coords in pose_preds:
+            # Draw each point on image
+            for coord in coords:
+                x_coord, y_coord = int(coord[0]), int(coord[1])
+                cv2.circle(image_debug, (x_coord, y_coord), 4, (255, 0, 0), 2)
+                new_csv_row.extend([x_coord, y_coord])
+
+        total_then = time.time()
+
+        text = "{:03.2f} sec".format(total_then - total_now)
+        cv2.putText(image_debug, text, (100, 50), cv2.FONT_HERSHEY_SIMPLEX,
+                            1, (0, 0, 255), 2, cv2.LINE_AA)
+
+        cv2.imshow("pos", image_debug)
+        if cv2.waitKey(1) & 0xFF == ord('q'):
+            break
 
         csv_output_rows.append(new_csv_row)
-        cv2.imwrite(pose_dir+'pose%s.jpg' % count_str, image_bgr)
+        img_file = os.path.join(pose_dir, 'pose_{:08d}.jpg'.format(count))
+        cv2.imwrite(img_file, image_debug)
+        outcap.write(image_debug)
 
-        # get next frame
-        success, image_bgr = vidcap.read()
-        count += 1
 
     # write csv
     csv_headers = ['frame']
     for keypoint in COCO_KEYPOINT_INDEXES.values():
         csv_headers.extend([keypoint+'_x', keypoint+'_y'])
 
+    csv_output_filename = os.path.join(args.outputDir, 'pose-data.csv')
     with open(csv_output_filename, 'w', newline='') as csvfile:
         csvwriter = csv.writer(csvfile)
         csvwriter.writerow(csv_headers)
         csvwriter.writerows(csv_output_rows)
 
-    os.system("ffmpeg -y -r "
-              + str(args.inferenceFps)
-              + " -pattern_type glob -i '"
-              + pose_dir
-              + "/*.jpg' -c:v libx264 -vf fps="
-              + str(args.inferenceFps)+" -pix_fmt yuv420p /output/movie.mp4")
+    vidcap.release()
+    outcap.release()
+
+    cv2.destroyAllWindows()
 
 
 if __name__ == '__main__':
diff --git a/demo/inference_1.jpg b/demo/inference_1.jpg
new file mode 100644
index 00000000..2ca29d1b
Binary files /dev/null and b/demo/inference_1.jpg differ
diff --git a/demo/inference_3.jpg b/demo/inference_3.jpg
new file mode 100644
index 00000000..7f20915c
Binary files /dev/null and b/demo/inference_3.jpg differ
diff --git a/demo/inference_5.jpg b/demo/inference_5.jpg
new file mode 100644
index 00000000..d7b7117c
Binary files /dev/null and b/demo/inference_5.jpg differ
diff --git a/demo/inference_6.jpg b/demo/inference_6.jpg
new file mode 100644
index 00000000..cc1183c0
Binary files /dev/null and b/demo/inference_6.jpg differ
diff --git a/demo/inference_7.jpg b/demo/inference_7.jpg
new file mode 100644
index 00000000..9629b300
Binary files /dev/null and b/demo/inference_7.jpg differ