|
1 |
| -# Inception-v3 speed: Raspberry Pi 3 vs 2013 MacBook Pro |
| 1 | +# Inception-v3 speed: Raspberry Pi 3 |
| 2 | + |
| 3 | +_Latest update: December 1, 2016; TensorFlow 0.11.0_ |
2 | 4 |
|
3 | 5 | ## About
|
4 | 6 |
|
5 |
| -This file contains some very basic run-time statistics for [TensorFlow's pre-trained Inception-v3 model](https://www.tensorflow.org/versions/r0.7/tutorials/image_recognition/index.html) running on a [Raspberry Pi 3 Model B](https://www.raspberrypi.org/products/raspberry-pi-3-model-b/) as compared to an [Early 2013 15 inch Retina MacBook Pro](https://support.apple.com/kb/SP669?locale=en_US). |
| 7 | +This file contains some very basic run-time statistics for [TensorFlow's pre-trained Inception-v3 model](https://www.tensorflow.org/versions/r0.7/tutorials/image_recognition/index.html) running on a [Raspberry Pi 3 Model B](https://www.raspberrypi.org/products/raspberry-pi-3-model-b/) as compared to an [Early 2013 15 inch Retina MacBook Pro](https://support.apple.com/kb/SP669?locale=en_US) with an Intel i7-3740QM CPU as well as a desktop rig running Ubuntu 14.04 with a Titan X Maxwell GPU and Intel i7-5820K CPU. |
6 | 8 |
|
7 |
| -Out of the box, Inception-v3 is available to run from either a Python script [(tensorflow/models/image/imagenet/classify_image.py)](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/models/image/imagenet) or as a compiled C++ binary [(tensorflow/examples/label_image/main.cc)](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/label_image). To get the rough benchmarks used in this file, I made minor modifications in both files to print out run-time information after processing. The modified files are available in this directory: [classify\_image\_timed.py](classify_image_timed.py) and [main.cc](main.cc) Both tests used the default grace_hopper.jpg image used in the Inception-v3 C++ file. |
| 9 | +To run this benchmark, I use a modified version of the example [classify_image.py script](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/models/image/imagenet). I made minor modifications to collect and print out run-time information after processing. The modified file is available here: [classify\_image\_timed.py](classify_image_timed.py). |
8 | 10 |
|
9 | 11 | ## Summary
|
10 | 12 |
|
11 |
| -* _Build_ refers to the amount of time it took to load and build the Inception-v3 graph from storage |
12 |
| -* _Eval_ refers to the amount of time it took to classify the image once it was loaded into memory |
13 |
| -* _Total_ is the sum of _Build_ and _Eval_ |
| 13 | +* _warmup_runs_ refers to the number of calls to `Session.run` before starting the benchmarking in order to "warmup" the model. TensorFlow makes adjustments on the fly, so the first few times running the model are slower than subsequent runs |
| 14 | +* A _run_ is the time between the start of a call to `Session.run` and when it returns. We list the best, worst, and average time (averaged over 25 runs) |
| 15 | +* _Build_ is the amount of time spent constructing the Inception model from the protobuf file. |
14 | 16 |
|
15 | 17 | <table>
|
16 | 18 |
|
17 | 19 | <tr>
|
18 |
| - <td></td> |
19 |
| - <th colspan="3">Python (CPU)</th> |
20 |
| - <th colspan="3">C++ (CPU)</th> |
| 20 | + <th colspan="6">_TensorFlow version 0.11.0_</th> |
21 | 21 | </tr>
|
22 |
| - |
| 22 | + |
23 | 23 | <tr>
|
24 |
| - <td><i>Model</i></td> |
25 |
| - <td><i>Build (sec)</i></td> |
26 |
| - <td><i>Eval (sec)</i></td> |
27 |
| - <td><i>Total (sec)</i></td> |
28 |
| - <td><i>Build (sec)</i></td> |
29 |
| - <td><i>Eval (sec)</i></td> |
30 |
| - <td><i>Total (sec)</i></td> |
| 24 | + <td></td> |
| 25 | + <th><i>Model</i></th> |
| 26 | + <th><i>Best run (sec)</i></th> |
| 27 | + <th><i>Worst run (sec)</i></th> |
| 28 | + <th><i>Average run (sec)</i></th> |
| 29 | + <th><i>Build time(sec)</i></th> |
31 | 30 | </tr>
|
32 |
| - |
| 31 | + |
33 | 32 | <tr>
|
34 |
| - <td>Raspberry Pi 3</td> |
35 |
| - <td>3.496</td> |
36 |
| - <td>11.004</td> |
37 |
| - <td>14.500</td> |
38 |
| - <td>0.436</td> |
39 |
| - <td>6.969</td> |
40 |
| - <td>7.405</td> |
| 33 | + <th rowspan="4"><b>warmup_runs=10</b></th> |
| 34 | + <td><b>Raspberry Pi 3</b></td> |
| 35 | + <td><b>1.8646</b></td> |
| 36 | + <td><b>2.1782</b></td> |
| 37 | + <td><b>1.9805</b></td> |
| 38 | + <td><b>4.8962</b></td> |
41 | 39 | </tr>
|
42 |
| - |
| 40 | + |
43 | 41 | <tr>
|
44 |
| - <td>2013 MacBook Pro</td> |
45 |
| - <td>0.747</td> |
46 |
| - <td>1.421</td> |
47 |
| - <td>2.168</td> |
48 |
| - <td>0.253</td> |
49 |
| - <td>6.036</td> |
50 |
| - <td>6.289</td> |
| 42 | + <td>Intel i7-3740QM (Early 2013 MacBook Pro)</td> |
| 43 | + <td>0.2146</td> |
| 44 | + <td>0.2425</td> |
| 45 | + <td>0.2272</td> |
| 46 | + <td>1.3104</td> |
51 | 47 | </tr>
|
52 |
| - |
| 48 | + |
53 | 49 | <tr>
|
54 |
| - <td>Time increase on Raspberry Pi</td> |
55 |
| - <td>4.68x</td> |
56 |
| - <td><b>7.744x</b></td> |
57 |
| - <td>6.688x</td> |
58 |
| - <td>1.723x</td> |
59 |
| - <td><b>1.155x</b></td> |
60 |
| - <td>1.177x</td> |
| 50 | + <td>Intel i7-5820K (Ubuntu 14.04)</td> |
| 51 | + <td>0.1397</td> |
| 52 | + <td>0.1730</td> |
| 53 | + <td>0.1567</td> |
| 54 | + <td>0.7064</td> |
61 | 55 | </tr>
|
62 |
| - |
63 |
| -</table> |
64 | 56 |
|
65 |
| -### Remarks |
| 57 | + <tr> |
| 58 | + <td>NVIDIA Titan X (Maxwell), Intel i7-5820K (Ubuntu 14.04)</td> |
| 59 | + <td>0.0240</td> |
| 60 | + <td>0.0290</td> |
| 61 | + <td>0.0259</td> |
| 62 | + <td>0.9566</td> |
| 63 | + </tr> |
66 | 64 |
|
67 |
| -* The good-ish news: the RPi3 appears to achieve fair performance relative to the MacBook Pro when running the compiled C++ binary |
68 |
| -* The bad news: The Python version is **really** slow. From just this test, I can't tell if the Python bindings to C++ aren't working properly, but I think it's definitely worth looking into |
69 |
| - * Dan Brickley (@danbri) shared some results when [testing out a camera module on his Raspberry Pi 3](https://twitter.com/danbri/status/709903532216995842). Direct link to Gist [here](https://gist.githubusercontent.com/danbri/ee6323d78ca14e616e4e/raw/6f50a897a59cb25d6c5e8f43fdfb0392fe9945d8/gistfile1.txt) |
70 |
| - * Pete Warden (@petewarden) mentioned that the compiler [may not be using NEON](https://github.com/tensorflow/tensorflow/issues/445#issuecomment-196021885) on the Raspberry Pi 2 while attempting to build TensorFlow. While my tests did not take a minute to run, @danbri's results suggest similar performance to @petewarden's; this may be a first place to look for improvements |
71 |
| -* On Mac, the Python version appears to run _much_ faster than the C++ binary. I'm not quite sure how this happened. I'd like to test on other systems to see if the results hold. |
72 |
| -* During the first run of the C++ binary after booting the system, there was a noticable slowdown during the 'model building' |
| 65 | + <tr> |
| 66 | + <th rowspan="4"><b>warmup_runs=0</b></th> |
| 67 | + <td><b>Raspberry Pi 3</b></td> |
| 68 | + <td><b>1.8541</b></td> |
| 69 | + <td><b>6.3338</b></td> |
| 70 | + <td><b>2.0656</b></td> |
| 71 | + <td><b>4.9755</b></td> |
| 72 | + </tr> |
73 | 73 |
|
74 |
| -## Outputs |
| 74 | + <tr> |
| 75 | + <td>Intel i7-3740QM (Early 2013 Retina MacBook Pro)</td> |
| 76 | + <td>0.2174</td> |
| 77 | + <td>1.3151</td> |
| 78 | + <td>0.2662</td> |
| 79 | + <td>1.2761</td> |
| 80 | + </tr> |
75 | 81 |
|
76 |
| -### Python |
| 82 | + <tr> |
| 83 | + <td>Intel i7-5820K (Ubuntu 14.04)</td> |
| 84 | + <td>0.1435</td> |
| 85 | + <td>0.7027</td> |
| 86 | + <td>0.1750</td> |
| 87 | + <td>0.7103</td> |
| 88 | + </tr> |
77 | 89 |
|
78 |
| -#### Raspberry Pi 3, Raspbian 8.0 |
79 |
| -``` |
80 |
| -giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca (score = 0.89233) |
81 |
| -indri, indris, Indri indri, Indri brevicaudatus (score = 0.00859) |
82 |
| -lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens (score = 0.00264) |
83 |
| -custard apple (score = 0.00141) |
84 |
| -earthstar (score = 0.00107) |
85 |
| -Build graph time: 3.495808 |
86 |
| -Eval time: 11.004332 |
87 |
| -``` |
| 90 | + <tr> |
| 91 | + <td>NVIDIA Titan X (Maxwell), Intel i7-5820K (Ubuntu 14.04)</td> |
| 92 | + <td>0.0232</td> |
| 93 | + <td>1.5800</td> |
| 94 | + <td>0.0871</td> |
| 95 | + <td>0.7659</td> |
| 96 | + </tr> |
88 | 97 |
|
89 |
| -#### Early 2013, 15-inch MacBook Pro (2.7 GHz Intel Core i7), OS X 10.11.1 |
| 98 | +</table> |
90 | 99 |
|
91 |
| -``` |
92 |
| -giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca (score = 0.89233) |
93 |
| -indri, indris, Indri indri, Indri brevicaudatus (score = 0.00859) |
94 |
| -lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens (score = 0.00264) |
95 |
| -custard apple (score = 0.00141) |
96 |
| -earthstar (score = 0.00107) |
97 |
| -Build graph time: 0.746465 |
98 |
| -Eval time: 1.421328 |
99 |
| -``` |
| 100 | +### Remarks |
100 | 101 |
|
101 |
| ---- |
| 102 | +* Test performance has gotten significantly better over the past several releases of TensorFlow, though running Inception on a Raspberry Pi still takes longer than a second when using Python |
| 103 | +* Warming up your `Session` is _crucial_. There have been many issues opened in this repo asking how to improve performance, so here's the number one thing to start with: keep your `Session` persistent to take advantage of automatic optimization tweaks. |
| 104 | +* Along the same lines: do _not_ simply call your Python script from bash every time you want to classify an image. It takes multiple seconds to rebuild the Inception graph from scratch, which can slow down your model by multiple times (this test doesn't include the time it takes to import `tensorflow`, which is another thing to benchmark...). This goes for pretty much any TensorFlow model you use- keep some sort of rudimentary server running that can respond to requests and utilize a live TensorFlow `Session` |
| 105 | +* Running the [TensorFlow benchmark tool](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/benchmark) shows sub-second (~500-600ms) average run times for the Raspberry Pi (I'll need to do another write-up with more details). Since this benchmark is run entirely in C++, we'd expect it to run faster than through Python. The question is whether or not all ~1.5 seconds of difference between these tests is entirely due to the communication layer between Python and the C++ core. |
102 | 106 |
|
103 |
| -### C++ |
| 107 | +## About `classify_image_timed.py` |
104 | 108 |
|
105 |
| -#### Raspberry Pi 3, Raspbian 8.0 |
106 |
| -``` |
107 |
| -I tensorflow/examples/label_image/main.cc:210] military uniform (866): 0.647298 |
108 |
| -I tensorflow/examples/label_image/main.cc:210] suit (794): 0.0477194 |
109 |
| -I tensorflow/examples/label_image/main.cc:210] academic gown (896): 0.0232409 |
110 |
| -I tensorflow/examples/label_image/main.cc:210] bow tie (817): 0.0157354 |
111 |
| -I tensorflow/examples/label_image/main.cc:210] bolo tie (940): 0.0145024 |
112 |
| -
|
113 |
| -# First time running after booting system: |
114 |
| -4450 milliseconds to build graph |
115 |
| -7005 milliseconds to evaluate image |
116 |
| -
|
117 |
| -# Subsequent time |
118 |
| -436 milliseconds to build graph |
119 |
| -6969 milliseconds to evaluate image |
120 |
| -``` |
| 109 | +I add two additional flags to `classify_image_timed.py` which allow users to easily change the number of test runs (runs that will collect information), as well as the number of "warmup" runs used. Simply pass in a number to `--num_runs` or `--warmup_runs` when calling the script: |
121 | 110 |
|
122 |
| -#### Early 2013, 15-inch MacBook Pro (2.7 GHz Intel Core i7), OS X 10.11.1 |
| 111 | +```bash |
| 112 | +# Use a sample size of 100 runs |
| 113 | +$ python classify_image_timed.py --num_runs=100 |
123 | 114 |
|
| 115 | +# Don't include any warmup runs |
| 116 | +$ python classify_image_timed.py --warmup_runs=0 |
124 | 117 | ```
|
125 |
| -I tensorflow/examples/label_image/main.cc:210] military uniform (866): 0.647299 |
126 |
| -I tensorflow/examples/label_image/main.cc:210] suit (794): 0.0477195 |
127 |
| -I tensorflow/examples/label_image/main.cc:210] academic gown (896): 0.0232407 |
128 |
| -I tensorflow/examples/label_image/main.cc:210] bow tie (817): 0.0157355 |
129 |
| -I tensorflow/examples/label_image/main.cc:210] bolo tie (940): 0.0145023 |
130 |
| -
|
131 |
| -# First running time after booting system: |
132 |
| -468 milliseconds to build graph |
133 |
| -6124 milliseconds to evaluate image |
134 |
| -
|
135 |
| -# Subsequent running times |
136 |
| -253 milliseconds to build graph |
137 |
| -6036 milliseconds to evaluate image |
138 |
| -``` |
0 commit comments