-
-
Notifications
You must be signed in to change notification settings - Fork 478
/
Copy pathjupyter_colab_notebooks_2020.Rmd
482 lines (391 loc) · 21 KB
/
jupyter_colab_notebooks_2020.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
---
title: "Stan Notebooks in the Cloud"
author: "Mitzi Morris"
output:
html_document:
toc: yes
toc_depth: 2
pdf_document:
toc: yes
toc_depth: '2'
---
<style type="text/css">
.table { width: 40%; }
blockquote {
padding: 10px 20px;
margin: 0 0 20px;
font-size: 0.95em;
border-left: 5px solid #eee;
}
p.caption {
font-size: 0.9em;
font-style: italic;
margin-right: 10%;
margin-left: 10%;
text-align: justify;
}
</style>
```{r setup, include=FALSE}
knitr::opts_chunk$set(cache=TRUE, message=FALSE, error=FALSE, warning=FALSE, comment=NA, out.width='90%', tidy.opts=list(width.cutoff=60), tidy=TRUE, fig.pos='H')
def.chunk.hook <- knitr::knit_hooks$get("chunk")
knitr::knit_hooks$set(chunk = function(x, options) {
x <- def.chunk.hook(x, options)
ifelse(options$size != "normalsize", paste0("\\", options$size,"\n\n", x, "\n\n \\normalsize"), x)
})
```
Putting your Stan model and data into a Jupyter notebook
lets your audience work through your analysis step by step.
The challenge in a demo, talk, or classroom situation
is getting everyone in the room on the same page
when that page is your presentation notebook and you want it to
display properly and be runnable from their browser.
Unfortunately, the law of conservation of energy mandates that
the easier things are for your audience, the harder they are for you.
This report takes you through the tedious details of setting up
a Jupyter Notebook so that anyone with a modern web browser and a Google account
can run your Stan analysis with Google Colaboratory free cloud servers,
with plenty of screenshots and technical details. Inasmuch as clouds are always
moving and changing, I've titled this report "cloud-compute-2020".
While the screenshots may have a sell-by date of Q3 2020,
the challenges will remain.
### Introducing CmdStanR and CmdStanPy!
The example notebooks for R and Python in this report use
two new lightweight interfaces to Stan:
[CmdStanR](https://mc-stan.org/cmdstanr/articles/cmdstanr.html) and
[CmdStanPy](https://cmdstanpy.readthedocs.io/en/latest/index.html).
They were developed with the following goals:
- Simplicity and modularity: these packages wrap CmdStan and
just provide functions to compile models, do inference,
and assemble and save the results;
other packages are needed for downstream analysis
- Keep up with Stan releases: these interfaces can use
any (recent) version of CmdStan, including the current release, Stan 2.22.
- Quick and easy installation: minimal dependencies with other packages and no direct calls to C++.
- Flexible licensing: BSD-3.
### Jupyter 101
A Jupyter notebook consists of blocks of markdown text interleaved
with blocks of statements which unifies the exposition of ideas and arguments
with the necessary supporting data, computation, and visualizations.
The three core programming languages supported are
Julia, Python, and R, hence the name, *Ju-Pyt-R*.
In order to author a Jupyter notebook on your machine, you need a local install
of both Python and Jupyter, as outlined in the
[Jupyter installation instructions](https://jupyter.readthedocs.io/en/latest/install.html).
Once the Jupyter server is running, you can then run existing notebooks and create new
notebooks via your favorite web browser.
A notebook document is a JSON file with suffix `.ipynb` which
contains a list of cells, one cell per content block (code, text or image),
and a dictionary of metadata which specifies the kernel used to run
the notebook, here, either R or Python.
When viewed in a browser, the notebook is displayed as an HTML page where
the contents of the text cells are rendered by default while
the code cells are displayed with controls which allow them to be executed independently.
Additional controls allow you to create, edit,
and publish notebooks as HTML or pdf documents.
```{r figHelloNotebook, echo=FALSE, out.width = "75%", fig.cap="Jupyter notebook in the browser, local Jupyter server"}
library(knitr)
include_graphics("img/hello_notebook.png")
```
By distributing an `ipynb` notebook file together with your
Stan model and data, your audience can replicate your analysis on their machine.
But in order to do this, they must have a local install of
the Jupyter notebook server or other IDE that can run the notebook.
This is not always possible; they might
not have have enough permissions to install software on their computer
or their machine might not have enough memory or a powerful enough CPU to run Stan.
You could give them access to your machine by running a
[Jupyter notebook server](https://jupyter-notebook.readthedocs.io/en/stable/public_server.html)
for single-user access or
[JupyterHub server](https://jupyterhub.readthedocs.io/en/latest/) for multiple-user access,
but this requires bandwidth, compute power, and careful attention to
[security](https://jupyter-notebook.readthedocs.io/en/stable/public_server.html#notebook-server-security).
Regarding security, here's some wisdom from the [Jupyter blog](https://blog.jupyter.org/public-notebooks-and-security-3058c433c884),
(emphasis added)
> It is important to keep in mind that, since **arbitrary code execution is the main point of IPython and Jupyter**, running a publicly accessible server without authentication or encryption is a very bad idea indeed.
Hence the need for Jupyter servers in the cloud: many instances of
someone else's server where the audience can run your notebook.
### Challenge: Free Servers
It is a truth, universally acknowledged,
that there ain't no such thing as a free lunch.
Nonetheless, as of 2020, Google is providing
free email via Gmail,
free file storage via Google Drive,
and free Jupyter notebook servers via
[Google Colab](https://en.wikipedia.org/wiki/Project_Jupyter#Colaboratory).
To use these services, you must sign up for a [Google account](https://myaccount.google.com/intro).
The Colab server instances are limited, as is the amount of storage on your Google Drive,
and should you use Gmail, Google examines all messages and serves ads accordingly;
there ain't no such thing as a free lunch, Q.E.D.
#### Google Colaboratory 101
Google Colaboratory, nicknamed Colab,
is a free Jupyter notebook environment that runs in the cloud,
i.e., on remote Google servers.
Google makes no promises about VM availability or resources,
(see https://research.google.com/colaboratory/faq.html),
however you can inspect the virtual machine instance that a notebook
is running on via system commands.
To see how this works, I created a notebook called `show_VM_specs.ipynb` which is stored on my
Google Drive in a folder called 'Colab_Notebooks'.
In order to run this notebook, I can either open it with Google Colab or download it
and run it locally.
```{r figColabOpen, echo=FALSE, out.width = "75%", fig.cap="Open a notebook in Google Colab"}
library(knitr)
include_graphics("img/gdrive_open_notebook.png")
```
The Colab UI is just different enough from the Jupyter UI to be confusing.
Execution of each cell is controlled by a "play button" icon.
```{r figColabStep1, echo=FALSE, out.width = "75%", fig.cap="Jupyter notebook in Colab"}
library(knitr)
include_graphics("img/run_notebook_in_colab.png")
```
The above screenshot shows that when I ran this notebook on Colab,
it ran on a VM instance with 12 GB RAM, 2 Intel Xeon 2.2Ghz processors,
running Unbuntu Linux on a file system with 74GB free disk space.
This is sufficient to use Stan to fit small-to-medium models on small-to-medium datasets;
i.e., good enough for demos and classroom exercises.
Colab is a gateway drug - for large-scale processing pipelines
you'll need to move up to Google Cloud Platform or one of its competitors AWS, Azure, etc.
A Colab VM persists for approximately 10 or 12 hours from first spin-up,
therefore you must upload/download local files to your Google Drive or another cloud storage system
in order to save your work.
The notebook itself is saved on your Google Drive.
You can download it to your local machine, or you can save to GitHub
via the the "Save a copy in GitHub" option from the Colab UI "File" tab.
```{r figColabSave, echo=FALSE, out.width = "75%", fig.cap="Saving a notebook to GitHub"}
library(knitr)
include_graphics("img/colab_save_github.png")
```
The save dialog provides the option of adding an "Open in Colab" button
to the saved notebook.
```{r figColabSaveDialog, echo=FALSE, out.width = "75%", fig.cap="Save dialog"}
library(knitr)
include_graphics("img/save_2_github.png")
```
This saves the notebook `show_VM_specs.ipynb` to the
[GitHub repo for this report](https://github.com/stan-dev/example-models/tree/master/knitr/cloud-compute-2020/show_VM_specs.ipynb).
```{r figViewNotebookOnGithub, echo=FALSE, out.width = "75%", fig.cap="Saved notebook in GitHub preview"}
library(knitr)
include_graphics("img/colab_notebook_on_github.png")
```
#### A few words on Colab security and authorization
In order to run a notebook in Colab, you must first be logged into your Google Account.
Once logged in, when you first attempt to run the notebook, you will get a warning
that the notebook may request access to your data or read stored credentials.
```{r figColabWarn, echo=FALSE, out.width = "75%", fig.cap="Security warning for GitHub notebooks"}
library(knitr)
include_graphics("img/colab_warning.png")
```
This is a valid warning!
However, in order for a notebook to access your Google Drive, it must use the
the [Google Drive API](https://developers.google.com/drive/api/v3/about-sdk),
and this will be evident by reading through the markdown and Python notebook cells.
So don't take my word that these notebooks are safe;
review the notebook source code before clicking "RUN ANYWAY" (or "CANCEL").
#### Editing files and data on Colab
The Colab interface provides a "files" utility which lets you upload
data and models in your notebook, and also gives you minimal editing facilities
over files on the Colab instance. This is the file folder icon to the left
of the main window
```{r fig11, echo=FALSE, out.width = "75%", fig.cap="Colab Files Utility"}
library(knitr)
include_graphics("img/colab_files_util_1.png")
```
Clicking on this icon opens the files utility, which shows a view of the files
in the current working directory of the Colab instane. Across the top of
the files utility are controls to upload files from your local computer,
refresh the files view, and connect to your Google Drive.
```{r fig12, echo=FALSE, out.width = "75%", fig.cap="Files utility, controls"}
library(knitr)
include_graphics("img/colab_files_util_2.png")
```
The file upload utility opens a file browser that lets you select one or more files to upload.
```{r fig13, echo=FALSE, out.width = "75%", fig.cap="Files utility file browser"}
library(knitr)
include_graphics("img/colab_files_util_3.png")
```
Once uploaded, Colab provides minimal editing facilities for data and test files.
```{r fig15, echo=FALSE, out.width = "75%", fig.cap="Files utility file editor"}
library(knitr)
include_graphics("img/colab_files_util_5.png")
```
Unfortunately, the extension `.stan` is not recognized; however `.json` data files
can be edited.
This makes editing program files in Colab challenging.
Fundamentally, Colab is a notebook viewer, not an IDE.
### Challenge: Fast Spin-up
In a talk, class, or demo situation every minute spent installing
software is a minute less for the presentation itself.
A Colab virtual machine instance has both Python and R already installed,
as well as both the clang and gcc-7 C++ compilers.
But to run the current Stan release (Stan 2.22) from R or Python, you need either
CmdStanR or CmdStanPy as well as a local CmdStan installation.
Both CmdStanR and CmdStanPy are small packages with minimal dependencies
and can be quickly downloaded from their respective repositories and install
in a matter of seconds.
Both packages provide a function `install_cmdstan` which downloads and
compiles the latest CmdStan release.
Unfortunately, installing CmdStan can take upwards of 10 minutes because
the `install_cmdstan` function downloads a CmdStan release from GitHub,
decompresses and unpacks the tarball, and compiles all C++ code for CmdStan,
Stan, and the math libraries. This must be done every time a fresh VM spins up.
As a Colab VM only persists for at most 12 hours, this means that most of the
time your class, talk, or demo will require waiting on this installation process
before you or your audience can start running Stan.
To avoid this, we've created a CmdStan binary for Colab for the CmdStan 2.23.0 release,
[colab-cmdstan-2.23.0.tar.gz](https://github.com/stan-dev/cmdstan/releases/download/v2.23.0/colab-cmdstan-2.23.0.tar.gz).
From a Colab notebook, downloading and unpacking this set of pre-compiled binaries takes on the order of 10 seconds.
With this shortcut, the Stan install process for Colab consists of three steps:
* Install the CmdStanPy or CmdStanR package.
* Download the pre-compiled CmdStan binaries for Google Colab.
* Register the CmdStan install location in the Python or R session.
To see how this works, I have created two example notebooks for Colab
which run CmdStan's example model `bernoulli.stan`:
[CmdStanPy_Example_Notebook.ipynb](https://github.com/stan-dev/example-models/tree/master/knitr/cloud-compute-2020/CmdStanPy_Example_Notebook.ipynb)
and
[CmdStanR_Example_Notebook.ipynb](https://github.com/stan-dev/example-models/tree/master/knitr/cloud-compute-2020/CmdStanR_Example_Notebook.ipynb)
#### CmdStanR notebook spin-up
The initial cells of the CmdStanR_Example_Notebook follow the steps described above.
To create this notebook, it was necessary to first create an
IR notebook using Jupyter running on my notebook, then upload it to Google Drive.
The first cell in the CmdStanR notebook installs the CmdStanR package, as needed.
CmdStanR is not yet on CRAN, so we install from GitHub instead.
To speed up the install process, only the necessary dependencies
are are installed.
```
# Install package CmdStanR from GitHub
library(devtools)
if(!require(cmdstanr)){
devtools::install_github("stan-dev/cmdstanr", dependencies=c("Depends", "Imports"))
library(cmdstanr)
}
```
The following cells download the precompiled CmdStan binaries
and registers the path to the CmdStan installation:
```
# Install CmdStan binaries
if (!file.exists("cmdstan-2.23.0.tgz")) {
system("wget https://github.com/stan-dev/cmdstan/releases/download/v2.23.0/colab-cmdstan-2.23.0.tar.gz", intern=T)
system("tar zxf colab-cmdstan-2.23.0.tar.gz", intern=T)
}
# Set cmdstan_path to CmdStan installation
set_cmdstan_path("cmdstan-2.23.0")
```
#### Spinning up a CmdStanPy Notebook
The CmdStanPy_Example_Notebook contains the Python version of the fast spin-up steps.
CmdStanPy requires Python3, which is the default runtime for new Colab notebooks.
CmdStanPy is a pure-Python package which can be installed from PyPI:
```
!pip install --update cmdstanpy
```
We specify the `--update` flag in order to get the latest version,
in case any of the pre-installed Python packages for the Colab Python runtime
are using older versions of CmdStanPy.
We can use Python to download and unpack the precompiled CmdStan binaries:
```
# Install pre-built CmdStan binary
# (faster than compiling from source via install_cmdstan() function)
tgz_file = 'colab-cmdstan-2.23.0.tar.gz'
tgz_url = 'https://github.com/stan-dev/cmdstan/releases/download/v2.23.0/colab-cmdstan-2.23.0.tar.gz'
if not os.path.exists(tgz_file):
urllib.request.urlretrieve(tgz_url, tgz_file)
shutil.unpack_archive(tgz_file)
```
The following cells check the CmdStan installation and register its location:
```
# Specify CmdStan location via environment variable
os.environ['CMDSTAN'] = './cmdstan-2.23.0'
# Check CmdStan path
from cmdstanpy import CmdStanModel, cmdstan_path
cmdstan_path()
```
In Colab, once you have executed a code block, hovering over the "run" icon shows the execution time -
here a CmdStan installation took just over 10 seconds.
```{r figInstallCmdStanBinaries, echo=FALSE, out.width = "75%", fig.cap="Execution time to install precompiled CmdStan binary"}
library(knitr)
include_graphics("img/install_binaries.png")
```
Note that these binaries *might* work on other Ubuntu machines, but definitely won't work in Mac or Windows.
### Challenge: Putting the R in Jupyter
This section is for all Stan users who work primarily in R and RStudio.
It shows you how to get your local Jupyter environment set up and how
to translate R Markdown documents to Jupyter notebooks.
In order to author Jupyter notebooks for R, you also need a local install of R as well as
the [IRkernel package](https://irkernel.github.io), the R kernel for Jupyter notebook,
[installation instructions here](https://irkernel.github.io/installation/).
Jupyter notebooks for R are similar to RStudio's
[R Markdown documents](https://rmarkdown.rstudio.com),
in that both contain chunks of R code, interleaved with chunks of text.
Underlyingly, Jupyter notebooks are JSON documents with suffix `.ipynb`
while R markdown documents are in R Markdown format and have suffix `.Rmd`.
Although the RStudio IDE provides an R Markdown notebook interface,
notebooks authored via the RStudio IDE are in R Markdown format,
not Jupyter Notebook format.
```{r fig1, echo=FALSE, out.width = "75%", fig.cap="RStudio R Notebook"}
library(knitr)
include_graphics("img/rstudio_notebook.png")
```
The RStudio interface calls this a notebook, but the resulting file is still in R Markdown format:
```{r rmarkdown-notebook, size="footnotesize", echo=FALSE, fig.cap="Notebook file"}
cat(readLines('Untitled.Rmd', n=15), "...", sep="\n");
```
When viewed with Jupyter, this document is treated as a raw text file.
```{r fig2, echo=FALSE, out.width = "75%", fig.cap="R Markdown document in Jupyter browser"}
library(knitr)
include_graphics("img/jupyter_local.png")
```
An extremely useful Python package called [Jupytext](https://jupytext.readthedocs.io/en/latest/)
allows you to convert from R Markdown to Jupyter notebook format and back again.
After [installation](https://jupytext.readthedocs.io/en/latest/install.html),
Jupyter now treats R Markdown documents as notebooks.
```{r fig3, echo=FALSE, out.width = "75%", fig.cap="R Markdown document in Jupyter with juptext extension"}
library(knitr)
include_graphics("img/jupyter_juptext.png")
```
Juptext identified the R Markdown YAML header as its own block, with celltype `raw`,
and converted the next three chunks to notebook cells with celltype `markdown`, `code`, and `markdown`, respectively.
After editing and testing the notebook in the browser, you can save it as
a Jupyter R notebook via the options menu tab **"File"**, selection **"Download as"**, option **"notebook (.ipynb)"**.
```{r fig4, echo=FALSE, out.width = "75%", fig.cap="Convert R Markdown file to Jupyter notebook"}
library(knitr)
include_graphics("img/save_as_ipynb.png")
```
Congratulations, you now have a Jupyter R notebook that you can share with the world!
Anyone who has access to a Jupyter server can recreate and extend your analysis.
#### Putting the R in Colab
In order to run a Jupyter notebook via Colab, it must be available on the web or your Google Drive.
Google Drive lets you can create new Python notebooks or upload existing notebooks from your local machine.
To share a Jupyter R notebook with the world, you will need to author the notebook locally and then upload it
to Google Drive:
```{r figGgoogle, echo=FALSE, out.width = "75%", fig.cap="Upload to Google Drive"}
library(knitr)
include_graphics("img/gdrive_upload_notebook.png")
```
Alternatively, you could use this [this empty R notebook](https://github.com/stan-dev/example-models/tree/master/knitr/cloud-compute-2020/template_IR_notebook.ipynb), but of course, if you want an R notebook that runs Stan, use the example CmdStanR notebook from the previous section.
### Next steps
The example notebooks for CmdStanPy and CmdStanR provide a starting point for
sharing your Stan analysis as a Colab notebook by getting you over the critical
software installation hurdle.
Whatever your analysis, it should include the initial cells in these notebooks
which download the Python and R wrapper interface packages, download the precompiled CmdStan installation,
and set the package's CmdStan path variable accordingly.
After installation, an interesting Stan notebook should include cells to:
- Assemble, plot data
- Compile, display Stan model
- Fit the model to data
- Display inferences
- Graph data and fitted model
An extremely interesting Stan notebook would expand the above to the full Bayesian workflow:
- Read in data
- Graph data
- Fit model 1 to data
- Graph data and fitted model
- Simulate fake data from model 1, fit model 1, and informally check that the inferences are close to the assumed parameter values
- Expand model 1 to create model 2
- Repeat the above steps for model 2
- Compare the two models
### Update, May 2020
We've just added notebooks for both Python and R which correspond to Andrew's blogpost on an early
study on Covid-19 prevalence in Santa Clara county, CA:
- https://statmodeling.stat.columbia.edu/2020/05/01/simple-bayesian-analysis-inference-of-coronavirus-infection-rate-from-the-stanford-study-in-santa-clara-county/
The notebooks, models, and data are available from GitHub:
- https://github.com/stan-dev/example-models/tree/master/jupyter/covid-inf-rate