September 26th, 2017

 

Last spring I gave a talk at New York R Conference and EARL SF titled “The Missing Manual for Running R on Amazon Cloud”. It was meant to be targeted at small (or large) enterprise users looking to build out or outfit a data science team capable of doing effective data science in the cloud with all the data ingest, security and usability concerns and implications that come with navigating that space. In recent months I’ve been surprised but overjoyed to see cloud data science start to become championed by citizen data nerds and #rstats community folks in academia.

In an experiment of unknown and unplanned duration, I’ve been leaving my work laptop at work on Friday night, then resisting the urge to go get it on Saturday morning. If I need or want to do anything on the computer, be it R or otherwise, I have to figure out how to do it on my Acer Chromebook 11.

 

 

 

The major things, like working with RStudio server on AWS, aren’t all that different from how I operate every day at work. I do find that I’m more likely to “cheat” and use a local-cloud hybrid approach to data management when I’m using my work machine, and I like that the Chromebook forces me to honestly evaluate the usability of the cloud data science system we’ve designed.

It’s the little things that have me feeling constrained on the Chromebook. Taking screenshots, managing them all, editing diagrams and trying to create slide deck presentations is all a bit of a drag. So far I’ve felt more effective switching to my phone when I need to do that sort of thing. Making an Acer Chromebook 11 feel satisfying to operate is probably an entirely lost cause, but there is something really fun about having all the power of the cloud at your fingertips on one of the cheapest little laptops money can buy.

 

 

My Weekend Chromebook Tools

  • The whole Google suite of web tools for email, documents and chat.
  • B23 Data Platform: Web service for provisioning Amazon Web Services resources with data science tools (RStudio server, Jupyter Notebooks, Zeppelin Notebooks, H2O.ai, Spark…) as well as data pipelines. I work on the team that builds and maintains this product.
  • AWS S3 Bucket linking with RStudio Server. It’s useful to have data when I’m doing work, and keeping data in private S3 buckets is probably a better option than keeping it on my Chromebook. We have a similar system for getting other data sources into stacks as well.

 

Slide 9 from “The Missing Manual for Running R in the Cloud”

 

 

  • Github deployment with RStudio Server. While not a necessary feature, it’s super useful when I just want to automatically “deploy” a repo from Github onto the instance or start running a script via cron schedule right away.

 

Github deployment customization for R Server stacks on B23 Data Platform

 

 

  • The B23r package, packrat and RStudio projects for saving and restoring projects and packages between working sessions. Without this workflow, I don’t think I’d put up with doing anything in the cloud on a regular basis. It takes too much time and effort to set up all the data, files and R packages I want, every single time I want them. With this system, I use a custom function to send project files out to an AWS S3 bucket and then pull them back down and restore them when I need to work in that project session with that set of packages again.

 

Slide 16 from “The Missing Manual for Running R in the Cloud”

 

 

  • AWS Console: Mostly just for editing security groups on the fly. Everything EC2 related is taken care of by B23 Data Platform.

About the Author: Kelly O’Briant is a data scientist and lead R package maintainer at B23 LLC . She received her M.S. in Computational Science and Informatics from George Mason University. Kelly is a founder and co-organizer of the Washington DC chapter for R-Ladies Global.