Wisdom—the knowledge of how to use knowledge—is often hard won.
That made 12 hours in the presence of Dr Hadley Wickham, RStudio Chief Scientist and author of many extremely useful R packages, doubly good: not only did a little of Hadley’s wisdom rub off on me, I did not have to go through years of hard work, experimentation, implementation, revision and so on, to get it.
(My thanks go to Dr Alex Whan for making possible Hadley's visit to Canberra last Friday.)
As well as learning about tidyR, dplyr, shiny, and pipelines for data analysis, I got an insight into how Hadley and his colleagues and collaborators are advancing the R body of knowledge to everyone’s benefit.
When not travelling, Hadley splits time working from home and a coworking space in Houston. Currently, RStudio staff are located across the US, affording Hadley a modicum of an increasingly rare and undervalued commodity in the modern workplace: solitude.
I got the impression that life at RStudio strikes a good balance between collaboration, communication and concentration. One of Friday’s delights was to see the fruits of that work environment in the latest release of RStudio.
Thanks to tighter integration with knitr and pandoc, RStudio users can use R markdown to do reproducible research and communicate it via HTML, PDF or Word. And through the goodness of shiny, RStudio enables users to produce and publish interactive web pages so that we can dynamically explore the impact of different parameters and options on our analyses and visualisations of data.
Pipes and pipelines have been an incredibly useful feature of the UNIX operating system since the 1970s.
Now, through Stefan Milton Bache’s magrittr (“Ceci n’est pas un pipe”) and Hadley’s dplyr packages, R users can pass data through pipelines of filters, sections, grouping operations, and per group analyses, making code more succinct, readable and maintainable.
flights %>% group_by(destination) %>% summarise(mean_delay = mean(arrival_delay), n = n()) %>% arrange(desc(mean_delay))
…takes airline flight data, then groups it by destination, then calculates the average arrival delay to each destination, then arranges the results in descending order of average delay.
And here lies deep wisdom about the value of designing code that can mimic or take advantage of natural language. This reduces the cognitive burden both of reading, and of writing code and, in my view, increases the odds of correctness.
The other thing that happens with programming idioms that are easy and useful to apply is that they encourage us to program, think, and—in R’s case—analyse in similar ways.
Hadley talked about R’s “dataframe” as a fundamental concept which, if adopted, supports a whole range of downstream analyses very straightforwardly. This reminds me a bit of the way containers revolutionised goods transport: you can pack a huge assortment of things into a container, and doing so makes them easy to handle. I gather that this is part of the philosophy underpinning tidyR.
For me, and many others, this kind of standardisation and abstraction is incredibly useful, allowing analysts to think more about the analysis of data, and spend less time wrangling. The value of R is increasing as useful programming idioms and forms of expression develop.
Digressing again, this makes me think about the mindset that “all value can be monetised”. I’m not an economist and I’m not an entrepreneur, so this may be a bit naïve, but I venture there are some incredibly valuable things that would cease to be valuable if we sought to realise their value from them. On reflection, “standards” (e.g., 240V, 50Hz electricity… oops! Not standard worldwide!) and “conventions” (e.g., “let’s all drive on the left… until we cross the border”) seem to qualify… I welcome enlightment on this topic!
This digression brings me to why I feel very positively disposed to RStudio and R in general.
In addition to addressing my interests in data analysis, statistics, and visualisation, R (by which I mean the core development team, package writers and the wider R community) and groups like the RStudio team hit all the elements of David Maister’s “Trust Equation”.
And talking of the wider R community and its contributors, Hadley mentioned Rasmus Bååth’s analysis of naming conventions in R packages, noting that not only were several conventions in use across the 2668 packages on CRAN, but also that 28% of packages mixed three or more naming styles in the one package.
So much for decreasing cognitive burden.
Still I think this diversity is evidence of a language that is both useful and being used.
What other wisdom did I gain last Friday?
- Backticks: apparently, these have been around for a few years in R
- browseVignettes(package="dplyr") yields much useful information
- lag() and lead() automagically pad the result with `NA` (or a value of your choosing), unlike diff() which gives an answer of length one less than the vector it is applied to
- semi_join(x, y) returns the rows of x that intersect with y when joined on the columns in common
- dplyr allows you to have dataframe columns that contain list variables
- You can get images in Rmarkdown !(http://australianbioinformatics.net/storage/downloads/ABN-Clear-80x80.png)
- knitr:kable() will print tables in Rmarkdown
Although, I clearly have a lot more to learn, I look forward to putting the wisdom I have gained into action, and I sign off with many thanks to Hadley Wickham, the RStudio team and Alex Whan for an excellent 12 hours of poweR.
*Disclaimer: I realise using “R” (the name of the free software environment for statistical computing and graphics) in place of “Ah” (the exclamation) is getting a bit tired, but I consider this a kind of consolation for “R” being so damn difficult to Google. Also, I have not read Henry Handel Richardson’s The Getting of Wisdom; I am hoping that it is, in fact, about becoming wiser.)