From S to R: 35 Years of AT&T Leadership in Statistical Computing
Today is a golden age in data analysis. The advent of the web has made it easy to assemble huge data sets. Every click—an online purchase, a like, or a product rating—adds a new data point to some data set somewhere. Multiply these events across millions of users and over weeks and months, and huge ready-made data sets of billions and even trillions of data points almost grow themselves.
But the data by itself means little without the capability to extract useful information; fortunately, data analysts today have a range of exploratory tools that aid data analysis, chief among them interactive visualizations that let users directly manipulate data to quickly “see” the dominant trends and the positive and negative relationships, while quickly locating the anomalies and outliers that reveal much about the data.
Much of the work supporting interactive data analysis and visualization is the result of research ongoing at AT&T for over 35 years. First Richard Becker, John Chambers, and Allan Wilks while at Bell Labs created S, one of the first programming languages devoted to data analysis and used by researchers and academics (and then their students) to create new methods of data analysis.
Simon Urbanek continues the tradition as both an AT&T researcher and a core developer of R, a data-analysis environment that builds on S. A researcher focused on interactive visualizations, Urbanek uses and extends R. Today he is working with other researchers to create an R collaborative environment that will not only allow multiple users to access and work together on the same sessions, but will also incorporate parallel and distributed processing needed for the immense data sets now commonly available.
Data analysis in the early computing days
Before S and thirty-five years before R, data analysis was anything but visual or interactive. Instead it was statisticians using mainframes to perform regression, analysis of variance, and traditional statistical methods existing before computers. Data analysis was half analysis, half programming when programming meant calling subroutines in a language like Fortran. Every variable had to be declared ahead of time and there were implicit rules for what variables could be floating point and which could be integers. The analysis returned not a simple answer but pages and pages of a top-down analysis containing within it the wanted answer.
It wasn’t a good fit. Statisticians as a rule are not programmers. They just want to be able to analyze the data, and not have to worry about calling a subroutine or declaring variables.
Some operations that were once difficult became trivial in S.
There were other difficulties. Fortran and other early languages were created for engineering problems, not for data analysis. Fortran is fast at solving differential equations and linear algebra computations, but this isn’t what statisticians are typically trying to do. Statisticians want to understand relationships among variables. Do patients improve when the dosage is increased? Do students score higher grades when class sizes are reduced? What happens to X when Y changes?
The limited amount of computer memory in early computers also limited the types of statistical analysis that could be done computationally. With their small memories, computers typically handled only a certain portion of the data at one time. One portion would be read in, processed, and then the next data portion would be read in. Computations performed in sequence might work for getting averages, standard deviations, and other linear computations, but not for sorting, finding the median, clustering, or other operations where all data needed to be in memory.
A language for data analysts
In 1975, Richard Becker and John Chambers, both statisticians working on data analysis at Bell Labs and both computer scientists, were frustrated with having to interrupt their data analysis to do the required programming. They wanted a more interactive process, where they could just type an expression at the command line and get back an immediate response, or better yet, a bar graph or other visual.
Packages like SAS, SPSS, and others would alleviate the chore of programming. But Becker and Chambers didn’t want a package (a suite of programs). Packages were too limiting; they might work well on standard and well-formulated operations but they don’t give much control over the analytic process, nor do they make it easy to design new tasks or work outside the supported options. Plus packages normally return printed output. Becker and Chambers wanted to get back a simple answer in the form of an object that could be passed to and manipulated by other functions.
While a programming language would give them the control they needed, Fortran and other languages of the day were too low-level to express statistical concepts. It was all about programming at the level of individual data elements and writing big, tedious loops.
They started to think what would be required in a high-level language created specifically for statistics and data analysis. It would have to have a syntax and notation appropriate for expressing statistical concepts and modeling, one where the basic data type should be vectors, and where everything should be a function. Such a language would give them the interactivity needed to more easily explore data sets while handling the low-level programming tasks so they could concentrate on the high-level analysis.
In 1976 they started writing S, one of the first programming languages written for statistical and data analysis. The first S was written in Fortran for running on a Honeywell 6000, and served more as an interface to access subroutines already existing in an in-house Fortran library. The more Becker and Chambers worked on S, the more possibilities they saw, and S gradually evolved into a complete environment specialized for data analysis, with support for graphics and plots.
S was based entirely on functions, allowing for a high level of abstraction. There were functions for regression, clustering, seasonal adjustments, smoothing, and other statistical methods. Becker and Chambers, using their knowledge of both computer science and statistics, paid attention to the details of writing these functions, ensuring the statistical operations were done correctly and in a computationally efficient fashion. Some operations that were once difficult became trivial in S. Anyone could just run an expression with an S function, know that it was done in the right way, and access the results.
As new algorithms were developed, Becker and Chambers would write new functions to keep S up to date on the latest methods. Just as importantly they made it easy for other users to write their own functions by including interfaces to translate between internal S data structures and those of other languages.
The original model for importing algorithms into S: The algorithm is represented as a circle enwrapped by a square interface, allowing it to fit into the square slot provided to S functions.
To support clustering, nearest neighbors, and other modeling techniques where each data point is compared to all other datapoints, S was designed for in-memory operations—this at a time when mini-computer memories held 64 Kbytes. Though in-memory operations would initially limit what could be done in S, Becker and Chambers took the long view. As computers got bigger, so would memories.
Working without the demands and constraints of releasing a commercial product, Becker and Chambers (and in 1982, Allan Wilks) built the language they wanted and according to their own requirements. Each change required a 3-0 vote. If one couldn’t convince the others that a change was the right way, then it wasn’t the right way. If they didn’t have a good way of implementing something in S, they would wait until they had a way to do it properly. For the longest time, they didn’t include analysis of variance because they didn’t know how to express it in the language.
Where S started from nothing, R could take everything provided by S and build from there.
They also wanted S to keep pace with the coming changes in computing and software. Already colleagues at Bell Labs were working on UNIX and C. Mainframes and punch cards would give way to minicomputers. Realizing there would be more interest in porting a good operating system like UNIX with its powerful shell language and clean and simple file system than in porting S, Becker and Chambers with Wilks rewrote S in C. While others did the hard work of porting UNIX to different machines, S went along for free, reaching a greater number of people, and extending beyond those within Bell Labs.
When AT&T, then still a monopoly, gave universities a free license to use UNIX, S too gained a foothold with academics and students. S began to make an impact on research and teaching and altered how people analyzed, visualized, and manipulated data.
After its breakup in 1984, AT&T ceased to be a monopoly and, in 1987, licensed S to Statistical Sciences, which sold and supported S under the name S-PLUS. In 1993, the license was made exclusive, meaning AT&T was effectively out of the S business, and anyone wanting to use S would have to pay for S-PLUS. (Statistical Sciences was later renamed Insightful and later acquired by TIBCO.)
The last version of the original (free) S was compiled at Bell Labs sometime around 2000.
A new era in statistical computing
Not everyone wanted to pay for S, least of all students. In 1991, two University of Auckland professors, Ross Ihaka and Robert Gentleman, essentially re-implemented S from scratch while incorporating elements from Scheme and called it R. R had the same syntax and grammar as S, and most S programs could run in R (and still can).
Like S, R made it easy to write functions that returned objects, and it had mechanisms for organizing data, running calculations, and creating graphical representations. And like S, R comes with a comprehensive library of data analysis functions.
What was different was the starting point. Where S started from nothing, R could take everything provided by S and build from there.
The data analysis environment was different also. A lot had changed in computing and data analysis in the 15 or so years since Chambers and Becker first started working on S. Computers were smaller and more powerful; data sets were bigger, and the data itself was different. Many of the traditional statistical methods like regression didn’t easily scale up, and statisticians themselves were slow to evolve their traditional methods. It was researchers in the new fields of data mining and machine learning who developed the new data analysis techniques of matrix factorization, boosting, clustering, and others, many borrowed or adapted from computer science. It would be hard for one language, one program, or even a complete environment to cover it all.
R succeeds today because it is very easy to extend R. If the driving force behind S was interactivity, for R it was extensibility. Partly this extensibility stems from S, which was designed to interface easily with outside elements (specifically with subroutine libraries). R retains this interfacing capability so it easily incorporates native code as well as objects, analysis, and visualizations created in other programs, giving R users options outside of R. And R is extensible also because it’s open source. Where three people once worked on S, the open-source R can draw from a much bigger pool of contributors since anyone can take R code, fork it, and make changes.
R packages do everything from cutting-edge statistical models, genetic data analysis, interfaces to geographical information systems, and economic analysis, to printing labels.
It was not that Ihaka and Gentleman planned out all this from the beginning. They just wanted a tool that would help them teach data analysis to their students. As they started slowly sharing R with others and as word spread, their project got ahead of them. Users started emailing suggestions and requesting fixes. This job became too much for two people, and they started first a mailing list and then a news groups to carry discussions about R and R development, giving people a way to contribute patches and code.
The next logical step was for Ikaha and Gentleman to officially make R open source, and in 1995, just four years after the first release of Linux, they did just this under the GNU Public License. In 1997, responsibility for maintaining the basic, central functionality for R was handed to a small core group of developers. Today the R core group consists of 19 international developers, one of whom is John Chambers. While Chambers is no longer at AT&T, AT&T’s tradition of fundamental contributions to statistical computing is upheld by Simon Urbanek, who is both AT&T Researcher and R core developer.
R resembles S and maintains the important aspects that work well for data analysis: Everything in R is a function and it is easy to write functions in R, using the intuitive, clear S syntax. All results are returned as objects that can be easily incorporated into the next step. (Objects created in other systems and then imported into R are accessed in the same way R objects are accessed.) It’s easy to write native code in Java, C++, or any other language, and then have R call the code directly, with almost no loss in speed or performance.
But R differs from S in one fundamental way. R has packages.
A package bundles together the R functions and compiled code needed to accomplish a particular task. Anyone wanting to create a new function can simply write it out and save it as an R package. Everything needed to run the function is all maintained together with the package so a function created months earlier can be re-run at any time without having to remember or dig up notes as to what arguments get passed.
Packages not only make it easy to save functions, but to share them. If someone creates in R a new function or method that solves a problem or performs a particular task (performing a classification tree, interfacing to a database), it can be handed off to others who want to perform the same task.
And it’s this ability to distribute and share packages that makes R so extensible. If something is missing in R, it can be added via a package. In fact, most of R’s functionality comes from user-created packages. R’s interactive visualization capabilities, for example, come via packages, including rggobi which interfaces R to GGobi, and iPlots (or iPlots eXtreme), which has the benefit of being integrated with R.
R packages do everything from cutting-edge statistical models, genetic data analysis, interfaces to geographical information systems, and economic analysis, to printing labels. Statisticians might create a package for a new data analysis method; biologists, chemists, geneticists, or anyone else can create packages specific for their discipline.
Increasingly scientific papers are distributed with an R package. It’s usually only a matter of time before the inventor of a new data-analysis technique makes it available via a package, and it’s why a very large number of new methods and developments in scientific research are implemented first in R. And because it’s not necessary to know what’s inside a package to use one, even nonprogrammers can use packages to access the latest data-analysis techniques coming out of academia or industry research department.
To support packages and make them easy is to create and distribute, there is the CRAN (Comprehensive R Archive Network) repository, an automated, structured framework for building packages according to a standardized format.
Project RCloud notebooks resemble a sort of dynamic content management system that expresses how to link together the various elements of the analysis.
Anyone wanting to broadly distribute a package as open source can submit it to CRAN, which runs a series of checks to make sure the package executes properly and has documentation; if the package passes the checks, it will be accepted and CRAN will attach the binaries for easy installation. CRAN packages are listed at the R-core repository http://cran.us.r-project.org, where almost 5000 CRAN packages are available for anyone to download; new packages added daily. Additional packages (which may not meet CRAN requirements) are available on different repositories (see sidebar).
Extending R to the cloud
AT&T researcher Simon Urbanek, a core R developer and creator of iPlots and iPlots eXtreme (as well as the maintainer of R’s Mac portal), is working with others at AT&T Research to extend R in ways that go beyond simply sharing packages, and to more directly support the sharing of an analysis itself.
The initial impetus was a pressing need within AT&T Research for a collaborative data-analysis environment that would allow researchers to work together on different aspects of analyzing large data sets. Different groups within AT&T Research—speech, visualization, and network management groups—all have large data sets that they are trying to analyze. Each group works with different models and methods but at times encounter similar problems. The visualization group for instance might need to create a document term matrix with little idea of how to do it although this knowledge does exist in the speech group. But without a framework for sharing analyses and code, there was no easy way to transfer knowledge from one group to another.
RCloud now serves as the core technology engine behind AT&T’s analytics and visualization of data, just as S once did for the old Bell Labs.
Existing collaborative environments and systems—including IPython Notebook, GitHub, and the R packages knitr and Sweave—either lacked the necessary data analysis functionality or were not web-based, which would have made sharing simple.
In a project code-named R-Cloud, AT&T researchers have taken ideas from existing models and combined them to build a collaborative environment that uses web-based notebooks for documenting and demonstrating a data analysis. These notebooks list what packages and code were used at each step as well as the data sources and servers, while providing a way for users to insert text and comments for more explanation. The idea is for anyone reviewing the notebook to able to re-create or independently verify the analysis. As with a package, everything in a notebook is stored (and versioned) together.
Project RCloud notebooks resemble a sort of dynamic content management system that expresses how to link together the various elements of the analysis. The speech group might use a notebook to show how it uses R to pull in data from a data source and pass it to a speech algorithm to do some modeling before passing it outside R to a visualization program to produce plots. It’s actually a natural way to program in a functional language, and it again shows the extensibility of R. The notebook that describes this analysis might be taken by another group and used in a completely different context by swapping out the data sources, changing the packages or code, or specifying a different server.
(To make it easier to swap data sources, R’s previously internal connections were made external so programmers could more easily write or adapt code for any type of data source, whether it’s a file, Hadoop database, or URL. This conversion required Urbanek to make a change to R source code—demonstrating a nice perk to having an in-house R core developer. But everyone else also benefits since these external connections are part of R Release 3.0.0.)
RCloud notebooks open in the browser and are thus accessible from anywhere and from any connected device. Someone might close a notebook at work and then re-open it at home or from any other location. Others collaborating on a project can likewise open the same notebook to see what others are doing and insert a new step or modify an existing one. (RCloud and IPython developers are cooperating in finding a way for the two systems to talk to one another so it will soon be possible to run RCloud from IPython notebook.)
Collaborative R. With project RCloud, the user’s location doesn’t matter. Analysis, code, and visualizations are always available via a browser (and appropriate permissions).
As the name implies, RCloud runs on cloud-based servers with the notebooks and code stored in GitHub. With RCloud running in the cloud, users don’t have to care about the resources or machines involved and can rely on the system to scale as needed. (Anyone deploying RCloud can accommodate more users by increasing the number of machines.)
A side effect of running R in the cloud is being able to run RCloud on clusters of machines, and thus take advantage of the expanded cloud-based computational and memory resources, especially for those data sets too large to fit in memory of a local machine. Running R from the cloud also makes it possible to implement R for distributed and parallel-processing environments, allowing for fast computation and visualization of large-scale data in near real time. Though R is single-threaded (a design inherited from S before there was such a thing as parallel processing), AT&T researchers get around this limitation by using the Rserve package to implement RCloud on entire cluster, connecting users to one of the machines in the cluster.
Researchers are now working to map data and computation onto specific platforms, especially onto server clusters that will run R tasks in parallel with other data-processing tasks. The research challenge is finding the best way to identify for R which tasks are (or are not) parallelizable so R can automatically subdivide tasks for different machines without the user having to manage these lower-level aspects.
Project RCloud provides a case study of how collaborative coding should work, creating a social coding environment. Code applied to one real-world problem is generally distributed so that everyone else with the same or similar problems can take that analysis and use it. Thus, a need solved by one researcher or developer can be used by another, expanding or using the code in a completely new direction and benefitting the broader community.
RCloud now serves as the core technology engine supporting AT&T’s analytics and visualization of data, just as S once did for the old Bell Labs. In a sense, things have come full circle. Created within AT&T, RCloud is an extension of the open-source R, which is itself an extension of the original S, also created within AT&T. In both S and RCloud, the aim was to expand data analysis, though the means to do so has changed radically over 35 years. Where S was the product of a small, tightly focused group of people working without the compromises demanded of a commercial product, R follows the now-proven open-source alternative, and is able to draw on the collective knowledge of many people from diverse scientific backgrounds. When many can and do contribute, the opportunity for extension and augmentation is almost limitless. R benefits immensely from the open-source model, and so does anyone who uses R.
RCloud too will soon be made open source; until then it is currently available (with limited documentation) at this GitHub site.
About the researchers
Richard Becker is a Member of Technical Staff the Statistics Research Department at AT&T Labs – Research. His primary interests are statistical computing, graphics, data analysis and large datasets. He joined Bell Labs in 1974 and began work the next year with John M. Chambers to create the S language, one of the first programming languages devoted to data analysis. S underwent many revisions over the years, and inspired the open-source R environment.
Becker’s pioneering work in dynamic graphics, with William S. Cleveland, resulted in the ''brushing'' technique for visualizing multi-dimensional data. Recent work uses call detail records to understand human mobility patterns and aid urban planning.
Becker was awarded the AT&T Science and Technology Medal in 1997, the AT&T Strategic Patent Award in 2005 and was named an AT&T Fellow in 2008. He was elected a Fellow of the American Statistical Association (ASA) in 1990.
He co-authored (with John Chambers) S: An Interactive Environment for Data Analysis and Graphics, as well as (with John Chambers and Allan Wilks) The New S Language: A Programming Environment for Data Analysis and Graphics.
Simon Urbanek is a Member of the Statistics Research Department at AT&T Labs – Research. His main interests are interactive data visualization, exploratory model analysis, tree-based methods, model ensembles and R project development. He is also an R core developer, maintaining the Mac interface for R, and has created several R packages, including iPlots (and iPlots eXtreme, a more efficient form of iPlots) and Cairo, both of which offer interactive and advanced graphic capability for R.
His dissertation at University of Augsburg was on exploratory model analysis, model comparison, tree models and ensembles, for which he received a Dr. rer. nat. (summa cum laude) in Statistics, equivalent to a PhD. Before arriving at AT&T Research in 2004, he was as an assistant researcher at the department of computer-oriented statistics and data analysis at the University of Augsburg.
Urbanek is author of Exploratory Model Analysis and coauthor (with M. Theus) of Interactive Graphics for Data Analysis: Principles and Examples.
Why use R?
R’s strength is making it easy to both create or call ad hoc functions specifically for data analysis, while also providing good graphics and visualization capabilities. For code or graphics done in other programs, R provides a uniform framework.
There are also practical reasons also to use R:
The R packages
An R package is compiled code of R functions and data that has been assembled for a specific task and then contributed to the general community. Packages require no explicit programming (though source code is available for those wanting to modify things).
There are currently 4700+ packages currently at the CRAN repository that do everything from spatial modeling and big-data support, cutting-edge statistical models and colored and textured graphs, to printing labels.
Besides CRAN, other package repositories include BioConductor (for analysis in different life science areas, such as tools for microarray, next-generation sequence and genome analysis), RForge (mostly experimental development packages), Omegahat (system interfaces).
Here’s a list of the top 100 downloaded packages.
Data object types in R
R has a wide range sophisticated object types to represent and manipulate data. Some are core to R and some are specific to certain packages.
Within R, which can be described as a vector-based language, vectors are the most basic building blocks, making R amenable to mathematical notation. (Unlike other languages, it is not necessary in R to loop over the matrix to multiply the matrix by the number.)
Besides vectors, other core R data objects include data frames (the most common data structure in R), arrays, and matrices.
R packages may have other object types, including points, line strings, and polygons.