Into Main Memory: Nanocubes for Interactively Visualizing Billion-Point Data Sets
This article references several publicly available spatial-temporal data sets, including one of 210M tweets and another of 20 years of US flight data. Both data sets and others are accessed and visualized at http://www.nanocubes.net.
An interactive visualization lets analysts quickly understand the dominant patterns and relationships within a large data set. Click image to enlarge.
For exploring large, complex data sets, nothing matches the power of interactive visualizations that let users directly manipulate data. With a few motions or mouse clicks, users can drill down for more detail or zoom out for a broader perspective while quickly seizing on the anomalies and outliers that reveal much about the data.
Unlike static visualizations preconfigured for a certain subset of data, interactive visualizations allow users to control what data is displayed and to incrementally arrange the display to answer specific questions, using the results to immediately select new data to refine the line of questioning. It’s an open-ended, iterative type of visual problem-solving, analogous to hypothesis generation, where one discovery stimulates new questions.
It all depends on smooth, real-time interactions between the user and the underlying visualization system.
But most visualization systems cannot maintain real-time interactivity for data sets with millions or billions of data points. The sheer number of computations needed to transform a query description into a query result prohibits interactivity. Massive parallelism offers a possible solution, but it’s not scalable. Every increase in data size requires a proportional increase in CPU power.
Also degrading interactivity is the latency introduced when massive data, too large for main memory, is moved to disk storage, slowing access and interfering with real-time interactions.
As visualization researchers, we weren’t willing to sacrifice interactivity even in the face of billion-point data sets. It was clear to us that the best way to achieve interactivity was to move all data into main memory. How to do that would be the challenge.
Nor was size the only issue. With the proliferation of GPS-enabled devices and the attendant interest in location-based services, more data sets have both temporal and spatial dimensions that we also wanted to incorporate into any visualization system we created.
Structures for aggregating data
While we had no preconceived idea in the beginning for how to proceed, the search quickly centered on data cubes, a data structure used in the database world to speed queries. Since analysts often care more about the sums and totals—the number of tweets made during the Super bowl, the number of American Airlines flights delayed in Phoenix—the idea behind the data cube is to arrange the data to make it easy to compute the answers.
For every possible query, a data cube will aggregate all similar data points—those sharing the same attributes—in a way that makes it easy to compute the aggregation counts needed by analysts. If a query is for the total number of tweets in Chicago for a certain day or hour, a data cube can almost instantly compute the answer.
A data cube aggregates similar data points into a cell to speed querying, making it possible to easily compute answers to a query—for example, the number of English and Spanish tweets in Chicago during a given hour, or the number of all tweets made between 6 and 7 PM.
A data cube is made up of rows and columns, each of which corresponds to a category, or attribute. At the intersection of two or more categories is a cell containing the aggregated data points that meet the attribute criteria for that time period. This data cube shows tweets with three attributes for four US cities over four hours. The highlighted cell contains the aggregated tweets that meet the query Number of tweets in English sent from Chicago between 6:01-7:00 PM. The highlighted column contains all tweets between 6 and 7 PM, regardless of location, device, application, or language.
Data cubes are meant to speed queries, not reduce memory requirements, and are generally too large to fit into main memory of a typical PC. Depending on the particular encoding, a data cube may require more memory than the actual data set itself. Some data cube implementations create a cell for every possible combination of attributes, even if a combination of attributes doesn’t exist. If no tweets are sent between 2:30AM and 3:30AM in some rural small town, there is still a cell for that particular set of attributes. For data sets with many attributes, the data cube may contain many “empty” cells that nonetheless take up valuable memory.
As a smart data cube encoding, a dwarf cube alleviates the space requirements needed in more naïve data cube encodings. By storing only aggregation counts, and not the individual records, a dwarf cube conserves memory since an aggregation count takes far less memory than storing the actual records. (Dwarf cubes won’t let you get back to the individual records directly, but that’s not generally the purpose of visualizations, which are intended to give a high-level overview of all the data. When needed, individual records can be tracked back to the original database.)
A dwarf cube eliminates other drawbacks of the data cube: it avoids empty cells by substituting a tree structure made up of cells for only those subsets that exist in the data set.
Dwarf cubes reduce memory requirements for large data sets but do not efficiently handle spatial and temporal dimensions. The previous graphic of a cube showing four cities treats location almost as a category of data, when in reality analysts want to smoothly move between different locations, whether it’s all of Chicago or just a few city blocks. Smooth zooming and panning from one lat/long coordinate to another requires that location be treated as a continuous type of data. The same is true for time; users might want to go from 15-minute increments to 1-year increments. Dwarf cubes aren’t well-suited for continuous dimensions, but the visualization world has well-defined methods for aggregating across time and moving smoothly between locations.
In a nanocube, we have merged a dwarf cube’s reduced memory footprint and fast querying with the ability to handle spatial and temporal dimensions. A nanocube retains the hierarchical tree structure of a dwarf cube but introduces a layering concept that includes three types of dimensions: spatial, categorical, and temporal, always traversed in that order. Each dimension has a start node and is subdivided into levels, the number of which depends on the data set. A query is a path through this tree.
A nanocube is a smart encoding of a dwarf cube. It avoids empty cells (or nodes), allows paths to jump levels and dimensions, and stores aggregation counts for each unique subset once only, while allowing multiple paths to point to it. Click image to enlarge.
The first dimension to be traversed is the spatial one, which in a nanocube can have up to 25 levels. The start point is the entire-world level followed by successively more granular levels all the way down to city blocks. (This is similar to Google Maps 17 spatial layers, with the highest resolution level translating, depending on the imagery, to about 10-30 meters.)
The nanocube uses the quadtree algorithm, a well-known spatial indexing technique in visualization to enable users to smoothly zoom in or out from one location to another. A quadtree divides the world into four quadrants, and each quadrant has four children, each of them likewise having four children also. In this way, the world is successively divided up into finer tiles, allowing users to visualize spatial data at arbitrary zoom levels.
The number of levels depends on the required resolution for a data set. A data set of tweets or cell towers requires a very high resolution to differentiate closely spaced locations that may be only a few feet from one another. The flight data set requires far less resolution since airports are generally several miles away from one another and can be distinguished by a much coarser resolution.
The categorical dimensions follow the spatial one and contain the attributes appropriate to the data. A Twitter data set might have three categories—device, application, language—and a data set of flight data might have two: departure delay and airline.
The last dimension is the temporal one; it consists of discrete time bins that together compose a time series associated with each node. For each time series, there is an aggregation count, which maps to a memory location. To optimize memory usage, aggregation counts are stored only for actual data occurrences.
Time, like the spatial dimension, has a resolution adjustable for each data set and might be divided into 1-minute, 1-hour, 1-day, or any other temporal increments.
Each query enters at the top node (world) and proceeds downward, traversing within each dimension the number of levels required by the constraints imposed on the query. In the spatial dimension, All tweets in the US continues to the 2nd spatial level. Once the constraints of the query are fulfilled in one category, the query jumps to the start node of the next category. For tweets, that category might be devices, language, or application. The query All iPhone tweets from the Empire State Building, once it traverses 25 levels of the spatial dimension, would then need to traverse the device-category dimension.
Doubling up on memory locations
Nanocubes like dwarf cubes don’t allow empty cells, thus conserving memory that is often wasted by data cubes that create cells for every possible combination of attributes, whether needed or not. But removing empty cells by itself won’t necessarily fit a billion-point data set in main memory.
An additional way to conserve memory is to avoid duplicate memory locations that store the same aggregation count for the same set of data points. If multiple queries result in the same set of data points, a nanocube stores the aggregation count for those data points one time in a shared memory location, and each appropriate query points to that same memory location. For example, if a single tweet fulfills two separate queries (All iPhone tweets in German in Australia at 7:00 PM and All iPhone tweets in German in Sydney at 7:00 PM), both queries may well point to the same memory location. (Distinguishing among the aggregation counts themselves—that is, whether an aggregation of 3 refers to a set of Android tweets, a set of iPhone tweets, or any other query—is possible because the attributes get encoded as the query traverses the tree structure.)
Multiple queries, though seemingly unrelated, may point to the same time series and the same memory location. Click image to enlarge.
Sharing memory locations vastly reduces the memory requirements for nanocubes. In one example, the nanocube built on the Twitter data set would have required four times the amount of memory if memory locations were not shared. While dwarf cubes also share some memory locations, with nanocubes there is a more systematic and rigorous search for queries returning the same exact set of data points, extending the search even to those queries that seem unrelated. It isn’t readily apparent that the query All Android tweets in the eastern US (no time specified) and the query All Android tweets in the US between 8:01 and 9:00 would be fulfilled by the exact same data point, but that is what occurs in the previous figure.
Finding subsets whose content is the same is mathematically difficult, especially given a data set of billions of data points. It requires combinatorics to prove that a set of points doesn’t change under various time or spatial constraints. The easier approach, whether queries are related or not, is just to create two separate aggregation counts and store them in separate memory locations. But it is also an approach that wastes valuable memory.
Cumulative totals to reduce the number of queries
Conserving memory was one requirement for the nanocube; another was speeding queries. While a single query is very fast—about 1 millisecond when data is in main memory—analysts often create multiple-step queries such as Number of total flights out of Phoenix during 2007 or Number of tweets in March 2012 vs number in March 2013, each of which requires many computations across a potentially large number of time bins.
The naïve approach to sum the number of points over a multiple-bin window would be to simply query each bin in the sequence to get a count and then sum the counts. Querying 24 bins in this way would require 24 separate queries; assuming 1 millisecond each, 24 queries would take 24 milliseconds. It’s only a slight time delay, but things can quickly get out of hand. A query for the number of flights out of Phoenix in 1998, assuming 4-hour bins, would require querying 6 (bins) x 365, noticeably disrupting interactivity.
Rather than separately querying each bin in sequence, a nanocube tracks a running total, adding each bin’s count to the total of previous bins. Thus if bins 1 through 10 each contain 2 points, bin 10 would have an accumulated total of 20. Obtaining this total requires a single query to bin 10. Obtaining a partial count of contiguous time bins requires making two queries (one to the first bin in the series and one to the last) and finding the difference. Flights from Phoenix in July of 2013 requires querying the first time bin in July 2013, and the first time bin in August 2013 to find the difference.
Keeping a cumulative count within contiguous bins is a well-known method but is not often used for temporal data.
Keeping a running total reduces the number of queries. Empty bins between the start and ending bins are ignored.
Nanocubes from real-world data sets
Our original intent in building nanocubes was to put very large spatiotemporal data sets into main memory in order to achieve interactivity. We have achieved this objective by building on work done previously in the fields of interactive visualizations and databases, particularly dwarf cubes. The innovation is in combining the memory-saving techniques of dwarf cubes with spatiotemporal techniques such as the quadtree algorithm, and by writing more efficient code that systematically hunts for those queries sharing the aggregation counts, thus further reducing memory requirements.
The data-set examples of three and five tweets used in this article are for illustration purposes. In the real world, nanocubes scale well, and we have successfully built nanocubes from a variety of very large data sets, including those detailed in the following table. All fit into main memory. In another example, the nanocube built from the now-defunct Brightkite social site originally contained 4.5M data records (documenting times and locations of checkins); the resulting nanocube took 1.6 GBs to store in main memory, and not quite 4 minutes to build.
*Counts both tree and leaf nodes. As an example, the nanocube of five tweets has 41 nodes: 22 tree nodes + 19 leaf nodes (number of tweets returned by queries).
** How much larger the nanocube would be without the sharing mechanism.
The sharing column illustrates the extent to which sharing memory locations contributes to the small size of nanocubes. Without the sharing mechanism, the nanocube built on the CDR data set would require 18.60 times more memory if it would have without sharing memory locations.
Because of our interest in visualization, we built nanocubes to support scatterplots, histograms, parallel coordinate plots, and choropleth and heat maps, while providing the standard toolset of filters, zooming, and details-on-demand. We have also made it easy to share nanocubes. In addition to a PC front end, a browser version is available so anyone can explore a nanocube using a recent version of Chrome or Firefox. The convenience of the browser version comes at the cost of a more limited tool set than the PC version, and also a bit more latency since all rendering is done on the server. (For even more convenience, an iPhone version is being planned.)
But nanocubes are not limited to visualizations. Nanocubes could be used to restructure data for almost any query engine or used as a backend for other data cube visualization applications.
The nanocube concept remains a work in progress, and the wish list is long, starting with additional support for streaming data. While building a nanocube is currently single-threaded, we intend to provide a data-parallel building option to speed up the preprocessing time.
New work is being done to support multiple spatial dimensions and additional categories (the current nanocube implementation allows one spatial and one temporal dimension). Having additional spatial dimensions will enable building nanocubes from other data sets that include a duration with different starting and end points and times: airline flights with takeoffs at one airport at a certain time and landings at a different time and locations, taxicab rides with a pickup time at one location and a drop-off time at another location, anonymized CDR data that records the beginning time and location of a call with its end time and location.
Nanocubes is open-source software. We are excited about the prospect of providing larger communities with the capabilities of nanocubes, as well as a broader collaboration with the open-source visualization community. The nanocubes source code is currently hosted at http://github.com/laurolins/nanocube, and the project website is http://www.nanocubes.net.
For a detailed technical description of nanocubes, see the paper Nanocubes for Real-Time Exploration of Spatiotemporal Datasets.
About the authors
Lauro Lins is a Senior Member of Technical of AT&T Advanced Technology. His research focus is in algorithms, data structures, encodings, and systems for enabling intuitive and scalable data visualization. He joined AT&T in October 2012. He received a BSc/MSc in Computer Science and a PhD in Computational Mathematics from Universidade Federal de Pernambuco (Brazil, 1996-2007). During this period, he also worked as the main software designer and developer of LinSoft, a company that deployed optimization software for industrial operations. From 2008-2010, Lauro worked as a Post-Doc at the Scientific Computing and Imaging Institute at University of Utah and from 2011-2012 he worked as an Assistant Research Professor at NYU-Poly.
James Klosowski is a Principal Member of Technical Staff in the Information Visualization Research Department at AT&T Labs. His current research focuses on the interactive visualization of large datasets, with a focus on algorithms, data structures, and scalable systems. He received a BS degree in computer science and mathematics from Fairfield University in 1992, and MS and PhD degrees in applied mathematics from the State University of New York at Stony Brook in 1994 and 1998, respectively. Upon graduating, he joined the IBM T. J. Watson Research Center as a Research Staff Member, working in the areas of interactive computer graphics and scalable visualization. He has worked at AT&T Labs - Research since 2009.
Carlos Scheidegger joined AT&T Labs - Research in 2009, after finishing his PhD at the University of Utah in the SCI Institute and School of Computing. Since then, he is a Senior Member of the Technical Staff of the Information Visualization Research Department at AT&T Labs. He is interested in large-scale data visualization, exploratory data analysis and in provenance and data management issues of computational tasks. He has authored over 30 research papers and received best paper awards at VisWeek 2007 and SMI 2008. He is currently working on the next generation of data analysis, collaboration and visualization infrastructure tools for the world wide web, including the open-source RCloud project.
How to count data points in a visualization
What does it mean to visualize one billion data points? It depends on how and what you count.
Records in a database are easy to count, but visualizations aren’t typically constructed from database records. Rather they are created from precomputed sums of similar records: the number of flights departing Phoenix on March 1 2006 or the number of tweets sent New Year’s Eve. By storing the aggregated count of all data points sharing the same attributes, it’s possible to reduce a potentially large number of data records into a single data point.
It’s these aggregation counts that are used in creating a visualization. And since one aggregation count represents many data points, fewer data points are required to build a visualization.
A data set of 1B data points might be visualized with 200M datapoints, depending on the unique combination of attributes. To take two extreme cases to illustrate the point: if 1B records all share the exact same attributes—tweets sent this month—they could be visualized with a single data point; if 1B records were each unique, each being generated every 1/10th of a second, it might well take 1B data points to create a visualization.