Data exploration and discovery often begins with "seeing all the data." Effective overviews reveal clusters, gaps, outliers, trends, correlations and other patterns that merit further consideration. But when we have hundreds of millions of points on a map, how can we see "everything"? Because of overplotting, simply plotting all the points won't work. Also, interesting location-based data oftne includes other attributes like usage, signal strength, and particularly, time. The ability to zoom in on maps and select time ranges and other attributes is a powerful way of gaining insight into interesting patterns for location-based analytics.
Nanocubes enable visualization of billion point time-series in interactive web clients. A nanocube server is an in-memory data structure for rollup summaries of geospatial data sets, at multiple levels of scale. The idea is to take advantage of the natural sparsity of real-world geospatial data, the possibility of sharing common data blocks, and the efficiency of presenting concise images, like heat maps. The data structure allows very fast queries of subranges such as map regions or time ranges and other selections. Details can be found in this research report. Nanocubes were invented by Jim Klosowski, Lauro Lins and Carlos Scheidegger. Nanocubes have been tested on data sets such as 220 million geo-located tweets, and a set of a billion call detail records from a fraud avoidance application. We also deployed our prototype to explore the records captured by AT&T's "Mark the Spot" (MTS) smartphone app. MTS allows customers to report locations where there are problems with wireless coverage or service. The data contains several million points reported over about two years, with report types (voice, data, or coverage problems), device types (iPhone, Android or Windows phone), and timestamps. Interactive time range selection makes it possible to see how investments in the radio network improved customer service and satisfaction, and where to make further improvements.
How can we discover structure and meaning in large sets of geospatial tracks? Tracks or trajectories describe individual human movement on foot or in vehicles, paths of freight shipping containers, planes, weather patterns and animal migration. When there are thousands of tracks, plotting them on a map seems more to hide than reveal interesting patterns. How can we visually analyze trajectories? In 2007, Jae-Gil Lee, Jiawei Han and Kyu-Young Whang from the University of Illinois showed it is possible to group large sets of tracks into clusters with similar features. By visualizing clusters, we can better see and understand the underlying patterns. Usually just a few clusters cover most of the useful patterns - a variant of Edward Tufte's "small multiples". This insight was a crucial first step toward useful visualization.
In 2012 and 2013, AT&T researchers and academic partners took some next steps. Traditional clustering methods are based on having or calculating distances between tracks. It is challenging even to define the concept of distance between tracks (represented by nonuniform sampling of locations), and overcoming the complexity of computing distances between all pairs of trajectories is a serious problem. To break this bottleneck, Jim Klosowski, Carlos Scheidegger, collaborating with a former AT&T colleague Claudio Silva (now at NYU Poly) and summer intern Nivan Ferreira, invented Vector Field K-Means trajectory clustering. A key insight is that in all but the simplest examples, no single trajectory adequately represents an entire cluster. Instead, they model clusters as 2-D vector fields derived from the data. They developed an alternating trajectory clustering algorithm. Each step first recalculates the vector fields (each representing a cluster center), then reclusters the trajectories by matching each to one of the vector fields. Initial clusters are generated from randomly chosen sample trajectories. Their method yields impressive results, is insensitive to partial or incomplete trajectories, and scales to hundreds of thousands of tracks. A description of their method, including experiments with human geotracks and hurricane paths, was presented at Eurovis 2013.
Taking this concept further, Shankar Krishnan from AT&T and Amitabh Varshney and Cheuk Yiu Ip from the University of Maryland developed a more general way to think about trajectory clustering. In their method, clusters are represented by kernel density estimates. This handles richer types of trajectories, with other parameters such as time, speed, size, weight, and uncertainty in sample locations. Their method is more sensitive to missing data or gaps in trajectories, and their solver runs faster.
Through work on trajectory clustering and visualization, AT&T Labs is moving beyond analysis of static locations, to understand dynamic patterns of movement, at orders of magnitude more scale than previous methods.
Social media has brought about a revolution not only in how people communicate, but how companies are seen by the public, manage manage relationships with their customers, and respond to emerging issues. Companies want to know what people are saying about them online. This requires exploring and understanding what people are discussing. As in many types of data visualization, an important challenge is to present understandable mid-level views that show useful structure without oversimplifying or getting lost in details.
TwitterScope is an interactive social media analytics tool. It presents an online message stream as a vivid, animated topic map or diagram. The map shows the main topics while they are changing, and allows drill-down to look at the details of individual messages. TwitterScope complements conventional tools for sentiment analysis and topic tracking. The input is typically a stream of messages filtered on keywords or hashtags, such as "#news" or "AT&T iPhone" with expected rates up to dozens or hundreds of messages per minute. The messages are clustered and visualized in a web client. Analysis is performed by Latent Dirichlet Allocation (LDA). The visualization shows Graphviz GMap diagrams (in a geographic map metaphor), and rectangular "compact maps" for more structured layouts. You can try an interactive demo using several fixed keywords. An internal experimental version searches for any keywords entered interactively. The prototype was created by Emden Gansner, Yifan Hu, Stephen North, and intern Xiaotong Liu.
Statistical computing and interactive graphics changed data exploration forever, but current tools are still based on standalone personal computing. The next step is to bring web search, collaboration, sharing, recommendations and data publishing to exploratory data analysis. We are basing our work on the popular R statistical programming language. R-in-the-cloud presents a human interface in web clients, but computation is performed in the cloud. This provides the best of both worlds: high interactivity, sophisticated graphics, portability across many devices, with the stability, performance, and impliedf sharing of a common cloud infrastructure. Workbooks in R-in-the-Cloud are automatically versioned and made persistent. The basic toolkit provides support for publishing, turning live experiments into production websites.
Another innovation is support of indexing and searching on both code and data, with the opportunity to exploit metadata to answer questions like "What packages or data feeds are usually used with this kind of data?" "What packages support the analysis that I need?" and "Who else works on this data?" We are working with external partners in the opensource community to accelerate the introduction of these concepts into other next-generation programming and data analysis environments. The R-in-the-Cloud project is led by Carlos Scheidegger and Simon Urbanek.
Image and Streaming Video Upsampling
Devices like Apple's Retina display and 4K TVs bring exceptional display technology to the mass market, but create new challenges in presenting multimedia content. Often, images and videos need to be shown at higher resolutions than the way they were recorded or generated. Communication and storage bottlenecks may also limit the resolution of content. Upsampling or upscaling generates a larger image or video from a smaller one. The challenge is to do this in a way that does not make images blurry or blocky. A common approach is to copy the smaller image into a larger grid, and average nearby pixels to fill in the missing content and to smooth the larger image. The problem is that most people can see glitches and blurring. A solution proposed by Jim Klosowski and Shankar Krishnan is to apply content-specific filters to upscaling. Through high-performance algorithms on a CPU or GPU (parallel graphics processor), our upscaler can detect edges, corners, and similar structures in the smaller image, and preserve them in the larger image. The results are noticeably sharper and clearer. This work is being presented at IEEE ICIP in September 2013. This is a practical example of applying computation to save storage and networking, to improve the web and multimedia experience on future consumer video devices.