Getting and Understanding the Bigger Picture
Simple, elegant, direct . . . graph visualizations can be informative, too. Without knowing anything about the underlying data and without even trying, you see relationships and patterns. The nodes and links could be Facebook connections, router-to-router links in a network, a transportation distribution hub, the path of a call through a call center, or the connections between objects in a compiled grammar. All are networks—a set of entities and the connections between them—and all can be analyzed through graph visualizations
There is information in the connections. A glance is enough to identify nodes with the most links, nodes straddling different subgroups, and nodes isolated by their lack of connections. Corporations might look at a graph to verify that marketing and sales are communicating, urban planners to monitor the interconnectedness, or isolation, of neighborhoods, biologists to discover interactions between genes, and network analysts to monitor security.
Graph creation made easy
Graph visualizations are everywhere. They need to be. Without them, there would simply be no way to make sense of the vast amounts of data collected.
Fortunately, graphs are easy to make thanks to automatic graph-drawing programs that take information describing a node and the node’s connections and automatically lay out a topology, handling the low-level details of how to arrange nodes so they don’t overlap and obscure one another. Feed in the data and get a picture.
Aesthetics is important not so much for looks . . . but for readability.
A multitude of graph drawing programs is available, many offering add-ons such as editors to manually change the node layout and adjust colors, shapes, and backgrounds to achieve visually appealing graphs.
Aesthetics is important not so much for looks—though some visualizations can be stunning to look at—but for readability. Links that intersect and nodes that overlay one another result in poor readability, and graph visualization programs work hard to minimize the number of link intersections and give enough whitespace around each node to make it stand out from its neighbors.
One method to ensure a good distribution of nodes, force-directed layout, endows each node with a repelling force to push away neighboring nodes that are too close while spring-like attachments on links work to keep connected nodes clustered together.
The nodes themselves find their own optimal position in turn. First one node employs its repelling force, moving too-close nodes further away. Then it’s the turn of the next node, and then another until each has had a chance to position itself. But the sequence needs to be repeated since later nodes may intrude into the space of a previous node; many iterations may be needed until a state of equilibrium is reached.
Force-directed layouts are computationally heavy but work well for graphs of 100-200 nodes; however, at larger scales (100,000 nodes and more), the scheme breaks down under the weight of all the calculations. As the number of nodes increases, the number of calculations increases quadratically. For n nodes, the number of calculations is on the order of n 2 . What works for small graphs won’t work for large ones.
At a company like AT&T, which operates some of the world’s largest networks, the problem of scale is a perennial issue. AT&T must often develop custom solutions to address the scale issue. This is one reason it maintains a sophisticated and wide-ranging research effort.
What works for small graphs won’t work for large ones.
For the visualizations needed for its large data collections, AT&T adapted its own network visualization program, Graphviz, which had originally been developed for graphing software objects. (Because all networks have a similar mathematic definition, there is little difference between rendering software objects and router-to-router links in communications networks, apart from the scale factor.) For the program to handle layouts with millions of nodes entailed key advancements in network visualization, including the introduction of advanced geometric and numerical optimization algorithms. Entirely new approaches including stress majorization and multilevel optimization were also required
Additional algorithmic enhancements were needed to ensure a readable, useful visualization, and Graphviz pioneered techniques in linear programming for aesthetic node placement, drawing curved edge splines around obstacles through numerical optimization, and robust overlap removal methods to preserve the overall structure of layouts while also making it possible to read labels even in tightly packed diagrams.
AT&T has made Graphviz available as open source software, and many other programs incorporate Graphviz as a visualization service for applications as diverse as databases and data analysis, bioinformatics, software engineering, programming languages and tools, and machine learning.
So how does Graphviz handle large datasets?
The first step is to reduce the size of the graph. It takes a few steps since it’s an iterative process in which each step halves the number of nodes. One million nodes becomes 500,000, which then becomes 250,000, all the way down to a manageable 50 or fewer nodes. At this point it’s a small graph and can employ force-directed techniques.
Of course, the way in which the nodes are reduced has to be done carefully and above all uniformly to preserve the original overall structure. This is not easy, and certain substructures, such as star graphs with an inordinate number of connections, must be identified and handled as special cases.
Various filtering and aggregating techniques are used to identify nodes that can be deleted without destroying the overall structure. One method is graph coarsening, which identifies perfectly matched node-links arrangements that can be collapsed without altering the overall topology.
Laying out large graphs is a hard problem, but viewing them is also hard.
Once a good layout is found through a force-directed algorithm, the graph is then built up again to its original size. For a graph the size of 1 million nodes, the entire process takes around 15 minutes; considering the number of calculations carried out and the range of operations, this is exceedingly fast. This time can be reduced significantly if a slight reduction in layout quality is allowed.
Large graphs and their information
Laying out large graphs is a hard problem, but viewing them is also hard. Large graphs won't display within the confines of a computer screen. Viewing a small part of the whole at a time is a solution, but not when it's important to see how changes in one part of the graph affect other parts. In communications networks such as those maintained by AT&T, investigating how various elements within the network interact and affect other parts is sometimes the whole point.
The Graphviz proposal is an interactive graph viewer, Smyrna, designed to handle large graphs. In addition to panning and zooming, Smyrna provides a topological fisheye view that magnifies an area of interest while keeping the entire graph in view to maintain the needed context. This is done by collapsing nodes outside the focal area in a way that the overall structure is maintained, with more distortion occurring in the nodes farthest from the focal area. The calculations for collapsing nodes are done in real time to ensure smooth transitions as the viewer navigates through the scene.
With Smyrna, the viewer can also navigate through three dimensions, which gives additional space to examine a graph's structure. Occlusions are more of a problem with 3D displays, though it's easy to manipulate the graph to see behind nodes. This ability to directly manipulate the graph structure, to turn it and move individual nodes involves the viewer with the graph in a way not possible with 2D graphs.
If you can move an individual node, the next natural step is to click a node for information (node information is very rich in Graphviz), which Smyrna supports. Clicking a node can access underlying data if the visualization associates an action with each node, such as opening a web page where the information can be found.
But more useful for accessing information and analyzing the data, is the ability in Smyrna to write queries and filter graphs based on attribute information and graph structure. Queries could locate all nodes sharing similar attributes, such as all people with the same last name or birthday, or locate all ISPs that recently handled a certain IP address.
Filtering a graph reduces a large graph to just the part that is relevant to the particular query.
The distinction between visualization and interface is being blurred.
Where until now, node-link layouts have emphasized representing relationships and have too often been an end of a process (feed in the data, see what it looks like, go back to the data), the abliity to directly access data, write queries, and closely investigate individual nodes show the potential of how visualizations can become part of the analysis process itself, and to answer questions such as "Why do some nodes cluster together?" "Which nodes in the entire graph share the same attributes?"
The distinction between visualization and interface is blurring, and the line between visualization and analysis is being crossed.
For visualizations to truly be part of the analysis, there remain some hard problems to solve. One, networks change constantly, and understanding where these changes occur is an important part of network analysis. But finding changes in large graphs is difficult, for two fundamental reasons.
One, the sheer size of a visualization containing millions of nodes makes it hard to see changes, especially those happening in obscure areas that are not often investigated.
Second, the way in which large graphs are generated, where the initial node placement may be random, can cause a large graph to look different every time it's generated, even when the data changes are small and limited to a few nodes. People build a mental map of what a graph looks like, based on how they first see it and look for changes by comparing the new layout with their mental map. Minor tweaking to add or delete a few nodes causes many nodes to self-adjust, giving the potential that the visualization may look much different from the menal map even though the changes are small. New or better algorithms are needed to preserve a node layout so small changes don't result in big changes, and do so in very fast computing time.
Other problems include handling small-world graphs, in which a high number of nodes are closely related, a common feature of social and communications networks. The resulting tight proximity and the number of shared links make layout particularly difficult. One solution being explored is to represent small-world subgraphs as a single, large node that expands when clicked.
There is no lack of incentive to find solutions to these and other problems of large graphs. Visualizations may be the only way to tackle the analysis of extra-large datasets, which will only become more numerous and bigger as data collection techniques multiply. Solutions will be found, and the only questions are what those solutions will be, and how soon.
Getting the Big Picture
Undirected layouts are used when there is no inherent ordering of the nodes. Graphs arising from communication and online social networks are undirected.
Circular layouts depict connections of nodes closely related to one another, such as those on a LAN.
Radial layouts depict networks focused on some central node, as arises in some social networks.
Matrixes can be visualized using the same node-link graphs. Yifan Hu of the Labs graphed over 2000 sparse data sets using Graphviz.
Reducing the number of edge crossings is key to graph readability, and most graph layout algorithms consider the task of arranging nodes to reduce crossings.
But there are other ways of approaching the readability problem.
What if you start with a different premise, such as reducing the amount of ink? Reducing the number length of links would reduce the amount of ink.
In this exercise, the nodes were sorted around the circle in a way to reduce the length of edges. This produced layouts with fewer crossings than when heuristics are used to explicitly reduce crossings. And from an idea based on the work of Danny Holten (Eindhoven University of Technology), lines that traversed the interior share the same path when possible, reducing the perceived number of intersections as well
The resulting increase in white space reduces the clutter.
But can the graph be made clearer still?
The solution. Forget about saving ink, and try something new.
Rerouting links around the outside of the circle minimizes link crossings dramatically.
It uses more ink of course, but in graphs, clarity is all.