article

Managing Data at AT&T scale

by: Divesh Srivastava, Rick Greer, Theodore Johnson, Oliver Spatscheck, Vladislav Shkapenyuk, Wed Jun 06 13:29:00 EDT 2012

How you store and update data and how you search it determines the value of the extracted information and how fast that information is available for responding to current conditions. At massive scale, every aspect of data management has to be evaluated against memory, hardware, and computing limitations.

Move massive amounts of data and use up resources for processing. Update in large batches, and forfeit the chance for real-time information. Distribute data in various locations on disk, and wait longer to access it. Build ad-hoc, specialized tools, and lose the flexibility to adapt quickly to changing conditions.

For AT&T, analyzing streaming data on its internal network, and doing it in real time, is key to maintaining performance at the network, service, and application levels. Real-time analysis of network data gives an up-to-the-second picture of network health and usage, telling network managers where traffic volume is building, delays are occurring, packets are being dropped, and where a denial-of-service attack is starting. It also gives a view into the customer experience, letting managers “see” when web pages are loading too slowly or game commands aren’t executing fast enough.

By the time data is loaded into tables, verified for accuracy, indexed, and queried, the situation has long since changed.

Other companies also rely on real-time analysis of streaming data. It enables retailers to reduce inventory and precisely track goods anywhere along the supply chain; it enables credit card companies to analyze purchases at the point of sale, before fraud occurs; and it helps financial companies react almost instantaneously to the slightest price fluctuations.

It’s not just businesses chasing competitive advantages. More accurate weather reports, a smarter, more energy-efficient electric grid, and better, less expensive medical care through telehealth will all be possible using real-time analysis of streaming sensor data. In the case of telehealth, with doctors alerted at the first sign of heart attack or stroke, streaming data will save lives.

But only if data management methods are in place to handle massive amounts of streaming data. The traditional method of feeding batches of data into databases and running queries won’t work with data that changes moment by moment. By the time data is loaded into tables, verified for accuracy, indexed, and queried, the situation has long since changed.

A history of streaming data

AT&T’s long history of managing data dates back to when AT&T was a telephone network company; instead of IP packet streams, there were call details records (CDRs).

CDRs were originally generated for billing purposes but it was also realized fairly early on that they contained useful information about calling patterns: what days of the week or hours of the day had heaviest call volume? Where did most calls originate? Being able to compare current patterns with the historical calling record would make it easy to see developing trends and allow the company to get ahead of demand and build up the system where needed. Retrospective analysis of historical information turned out to be enormously important in ensuring the quality of the phone system. But it first required a database capable of storing large amounts of data.

The design of the database and the methods to search and update the information were influenced by the way CDRs arrived: continuously and in ever-increasing volume as the number of phones and then cell phones increased.

Today, network data also arrives continuously, though now it is said to stream. As with CDRs, network data contains information critical to ensuring the quality of the network. Information extracted from network data is important in ensuring high performance at the network, service, and application levels.

The main difference today is the scale and the speed at which events occur on a highly interconnected network and the potential for problems to rapidly radiate outward. Network data on AT&T internal network has to be analyzed in real-time so managers can react instantaneously, and pull almost any statistic helpful to tracking the cause and correlating events.

But how out of millions of simultaneous events to find just those related to a specific problem? This is the difficulty of network data.

Massive is defined today at AT&T scale as 28 petabytes of data per day, and fast-moving as 80 gigabits/second.

Streaming data is unbounded and a mix of data types, moving at high speeds

Specialized tools and other ad-hoc solutions created specifically to do one thing—monitor for alerts or obtain a certain measurement type—lack flexibility and require network mangers to know ahead of time what measurements are needed. But managers often don’t know.

Managers need to start with the entire picture and then zero in on a single sequence of events. Is the problem affecting specific applications, or specific servers? What protocols are affected? If half the HTTP packets are dropping, why just some? Why not all? What is the common link among the dropped packets?

If the data was in a database, and not streaming, a network manager could simply query the database. The database would then move the appropriate table from disk into memory (or parts of the table if a table is too large for memory) and perform the query in one or more passes.

Streaming data can’t be queried in the same way. There is no table with data sitting neatly arranged and formatted, only a rush of mixed data that can’t be stopped for multiple scans or even slowed. A mere 10-millisecond delay blocks other data coming down the pipeline. Eventually packets get dropped, maybe those holding the answer.

Nor is there a single answer to a streaming query. Where a database query is asked in the past tense—“How many VoIP packets moved between server A and server B between 9:00 and 9:06?”—a query on live data asks about current conditions “How many VoIP packets are moving between servers A and B?” The answer differs from one moment to the next.

New methods were needed to handle the unique characteristics of streaming data, leading to data stream management systems (DSMSs). These systems query in a single pass, and they query continuously, as frequently as needed, operating on a stream and producing an output that is itself a stream. Conceptually it’s the query that stays in place, and the data that moves through (in a database environment, the query moves over data sitting in place).

In querying a stream, a DSMS queries data within the boundaries of a window that moves relative to the data. As data continuously flows through, new results are returned at regular intervals so managers have the most up-to-date information.

A streaming query engine for massive data

GS Tool is AT&T’s own DSMS, optimized for massive amounts of data. Massive is defined today at AT&T scale as 28 petabytes of data per day, and fast-moving as 80 gigabits/second, the data speed of the core backbone. At this scale and speed, processing requirements run up against hardware and computing limitations; even relatively inexpensive computing tasks, like the simple act of moving data, become prohibitively expensive.

GS Tool was designed not only to handle data at this scale but also to provide flexibility. The network is not a static place. New devices and protocols come into play, and different protocols and services dominate at different places in the network. To enable network managers, who are not programmers, to easily write and update queries, GS Tool implements a high-level, declarative query language, GSQL (an SQL variant) so managers can get precisely the information they need: How many HTTP or BGP packets are dropping? What is the delay between two servers? What is the average round trip of a handshake between server A and server B?

By making it easy to query, GS Tool lets managers ask a whole range of questions related to a large set of metrics and measurements. The result is a set of queries continuously applied. On the core network, up to 30 queries are run against each packet.

[pic]GS Tool can sit anywhere on the network from the relatively low-volume, low-speed edge networks to the core itself, where rates can reach 80 gigabits/sec.
Managing the complexity of these queries and handling the scale of data requires carefully examining each step in the process and efficiently using resources
.

Reducing the amount of data to be queried is an obvious first step. This GS Tool does through an optimized software prefiltering step, multiple query optimization, that passes on only those packets needed by queries within the query set. This filtering, done early to save processing further up the chain, requires GS Tool to analyze the requirements of each query in the set and identify commonalities. Humans can compare only a small number of queries, not the 20 or 30 queries in a typical set. And it’s more complicated than that since each human query may have several parts, or predicates. A query for the average round-trip time of a TCP handshake, for example, is actually composed of several subqueries: one each to find the times for Syn, SynAck, and Ack, and another to compute the total time. Averaging many handshakes requires additional steps.

GS Tool handles the complexity of subdividing the human queries into their component subqueries, translating them into native code, and then comparing and identifying those subsqueries with similar processing so tasks can be performed together.

To conserve processing, queries are decomposed into low- and high-level subqueries, according to the amount of processing required. Low-level queries, which include aggregations and other less computationally intensive tasks, are often performed in the buffer on the network interface card, as close to the source as possible. This conserves the CPU cycles otherwise required to move potentially massive amounts of data.

Low-level queries, operating on the filtered stream of packets, produce a much reduced data stream that is forwarded to the more expensive, high-level tasks that include joins and regular expression matching. High-level queries are performed into memory.
 

Complexity cannot be avoided, but it can be shifted from humans to systems capable of handling it.

[pic]

Low-level queries receive a high-volume packet stream and output a low-volume data stream, which continues up stream to high-level queries.

When handling speeds too fast for the computational resources of a single machine, GS Tool splits the stream using specialized hardware. The resulting substreams are distributed to separate core processors, which may apply the same or different queries. If the data volume or complexity of a query is still too great, data will be discarded, but in a in a controlled way, with GS Tool first shedding data that’s less relevant and less-processed (and contains less information), or by ignoring data whose information has already been incorporated.

GS Tool is an exercise in stripped-down efficiency. Non-relevant data is filtered out at the first possible moment and filtering for all queries is done once. Data is not moved unless necessary. Data is aggregated as much as possible before reaching the higher-level, more computationally intensive queries. There is no other way to perform 30 simultaneous queries against 20 million records per second, as GS Tool is regularly asked to do.

These numbers will only increase. GS Tool is pressed from both sides: data is always increasing, and queries continually become more complex as managers push the system for more information, adding new predicates that require GS Tool to create and compare more subqueries.

A single query set of medium size subdivided by GS Tool into individual subqueries (aggregations, joins), where the output of one subquery becomes the input for another.

Complexity cannot be avoided, but it can be shifted from humans to systems capable of handling it.

GS Tool, now 10 years old, lets network managers pull the network, service, or application-specific data they need from live data, and do it in real time.

GS Tool, however, does not store data. There is simply too much data moving through the network to capture it all.

Much of this data passing over the network, such as the content of packets, the mapping between public and private IP addresses  (where network address translation is employed in AT&T's network), etc., is either not stored at all or stored only for brief periods of time.

Storing massive amounts of data

Other data, however, is stored for the long term, including router, SNMP, and configuration data. This information allows managers to reconstruct the state of the network at a specific moment, important for the “post-mortems” performed after a major network event. Knowing the configuration of a router, or which cell towers handled a call, or even the direction of the tower’s antenna are all major clues to piecing together exactly how an event unfolded.

Years of network configuration data are stored in data warehouses generated using Daytona, AT&T’s data management system. Daytona was first begun over 30 years ago when AT&T was still a phone company, and well before the advent of today’s huge datasets. That Daytona is still in operation today—it is the sole repository of AT&T detail records and provides the database for Darkstar—is due to its ability to scale and to engineering decisions that emphasized simplicity and flexibility while avoiding redundancy.

One example of simplicity in design is Daytona’s implementation of horizontal partitioning, in which data is stored in individual, temporal partitions that are separately indexed.

When new data arrives every hour or so, Daytona simply builds a new horizontal partition on top of existing ones and then indexes the new partition. Data arriving in the next hour or so gets its own partition and its own index. Once created, the partition and index are complete and require no further updating or processing. (Daytona ties all indexes together through additional files.)
 

[pic]

Horizontally partitioning data and indices makes it easy to update tables as new data arrives, while also limiting the search space to be queried.

Daytona’s horizontally partitioned indices contrast with the more usual use of a single index that operates over the entire database. While a single index works well for semi-static databases, a single index would require constant revision and updating for continuously arriving network data.

Horizontally partitioning data has the added benefit of limiting queries to specific partitions. Network data tends to be queried according to time frames. Recent data, for instance, is more likely to be queried than older data, so older partitions don’t need to be touched. Older data, too, is usually compared with data from a similar time frame (data flows in January 2008 vs data flows in January 2009), restricting the search to very few partitions.

Additional indexes can be created for each partition, allowing for more targeted queries, such as separate indexes for all data flows sent from a specific server.

Horizontal partitioning also enables tables to grow easily, in theory, without limit. New data arrives, and a new partition gets added. If a table gets too large for the disk, partitions are split and located on different disks, with a pointer inserted to direct the operating system to the other disk. 

Simplicity of design is carried over into the architecture as well. For one thing, Daytona has no specialized database server. Querying, locking, file access, scheduling, and other database tasks are accomplished using the operating system. This reduces the redundancy of having two large systems (a server and the operating systems) together on the same platform using the same resources.

Daytona is also fast, an absolute necessity for massive data sets. Speed comes both from compiling queries directly into machine code without the extra step of translating them into a representational language, and also from compression (both field and record), which results in having to access fewer bytes from disk.

Live and historical data in one repository

Daytona allows the querying of historical data, and GS Tool allows querying of live data, but having two systems for two data sets prevents the cross-analysis that often helps network managers better understand patterns observed in live data.

Patterns tell a lot, and managers are naturally curious about any newly discovered pattern. Perhaps it is a clue to a stubborn glitch, or an early indication of a developing attack. New patterns may signal new usage, the appearance of a new device with new capabilities. Or it could be that the pattern isn’t new, but simply being noticed for the first time.

Knowing for sure requires comparing live data with the historical record, placing both old and new data in the same repository to be queried using the same tools and query language.

A single repository has other benefits. It allows new data to age seamlessly into the historical record, and it simplifies and speeds all aspects of data management. There’s one place to go for data, and one place to perform all processing (normalization, data scrubbing) while having it done in a controlled way by data experts dedicated to the task.

One of streaming data’s distinguishing characteristics is its temporal nature.

While Daytona could scale to accommodate streaming data, the difficulty was in continuously directing a constant rush of massive data into database tables without it in any way interfering with ongoing queries and transactions. Feeding data into database tables normally requires first checking for completeness and accuracy (so transactions don’t fail). This obviously can’t work when massive data of uneven quality and consistency is continually streaming in.

Creating tables for hundreds of data sources is itself a huge challenge. Each table is individually customized for each data source, with the right number of columns, each formatted for a different data element. It takes time and effort, and it requires knowing quite a bit about the data source.

And it doesn’t scale. With hundreds of data sources, it’s simply not possible for humans to know about each data source and set up the requisite tables.
Streaming data also requires real-time querying since the information it contains is most valuable to network managers in its first few minutes, when it is still relevant to current network conditions.

Data-warehousing streaming data

Storing both live and historical data together required combining elements of a DSMS—with its ability to handle streaming temporal data of uneven quality and do it in real time—with elements from the traditional database environment with its emphasis on reliability (necessitating high-quality data) and long-term storage reaching back years.

To provide a unified view of live and historical data, in effect, a stream warehouse, AT&T researchers created DataDepot, an overlay layer on top of Daytona. DataDepot performs the necessary transformations required to feed streaming data, with its characteristics that make it incompatible with databases, and feed it into Daytona tables.

Storing both live and historical data together required combining elements of a DSMS—with its ability to handle streaming temporal data of uneven quality and do it in real time—with elements from the traditional database environment with its emphasis on reliability (necessitating high-quality data) and long-term storage reaching back years.

To provide a unified view of live and historical data, in effect, a stream warehouse, AT&T researchers created DataDepot, an overlay layer on top of Daytona. DataDepot performs the necessary transformations required to feed streaming data, with its characteristics that make it incompatible with databases, and feed it into Daytona tables.

Data that’s even a few minutes old is not that helpful from a network monitoring and performance perspective because it’s not reporting current conditions.

DataDepot performs the transformations necessary to handle streaming data’s characteristics necessary transformations required to feed streaming data into Daytona tables.

To prevent problem data from being ingested, DataDepot feeds data first into raw tables resilient to missing or corrupt data. It’s here that DataDepot normalizes the data and aligns it in time, and performs quality checks by cross-referencing multiple sources. Data remains in the raw tables until stable and verified, at which point the data can be loaded into derived tables and indexed with little chance of interfering with transactions. (Multi-version concurrency control ensures updates and queries proceed concurrently.)

Data in raw tables is available for querying so network managers have the real-time information they need.

To handle the speed and rush of streaming data, DataDepot uses a series of scripts to automate all aspects of creating and populating Daytona tables. There are scripts to feed data into tables, taking data supplied by data sources, in whatever format the data arrives, defining exactly how data flows into what table, what data element goes into each column, and how that data needs to be transformed from its current format to the Daytona-readable form. Scripts also propagate updates through the many levels of dependent materialized views.

Collectively, scripts direct 200 million records to 1,000 tables every day, without human intervention, or human errors. Table definition scripts, which automatically create tables for each new data source, made it possible to go quickly from 10 tables to 600, something simply not possible through manual means.

Even the scripts are auto-created from data DataDepot extracts from user-supplied configuration files. DataDepot scans the file and matches the data elements to those contained in a Perl library, where DataDepot retrieves the appropriate formatting or conversion rules. This information is arranged into what is essentially a template, with blank spaces for unidentified data. A data expert just needs to view the blocked-out script, fill in the missing blanks, and specify the table name.

To handle the tremendous amount of processing (formatting, indexing, querying), DataDepot makes intelligent use of resources based on what it “knows” about the data: its age, what tables it’s feeding into, and what other data it’s likely to be queried with. By comparing time-stamp information, DataDepot knows, for instance, what data it has already loaded and won’t waste resources loading duplicated data. And because newly arriving data is often not yet in order or stable (data yet to arrive may override it), DataDepot will hold off processing until the data is more stable and has been verified against other arriving data.

For faster data access and querying, DataDepot generates large chunks when data is likely to be queried together, making it more likely the data will be stored in sequential blocks (see sidebar).

DataDepot takes streaming data and applies order to it, aligning data in time, normalizing it, and ensuring accuracy before feeding it into database tables.

Introducing the notion of time

One of streaming data’s distinguishing characteristics is its temporal nature. Each packet has a time stamp—useful information not only for re-ordering out-of-order data, but for enabling the time- or sequence-based operations important for network operations. Such operations include counting or averaging the number of events in a specific time interval (for example, detecting when a loss rate crosses a threshold three times in a specified time) or for detecting when the normal sequence of events is broken, as when Event B precedes Event A.

DataDepot introduces the notion of time—lacking in traditional databases—by using Daytona’s horizontal partitioning to create tables of various intervals, according to their time stamp. Real-time information goes into tables that get checked every minute; older, less-critical data goes into tables checked every five minutes, then every 15 minutes, until eventually data is rolled into tables that extend for two or more years, forming a contiguous temporal window.

The separation of real-time and historical tables is completely transparent to managers querying the data. For queries that span real-time and historical data, DataDepot queries both tables.

Partitioning is also used by DataDepot to ensure that updates and the recomputations they trigger are limited to specific areas of the warehouse. In DataDepot, data is always coming in and it’s always being queried, and it’s always triggering recomputations that affect other partitions. Intelligent scheduling algorithms take advantage of table metadata to transparently re-organize data while lessening the impact across the warehouse.

Data transforms over time

Time is also important for identifying the relative value of data. Data that’s even a few minutes old is not that helpful from a network monitoring and performance perspective because it’s not reporting current conditions. Week-old data is not going to help fix a current problem or deflect an attack happening now.

DataDepot and Bistro are essentially wrappers that extend the reach of traditional database-era management methods into the age of streaming data.

Current data needs to be detailed so managers know where the data is coming from, where it is going, the types of packets involved, which routers handled it, or which links it passed over.

Older data is less relevant for current conditions but retains information useful for long-term planning and trend analysis. But historical data doesn’t need to be detailed. If network planners are looking at link utilization, it’s the overall volume that matters, not details about where the data was coming from or where it was going or even the packet type. The actual traffic volume may be less important than an average volume. Over many months, even the simple volume datapoint on a given link doesn’t matter since the link itself may no longer exist.

With the scale of new data flooding into the system, it doesn’t make sense to keep data of little value. Only information necessary for long-term planning for the internal AT&T network is archived for long periods of time.

Once the trends have been graphed or the absolute totals or averages recorded, the detailed data backing them up is no longer needed. Knowing what data retains its relevance, what data can be discarded, and the right amount of aggregation are all part of data management.

As data ages and is rolled into tables of ever-longer intervals, the data can be increasingly aggregated to conserve space. Eventually, the data ages out of the system gracefully while the information extracted from it remains, incorporated in the appropriate derived tables. 

Data is transformed over time, going through phases of increasing aggregation and anonymization.

Reaching back to the data sources

Streaming data becomes less actionable with every passing minute, so it’s critical to get data as soon as possible. Though DataDepot is able to update real-time tables within 1-5 minutes of receiving new data, this loading speed means little if the data is already out of date by the time it arrives.

There are often delays in delivering data to DataDepot and other data subscribers. Data generated from SNMP pollers, routers, and other network devices isn’t normally going direct to the data subscribers who need it. Devices typically forward their data, often via intermediate hops (each one a propagation delay) to a central location where the data is usually combined with other data into files. At some point these files get delivered to data subscribers. It may take upwards of an hour for data to be delivered to where it’s needed, an eternity when the aim is real-time monitoring.

Not every subscriber needs data in real time and every subscriber needs something different: different types of data in different forms (raw or formatted) and at different update frequencies. Some subscribers arrange to have data delivered; others poll for it. In neither case can subscribers communicate with sources if data doesn’t arrive. If a data source is down, subscribers may poll repeatedly, adding to network traffic and causing a processing backlog for the source when it comes back online, adding further delay.

For a long time, the onus was on data subscribers to find and get data, arranging separately with each data source on terms such as filenaming conventions, file format specifications, frequency of updates, handling of tardy or out-of-order data. The policies of the sources are different, the organizations that control the data sources are different, and each has a unique way of distributing data. The result was a series of ad hoc agreements independently created many times over between each subscriber and its sources.

Subscribers were also on their own to figure out the data themselves, not an easy thing when the wanted data comes bundled with other data and when it’s hard to tell what the data is. Data sources don’t necessarily label data or arrange it in any sort of predictable order. Data changes. Sources at any time can adopt new formats, filenames, timestamps, or add or delete information. Subscribers should constantly monitor for changes, or risk an interruption in data, but often don’t. They may not be data-savvy enough to detect (or adjust to) format changes, or have the knowledge or time to hunt for new, better data sources that sometimes come online. The data sources themselves may not fully understand the data and are in any case not always aware of which subscribers downstream get their data.
The whole data feed process, replete with delays and duplication of effort, was an obvious candidate for streamlining and simplification, and the impetus behind Bistro, a data distribution infrastructure to oversee, speed, and simplify all aspects of the data feed process.

[pic]
Bistro is a single point of contact between data sources and data subscribers, speeding delivery of data, extracting data according to needs of individual subscribers, and monitoring for new data sources or changes in data.

Bistro includes communication protocols that enable data sources to directly notify Bistro servers when new data is available. Bistro then immediately retrieves the data for distribution to data subscribers, based on their individual requirements. A flexible specification language allows Bistro to match data files to feeds, perform file normalization and compression, efficiently deliver files, and notify subscribers using a trigger mechanism. Enhanced communication enabled DarkStar, a Bistro subscriber with hundreds of data feeds, to go from hourly updates to updates every few minutes.

More than speeding delivery, Bistro helps subscribers better understand the data so they can make better decisions on when and how to process it. Subscribers often don’t know, for example, if they have all the data needed to begin processing. Do they delay processing in case new data is yet to arrive, or do they begin processing with the risk they may need to re-reprocess if more data arrives? It’s a tradeoff between delaying processing or performing unnecessary work.

Bistro provides additional information that can help subscribers make better choices. An analyzer component constantly monitors the data, using pattern recognition to learn what the data looks like, what constitutes a complete batch, how often the files arrive, and other information. Thus Bistro can more accurately infer the end of a batch, and insert markers so subscribers know when to begin processing. (Bistro may soon incorporate machine learning to better determine the end of a batch from file-arrival patterns.)

The analyzer also compares data files against file specifications to detect changes to naming conventions, aggregation mixes, and other specifications. When new files appear, the analyzer extracts what information it can (many times the filename alone—MEMORY_POLLER1_2010092504_51.csv.gz—is enough to identify the device and timestamp), and forwards it to subscribers.

Data from Bistro often flows to DataDepot, and the two programs have been tightly integrated so Bistro can immediately notify DataDepot when new files arrive. 

Reexamining the database

DataDepot and Bistro are essentially wrappers that extend the reach of Daytona and traditional database-era management methods into the age of streaming data, without having to fundamentally change the underlying architecture. But the pressure to continually shorten the time between data creation and data analysis is now too great, and researchers are looking at how database and the hardware on which it relies to gain speed. Faster data access is needed. The spinning disks typically used for traditional databases require scanning records in sequential order. Searching 1 billion records one after the other to find the 100 of interest is incredibly inefficient and a major bottleneck. Such searches are faster on random and solid state disks, which read data in any order. The prices for these disks are falling, making them increasingly attractive.

Memory prices are also falling. Where a large enterprise spinning disk sells for $2,000 to $3,000 a terabyte, two-terabyte disks go for as little as $150. At these prices, memory is cheap enough to use as data storage, at least for finite periods of time. Clustering hundreds of computers with small, cheap disks starts to make economic sense. And it’s faster. Where a single computer’s internal transfer rate might be 6 gigabits/second, connecting a hundred such machines via fiber channel produces a far higher aggregate bandwidth of 600 gigabits/second when all are processing simultaneously.

Clustering has other advantages as an alternative data storage architecture: If one machine goes down, there are backups. Clustering is itself a form of replication.
Swapping out one type of underlying data storage for another would be surprisingly non-disruptive. GS Tool, DataDepot, and Bistro are all designed to perform general processes transferable to other applications and domains. Incoming data might be directed to a single machine or to a cluster of machines, and then received into a data warehouse or an application. For DataDepot, it’s a matter of rewriting scripts; for Bistro, it simply requires an additional subscriber entry in an existing configuration file.

One drawback of a DSMS is its lack of reliability, something that databases do provide. Inserting reliability may be achieved by using separate instances of GS Tool on parallel streams feeding to different clusters.

Another technical challenge is keeping clusters in sync. Just as Bistro can now feed the same data to multiple subscribers, so it can feed the exact same data to multiple clusters, ensuring they receive the same raw data. The harder problem, and the focus of current Bistro research, is replicating data within data warehouses, data that’s been processed and updated multiple times. Researchers are now looking to resolve issues of defining the level of consistency and locking data or recovering after a loss of synchronization, while creating a mechanism to transport files between replicas.

The flexibility designed into AT&T tools extends to the incoming data. The current generation of AT&T’s data-management tools has been designed for streaming data, but what exactly constitutes streaming data? A database—constantly updated as new data enters the front and older data leaves out the back—is itself a sort of slow-moving stream. While the traditional mode has been to selectively apply a query in a known location, running all data through a standing query may be a fresh way of looking at data and driving analytics in a new direction. Even syslogs and other files may be run through such a query, making this information available for cross-referencing with other data.

Preparing for the next data onslaught

Making even more information available is another gain for the analysts, but it’s more stress on the data-management infrastructure, more scale and more complexity on top of existing scale and complexity. But by now the story is a familiar one and so are the strategies for handling more data.

Constant innovation is one, and it’s why AT&T does research on data management systems and data visualization. New data management tools today have to be engineered from the beginning for flexibility as researchers constantly consider new, faster hardware and more efficient methods that can withstand scale and speed of massive streaming data. GS Tool has already gone through two major rewrites—version 2 to take advantage of distributed processing, and version 3 to incorporate 64-core processing.

Flexibility is important. It is very hard to design specific tools in advance of future scale or demands that simply cannot be known ahead of time. Tools also have to operate on different data types and in different combinations. GS Tool is now being modified to accept, in addition to IP packets, a processed data stream, making the tool more generic and capable of operating in other domains with other types of data. One possibility might be a series of GS Tools, one after the other, to handle massive streams that would otherwise overwhelm a single GS Tool. Data from the first GS Tool would flow to a second GS Tool, which would extract information not already absorbed. Perhaps there’d be a third, or a fourth GS Tool in the chain.

A closely integrated approach—from data generation to the time the data is packed and aggregated into long-term storage—ensures that a process, done once, is not repeated. If time-stamp information has already been captured, it doesn’t need to be captured again at a later stage. Tools have to be able to communicate with one another to relay information about what processing is already done, thus reducing redundancy, speeding processing, and conserving computing resources.
Knowing more about the data and using the data’s meta-information—its timestamp, filename, header information—is crucial to efficient processing. The age of the data is an important indicator of how likely the data is to change or to be queried, leading to better decisions on when to process and when to delay processing.
It is a constant arms race as existing data sources generate more data and new data sources come online.

Bigger streams are coming. The groundwork for another data deluge is being set as sensors start to be deployed in greater numbers. Now it’s the electric grid; soon it will be wearable medical monitors. In the not-too-distant future, even the most mundane of objects will have its own sensors and spew its own stream of data.

Streaming sensor data, like the network data it resembles, will need to be queried, stored, and analyzed in real time. It won’t be easy to do. But AT&T’s solutions—partially the result of AT&T’s unique telephony history and partially the result of recent, directed research into all aspects data management—provide one possible roadmap to what is required to store, update, and analyze massive amounts of data. The process is not over.
 


 

Petabyte
 
 

 

AT&T Research papers

For more detailed information about the methods and technologies discussed in this article, see the following papers:

Consistency in a Stream Warehouse

Bistro Data Feed Management System

Update Propagation in a Streaming Warehouse

Enabling Real Time Data Analysis

Scheduling Updates in a Real-Time Stream Warehouse
 
Stream warehousing with DataDepot

Query-Aware Partitioning for Monitoring Massive Network Data Streams

Gigascope: A Stream Database for Network Applications

 

Declarative query languages

Declarative query languages make it easy to write queries because they allow you to use a natural language-like syntax to describe what you want to achieve on the data. (Imperative query languages, in contrast, require you to define the sequences of commands the computer needs to follow. Imperative query languages are more efficient from the machine’s point of view, but requires more up-front planning and some knowledge of how the computer operates.)

Declarative because it lets you tell the computer what you want rather than imperative programs that make you tell the computer how to do it

Both GS Tool and Daytona implement declarative query languages since the purpose is to make these tools generic—more easily implemented by more people and in more situations. In the case of declarative languages, the human queries get translated into machine language.

Daytona’s Cymbal language (named for the symbolic logic on which it is based)

cymbal-program
 

Daytona data management system

 

Daytona is a prime example of how good engineering design can withstand future, unforeseen demands. Though designed over 30 years ago, Daytona’s efficient architecture—which uses the native operating system, rather than specialized database operating server—has allowed Daytona to keep pace with today’s scale and complexity. Daytona’s largest database today consists of 30 trillion records.

Daytona-sidebar2

Spinning disk technology

Traditional databases are designed for spinning disks, where a disk head, mounted on a movable arm, is positioned over data being read or written.

Moving the arm to a new place takes time (several milliseconds), meaning that data distributed in multiple locations takes more time to read than data aligned in sequential blocks.

DataDepot will generate data in big chunks to make it more likely that data likely to be queried together is stored contiguously.