DataDepot generates very large warehouses from streaming data feeds while at the same time storing massive amounts of historical data for data mining and analysis. DataDepot, which is a set of table definitions and database management scripts, sits between data sources and a relational database. As information is fed in from disparate sources, DataDepot performs the complex steps involved in extracting data from various sources, transforming it to a common format useful for the data warehouse environment (while also deriving new information through joins and other operations), and finally loading the transformed data into database tables.
The transformation and loading process is done in real time so information is almost immediately available (for troubleshooting, for instance) or for comparing and correlating with past data (for purposes of fraud detection or to identify other inconsistencies). A DataDepot warehouse thus presents a unified version of disparate real-time and historical source data.
Any data stream can be ingested, including network-traffic traces, router alerts, financial tickers, and transaction logs. The data is fed first into raw tables until DataDepot can verify the accuracy of the information and place it in the proper temporal order if data is out of order. Once verified and aligned in a time order, the data is fed into derived tables, with each table horizontally partitioned by timestamp, forming a contiguous temporal window. As newer data arrives, new partitions are created. Old data is automatically timed out of the system.
The explicit queries used to create database tables document the provenance of the data so everything is trackable and ordered, and there is none of the ad-hoc nature of manually created tables.
To ensure only the most accurate data is used, DataDepot can cross-reference multiple sources and perform other quality checks such as monitoring for feed errors, ensuring that devices are reporting on schedule, tracking resource usage, and raising alerts about looming problems. If table entries populate other tables, Data Depot ensures they’re updated in the right order, that new data replaces old, and that updates don’t interfere with a query under way.
So that historical and current data co-exist in the warehouse, DataDepot offers a variable partitioning to manage a historical (e.g., two-year) table with real-time (e.g., five minute) updates. When a collection of partitions ages out of the real-time window, it is rolled up into a coarse partition.
DataDepot (using Daytona as its storage manager) is currently being used for five very large warehousing projects within AT&T, including Darkstar, which ingests close to 340 million records a day to organize network data and make it readily accessible to Researchers, network operators and engineers. Data collated in the Darkstar warehouse covers IP, mobility, IPTV and private line networks and devices, and includes service performance measurements, network performance measurements (end-to-end and local to specific devices, including packet losses, latency measurements, errors, link utilization), workflow logs (commands typed on the routers), configuration files, syslogs and network and customer tickets.
Other projects incorporating DataDepot include a DPI Warehouse, PMOSS, and a VoIP Correlation Engine.
Current and future research centers on distributing DataDepot across multiple machines to speed up data ingestion for massive warehouses, such as Darkstar.