Book Interview: Chuck Kalmanek
The following interview is with Chuck Kalmanek, one of the editors of Guide to Reliable Internet Services and Applications, a new book focused on the latest research in network management.
How did the book come about?
Network reliability is an important issue. Business and even the economy absolutely rely on the Internet. When Richard Yang (Associate Professor of Computer Science, Yale University), with whom I was co-chairing a SIGCOMM workshop on network reliability, told me he was thinking about editing a book on network reliability, my immediate reaction was that this would be a really valuable book.
Network management encompasses...planning, provisioning, operating, monitoring, controlling, and maintaining a large network.
Universities and researchers want to do research on pragmatic networking problems, but without access to real-world data, they can’t really understand the problems.
We at AT&T Research, however, do have access to AT&T networks, which are among the largest in the world. We’ve worked closely with the network engineering and operations organizations in AT&T from the day that AT&T first decided to build an IP network in the 1990s. We know what the issues are and spend a lot of time thinking about how to do things better. I wanted to make some of that accumulated experience available to the academic community and to network designers in large businesses that face similar problems. There’s a real need in the vendor community as well.
In the last decade, as the telecom industry has consolidated and transitioned to IP, some traditional telecom vendors aren’t building products at the level of reliability that they used to. And there are new vendors building software that needs to be carrier-grade, but they don’t come from that tradition, and frankly, I don’t think the engineers and software developers they hire understand what it means to operate services that work well day in and day out, 365 days a year. AT&T does a lot of work to educate our vendors about our needs. My thought was that if some of this knowledge could be captured in a book, it would raise the understanding of the whole industry.
How would you define network management?
Network management is a tremendously broad area, and there isn’t a single definition. It encompasses all functions related to planning, provisioning, operating, monitoring, controlling, and maintaining a large network. End-to-end service providers also need to manage the applications and services that are carried over the network. The best answer is probably “read the book.”
You’ve edited from the viewpoint of AT&T, whose networks are among the largest networks in the world. Would network managers of smaller networks learn anything from reading the book?
There are businesses whose networks are big enough that they face many of the same network management challenges that a company like AT&T does. But at this point, the scale and scope of AT&T’s networks puts us in a class by ourselves. We provide Internet services to businesses and consumers, we provide IPTV service, we run a cellular network, and we’re global – with points of presence in hundreds of countries. But I believe the same methodologies that we use to run our networks can and should be used on smaller networks.
Are there different schools of network management, or different approaches?
My first thought is that there are probably lots of wrong ways to do it, but only one right way.
What impedes good network management?
Designing and running reliable networks requires talented engineers and software developers, and incredible discipline – on the part of equipment providers, network engineers, and operations personnel. To do it well, you need to have a very good understanding of the end-to-end service objectives, and every aspect of the design and operational processes are tailored to meet them. When it’s not done well, it’s usually because companies don’t have the talent or try to cut corners in some way.
The latest challenge is the phenomenal growth of mobile data.
How is AT&T’s approach to network management different now than it was five years ago? How have the types of problems changed?
Many of the principles of running large scale networks have been around for a long time. They grew out of the PSTN [public-switched telephone network]. I would say that there are four major changes that make the problem more challenging (and more interesting) today.
One is that the Internet Protocol was not designed to make networks easy to manage – it was designed to allow networks run by different organizations to be connected together with a minimum of coordination. Manageability had to be “bolted on” to IP after the fact. We’re still working on it.
Second, IP enables many services to run on it, and these overlay services place requirements on the network. In fact, the IP networks that service providers run today don’t look much like the IP networks of the 1990s: we’ve had to evolve the network technology in fundamental ways to meet the needs of these overlay services. Today, we run virtual private networks for businesses using a technology called MPLS [multi-protocol label switching], and we support IPTV using a technology called IP Multicast. Every one of these advances in the way that IP networks work has led to increased functionality, but also increased complexity in design and management.
Third, the Internet really is continuing to see exponential growth. This pace of change means that we have to constantly evolve the network, while at the same time that providing reliable service, twenty four hours of every day, to millions of users.
The latest challenge is the phenomenal growth of mobile data. This wave of innovation isn’t likely to slow down any time soon: people like the ability to use applications, anytime, anywhere; device manufacturers continue to be incredibly creative, and we haven’t even seen the potential impact of machine-to-machine communication on the mobile Internet. It will keep us busy for a long time.
How do you expect network management to change in the next five years?
One area that is very promising and that we’re working on is something we call exploratory data mining. Service providers have gotten very good at finding and fixing what we call “hard failures.” Exploratory data mining is necessary when you want to raise the bar beyond simple reactive network management. In large networks, you need to find and fix recurring low-level performance problems – like running a mild fever - that are below operations radar. By correlating lots of different types of data, you can expose the root cause of the low-level or latent problems.
Another area of interest is cloud computing. We need to develop better techniques for large-scale systems management that can work in the presence of large numbers of servers and unpredictable workloads. A third area relates to mobile devices: how can we make it easier to manage all of the devices that will be connected to networks and have them work well together? Another area is security. Since end systems are vulnerable to security attacks, can we build security into the network to filter out some kinds of attacks or at least prevent some of the consequences?
If there was one thing you could change about the Internet to make network management easier for network providers, what would it be?
There isn’t one thing, there are many things.
Part of the burden falls on equipment vendors to design products that are robust to failures and that degrade gracefully under load. We also need better methods for modeling performance and predicting behavior. It would be great to be able to predict what happens when you add a million new mobile devices, so that you could plan accordingly. Another interesting area is the emergence of programmable routers. We are building a testbed, known as ShadowNet, to see if we can add new functionality to a running network in a way that doesn’t have any chance of adversely impacting existing functions. If we can really do this, it would allow us to grow and upgrade functionality far more easily than we can today. These are just a few ideas.
The IP networks that service providers run today don’t look much like the IP networks of the 1990s: we’ve had to evolve the network technology in fundamental ways to meet the needs of these overlay services.
AT&T Labs Inc.
Table of Contents
I Introduction and Reliable Network Design
1. The Challenges of Building Reliable Networks and Networked Application Services. Download Chapter 1 (PDF)
2. Structural Overview of ISP Networks
II Reliability Modeling and Network Planning
3. Reliablity Metrics for Routers in IP Networks
4. Network Performability Evaluation
5. Robust Network Planning
III Interdomain Routing and Overlay Networks
6. Interdomain Routing and Reliability
7. Overlay Networking and Resiliency
IV Configuration Management
8. Network Configuration Management
9. Network Configuration Validation
V Network Measurement
10. Measurements of Data Plane Reliability and Performance
11. Measurements of Control Plane Reliability and Performance
VI Network and Security Management, and Disaster Preparedness
12. Network Security Management: Fault Management, Performance Management, and Planned Maintenance
13. Network Security – A Service Provider View
14. Disaster Preparedness and Resiliency
VII Reliable Application Services
15. Building Large-Scale, Reliable Network Services
16. Capacity and Performance Engineering for Networked Application Servers: A Case Study in E-mail Platform Planning