A Collaboration: Network Coding + Reliability
In 2009, Robert Calderbank then a professor at Princeton received an Air Force grant to help solve a difficult communications problem: how to make wireless networks more reliable and thus avoid the communications failures that can result from packet loss. For the Air Force, such failures, occurring during critical situations, have the potential to be life-threatening.
Packet loss has long been a big problem on wireless networks where interference, physical obstacles, and distance between network devices all contribute to the problem. A second problem is the limited capacity of wireless networks and the resulting inefficiencies. Over the years, solving the problems of reliability and efficiency has been the focus of much effort and research.
Calderbank himself had long been investigating the problem. A past vice president of AT&T Research with a background in computational mathematics, his particular focus was on coding theory, both to correct for noise (which can interfere with packet delivery) and to compress data for more efficiency. He was also aware of others were approaching the same problems.
The problem of reliabilty
The reliability problem exists because wireless networks are inherently lossy and constrained. In case of packet loss, TCP (the transport protocol that ensures a packet makes it to its destination) depends on end-to-end retransmission to recover lost packets. While this mechanism is sufficient for wired networks where there is little packet loss, it is not optimal for wireless networks, which are lossy.
When wireless networks came into use, TCP’s inability to operate efficiently under significant packet loss was a big problem, and end-to-end retransmission (where the packet is re-sent from the source), TCP’s sole mechanism for recovering a lost packet, can result in substantial reduction in throughput.
TCP is ill-suited for lossy wireless networks in another way: TCP interprets all packet loss as a sign of congestion—whether or not the cause is overburdened links or errors in the channel—and will reduce the transmission rate. While this response is appropriate for reducing congestion, it doesn’t solve the problem of wireless error-caused packet loss and in fact may result in links that sit underutilized while TCP attempts to alleviate non-existent congestion.
One way to avoid the inefficiency of underutilized links is to implement Explicit Congestion Notification (ECN) in TCP. ECN, which K.K. Ramakrishnan of AT&T Research has long proposed (K.K. Ramakrishnan, Raj Jain, "A Binary Feedback Scheme for Congestion Avoidance in Computer Networks with a Connectionless Network Layer") may be used to disambiguate between congestion-caused packet loss and wireless channel errors. Any packet loss not correlated with ECN could therefore be interpreted as due to wireless channel errors.
The possibilities of network coding
The problem of inefficiency is also addressed by network coding, a new way of forwarding packets that was developed independently over the last decade. In network coding, intermediate nodes within the network combine several packets into one coded packet so more packets can simultaneously use the same link without competing for resources. (Packets are combined using XOR or other linear operations; see side bar.)
The coded packet can then be decoded by the receiving node with the help of additional packets or side information. In a wired network, this information is relayed to receiving nodes in separate packets routed over less-congested paths in the network.
In wireless networks, this side information comes for free, thanks to the broadcast and overhearing capabilities inherent in a wireless medium. Because nodes can overhear one another’s transmissions, they can exchange decoding information without additional overhead.
Since intermediate nodes must know what packets the receiving nodes have overheard to know what new packet to send, receiving nodes must communicate their buffer state upstream to access nodes, often by appending this information on their own transmissions. This need to constantly communicate with other nodes adds both communication and computational overhead to the network coding approach.
But the advantage gained is that packets from different flows, which used to compete for scarce resources in bottleneck links, can now share and better utilize the same resources.
In an environment as resource-constrained as wireless networks, this sharing of resources is important.
Coding packets from different flows is known as inter-session coding, while coding packet within the same flow is known as intra-session coding. Inter-session coding improves efficiency (different flows can share the same bottle-necked links). Intra-session coding can add redundancy to a flow (by adding linear combinations of packets) and thus improve reliability in the presence of loss.
Forming a MURI
Calderbank saw network coding and the gain in efficiency as a way to build a better network for the Air Force. But networking coding is still relatively a new approach with solutions that are yet to be widely tested in practice; many aspects still need to be worked out. Implementing network coding would require expertise in networking, information theory, algorithms, and network protocols.
Calderbank in late 2009 formed a Multidiscipline University Research Initiative (MURI) to assemble the needed expertise. A MURI is a well-established framework for university collaborations with guidelines for sharing resources and funding.
"Our project is organized around the idea that transformational change in network management will require extraordinary interdisciplinary breadth," Calderbank said at the time, "in particular the infusion of fundamentally new mathematical ideas from algebraic topology and compressive sensing.”
For networking and application of network coding, he approached three researchers all working independently on different problems associated with network coding on wireless networks: Christina Fragouli, Suhas Diggavi, and Athina Markopoulou, faculty at l’École Polytechnique Fédéral de Lausanne, University of California at Los Angeles, and University of California at Irvine, respectively.
Fragouli and Diggavi were mainly working on theoretical problems and algorithms while Markopoulou along with her graduate student Hulya Seferoglu (also at UC Irvine) was focused more on the practical matter of implementing network coding and integrating it with TCP and other protocols. Specifically, Markopoulou and Seferoglu’s prior work studied inter-session network coding and its cross-layer optimization with TCP ("Network Coding-Aware Queue Management for TCP Flows over Coded Wireless Networks").
With a team in place for the network coding component, the issue of reliability remained, and for this Calderbank arranged for MURI members to attend a kickoff meeting in 2009, held at AT&T Research (Florham Park, NJ) where reliability for wireless networks is very much a practical engineering problem. From AT&T researchers working on the problem, MURI members heard first hand about the latest network research and which methods had the best chance to be deployed in the near future. (It was a homecoming of sorts. Several MURI members had at one time worked at AT&T Research. )
The meeting was followed by a workshop at UCLA in January 2010, this meeting focused specifically on network coding . Also attending was K.K. Ramakrishnan of AT&T Research.
Ramakrishnan had been working to improve the reliability of IP protocols over wireless networks. This work was in collaboration with researchers from the Rensselaer Polytechnic Institute (RPI)—located in Troy, NY—set up through AT&T Research’s VURI program (Virtual University Research Initiative), which facilitates collaborations between AT&T researchers and universities.
(For AT&T Research, collaborations with universities play an important role, since students have the time and inclination to fully and deeply investigate fundamental problems. University collaborations enable AT&T to expand research efforts, while students are given practical problems to solve—and often a ready-made thesis topic—along with the chance to work with experts in their field. Working with AT&T offers one more advantage, access to the tremendous amounts of network data maintained by AT&T.)
The collaboration between Ramakrishnan and RPI in place since 2005 had been looking to ensure reliability through redundancy by appending extra packets to each transmission. Each redundant packet contains enough information to replace any one lost data packet. The mechanism employed is forward error correction (FEC) using Reed-Solomon to encode information from a fixed-length block of packets.
In FEC, redundant packets can take the place of any lost packets
If a loss occurs in the block, the receiving node uses one of the redundant packets to reconstruct the lost information in the decoding process. Therefore, the loss of a data packet doesn’t matter if there is a redundant packet to replace it. It’s the total number of packets received that counts; if a receiving node expects eight packets and receives six data packets and two redundant packets, it’s received the requisite eight.
Since the coding and decoding is done at the end nodes (unlike network coding where the coding is done at intermediate nodes within the network), FEC is sometimes referred to as source coding.
Redundant packets add overhead (there’s more to transmit after all) but because FEC adds reliability, there are far fewer retransmissions. Ramakrishnan and his RPI collaborators further increase efficiency by “right-sizing,” or varying, the amount of redundancy depending on the reliability of the link, using more redundancy for unreliable links and less for reliable ones.
The theoretical meets the practical
Calderbank’s hunch was that Ramakrishnan’s practical work on transport reliability would complement Markopoulou’s work on network coding for better efficiency, and that combining the two methods would yield a more reliable and efficient network.
Seferoglu and Markopoulou at UC Irvine had looked at the interaction of network coding with TCP flows, but had not evaluated adding redundancy. But the team realized that the combination of network coding and packet-level FEC gracefully solved two problems at the same time: first, it reduced loss while improving the efficiency so critical to wireless channels, and second, it simplified network coding by making it unnecessary for nodes to constantly track which packets were transmitted, which were overheard, and by which nodes. When network coding is combined with packet-level FEC, all that is needed to determine the amount of redundancy is a simple percentage (of packets lost).
With the plan set, and with Ramakrishnan agreeing to advise the MURI on the FEC scheme and network architecture, work began in earnest in April 2010 and Seferoglu began working full time on the project.
Progress to date
The initial months were spent resolving the inconsistencies that inevitably occur when combining two methods, each separately evolved.
One of the first was how to handle the burstiness of TCP flows when inter-session network coding depends on a similar number of similar-sized packets from different flows. This was resolved by modifying active-queue management schemes in a way to work best in conjunction with TCP congestion control and wireless network coding, building on prior work at UC Irvine (see here).
Most of the work integrating TCP and network coding was done in the summer 2010 when Seferoglu worked at AT&T Research under the supervision of Ramakrishnan.
The issue of loss was also especially complicated in network coding because loss affects not only the direct links but the overhearing links as well (overhearing depends on good, error-free links). Because the performance of network coding declines on lossy networks, decisions have to be made on what percentage and which flows should be coded together (inter-session network coding) and how much redundancy (in this case in the form of intra-session network coding) is required.
Working through these and other problems resulted in a novel unifying scheme, I2NC, which builds on top of one-hop constructive network coding (COPE) and combined inter-session coding with intra-session redundancy. The team also designed a thin layer between TCP and I2NC, which codes/transmits and acknowledges/decodes packets at the sender and receiver nodes in such a way to make I2NC operations transparent to the TCP protocol. The benefits of I2NC include: bandwidth efficiency (thanks to inter-session coding); resilience to loss (thanks to intra-session redundancy); and reduced protocol overhead (setting the nodes free from the need to communicate with one another and exchange information about which packets they have overheard). A paper on I2NC (“Inter- and Intra- Session Network Coding“) has just been accepted at the IEEE Infocom 2011 conference.
The next step is implementation, and with an offer of help from Air Force Office of Scientific Research (AFOSR)—specifically the Operations Integration branch at Rome, NY—the MURI team has just started implementation at the AFOSCR Emulab testbed.
The next steps
The collaboration is now approaching its one-year mark, and the indications are that network coding with the FEC redundancy will provide both efficiency and reliability in wireless networks, and in a simpler way than was previously thought.
A real test will come as the team begins implementing I2NC on Android smartphones to determine whether I2NC is feasible on devices with limited resources, something that would be very hard to do when nodes were required to track overheard packets.
Certainly implementing network coding on smart phones was not foreseen at the beginning of the project, and it was only by collaboratively fusing two different methods that the necessary gains in efficiency were achieved.
Calderbank is pleased with the way the collaboration is going:
"When we held our workshop at UCLA we discovered two different approaches to improving the rate and resilience of Air Force communications and we started to explore whether the benefits were additive. I am delighted to see that they are additive and that collaboration with AT&T Labs is accelerating the transfer of technology to the Air Force.”
XORing two packets
Access nodes and other network devices see a packet as a string of 0s and 1s. In network coding, the bit strings of two packets are combined using the exclusive OR logical operation, or XOR (symbol ).
XORing assigns a “1” if two bits are different, “0” if the same.
The idea for bit-wise XORing packets in this way was proposed in the paper XORs in the Air: Practical Wireless Network Coding. It works like this:
An access node looks at its input queue to find similar-sized packets going to nearby destinations. The packets may be from different sessions.
The access node opens the packets and XORs the two packets’ bit strings to form a new coded packet:
The packet is decoded at the receiving node using information overheard from nearby nodes.
XORing is the simplest, not the only, method to combine packets (linear combinations are also used)