Problems in Low-Delay Internet Communication: Congestion Management

The Internet has a problem: there’s no agreed-upon reasonable standard for low-delay communication. In particular, congestion control for low-delay communication, typically running over User Datagram Protocol (UDP), is lacking. Real-time communication, especially interactive communication, is at best unreliable and currently depends on over-provisioning of any critical links, especially the last-mile hops (access links and/or local wireless links). Typically we are talking about voice-over-IP (VoIP), but all sorts of other communications are or will want to be real-time, including telerobotics, hardware control, and others.

When standard congestion-control mechanisms are used for communications that need low delay, they build up queues in routers and hosts, which may delay a data stream by seconds with little warning; this can be disruptive for real-time media, and can be fatal for low-delay control mechanisms.

There have been some experiments and work in lower-delay congestion control protocols, though none have directly or fully solved the problem. There have been experiments in the Transmission Control Protocol (TCP) space (such as TCP Vegas and Cx-TCP), in User Datagram Protocol (TFRC, DCCP, LEDBAT), and some in Stream Controlled Transmission Protocol (SCTP) (related to the Cx-TCP work).

Due to the increasing need for real-time communication, a number of people are trying to develop standards that will allow appropriate sharing of bandwidth and avoid congestion collapse. This is especially relevant since video and adaptive audio are now often part of media streams, and unlike “classic” fixed-rate VoIP protocols, they can adapt to changing network congestion.

The problem is simple: existing congestion control methods, in particular TCP, are loss-based and, in determining link bandwidth, must force the intermediate routers into a loss state (congestion). For a tail-drop router (the most common), this means maximum delay. If this is a small number of milliseconds (ms), this may not be a problem, but combined with even a hint of BufferBloat1, the delays can quickly make real-time interaction impossible. Delays of 100, 200, 500ms, or even multiple seconds, are possible.

Even algorithms such as LEDBAT2, which was designed as a ‘scavenger’ protocol to make use of ‘extra’ bandwidth while getting out of the way of user-initiated transfers, will engender a typical 100ms of delay in the bottle-neck node, which is problematically high for many real-time applications. For example, the ITU recommends in G.114 (and elsewhere) that one-way-delay (mouth-to-ear) be kept below 150ms for best subjective audio quality3, especially if echo isn’t perfectly controlled. In just one bottleneck node, 100ms is a killer when added to the other delays in a VoIP call. This is especially problematic as LEDBAT flows might be assumed not to interfere with user-initiated VoIP calls; so applications might use them indiscriminately and without user knowledge (e.g., background updates or backups).

An added complication is the need to work correctly in an environment with effective Explicit Congestion Notification (ECN) or Active Queue Management (AQM), which is a focus of efforts to combat BufferBloat.

Current Efforts

There is an effort underway as a derivative of the rtcweb effort (part of the W3C/IETF joint WebRTC project) to develop and standardize congestion control protocols to deal with those problems as best as possible given the need to compete with TCP flows. With tail-drop routers, large TCP flows may always ‘win’ and force the buffers to expand, but there’s little we can do about that unless or until active queue management (AQM) is the norm. About all we can do is mark real-time packets properly so routers have the option of handling them separately from TCP and other loss-based flows. (A number of access routers/modems do this now for many classic VoIP flows). It would be nice if a solution (or solutions) could be found for generic UDP flows, Real-Time Transfer Protocol (RTP) media flows over UDP (which carry some inherent timing information), and TCP/SCTP/etc. flows, though the initial focus is RTP flows.

Other considerations will include interactions with the TCP slow-start algorithm, the impact of probing for bandwidth availability on other flows, and the advantages (or disadvantages) experienced by relatively late-arriving flows.

This effort is underway on the rtp-congestion list4, and a birds-of-a-feather (BoF) meeting is proposed for IETF 84 in Vancouver, British Columbia, Canada, with the goal of chartering a working group (WG) to address this problem. There will also be a one-day IAB/IRTF Workshop on Congestion Control on July 28 in Vancouver. See the notice at http://www.iab.org/cc-workshop/.

There are several proposed ways to attack this problem, though more research and simulation is needed. One approach is from Google for a delay-sensing algorithm that infers the state of the bottleneck router from packet arrival delay deltas. Algorithms in this class are known to work, but their fairness (both with themselves and TCP) and ability to adapt to AQM have not been explored yet. Other options would be to use or leverage Cx-TCP or DCCP. LEDBAT in its current form is probably not an option, but a congestion-control algorithm based on the same principles may be a viable candidate.

One important consideration is to develop a coherent strategy for managing the different classes of flows on the Internet, with smooth-as-possible fallbacks when the preferred solution isn’t available (such as when the bottleneck is a tail-drop router).

A Delay-Sensing Congestion-Control Algorithm

The following is based on draft-alvestrand-rtcweb-congestion5, and is only a high-level description of the actual algorithm. This and other similar algorithms have been in use on the Internet in small amounts since at least 2004.

The basic idea is that we monitor the drift between when a packet was sent and when it was received (one-way-delay). We don’t need to know the actual one-way-delay value (which can be challenging to measure), only changes up or down in the delay value.

An increasing delay (after filtering) implies that the bottleneck node’s queues are increasing in depth. If the flows on the link are relatively static in bandwidth, such as a bandwidth-constrained access link largely occupied by the flow itself, then the signal from the filter will be very clear. More complex environments should show a signal, but the filter may take longer to converge.

Conversely, when the filter shows decreasing one-way-delay, then the assumption is that the bottleneck queues are draining.

This information can be used to estimate the amount of available bandwidth on the bottleneck link, and thereby whether the congestion control algorithm should allow the flow rate to increase, decrease, or remain the same. Other inputs are necessary, such as packet loss, Explicit Congestion Notification (ECN) markings, but these do not yet appear in this draft. Part of the research needed is to determine what the best response to such inputs is.

Unlike LEDBAT and Cx-TCP, there is no explicit nonzero queuing target; this algorithm attempts to use as much bandwidth as possible while keeping the queuing delay at or close to zero. This implies that it can’t be 100 percent efficient, but in a relatively static situation it can come very close. Another research consideration is determining algorithm efficiency in different scenarios, especially nonstatic scenarios and scenarios with larger aggregations of delay-sensing flows on the bottleneck link.

Future Work

There are many areas for useful research and innovation in this area. In addition to adapting existing algorithms to fulfill the need and comparing them, there is significant work to be done in improving these proposals, such as adding loss and ECN support to the Google proposal, developing appropriate startup-time heuristics, methods to minimize the impact on the current and other flows when probing for additional bandwidth, avoiding “swings” in fairness that can cause a major loss of utility (such as for interactive video calls), and many others.

As mentioned, we hope to charter a WG in Vancouver, British Columbia, Canada, and work in it to develop one or more RFCs to address these issues and move existing proprietary and ad-hoc congestion methods into a standardized framework .

References

1. http://www.bufferbloat.net/

2. http://tools.ietf.org/wg/ledbat/

3. http://www.itu.int/rec/T-REC-G.114-200305-I

4. https://sites.google.com/a/alvestrand.com/rtp-congestion/

5. http://tools.ietf.org/html/draft-alvestrand-rtcweb-congestion