TCP small queues是另一個減少網絡擁堵的機制，它的目標在於減少xmit queues (qdisc & device queues)中TCP包的數量，來減少RTT和cwnd bias，以此解決部分網絡擁堵情況。我們可以在不降低名義帶寬的情況下，減少每一個批量發送者的緩衝區：每Gbit< 1ms （相比於50ms with TSO）以及每100Mbit < 8ms （相比於132 ms）

By Jonathan Corbet
July 17, 2012

The "bufferbloat" problem is the result of excessive buffering in the network stack; it leads to long latencies and poor reliability in the network as a whole. Fixing it is a matter of buffering less data in each system between any two endpoints—a task that sounds simple, but proves to be more challenging than one might expect. It turns out that buffering can show up in many surprising places in the networking stack; tracking all of these places down and fixing them is not always easy.A number of bloat-fighting changes have gone into the kernel over the last year. The CoDel queue management algorithm works to prevent packets from building up in router queues over time. At a much lower level, byte queue limits put a cap on the amount of data that can be waiting to go out a specific network interface. Byte queue limits work only at the device queue level, though, while the networking stack has other places—such as the queueing discipline level—where buffering can happen. So there would be value in an implementation that could limit buffering at levels above the device queue.
Eric Dumazet's TCP small queues patch looks like it should be able to fill at least part of that gap. It limits the amount of data that can be queued for transmission by any given socket regardless of where the data is queued, so it shouldn't be fooled by buffers lurking in the queueing, traffic control, or netfilter code. That limit is set by a new sysctl knob found at:

    /proc/sys/net/ipv4/tcp_limit_output_bytes

The default value of this limit is 128KB; it could be set lower on systems where latency is the primary concern.
The networking stack already tracks the amount of data waiting to be transmitted through any given socket; that value lives in thesk_wmem_alloc field of struct sock. So applying a limit is relatively easy; tcp_write_xmit() need only look to see ifsk_wmem_alloc is above the limit. If that is the case, the socket is marked as being throttled and no more packets are queued.
The harder part is figuring out when some space opens up and it is possible to add more packets to the queue. The time when queue space becomes free is when a queued packet is freed. So Eric's patch overrides the normal struct sk_buff destructor when an output limit is in effect; the new destructor can check to see whether it is time to queue more data for the relevant socket. The only problem is that this destructor can be called from deep within the network stack with important locks already held, so it cannot queue new data directly. So Eric had to add a new tasklet to do the actual job of queuing new packets.
It seems that the patch is having the intended result:

Results on my dev machine (tg3 nic) are really impressive, using standard pfifo_fast, and with or without TSO/GSO. Without reduction of nominal bandwidth. I no longer have 3MBytes backlogged in qdisc by a single netperf session, and both side socket autotuning no longer use 4 Mbytes.

He also ran some tests over a 10Gb link and was able to get full wire speed, even with a relatively small output limit.
There are some outstanding questions, still. For example, Tom Herbert asked about how this mechanism interacts with more complex queuing disciplines; that question will take more time and experimentation to answer. Tom also suggested that the limit could be made dynamic and tied to the lower-level byte queue limits. Still, the patch seems like an obvious win, so it has already been pulled into the net-next tree for the 3.6 kernel. The details can be worked out later, and the feature can always be turned off by default if problems emerge during the 3.6 development cycle.

Tech黑手 - 工作雜記

2012年10月29日星期一

TCP small queues

沒有留言:

張貼留言

How to use simple speedtest in RaspberryPi CLI

2012年10月29日 星期一

TCP small queues

沒有留言:

張貼留言

How to use simple speedtest in RaspberryPi CLI

2012年10月29日星期一