Reducing CPU overhead in network programming with zero-copy data transfer:Linux
Reducing CPU overhead in network programming with zero-copy data transfer:Linux
Tags: linux-kernel, kernel-internals, embedded-linux
Series: Linux Kernel Internals
To increase WiFi throughput, it is important to minimize the amount of processing overhead involved in moving data from the WiFi chip to the application.
Zero-copy direct memory access (DMA) is one such approach. Traditionally, data packets received by the NIC are first copied into a buffer in the kernel space and then copied again into a buffer in the user space before being processed by the application. This double-copying process can result in significant CPU overhead and can limit the maximum throughput that can be achieved.
+------------+ +------------+
Network Data | Buffer | DMA | DMA | Memory <--------------| (NIC) |--------->| (Host) |--------> Processing +------------+ +------------+
But Zero copy DMA requires support from hardware and many chips may not support it.
For example Realtek RTL8812AU 802.11ac/a/b/g/n chip. This chip is commonly used in Wi-Fi adapters and dongles for desktops and laptops. The Realtek RTL8812AU driver for Linux (rtl8812au) uses a double-copy operation to transfer network packets between the kernel and user space. Specifically, when a packet is received by the Wi-Fi chip, it is first copied into a receive buffer in kernel space. Then, when an application reads data from the network using the read() system call, the data is copied again from the kernel space buffer to the user space buffer, resulting in a double-copy operation. A typical mechanism is to use skb to copy packets from the wifi chip internal buffer to skb. Then skb wakes up any processes that may be waiting for on this data.
Below is a snippet from open source rtl8812au/rtw_recv.c at v5.6.4.2 · aircrack-ng/rtl8812au · GitHub.
static void recvframe_expand_pkt(PADAPTER padapter, union recv_frame *prframe)
{
pfhdr = &prframe->u.hdr;
if (pfhdr->attrib.qos)
shift_sz = 6;
else
shift_sz = 0;
alloc_sz = 1664; /* round (1536
24 + 32 + shift_sz + 8) to 128 bytes alignment */
ppkt = rtw_skb_alloc(alloc_sz); if (!ppkt) return; /* no
way to expand */
skb_reserve(ppkt, 8 -
((SIZE_PTR)ppkt->data & 7));
skb_reserve(ppkt, shift_sz);
/* copy data to new pkt */
ptr = skb_put(ppkt,
pfhdr->len);
if (ptr) _rtw_memcpy(ptr,
pfhdr->rx_data, pfhdr->len);
rtw_skb_free(pfhdr->pkt);
/* attach new pkt to recvframe
*/
pfhdr->pkt = ppkt;
pfhdr->rx_head =
ppkt->head;
pfhdr->rx_data =
ppkt->data;
pfhdr->rx_tail =
skb_tail_pointer(ppkt);
pfhdr->rx_end =
skb_end_pointer(ppkt);
}
Here recvframe_expand_pkt() function is performing a double copy operation. It first allocates a new ppkt packet buffer using the rtw_skb_alloc() function, which allocates a new skb buffer in kernel space. The function then reserves space in the skb buffer for alignment purposes using the skb_reserve() function and copies the received packet data from the old skb buffer (pfhdr->pkt) to the new skb buffer (ppkt) using _rtw_memcpy(). This copy operation involves copying the packet data from the old skb buffer in kernel space to the new skb buffer in kernel space.
Finally, the function frees the old skb buffer using rtw_skb_free() and attaches the new skb buffer to the receive frame header pfhdr. The receive frame header is then updated to point to the new skb buffer for further processing. Therefore, this function is performing a double-copy operation by copying the received packet data from the old skb buffer in kernel space to a new skb buffer in kernel space before it can be processed by the driver or handed over to the user space.
Zero copy DMA is a special mode of DMA transfer wherein, the WiFi Driver can directly access the application's buffer without the need for intermediate copies. This can significantly reduce CPU overhead and increase the system's overall throughput. To implement zero-copy DMA, the application and the kernel need to work together to allow the WiFi driver to access the application's buffer directly. This is usually done through the use of shared memory or memory-mapped I/O.
Zero-copy DMA can also be used for transmitting data from the application to the NIC. In this case, the application can directly write data to the NIC's buffer without requiring intermediate copies. Zero copy DMA though requires careful implementation to ensure that the data is transferred securely and efficiently.
We may also need to ensure that the application's buffer is properly aligned and sized to avoid performance issues. Overall, zero-copy DMA is a powerful technique for increasing WiFi throughput by reducing the overhead involved in moving data between the NIC and the application.
By enabling direct access to the application's buffer, zero-copy DMA can significantly reduce CPU overhead for example, the ath10k driver for Qualcomm Atheros chipset uses Zero copy DMA to avoid extra copying overheads.
static int __ath10k_htt_rx_ring_fill_n(struct ath10k_htt *htt, int num)
{
BUILD_BUG_ON(HTT_RX_RING_FILL_LEVEL
= HTT_RX_RING_SIZE / 2);
idx =
__le32_to_cpu(*htt->rx_ring.alloc_idx.vaddr);
while (num > 0) {
skb =
dev_alloc_skb(HTT_RX_BUF_SIZE + HTT_RX_DESC_ALIGN);
if (!skb) {
ret
= -ENOMEM;
goto
fail;
}
...
paddr =
dma_map_single(htt->ar->dev, skb->data,
skb->len + skb_tailroom(skb),
DMA_FROM_DEVICE);
if
(unlikely(dma_mapping_error(htt->ar->dev, paddr))) {
dev_kfree_skb_any(skb);
ret
= -ENOMEM;
goto
fail;
}
rxcb =
ATH10K_SKB_RXCB(skb);
rxcb->paddr =
paddr;
htt->rx_ring.netbufs_ring[idx]
= skb;
ath10k_htt_set_paddrs_ring(htt,
paddr, idx);
htt->rx_ring.fill_cnt++;
....
}
...
return ret;
}
The ath10k_htt_rx_ring_fill_n() function allocates a new SKB (socket buffer) using the dev_alloc_skb() function, which allocates a new buffer in memory. This buffer is then mapped for DMA transfer using the dma_map_single() function, which returns a DMA address for the buffer. The buffer is not copied to a separate DMA buffer, but instead, the mapped DMA address is stored in the SKB structure using the ATH10K_SKB_RXCB macro. When the firmware receives a packet and writes it to the RX ring buffer, it includes the DMA address of the buffer in the HTT Rx descriptor. When the ath10k driver reads this descriptor, it uses the DMA address to directly access the buffer using DMA transfer. Once the data transfer is complete, the buffer is freed using the dma_unmap_single() function.
This technique avoids the need for copying data between kernel and user space or between different parts of the kernel, reducing the overhead associated with copying data and improving performance.