PCIe Traffic in DPDK Apps

This recipe introduces PCIe Bandwidth metrics used in Intel® VTune™ Profiler to explore the PCIe traffic for a packet forwarding DPDK-based workload.

Content expert: Ilia Kurakin, Roman Khatko

Data plane applications running on systems with 10/40 GbE NICs usually highly utilize platform I/O capabilities, in particular, intensively consume the bandwidth of the PCIe link that is an interface between the CPU and Network Interface Card (NIC). For such workloads, it is critical to effectively utilize the PCIe link by keeping the balance between packet transfers and control communications. Understanding PCIe transfers helps locate and fix performance issues.

For detailed methodology of the PCIe performance analysis for DPDK-based workloads, see Benchmarking and Analysis of Software Data Planes.

In this recipe, you can explore the stages of packet forwarding with DPDK and theoretical estimations for PCIe bandwidth consumption. Then, you can compare the theoretical estimations with the data collected with Intel® VTune™ Profiler.

  1. INGREDIENTS

  2. DIRECTIONS:

    1. Understand Inbound/Outbound PCIe Bandwidth metrics.

    2. Configure and run Input and Output analysis.

    3. Understand PCIe transfers required for packet forwarding.

    4. Understand PCIe Traffic optimizations.

    5. Estimate PCIe Bandwidth consumption.

    6. Compare PCIe Bandwidth vs Packet Rate.

Ingredients

This section lists the hardware and software tools used for the recipe.

Understand Inbound / Outbound PCIe Bandwidth Metrics

PCIe transfers may be initiated by both the PCIe device (for example, NIC) and the CPU. So, Intel® VTune™ Profiler distinguishes PCIe bandwidth metrics for the following bandwidth types:

Configure and Run Input and Output Analysis

To collect Inbound and Outbound PCIe Bandwidth data, use the Input and Output analysis.

In the GUI, create a project, select the Input and Output analysis on the HOW pane, and enable the Analyze PCIe bandwidth option:

In the command line, use the collect-pcie-bandwidth knob, which is set to true by default. For example, the following command starts the collection of PCIe Bandwidth data along with DPDK metrics:

amplxe-cl -collect io -knob kernel-stack=false -knob dpdk=true -knob collect-memory-bandwidth=false --target-process my_process

In the Intel® VTune™ Profiler GUI in the collected result, navigate to the Platform tab and focus on the Inbound and Outbound PCIe Bandwidth sections.

Note

Starting with server platforms based on the Intel microarchitecture code name Skylake, PCIe Bandwidth metrics can be collected per-device. Root privileges are required.

Understand PCIe Transfers Required for Packet Forwarding

Packet forwarding with DPDK implies receiving a packet (rx_burst DPDK routine) followed by transmitting the packet (tx_burst). The Core Utilization in DPDK Apps recipe describes details of packet receiving by means of Rx queue containing Rx descriptors. Packet transmitting with DPDK works similarly to packet receiving. To transmit packets, a working core employs Tx descriptors - the 16-Byte data structures that store a packet address, size, and other control information. The buffer of Tx descriptors is allocated by the core in the contiguous memory and is called Tx queue. Tx queue is handled as a ring buffer and is defined by its length, head, and tail. Packet transmitting from the Tx queue perspective is very similar to the packet receiving: the core prepares new Tx descriptors at the Tx queue tail, and the NIC processes them starting from the head.

For both Rx and Tx queues, the tail pointers are updated by the software to notify the hardware that new descriptors are available. The tail pointers are stored in the NIC registers that are mapped to the MMIO space. So, the tail pointers are updated through Outbound Writes (MMIO Writes). MMIO address space is uncacheable, so Outbound Writes, and especially Outbound Reads, are very expensive transactions and, therefore, such transfers should be minimized.

For packet forwarding, PCIe transactions go through the following workflow:

  1. The core prepares the Rx queue and starts polling the Rx queue tail.

  2. The NIC reads an Rx descriptor in the Rx queue head (Inbound Read).

  3. The NIC delivers the packet to the address specified in the Rx descriptor (Inbound Write).

  4. The NIC writes back the Rx descriptor to notify the core that the new packet arrived (Inbound Write).

  5. The core processes the packet.

  6. The core frees the Rx descriptor and moves the Rx queue tail pointer (Outbound Write).

  7. The core updates the Tx descriptor in the Tx queue tail.

  8. The core moves the Tx queue tail pointer (Outbound Write).

  9. The NIC reads the Tx descriptor (Inbound Read).

  10. The NIC reads the packet (Inbound Read).

  11. The NIC writes back the Tx descriptor to notify the core that the packet is transmitted and the Tx descriptor can be freed (Inbound Write).

Understand PCIe Traffic Optimizations

To increase the maximum packet rate and reduce the latency, the DPDK uses the following optimizations:

Besides, at the platform level all the transactions are performed at the cache line granularity, so the hardware always tries to coalesce reads and writes to avoid partial cache line transfers.

Estimate PCIe Bandwidth Consumption

For the packet forwarding case with optimizations described above, you can apply the following equations estimating PCIe bandwidth consumption at the given packet rate:

The equation for Outbound Write Bandwidth above works only when the packets are processed with the batches of the same size. This formula should be more accurate if packets are transmitted with batches of multiple sizes.

In the simple forwarding case, the core transmits all the received packets. The testpmd is an application designed within run-to-completion model, so you can assume that tx_burst operates with the same batches of packets as rx_burst does. In other words, the Rx Batch Histogram (see the Core Utilization in DPDK Apps recipe) reflects the statistics of both packet receiving and packet transmitting. Therefore, you can use the Rx Batch Histogram to estimate Outbound Write Bandwidth in a generic case.

Instead of the Tx batch size consider an "average" Tx batch size:



where is the batch size, - the number of rx_burst calls with batch size , – the number of packets in the ith peak of the Rx Batch Histogram and is the total number of packets forwarded. The picture below illustrates this calculation. For this example, the batch histogram has 3 peaks with batch sizes of 8, 10 and 12, and the calculation provides an average batch size equal to 11.

For simplicity, consider Rx free thresholds greater that maximal Rx batch size. Then, the final equation for Outbound Write bandwidth is the following:

Compare Estimations vs. Analysis

The charts illustrate a theoretical estimation for PCIe bandwidth and the PCIe bandwidth collected with Intel® VTune™ Profiler for the testpmd app configured as listed in the table below.

Packet Size, B

64

Rx Descriptor Size, B

32

RS Bit Threshold

32

Rx Free Threshold

32

The fracture in the middle of the Outbound Write dependency looks interesting. This drop is a consequence of the Batch Histogram modification caused by processing of the increased packet rate. Before this point, the Batch Histogram has only two peaks - with batch sizes 0 and 4, and at this point a new peak of 8 appears (see the histograms below). According to the equations listed above, this increases an average batch size and leads to reducing the Outbound Write Bandwidth.

For 10 Mpps:

For 13 Mpps:

In general, theoretical estimations look very close to the data reported by Intel® VTune™ Profiler, though there are some deviations that may be caused by effects that are not taken into account in equations.

The data plane application used in this recipe is already well-optimized; however, the demonstrated recipe is a solid starting point for I/O-centric performance analysis of real world application.

See Also