Ten Commandments for DPDK — Get the most out of your code

Don’t we all want faster packet processing?

DPDK can help you get the most out of your code by boosting packet processing performance and throughput, allowing for more computation time in data plane applications.  DPDK, short for Data Plane Development Kit, is an industry project that provides a set of libraries and drivers.

DPDK offers a fast packet I/O mechanism with low latency and zero copy packet handling. This alone can boost your network application performance by a large margin. Aside from the fast I/O, DPDK also comes with a set of tools and processing methodologies to take your application to a whole different performance level.

Here are Ten Commandments which serve as a short list of best practices, which fit the DPDK philosophy and can help speed up your packet processing.

  1. Use DPDK’s thread affinity APIs to minimize (better exclude) the Linux scheduler overhead. Strive to give each thread of your application a dedicated physical core to maximize thread utilization.
  2. Make use of CPU affinity to give your application a dedicated set of processing cores. This, along with thread affinity, will make sure your threads get 100% CPU utilization with no system threads, interrupts and other user processes getting in their way.
  3. Strive to utilize the “run to completion” job processing model wherever possible. Avoid L1/L2 cache thrashing across cores by keeping the job processing (along with its data) on a single core.
  4. Make smart use of data prefetch. Break your processing into pseudo-parallel stages to allow data to be fully cached before you access it. Adjust your prefetch stage duration so that you don’t access the data before it finishes prefetching – otherwise it will make the access slower.
  5. Don’t ignore instruction cache optimization. If your code surface grows too big, try to split different job types to different cores to make the subset of used functions fit fully in the core’s instruction cache.
  6. Use lockless data structures and mechanisms to store/access/synchronize your data. Don’t use blocking synchronization mechanisms (mutex, semaphore) as it is both expensive and unnecessary in a virtually non-preemptive application.
  7. Make use of the SIMD instruction set provided by your CPU. Combined with staged data processing it can do wonders.
  8. Avoid unnecessary synchronization by giving each thread its own set of structures and variables instead of sharing them across threads. Avoid inter-thread dependency as a rule. This will allow your application to achieve linear scalability across cores.
  9. Help the compiler to optimize your code better by using the likely/unlikely attributes on conditional statements and hot/cold on function declarations. If your compiler doesn’t support these options, always put the likely case conditional block ahead of the unlikely one to avoid branch mispredictions.
  10. Use block allocations instead of dynamically allocated memory. If your block pool is shared across threads, use per-thread object cache to further reduce the allocation/release overhead.

Remember – DPDK is only a tool. The result is what you make of it and how closely you follow these Ten Commandments. It’s been shown that following these rules 10Gbps bi-directional line rate processing is achievable on single CPU cores (L2/L3 learning forwarding).

We’re experts at developing code for faster packet processing. We follow these commandments, and many more best-in-class practices, in our drive to achieve the fastest packet processing possible.  We have years of expertise in DPDK and works with companies to implement DPDK solutions. Talk to us if you’re interested in taking advantage of our experience.

By Rami N.

Show Buttons
Hide Buttons