Performance improvements in ProL2TP 1.5

Fri 24 May 2013
By Tom Parkin

The release notes for ProL2TP 1.5 contain a small line which says:

Include new usl_timer and poll implementation with scalability improvements.

If you were to to completely ignore this as you skimmed the release announcement, I wouldn't blame you! It's fairly dry-sounding. But the reality of this small, unassuming line is rather more significant, as it hides a ten-fold improvement in runtime performance. Moving to ProL2TP 1.5 means you can free up CPU cycles for other applications in your embedded Linux box, or scale your ISP deployment further using the same servers.

In this blog post I'm going to take a closer look at performance in ProL2TP 1.5, how we tracked down hotspots in our code, and what we did to fix them.

Benchmarks, damned benchmarks, and statistics

Before starting out to improve ProL2TP performance, we first needed to come up with a metric to measure against. One obvious option would be data throughput, which is what the end user cares about, after all! However, prol2tpd routes data transport through the Linux kernel for efficiency, meaning the data path already scales nicely in terms of throughput.

As such, we looked a little deeper to get our performance metric. Ignoring data transport, the main task of prol2tpd itself is the bringup and maintenance of tunnels and sessions. For some of our customers, this is a one-off cost when the server boots, but for others it is usual to have some churn of tunnels and sessions over time, so the more efficient prol2tpd is here, the better. It made sense, therefore, to focus on tunnel bringup performance as a metric.

Having decided on the metric to benchmark, we needed a consistent and repeatable test which we could re-run easily during development and analysis in order to trace the effect of our improvements. I wrote a simple shell script which creates 10,000 L2TPv3 tunnels using prol2tpd's command line interface. As it runs, the script monitors the rate of tunnel bringup, and slows down if prol2tpd starts to fall behind. The faster prol2tpd can create tunnels, the faster the script creates new tunnels, and the faster it gets to its 10,000 tunnel target.

A control dataset

I would rerun the benchmark script many times during the performance analysis process. Before I started in earnest, though, I needed to get an idea of how the prol2tpd performed right now, before I started making any changes. To find out, I ran the benchmark script using two fairly ancient AMD Athlon 64 systems as an LNS and LAC. These low-powered systems make performance issues easier to pick out, as they run up against their limitations more quickly than more capable hardware would.

I ran the first benchmark using a build of revision 2929 of our source code. Here are the results:

Tunnel bringup for SVN r2929

There are two important points to note from this graph. Firstly, that really is time in hours on the x axis, and we really did run this test for over three hours! Secondly, the system never made it up to 10,000 tunnels. The LNS system ground to a halt at just over 7,000 tunnels, its CPU overwhelmed.

Although system failure is never a nice thing to consider, it's exactly what you want to find for a performance study. Clearly, there were improvements to be made to the software. It was time to break out the analysis tools and get to work.

Where did my CPU cycles go?

Having run the test system info a failure state, I needed to find out where the test machines were spending their time. I started off using the perf tool to get a real-time breakdown of CPU use in the system.

I was easily able to identify the first hotspot in the code. Almost half the system's CPU time was being spent in l2tp_tunnel_find_by_addr! I referred back to the code listing for that function and saw that it was performing a search for a peer IP. The search was using a linear list walk, which wasn't ideal

By itself this didn't seem enough to cause serious problems, even if it might be a bit slow. I decided to focus on code which was calling l2tp_tunnel_find_by_addr to see whether there were any hints there. Sure enough, I found a problem in l2tp_tunnel_alloc_config_id. This function takes care of allocating a per-tunnel sequential value that uniquely identifies the tunnel between the server and a particular peer. The identifier value is intended for use in SNMP databases. It was calling l2tp_tunnel_find_by_addr once per config id in order to find the first free id for a particular connection. As the tunnel count increased, that behaviour resulted in a lot of l2tp_tunnel_find_by_addr calls, eventually swamping the CPU.

Improvement 1: if you don't use it, you don't pay for it

On discussion, we decided there were two problems here. Firstly, the algorithm for deriving the SNMP id was inefficient, and needed improvement. Secondly, the SNMP id was hard-coded to be always enabled, whether a particular user needed it or not.

As a first port of call, it made sense to turn off config_id by default, and allow for it to be enabled using an option in the configuration file. Once we'd added this change to prol2tpd, I reran my benchmark.

Tunnel bringup: SVN r2929 v.s. r3031

By turning off a feature that is not widely used we'd massively cut runtime down from three hours to just under one hour. This was good progress, but sadly we were still failing to hit the goal of 10,000 tunnels, reaching a maximum of around 8,500 tunnels before the CPU got bogged down. It was time to dig deeper.

Improvement 2: more timely timers

Firing up perf again, I took a look at where CPU time was being used now. This time the culprit appeared to be usl_ord_list_add, which is a function from our USL library for adding an item to an ordered list. The ordered list is implemented as a simple linked list. Ordering enforced by means of an arbitrary per-item key, with the user supplying a function for comparing keys. Each time a new item is added to the list, a linear search of the list is performed in order to insert the new item in the correct order.

As before, I took a look at callsites for usl_ord_list_add, and found that the ordered list was being heavily used by USL in its usl_timer code. Each timer in the system was represented by a data structure in a list ordered by expiry time. Since prol2tpd uses a timer per tunnel to monitor things like ACK timeouts it was clear that ord_list performance would become important as tunnel count increased.

The most obvious thing to do next was to make the ord_list code faster. But try as I might, I struggled to get much more than a two-fold improvement upon the original implementation. This wasn't going to be enough, and I concluded that what we really needed was a more appropriate data structure to act as the back end for the timer implementation. After a bit of reading around I decided to go with a red black tree, which is one of the many flavours of balanced search tree. It provides O(log n) performance for insertion, removal, and search, while only taking O(n) space, making it much more scalable than the ordered linked list.

In order to be able to convince myself that I'd chosen a significantly better data structure, and that I hadn't botched the coding, I created some profiling test harnesses to profile the ord_list implementation alongside my new rbtree implementation. The results were very positive!

USL datastructure performance

The plot tells the whole story: by switching to an rbtree rather than the ord_list we would gain much better scaling behaviour for addition and lookup, albeit at the expense of a more involved removal algorithm. Encouraged, I reworked the timer code to use the rbtree in place of the ord_list, and re-ran my benchmark.

Tunnel bringup: SVN r2929, r3031, and r3125

Huzzah! I'd cracked the 10,000 tunnel limit! And even better, the bringup was quicker than before, too. Even on old hardware we could bring up 10,000 tunnels in a little over 30 minutes.

Improvement 3: efficient fd handling

Replacing ord_list with an rbtree for the timer code turned out to be a great improvement. Yet I could still see the CPUs working hard during bringup and after the 10,000 tunnels had been constructed. Perhaps there was something more to be done?

I returned to the profiling tools, using sysprof in this case to locate the next bottleneck. It was in USL again, but this time in the code which provides prol2tpd with its core event loop, usl_fd. We were getting down into the very heart of the application now, and changes in this area could be far reaching indeed.

Diving into the details of the usl_fd module, I could identify a number of inefficiencies. Firstly, the backend storage for file descriptors was based on lists, which meant a number of linear list walks during operation. That wasn't going to scale well. Not only that but we were generating a pollfd set each time we called the poll system call rather than when the module's pool of file descriptors was altered. This seemed like it could be optimised for speed.

Rolling up my sleeves I dug into the code, replacing usl_fd's internal list with another rbtree, and reworking the maintenance of the pollfd set. Once my tests were passing, I fired up the benchmark again, certain of a positive result. As the test progressed, my initial optimism turned to concern. After a strong start, system performance was degrading, even to the point of being worse than before!

Tunnel bringup: SVN r2929, r3031, r3125, and r3158

A setback!

What went wrong? Had I plotted the wrong dataset? No, I had not. My enhancements really had made performance worse, against all expectation.

This was a puzzling result, but as I looked more closely at the data I started to see a pattern. On both LNS and LAC machines, the balance of CPU usage had switched. Where we had been mostly working in user space, we were now working far more in kernel space. Looking again with sysprof suggested that we were spending most of our time in the poll system call.

I added some tracing code to track how we were using poll, and discovered that my most recent changes to USL had increased the number of times we called poll on average, as well as reduced the average number of active file descriptors per call. The meaning of this finding was subtle! By optimising the code around our calls to poll, the main event loop of prol2tpd was able to call poll more frequently. Furthermore, because the poll calls were more frequent for a constant rate of events, each poll call turned up less active fds on average.

But why did that make performance worse? I did some reading around, and found an excellent blog post by Zed Shaw describing exactly the behaviour we were seeing. As his research shows, poll is most efficient when the number of active file descriptors (i.e. fds with events on them) in your total fd set is relatively high. So by improving the performance of our userspace code, we were able to call poll more often, but with fewer active fds. As such, it performed worse per iteration, and we were calling it more!

This was a disappointing setback, but a rather compelling result. By tuning our userspace code, I'd managed to degrade the overall performance of the application, which isn't something you tend to pick up from the textbooks.

Improvement 4: picking the perfect poll

Having traced the root cause of the setback to poll, it seemed liked a good time to revisit our use of this particular call. My profiling of our poll usage suggested that we we could benefit from considering using epoll instead. In general, epoll is more efficient than poll when you have relatively low proportion of active fds in your overall fd set, which was the situation we were heading toward. To test the hypothesis, I added an abstraction in USL to support either poll or epoll, and fired up the benchmark once more in epoll mode.

Tunnel bringup: SVN r2929, r3031, r3125, r3158, and r3181

This was an excellent result. The venerable Athlon machines now bought up the 10,000 tunnel configuration in a mere 24 minutes, a 35% improvement on the result we'd achieved following the timer refactor with rbtree backend, and around an order of magnitude better than the original benchmark. On my single-core LAC machine the CPU was sitting at around 80% idle with 10,000 tunnels up and running.

The bottom of the rabbit hole?

The truth of performance improvements is that you're never really done. After this set of changes, prol2tpd was mostly bound by operations in kernel space, which seemed like a good point to stop digging. But who knows? Perhaps we might tweak the kernel somehow to make things faster still? If we do jump down this particular rabbit hole, we'll be sure to document it here. Watch this space...

For the time being though, that's it. The 1.5 release of ProL2TP is a much faster and more efficient beast than the 1.4 release was. Whether you're running it on a tiny embedded box, or on a huge server in a humming data centre, that can only be a good thing.