Willy Tarreau's stuff: November 2017

The end of the 3.10 branch is a good opportunity to have a look back at how that worked, and to remind some important rules regarding how to choose a kernel for your products, or the risks associated with buying products running unmaintained kernels.

Four years and a half

Linux Kernel 3.10 was released on June 30th, 2013, or 4.5 years ago. Greg KH decided that this kernel would become a long term supported one (LTS), which means that it would receive fixes for about 2 years after the regular stable cycle (ie after next the version is issued). It was expected to be dropped by September 2015, and it's now declared dead today on November 4th, 2017 after 108 maintenance releases.

At HAProxy Technologies (or HapTech for friends), we actively rely on LTS kernels for our ALOHA appliances, as the kernel is the most critical component for a networked product. Each major version or our software is maintained for 3 years and ships with a proven stable kernel. This means that the LTS cycle, despite being much longer than others, is still not enough to ensure a smooth maintenance for our products. That's why I've been taking over maintenance of some of these LTS kernels for a while now. Our version 5.5 was maintained till October 2015, explaining why I maintained kernel 2.6.32 for a while, and our version 6.5 issued in October 2014 using one-year old kernel 3.10 was maintained till October 2017, thus in theory I needed to maintain this kernel from September 2015 to October 2017. In practice I pursued 2.6.32 for 9 extra months after our product's end of life as it was also used by Debian whose kernel team helped me a lot during all this cycle, and in return Greg was kind enough to keep maintaining 3.10 till that date, saving me from having to maintain two kernels at once. Thus I inherited 3.10 on March 16th, 2016 after Greg issued 3.10.101, and I maintained it till end of October 2017.

A different experience

My experience of 3.10 was very different from the 2.6.32 one. First, as I mentioned, 2.6.32 was heavily used by Debian and Ubuntu. This usage kept some rhythm in the release cycle because we had frequent exchanges with their kernel teams regarding certain fixes and backports. For 3.10 I improved my process and tools and I thought I would release more often but the reality was a lot different. Few distros relied on it so I had to decide to work on it once in a while in order to catch up with other branches. And when you're busy working on other projects you don't always notice that a lot of time has already elapsed, so in 19 months I have only emitted 7 versions (approximately one every 3 months). Second, while 2.6.32 was mostly found in mainstream distros, 3.10 was mostly found in embedded networked products. And here it's scary to see that the vast majority of such products simply don't apply fixes at all and don't follow updates! If it wasn't at least for our products and with the faith to serve a few serious users, it would be discouraging to see that while 3.10.108 is just emitted, you still find 3.10.17, 3.10.49 and 3.10.73 on a lot of devices in the wild!

Why LTS matters

Most consumers don't realize the risks they're taking by buying products running on outdated kernels. Most people don't care, some find it convenient as it allows them to download some applications to "root" their devices by exploiting some unfixed bugs (which then basically become the only bugs the vendors care to fix when they do). Others are just used to reboot their home router in the basement once in a while because it just hangs every 3 weeks for no apparent reason (but it's a cheap one, surely it's expected). And of course everyone believes the vendors when they claim that they still backport important fixes into their kernels. This is wrong at best and in fact almost always a lie in practice.

First, there's no such notion of "important fixes". Even serious vendors employing several kernel developers got caught missing some apparently unimportant fixes and remaining vulnerable for more than two years after LTS was fixed. So you can imagine the level of quality you may expect from a $60 WiFi router vendor claiming to apply the same practices... The reality is that a bug is a bug, and until it's exploited it's not considered a vulnerability. Most vulnerabilities are first discovered as plain bugs causing a kernel panic, a data corruption, or an application to fail, and are fixed as such. And only a few of such bugs are observed with the eye of someone trying to exploit them and elevated to a vulnerability. Some vulnerabilities are found by researchers actively seeking them, but they represent a tiny part of the bugs we fix every year.

During the 3.10 life cycle, 6597 patches were backported (80% during the first ~3 years Greg maintained it). That's 4.15 per day on average or 29 per week. This simply means that in 4.5 years, we closed 6597 opportunities for malicious people to try to exploit bugs and turn them into profitable vulnerabilities. An interesting observation is that 1310 of them were discovered after the 3rd year, so the common belief of "if it's old, surely it's reliable by now" doesn't work at all there.

How do we know that these 6597 patches we merged were the right ones and that we didn't miss some ? That's simple : we don't know! We only have the biggest confidence anyone can have on the subject because LTS kernels are the de-facto reference in terms of kernel backports. First, all the patches that appear there were tagged by their authors for stable backporting when submitted for inclusion, so surely the code's author knows better than anyone else if his fix needs to be backported and how. Some developers even take the time to provide the backports themselves for various stable kernels. LTS maintainers exchange reviews, patches and suggestions for their respective branches, and have access to some unpublished reproducers needed to validate certain sensitive backports. Second, each release goes through public scrutiny and patch authors get a copy of the backports of their work to verify that it's properly done or is not missing a specific patch. Quite often we get some links to extra commits to backport, or a notice about something that will not work correctly due to a difference between mainline and the old kernel, or simply something we did wrong. Third, all stable kernels are built and booted on all supported architectures. That's 121 builds and 84 boots for every single 3.10 version before the version is released. And this process is extremely reliable, because among the 6597 patches we backported, only 9 were later reverted because they were not suited or caused trouble. That's 99.86% of success on average for each release! Who can claim to beat that in their isolated office by secretly deciding which patch is needed and which one is not, and having to perform their backport without the opportunity of the patch's author reviewing the work ? Simple : nobody. In fact it's even worse, by picking only certain fixes, these people can even damage their kernels more than by not picking such fixes, because such fixes rely on other patches to be backported. This irresponsible practice must absolutely stop and nobody should ever cherry-pick a selection of patches from stable kernels. All of them are needed.

How to use LTS for your products

LTS kernels are very convenient to use, because only what matters is updated. There's no API change, no unexpected behaviour change, no need to revalidate boot command line, userland nor scripts, no surprises. Sometimes it even causes us gray hair to backport some fixes without any user visible impact. And what's even better is that by using these kernels which experience very little changes, you can have a lot of product-specific patches that will most of the time apply well on top of the latest kernel version. It's also possible to simply merge the new kernel into yours if you're maintaining your own kernel in a Git repository. Most of the time, no human interaction will be needed at all. At HapTech, on top of 3.10 we used to have around 300 patches. We faced a patch conflict 3 times in 3 years, which each time was trivial to fix. And it's important to keep in mind that if you experience a conflict, it means that the code you used to patch (hence that you heavily rely on) used to have a bug, so actually such conflicts tend to be a good news for the stability and safety of your product.

We often hear the same comments from some users : "this kernel was issued too recently, let's wait a bit to see if anybody reports a regression". This is fine! I personally prefer users not to trust my work and to review it than them blindly deploying my occasional mistakes if it's too critical for them. As a rule of thumb, if this kernel is supposed to be easy to update (eg: used on your own machines), better deploy ASAP. But if it's going to be sent to customers where an update might involve finding a moment with the customer, or emitting another version making you look bad, better wait a week or two. What matters is that ultimately all fixes are deployed and that bugs don't stay exposed for too long. How long is too long ? It depends. At HapTech, we emit a new maintenance release every few months or immediately after a sensitive security fix gets merged in one of the components we use (ie mostly kernel, haproxy, openssl). When we emit such a release, we always upgrade the 3 of them to the latest maintenance version (in the same branch). This means that the kernels found in field on our products are in the worst case a few months old. This is orders of magnitude more responsible to customers than dropping an unfixed 3-years old kernel in field exposing them to attackers. And by doing so we've never experienced a single regression caused by a kernel upgrade within the same maintenance branch. This process is safe and proven, and should be adopted by anyone distributing kernels with products. The only thing which may vary is the frequency of updates.

If by educating users we manage to reach the point where no kernel found in field is more than 6-months old, we'll have significantly improved the stability and safety of devices connected to the internet.

What you can do to improve the situation

When you buy a product shipped with an outdated kernel, most likely it's because the device needs an update. Once updated, take a look at the version and the build date in a terminal or wherever it appears in the device's interface. For example here on my tablet :

$ uname -a
Linux localhost 3.4.39 #4 SMP PREEMPT Fri Oct 17:48:45 CST 2014 armv7l GNU/Linux

This one is based on 3.4 which is an LTS kernel. That's a good start. To know which kernels benefit from long term maintenance, please visit this kernel.org page. A non-LTS kernel is must be considered as a very bad sign, as sometimes it implies that the vendor didn't even care to port their local patches to newer kernels. Then it's important to check how old the version is :

$ git show v3.4.39
tag v3.4.39
Tagger: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date:   Fri Apr 5 10:09:01 2013 -0700

This kernel was released 4.5 years ago, while the latest version in the 3.4 branch is 3.4.113 which dates a year ago. This version missed 3357 fixes at last 3.4 release one year ago! And as can be spotted in the uname output above, last time it was built (hence had a chance to get a fix backported) was 3 years ago, which proves either a problem with the update process on this device or total a lack of care from the vendor. In short this kernel is totally unreliable and insecure. Such problems must absolutely be reported to the vendor. Some vendors are simply unaware of the problem and will be willing to improve; some have already improved a lot, at least to reduce the number of issues reported on their products. Some will explain that they just ship the kernel provided by their SoC vendor and that they have no added value on top of it, or worse that they don't even understand how it works. These ones at least need to be aware that some SoC vendors are better than others regarding mainlining, and that at least asking for a more recent kernel doesn't cost much and can result in less support calls on their side. And others absolutely don't care and must definitely be avoided in the future since there's no chance the products you buy from them will work well with thousands of unfixed bugs.

How bad can an unfixed kernel be ?

Sometimes people tell me that their kernel is not "that old". OK, let's see numbers for some 3.10 kernels still commonly encountered in field, compared to 3.10.108 (which will itself be outdated once released) :

Kernel version	age	# of known unfixed bugs
3.10.108	0	0
3.10.73	17mo	2302
3.10.65	19mo	2641
3.10.49	3yr	3456
3.10.28	3.5yr	4661
3.10.17	4yr	5473

It's unknown how many exploitable vulnerabilities are present in these kernels, however it's certain that all of them are at least locally exploitable, allowing for example a browser plugin to inject malware code into the system and take full control of the device to steal data or participate to internet attacks. And if you don't care about security issues, just think about some of these bugs that I have encountered on various devices running outdated kernels, some of which disappeared after I managed to rebase and rebuild the kernel :

random freezes and panics of all sorts, some even causing the device to overheat
File-system corruption on the NAND flash bricking the device (until I could reinstall it over the serial port)
file-system bugs causing the NAND flash to be "tortured" on every single block and aging very fast to the point of periodically reporting bad blocks
eMMC bug causing I/O errors and retries to happen every few kilobytes, making the device respond very slowly
SD card failing to enumerate after a few insertion/removal cycles
Memory leaks causing progressive slowdowns and regular crashes
WiFi random packet truncation causing many TCP connections to freeze and DNS to fail to resolve
WiFi disconnections of all sorts
WiFi to Ethernet bridge suddenly causing network packet storms by looping on certain multicast packets
File-system bugs on a NAS causing the disk to be immediately remounted read-only until the next reboot
Ethernet port on a NAS randomly switching between 100 Mbps and 1 Gbps several times a minute
Ethernet port not receiving packets anymore after some time
webcam driver bug killing the whole USB stack

Sounds familiar ? You can be confident that none of them are considered critical by your product vendor and that the relevant patches have no chance to get backported if they don't follow the official stable kernels (at least because it's hard to spot them as it's often hard to link the cause to the impact).

Should you upgrade to 3.10.108 now ?

The response is simply "no". 3.10.108 was emitted to "flush the pipe" of known pending fixes still affecting 3.10. It's fine the day it's emitted and possibly outdated the day after. Some late upgraders may consider that it could possibly remain OK for a few weeks or months, around the same interval as between two previous subsequent 3.10 kernels. But that's just "probabilistically" true, because if a high-level vulnerability were to be revealed, a new 3.10 would have been emitted immediately after with a fix. Now it won't happen anymore, so you're playing Russian roulette by deploying it. Of course one might think that it's less critical than keeping any other 3.10. But it's far better to upgrade to other stable kernels such as 4.4 which will be maintained till 2022. We at HapTech are using 4.4 and 4.9 in our more recent versions and both of these work very well.

So it's really time to switch now! 3.10 is dead.

Willy Tarreau's stuff

2017-11-04

Look back to an end-of-life LTS kernel : 3.10