Dates and Events:
|
OSADL Articles:
2023-11-12 12:00
Open Source License Obligations Checklists even better nowImport the checklists to other tools, create context diffs and merged lists
2022-07-11 12:00
Call for participation in phase #4 of Open Source OPC UA open62541 support projectLetter of Intent fulfills wish list from recent survey
2022-01-13 12:00
Phase #3 of OSADL project on OPC UA PubSub over TSN successfully completedAnother important milestone on the way to interoperable Open Source real-time Ethernet has been reached
2021-02-09 12:00
Open Source OPC UA PubSub over TSN project phase #3 launchedLetter of Intent with call for participation is now available |
"I just made a trivial code cleanup ..."
Crashes included?
Once upon a time there was a Linux kernel programmer who said "I just made a trivial code cleanup ...", and submitted a patch to the Linux Kernel Mailing List from where it was merged into the Linux kernel in mid November 2012. Among others, the programmer found that the element curr of the acpi_ec structure was repeatedly referenced so it would merit to be stored on stack in a separate variable to simplify the code. He, therefore, changed
struct acpi_ec *ec;
if (!ec->curr)
to
struct acpi_ec *ec;
struct transaction *t = ec->curr;
if (!t)
There is nothing wrong with it, and this change probably helps to increase the readability of the code. By the way: Even if the newly introduced variable will be used a couple of times later on, a modern compiler almost certainly will not generate faster and/or shorter code after this change.
Unfortunately, the above code snippets only demonstrate the principle of the code change but do not exactly represent the original situation. In reality, the code contains a lock
spin_lock_irqsave(&ec->lock, flags);
if (!ec->curr)
that the patch moved one line down:
struct transaction *t = ec->curr;
spin_lock_irqsave(&ec->lock, flags);
if (!t)
This new situation now allows another concurrently running kernel thread to update ec->curr and, thus, to invalidate the local copy t – a sincere way to a kernel crash when an attempt is made to access a structure element through the invalid local base pointer. The probability of such a one-line race condition to occur in a particular system certainly is very low, maybe less than once a year. With many million Linux systems installed worldwide, however, such crashes may occur everyday anywhere!
Lessons to be learned
- Do not write sloppy kernel code and hope to find bugs in a short-term test
If you introduce a race, it may take years of testing to trigger it. Think at every new line of kernel code that you write down what happens when it is executed concurrently by another kernel thread. (This is, by no means, restricted to kernel development; with the increasing popularity of multi-threaded programming, application developers will need to be aware of race conditions in the same way as their kernel colleagues.) - Document your code changes with respect to locks
A one-line comment in complex locking situations such as, for example, "/* called with rq->lock held and irqs disabled */" can help a lot when "cleaning up" code and make the intent of locks (or the lack of the same) clear – too often this information is buried in the git commit message or the discussion thread on the respective mailing list. - Do not combine several code upgrade purposes into a single patch
The above-cited patch was titled "Add more debug info and trivial code cleanup". If the patch had been submitted in two parts, one part to cleanup the code and another part to add debug info, it probably would have been obvious that the cleanup was not trivial at all but would lead to kernel crashes. In fact, it did, and these crashes were extremely difficult to debug and have thrown a number of people into deep frustration. The bug was only fixed in the RT patches in December 2013 - more than a year after its introduction! - Do not use suggestive patch titles
It certainly was not a good idea to call the code cleanup "trivial" in the patch header. If the header simply had been "code cleanup", reviewers possibly would have paid some more attention to the patch and discovered the bug.
PS: I certainly do not want to blame the author of the original patch. I know that even the most experienced developers can make mistakes. The purpose of this article rather is to give an example of how easily an apparently trivial code change can obscure a bug and how easily the introduction of a race condition can be overlooked.