Dates and Events: |
OSADL Articles:
2023-11-12 12:00
Open Source License Obligations Checklists even better nowImport the checklists to other tools, create context diffs and merged lists
2022-07-11 12:00
Call for participation in phase #4 of Open Source OPC UA open62541 support projectLetter of Intent fulfills wish list from recent survey
2022-01-13 12:00
Phase #3 of OSADL project on OPC UA PubSub over TSN successfully completedAnother important milestone on the way to interoperable Open Source real-time Ethernet has been reached
2021-02-09 12:00
Open Source OPC UA PubSub over TSN project phase #3 launchedLetter of Intent with call for participation is now available |
Real Time Linux Workshops
1999 - 2000 - 2001 - 2002 - 2003 - 2004 - 2005 - 2006 - 2007 - 2008 - 2009 - 2010 - 2011 - 2012 - 2013 - 2014 - 2015
15th Real Time Linux Workshop, October 28 to 31, 2013 at the Dipartimento Tecnologie Innovative, Scuola Universitaria Professionale della Svizzera Italiana in Lugano-Manno, Switzerland
Announcement - Call for participation (ASCII) - Hotels - Directions - Agenda - Paper Abstracts - Presentations - Registration - Abstract Submission - Sponsors - Gallery
"Embers in the ashes" or how to squeeze diagnostic information out of a crashed Linux system
Carsten Emde, Open Source Automation Development Lab (OSADL) eG
The Linux kernel provides a number of very powerful diagnostic tools for tracing and debugging; in consequence, most kernel problems can be located and fixed in a couple of hours. In rare cases, however, the kernel may simply stop execution and be unwilling to provide any information what core is executing what instruction and why it is blocking. Such problems sometimes may require months – if not years – until they get located and fixed. They often are related to a race condition as a result of an erroneous locking strategy; in consequence, the more often a system is preempted, the higher is the probability of such crashes. Or in other words, a PREEMPT_RT-equipped Linux kernel will suffer more often from silent crashes than the non-preemptive standard Linux.
Having in mind that Linux is planned to be used in safety-critical environments such as, for example, in railway control and driving assistance systems, it is mandatory to fix all Linux kernel bugs that cause a Linux system to randomly stop execution. The prerequisite of such bug fixing is the availability of adequate tools.
A well-known method of investigating silent crashes is the SysRq break mechanism that can be triggered via keyboard or even network ICMP signaling. Such triggers can be used to dump the state of all tasks, inspect timers, force kernel panic etc. But what can we do, if keyboard and network have crashed as well? For this purpose, the non-maskable interrupt (NMI) was invented. Specific interrupt handlers can be supplied to let the NMI trigger a particular SysRq command, if a related input bit is set. The standard parallel port or GPIO control pins can be used for this purpose. A simple parallel port plug is available at OSADL along with a related kernel driver. But what can we do, if the industry decided to replace NMI-equipped architectures with chips such as ARM processors that do not have an NMI?
This paper describes in detail the various options to investigate a silently crashed Linux system based on examples of recently detected and fixed Linux kernel locking bugs. It also reports on discussions with ARM engineers how the lack of NMI in this architecture can be supplemented by other diagnostic tools and strategies. We need to make sure that whenever a Linux system crashes the embers in its ashes are used to understand the underlying mechanism of the crash and a fix can be provided in a reasonable amount of time.