Dates and Events: |
OSADL Articles:
2023-11-12 12:00
Open Source License Obligations Checklists even better nowImport the checklists to other tools, create context diffs and merged lists
2022-07-11 12:00
Call for participation in phase #4 of Open Source OPC UA open62541 support projectLetter of Intent fulfills wish list from recent survey
2022-01-13 12:00
Phase #3 of OSADL project on OPC UA PubSub over TSN successfully completedAnother important milestone on the way to interoperable Open Source real-time Ethernet has been reached
2021-02-09 12:00
Open Source OPC UA PubSub over TSN project phase #3 launchedLetter of Intent with call for participation is now available |
Bad things come to those who wait
Real-time Linux kernel for AM 335x long-time stable now
One of OSADL's most important tasks is the quality assessment and assurance of Open Source software projects – the top Open Source project being the Linux RTOS kernel aka PREEMPT_RT. This is mainly done in the OSADL test center called QA Farm where "QA" stands for both Quality Assessment and Quality Assurance. Currently, the QA Farm hosts a wide variety of Linux-driven CPU boards:
- ARM, MIPS, PowerPC and x86 base architectures
- Year of CPU design from 1995 to 2013
- Address size 32 bit and 64 bit
- Number of cores from 1 to 32 (1 to 16 per socket)
- With and without hyperthreads
- CPU clock frequency from 133 MHz to 3.467 GHz
- RAM size from 26 MByte to 65.756 GByte
- Linux kernel versions from 2.6.33 to 3.10
- Native and virtualized systems
Quality assessment actually is based on continuous monitoring of a large number of variables and a threshold-based alarm and escalation mechanism. Quality assurance means that whenever an anomaly is detected, its origin is documented and the problem is (read: should be) fixed depending on its relevance .
Quality assessment
Quality assessment is the easier part. The figure to the left displays, for example, the results of more than 90 repeated latency tests each based on 100 million wake-up cycles (Click here for a higher resolution of the image). The worst-case latency never was longer than 23 µs. Such quality assessment is included free of extra charge in the OSADL flat-rate service fee (one board per share). The board that delivered these excellent data was evaluated on behalf of an OSADL member who is using it successfully in industrial devices that rely on such outstanding hard real-time capabilities. It is running an Intel G 850 CPU at 2,900 MHz clock frequency. Initially, some minor fine-tuning was necessary to get interrupts originating from 3D graphics out of the way. Thereafter, the board was perfect and did not require any extra work to make it real-time compliant. Only six off-tree patches were applied in addition to the RT_PREEMPT patch; three of them already have been merged into later kernel versions, and the remaining patches add some monitoring functionality to the kernel. None of them is related to real-time.
Quality assurance
Quality assurance is, by far, the harder part. As mentioned earlier, silent kernel crashes are the hardest thing to deal with – and there is no better way to really get frustrated with kernel development than trying to debug a kernel that silently stops execution at random. An example of such a situation happened some time ago when a PREEMPT_RT kernel was applied to an OMAP4 chip made by OSADL member Texas Instruments, the first multi-core ARM processor of the OSADL QA Farm. The phyCORE-OMAP4460 board was provided by OSADL member Phytec and was placed in rack #7/slot #7. Everything worked well during short measurement periods, and the worst-case latency as a marker of the real-time performance was as fast as could be. But when standard continuous monitoring with various load cycles was started, bad things came to us who were waiting and observing the board: The board apparently stopped on average every three to five days, and the only thing to bring it back to work was a cold reset. Any attempt to get debug output was deemed to fail. Unfortunately, the JTAG debugger crashed more often than the board so it was nearly impossible to bring the board to a state where the JTAG debugger survived a board crash. It took nearly five months until, finally, such condition was met and the JTAG interface could be used to investigate the origin of the crash. This was done using repeated halt and go commands to determine the position of the program counter which gave the following result:
Repetitions | Program counter |
---|---|
44 | 0xc0491e00 |
50 | 0xc0491e08 |
56 | 0xc0491e10 |
Disassembling the code at these positions revealed
0xc0491dfc <+120>: ldr r3, [r4]
0xc0491e00 <+124>: cmp r3, #0
0xc0491e04 <+128>: beq 0xc0491da8 <__raw_spin_lock+36>
0xc0491e08 <+132>: ldr r3, [r4, #4]
0xc0491e0c <+136>: cmp r3, #0
0xc0491e10 <+140>: bne 0xc0491dfc <__raw_spin_lock+120>
This code is called from the address space identifier rollover mechanism that works well on uniprocessor ARM CPUs and on mainline Linux but does not on real-time multi-core, since the cores may be waiting for each other to complete the related interprocessor interrupt action and thus wait forever in a livelock. The window width of this race condition is very small which explains why it took so long and required a continuous high load for triggering. After the problem was fixed on mid May 2013, the board suddenly became stable as shown in the impressive uptime graph below. The related posting of the report and the applied patch is here.
Ok, compared to the repeated and frequent crashes before mid May, this board works very well now. But aren't there many other chip and board manufacturers of multi-core ARM boards who are waiting for a miracle instead of acting? Shouldn't they also become OSADL member and let their boards get fixed? Well, bad things come to those who wait.