Can the versatile Linux kernel provide guaranteed response times for applications?
Linux is a general purpose kernel that can support small embedded devices as well as gargantuan servers... and everything in between! But can this amazingly versatile kernel also provide guaranteed response times for an application on all of these platforms? If your application can tolerate maximum response latencies of 200 microseconds, the answer is a resounding "yes"! (Actually, Linux can do much better than that, but it will require carefully chosen hardware and possibly the assistance of a Linux real-time specialist.)
So why is your Linux application sometimes encountering response latencies much larger than 200us? Being a general purpose kernel means that Linux tries to balance throughput, response times, and scheduling fairness in order to satisfy the "general use case". If one of these aspects must fulfill hard requirements, it is necessary to fine-tune kernel as well as application behavior. Let us take a look at the 10 major points that must be considered when developing a Linux system with hard real-time constraints. For each point I also mention a common pitfall of developers new to real-time with Linux.
1. Scheduling policies and priorities
If a task needs to fulfill response requirements, that task must be assigned the appropriate scheduling policy and priority. The chrt(1) tool or the function sched_setscheduler(2) can be used for this. For real-time systems, typically SCHED_FIFO is chosen as the policy for the task. Real-time priorities (from 1=low to 98=high) are appropriately chosen depending on the response requirements of this and other real-time tasks on the system. Higher priority tasks will preempt lower priority tasks.
Pitfall: Make sure to disable the RT throttling feature (by writing -1 to /proc/sys/kernel/sched_rt_runtime_us). RT throttling breaks real-time systems by creating dramatic priority inversion scenarios.
For further details be sure to read the scheduling overview sched(7) man page.
2. Isolation
Systems may have more than one real-time task running. If these tasks are scheduled for the same CPU, they could cause excessive response latencies for each other. Pinning the real-time tasks to different exclusive CPU sets can avoid this issue. The taskset(1) tool or the function sched_setaffinity(2) can be used for this.
It is often the case that real-time tasks wait for certain interrupts. An interrupt firing on one CPU while the waking real-time task gets scheduled on a different CPU is another source of latency. This can be overcome by restricting the hardware interrupt to the same CPU set as the real-time task that depends on that interrupt. The CPU affinity mask for hardware interrupts can be set in the smp_affinity virtual file of /proc/irq/IRQ-NUMBER/.
And finally, it may be desirable to have dedicated CPUs for real-time applications. The isolcpus boot parameter tells the kernel to boot with certain CPUs excluded from the default CPU affinity masks of the system. Using the previously mentioned tools and interfaces, real-time tasks and interrupts can be adjusted to use the isolated CPUs.
Pitfall: Not all hardware supports arbitrary CPU assignments for interrupts. Make sure to check the effective_affinity virtual file to verify the actual setting.
3. Page faults
One of the most devastating situations for a real-time application is the need to assign or "page in" memory. This can be due to the over-commitment behavior of Linux when accessing allocated or reserved memory for the first time. Or it can be paging in disk data (such as text segments) when calling functions for the first time. Whatever the case, it must be avoided in order to meet response requirements.
The first thing a real-time application needs to do is configure glibc to use a single non-shrinkable heap for the application. This guarantees that a pool of readily accessible physical RAM can be made available to the real-time application. This is accomplished using the mallopt(3) function (M_MMAP_MAX=0, M_ARENA_MAX=1, M_TRIM_THRESHOLD=-1).
Next, all allocated and mapped virtual memory must be assigned to physical RAM and locked so that it cannot be reclaimed for other purposes. This is accomplished using the mlockall(2) function (MCL_CURRENT | MCL_FUTURE).
Finally, the amounts of stack and heap needed during the lifetime of the real-time application must be allocated and written to in order to trigger heap and stack assignments to physical RAM. This is known as pre-faulting and is usually accomplished by memsetting a large buffer within a stack frame and allocating, memsetting, and freeing a large heap buffer.
Pitfall: Keep in mind that each thread will have its own stack.
4. Synchronization
Often real-time applications need to access shared resources (such as data within shared memory) that will require synchronization. For this the pthread_mutex must be used. This is the only lock object type with ownership and priority boosting semantics, both of which are critical properties for avoiding priority inversion.
Unfortunately, priority boosting is not enabled by default. This must be done using the pthread_mutexattr_setprotocol(3) function (PTHREAD_PRIO_INHERIT). With this enabled, a lower priority owner will be boosted to the priority of a higher priority waiter in order to allow the contended mutex to be handed over as quickly as possible.
Pitfall: Be wary of using libraries that might use their own mutexes or other synchronization mechanisms.
5. Notifying with synchronization
A common pattern in real-time applications is to wait for an event and then upon waking access a shared resource. For this pattern there exists the pthread_cond condition variable. This is a wait object associated with a mutex, allowing the kernel to perform priority boosting if necessary when waking high priority tasks.
Pitfall: It is important that a notifying task performs the notification before releasing the associated mutex. This provides a necessary lock contention to avoid priority inversion scenarios.
6. Cyclic tasks
For cyclic tasks it is important to use dedicated threads that sleep until an absolute time to perform their work. The function clock_nanosleep(2) is best suited for this as it is the only API to guarantee that the real-time task will be woken from the high resolution timer's hardware interrupt context.
Pitfall: Make sure the task sleeps with CLOCK_MONOTONIC and TIMER_ABSTIME. Otherwise the cycle may lead to variance or drift over time.
7. Kernel configuration
The Linux kernel supports various preemption models. In order to achieve sub-millisecond maximum response latencies, it is necessary to use the "Fully Preemptible Kernel (Real-Time)" model (CONFIG_PREEMPT_RT). This provides key requirements such as fine-grained interruptibility, determinism, and prioritizing of interrupts. If this model is not available it means that the PREEMPT_RT patch series needs to be applied to the kernel.
Pitfall: Take care to properly configure other features such as CPU frequency scaling that can cause variability in performance of the hardware.
8. Testing
Once the real-time system is ready, it is important to identify the maximum response latencies at various priority levels. This provides a necessary understanding of the types of latencies that different components of the system could experience. The cyclictest(8) tool is excellent at providing this information while also being relatively non-intrusive so as not to hinder the real-time application.
Typically while identifying worst-case latencies, various stress tests should be run in the background. For a real-time system, non-real-time stress testing must have minimal impact on the real-time application. Using tools such as hackbench(8) and stress-ng(1) can be helpful here. It is also important to make sure all the hardware components are getting stressed. The goal of this type of testing is to hit worst-case non-preemptible paths within the kernel.
Pitfall: Be sure to also test latencies when the system is completely idle. Depending on how the kernel is configured, this may actually be the worst-case.
9. Verification
Even if a real-time system appears to be running well and all the timing requirements are being met, it is also important to verify that the real-time application is really behaving as it should. Is priority boosting happening upon lock contention? Is clock_nanosleep() the only API used for cyclic tasks? Are page faults occurring? What is the worst-case response latency that the real-time application is really experiencing?
All these questions can be answered using various kernel features such as ftrace (example tools: trace-cmd(1), kernelshark(1)), perf(1), and eBPF (example tool: bpftrace(8)). These tools allow for live tracing, profiling, and measuring of pretty much anything that happens on the system. Take the time to learn how to use these tools effectively in order to verify that the system is acting the way the developer intended.
Pitfall: ftrace, perf, and eBPF are quite efficient. However, as with any measuring instrument, it is important that it is understood how the act of measuring is affecting the system.
10. External disturbances
Although usually unrelated to Linux, it is important to understand the "features" and limitations of the hardware. Some examples include System Management Interrupts (SMIs), memory bus bandwidth/contention, CPU topology, and cache sharing of CPUs. By understanding how the hardware can negatively affect a real-time system, it is (hopefully) possible to design and implement the software to avoid or workaround those issues.
Pitfall: Often hardware is chosen without considering real-time requirements, resulting in great efforts to implement software workarounds. Involving software and hardware engineers when choosing hardware for a real-time system can prove extremely helpful in the long term.
Final remarks...
Although this list is not exhaustive, these 10 major points provide enough information to get under that worst-case 200us response latency I mentioned at the beginning. And really, getting even lower maximum latencies depends less on fine-tuning Linux and more on the chosen hardware and the design, implementation, and build of the real-time application.