@ Documentation/RCU/checklist.txt:213 @ over a rather long period of time, but improvements are always welcome! the rest of the system. 7. As of v4.20, a given kernel implements only one RCU flavor, - which is RCU-sched for PREEMPT=n and RCU-preempt for PREEMPT=y. - If the updater uses call_rcu() or synchronize_rcu(), + which is RCU-sched for PREEMPTION=n and RCU-preempt for + PREEMPTION=y. If the updater uses call_rcu() or synchronize_rcu(), then the corresponding readers my use rcu_read_lock() and rcu_read_unlock(), rcu_read_lock_bh() and rcu_read_unlock_bh(), or any pair of primitives that disables and re-enables preemption, @ Documentation/RCU/stallwarn.txt:23 @ o A CPU looping with preemption disabled. o A CPU looping with bottom halves disabled. -o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel +o For !CONFIG_PREEMPTION kernels, a CPU looping anywhere in the kernel without invoking schedule(). If the looping in the kernel is really expected and desirable behavior, you might need to add some calls to cond_resched(). @ Documentation/RCU/stallwarn.txt:42 @ o Anything that prevents RCU's grace-period kthreads from running. result in the "rcu_.*kthread starved for" console-log message, which will include additional debugging information. -o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might +o A CPU-bound real-time task in a CONFIG_PREEMPTION kernel, which might happen to preempt a low-priority task in the middle of an RCU read-side critical section. This is especially damaging if that low-priority task is not permitted to run on any other CPU, @ Documentation/admin-guide/sysctl/vm.rst:131 @ allowed to examine the unevictable lru (mlocked pages) for pages to compact. This should be used on systems where stalls for minor page faults are an acceptable trade for large contiguous free memory. Set to 0 to prevent compaction from moving pages that are unevictable. Default value is 1. +On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due +to compaction, which would block the task from becomming active until the fault +is resolved. dirty_background_bytes @ Documentation/locking/index.rst:16 @ locking mutex-design rt-mutex-design rt-mutex + seqlock spinlocks ww-mutex-design @ Documentation/locking/seqlock.rst:4 @ +====================================== +Sequence counters and sequential locks +====================================== + +Introduction +============ + +Sequence counters are a reader-writer consistency mechanism with +lockless readers (read-only retry loops), and no writer starvation. They +are used for data that's rarely written to (e.g. system time), where the +reader wants a consistent set of information and is willing to retry if +that information changes. + +A data set is consistent when the sequence count at the beginning of the +read side critical section is even and the same sequence count value is +read again at the end of the critical section. The data in the set must +be copied out inside the read side critical section. If the sequence +count has changed between the start and the end of the critical section, +the reader must retry. + +Writers increment the sequence count at the start and the end of their +critical section. After starting the critical section the sequence count +is odd and indicates to the readers that an update is in progress. At +the end of the write side critical section the sequence count becomes +even again which lets readers make progress. + +A sequence counter write side critical section must never be preempted +or interrupted by read side sections. Otherwise the reader will spin for +the entire scheduler tick due to the odd sequence count value and the +interrupted writer. If that reader belongs to a real-time scheduling +class, it can spin forever and the kernel will livelock. + +This mechanism cannot be used if the protected data contains pointers, +as the writer can invalidate a pointer that the reader is following. + +.. _seqcount_t: + +Sequence counters (:c:type:`seqcount_t`) +======================================== + +This is the the raw counting mechanism, which does not protect against +multiple writers. Write side critical sections must thus be serialized +by an external lock. + +If the write serialization primitive is not implicitly disabling +preemption, preemption must be explicitly disabled before entering the +write side section. If the read section can be invoked from hardirq or +softirq contexts, interrupts or bottom halves must also be respectively +disabled before entering the write section. + +If the write serialization mechanism is one of the common kernel locking +primitives, use :ref:`sequence counters with associated locks +<seqcount_locktype_t>` instead. If it's desired to automatically handle +the sequence counter writer serialization and non-preemptibility +requirements, use a :ref:`sequential lock <seqlock_t>`. + +Initialization: + +.. code-block:: c + + /* dynamic */ + seqcount_t foo_seqcount; + seqcount_init(&foo_seqcount); + + /* static */ + static seqcount_t foo_seqcount = SEQCNT_ZERO(foo_seqcount); + + /* C99 struct init */ + struct { + .seq = SEQCNT_ZERO(foo.seq), + } foo; + +Write path: + +.. _seqcount_write_ops: +.. code-block:: c + + /* Serialized context with disabled preemption */ + + write_seqcount_begin(&foo_seqcount); + + /* ... [[write-side critical section]] ... */ + + write_seqcount_end(&foo_seqcount); + +Read path: + +.. _seqcount_read_ops: +.. code-block:: c + + do { + seq = read_seqcount_begin(&foo_seqcount); + + /* ... [[read-side critical section]] ... */ + + } while (read_seqcount_retry(&foo_seqcount, seq)); + +.. _seqcount_locktype_t: + +Sequence counters with associated locks (:c:type:`seqcount_LOCKTYPE_t`) +----------------------------------------------------------------------- + +As :ref:`earlier discussed <seqcount_t>`, seqcount write side critical +sections must be serialized and non-preemptible. This variant of +sequence counters associate the lock used for writer serialization at +the seqcount initialization time. This enables lockdep to validate that +the write side critical section is properly serialized. + +This lock association is a NOOP if lockdep is disabled and has neither +storage nor runtime overhead. If lockdep is enabled, the lock pointer is +stored in struct seqcount and lockdep's "lock is held" assertions are +injected at the beginning of the write side critical section to validate +that it is properly protected. + +For lock types which do not implicitly disable preemption, preemption +protection is enforced in the write side function. + +The following seqcounts with associated locks are defined: + + - :c:type:`seqcount_spinlock_t` + - :c:type:`seqcount_raw_spinlock_t` + - :c:type:`seqcount_rwlock_t` + - :c:type:`seqcount_mutex_t` + - :c:type:`seqcount_ww_mutex_t` + +The plain seqcount read and write APIs branch out to the specific +seqcount_LOCKTYPE_t implementation at compile-time. This avoids kernel +API explosion per each new seqcount LOCKTYPE. + +Initialization (replace "LOCKTYPE" with one of the supported locks): + +.. code-block:: c + + /* dynamic */ + seqcount_LOCKTYPE_t foo_seqcount; + seqcount_LOCKTYPE_init(&foo_seqcount, &lock); + + /* static */ + static seqcount_LOCKTYPE_t foo_seqcount = + SEQCNT_LOCKTYPE_ZERO(foo_seqcount, &lock); + + /* C99 struct init */ + struct { + .seq = SEQCNT_LOCKTYPE_ZERO(foo.seq, &lock), + } foo; + +Write path: same as in :ref:`plain seqcount_t <seqcount_write_ops>`, +while running from a context with the associated LOCKTYPE lock acquired. + +Read path: same as in :ref:`plain seqcount_t <seqcount_read_ops>`. + +.. _seqlock_t: + +Sequential locks (:c:type:`seqlock_t`) +====================================== + +This contains the :ref:`sequence counting mechanism <seqcount_t>` +earlier discussed, plus an embedded spinlock for writer serialization +and non-preemptibility. + +If the read side section can be invoked from hardirq or softirq context, +use the write side function variants which disable interrupts or bottom +halves respectively. + +Initialization: + +.. code-block:: c + + /* dynamic */ + seqlock_t foo_seqlock; + seqlock_init(&foo_seqlock); + + /* static */ + static DEFINE_SEQLOCK(foo_seqlock); + + /* C99 struct init */ + struct { + .seql = __SEQLOCK_UNLOCKED(foo.seql) + } foo; + +Write path: + +.. code-block:: c + + write_seqlock(&foo_seqlock); + + /* ... [[write-side critical section]] ... */ + + write_sequnlock(&foo_seqlock); + +Read path, three categories: + +1. Normal Sequence readers which never block a writer but they must + retry if a writer is in progress by detecting change in the sequence + number. Writers do not wait for a sequence reader. + + .. code-block:: c + + do { + seq = read_seqbegin(&foo_seqlock); + + /* ... [[read-side critical section]] ... */ + + } while (read_seqretry(&foo_seqlock, seq)); + +2. Locking readers which will wait if a writer or another locking reader + is in progress. A locking reader in progress will also block a writer + from entering its critical section. This read lock is + exclusive. Unlike rwlock_t, only one locking reader can acquire it. + + .. code-block:: c + + read_seqlock_excl(&foo_seqlock); + + /* ... [[read-side critical section]] ... */ + + read_sequnlock_excl(&foo_seqlock); + +3. Conditional lockless reader (as in 1), or locking reader (as in 2), + according to a passed marker. This is used to avoid lockless readers + starvation (too much retry loops) in case of a sharp spike in write + activity. First, a lockless read is tried (even marker passed). If + that trial fails (odd sequence counter is returned, which is used as + the next iteration marker), the lockless read is transformed to a + full locking read and no retry loop is necessary. + + .. code-block:: c + + /* marker; even initialization */ + int seq = 0; + do { + read_seqbegin_or_lock(&foo_seqlock, &seq); + + /* ... [[read-side critical section]] ... */ + + } while (need_seqretry(&foo_seqlock, seq)); + done_seqretry(&foo_seqlock, seq); + +API documentation +================= + +.. kernel-doc:: include/linux/seqlock.h @ Documentation/printk-ringbuffer.txt:4 @ +struct printk_ringbuffer +------------------------ +John Ogness <john.ogness@linutronix.de> + +Overview +~~~~~~~~ +As the name suggests, this ring buffer was implemented specifically to serve +the needs of the printk() infrastructure. The ring buffer itself is not +specific to printk and could be used for other purposes. _However_, the +requirements and semantics of printk are rather unique. If you intend to use +this ring buffer for anything other than printk, you need to be very clear on +its features, behavior, and pitfalls. + +Features +^^^^^^^^ +The printk ring buffer has the following features: + +- single global buffer +- resides in initialized data section (available at early boot) +- lockless readers +- supports multiple writers +- supports multiple non-consuming readers +- safe from any context (including NMI) +- groups bytes into variable length blocks (referenced by entries) +- entries tagged with sequence numbers + +Behavior +^^^^^^^^ +Since the printk ring buffer readers are lockless, there exists no +synchronization between readers and writers. Basically writers are the tasks +in control and may overwrite any and all committed data at any time and from +any context. For this reason readers can miss entries if they are overwritten +before the reader was able to access the data. The reader API implementation +is such that reader access to entries is atomic, so there is no risk of +readers having to deal with partial or corrupt data. Also, entries are +tagged with sequence numbers so readers can recognize if entries were missed. + +Writing to the ring buffer consists of 2 steps. First a writer must reserve +an entry of desired size. After this step the writer has exclusive access +to the memory region. Once the data has been written to memory, it needs to +be committed to the ring buffer. After this step the entry has been inserted +into the ring buffer and assigned an appropriate sequence number. + +Once committed, a writer must no longer access the data directly. This is +because the data may have been overwritten and no longer exists. If a +writer must access the data, it should either keep a private copy before +committing the entry or use the reader API to gain access to the data. + +Because of how the data backend is implemented, entries that have been +reserved but not yet committed act as barriers, preventing future writers +from filling the ring buffer beyond the location of the reserved but not +yet committed entry region. For this reason it is *important* that writers +perform both reserve and commit as quickly as possible. Also, be aware that +preemption and local interrupts are disabled and writing to the ring buffer +is processor-reentrant locked during the reserve/commit window. Writers in +NMI contexts can still preempt any other writers, but as long as these +writers do not write a large amount of data with respect to the ring buffer +size, this should not become an issue. + +API +~~~ + +Declaration +^^^^^^^^^^^ +The printk ring buffer can be instantiated as a static structure: + + /* declare a static struct printk_ringbuffer */ + #define DECLARE_STATIC_PRINTKRB(name, szbits, cpulockptr) + +The value of szbits specifies the size of the ring buffer in bits. The +cpulockptr field is a pointer to a prb_cpulock struct that is used to +perform processor-reentrant spin locking for the writers. It is specified +externally because it may be used for multiple ring buffers (or other +code) to synchronize writers without risk of deadlock. + +Here is an example of a declaration of a printk ring buffer specifying a +32KB (2^15) ring buffer: + +.... +DECLARE_STATIC_PRINTKRB_CPULOCK(rb_cpulock); +DECLARE_STATIC_PRINTKRB(rb, 15, &rb_cpulock); +.... + +If writers will be using multiple ring buffers and the ordering of that usage +is not clear, the same prb_cpulock should be used for both ring buffers. + +Writer API +^^^^^^^^^^ +The writer API consists of 2 functions. The first is to reserve an entry in +the ring buffer, the second is to commit that data to the ring buffer. The +reserved entry information is stored within a provided `struct prb_handle`. + + /* reserve an entry */ + char *prb_reserve(struct prb_handle *h, struct printk_ringbuffer *rb, + unsigned int size); + + /* commit a reserved entry to the ring buffer */ + void prb_commit(struct prb_handle *h); + +Here is an example of a function to write data to a ring buffer: + +.... +int write_data(struct printk_ringbuffer *rb, char *data, int size) +{ + struct prb_handle h; + char *buf; + + buf = prb_reserve(&h, rb, size); + if (!buf) + return -1; + memcpy(buf, data, size); + prb_commit(&h); + + return 0; +} +.... + +Pitfalls +++++++++ +Be aware that prb_reserve() can fail. A retry might be successful, but it +depends entirely on whether or not the next part of the ring buffer to +overwrite belongs to reserved but not yet committed entries of other writers. +Writers can use the prb_inc_lost() function to allow readers to notice that a +message was lost. + +Reader API +^^^^^^^^^^ +The reader API utilizes a `struct prb_iterator` to track the reader's +position in the ring buffer. + + /* declare a pre-initialized static iterator for a ring buffer */ + #define DECLARE_STATIC_PRINTKRB_ITER(name, rbaddr) + + /* initialize iterator for a ring buffer (if static macro NOT used) */ + void prb_iter_init(struct prb_iterator *iter, + struct printk_ringbuffer *rb, u64 *seq); + + /* make a deep copy of an iterator */ + void prb_iter_copy(struct prb_iterator *dest, + struct prb_iterator *src); + + /* non-blocking, advance to next entry (and read the data) */ + int prb_iter_next(struct prb_iterator *iter, char *buf, + int size, u64 *seq); + + /* blocking, advance to next entry (and read the data) */ + int prb_iter_wait_next(struct prb_iterator *iter, char *buf, + int size, u64 *seq); + + /* position iterator at the entry seq */ + int prb_iter_seek(struct prb_iterator *iter, u64 seq); + + /* read data at current position */ + int prb_iter_data(struct prb_iterator *iter, char *buf, + int size, u64 *seq); + +Typically prb_iter_data() is not needed because the data can be retrieved +directly with prb_iter_next(). + +Here is an example of a non-blocking function that will read all the data in +a ring buffer: + +.... +void read_all_data(struct printk_ringbuffer *rb, char *buf, int size) +{ + struct prb_iterator iter; + u64 prev_seq = 0; + u64 seq; + int ret; + + prb_iter_init(&iter, rb, NULL); + + for (;;) { + ret = prb_iter_next(&iter, buf, size, &seq); + if (ret > 0) { + if (seq != ++prev_seq) { + /* "seq - prev_seq" entries missed */ + prev_seq = seq; + } + /* process buf here */ + } else if (ret == 0) { + /* hit the end, done */ + break; + } else if (ret < 0) { + /* + * iterator is invalid, a writer overtook us, reset the + * iterator and keep going, entries were missed + */ + prb_iter_init(&iter, rb, NULL); + } + } +} +.... + +Pitfalls +++++++++ +The reader's iterator can become invalid at any time because the reader was +overtaken by a writer. Typically the reader should reset the iterator back +to the current oldest entry (which will be newer than the entry the reader +was at) and continue, noting the number of entries that were missed. + +Utility API +^^^^^^^^^^^ +Several functions are available as convenience for external code. + + /* query the size of the data buffer */ + int prb_buffer_size(struct printk_ringbuffer *rb); + + /* skip a seq number to signify a lost record */ + void prb_inc_lost(struct printk_ringbuffer *rb); + + /* processor-reentrant spin lock */ + void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store); + + /* processor-reentrant spin unlock */ + void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store); + +Pitfalls +++++++++ +Although the value returned by prb_buffer_size() does represent an absolute +upper bound, the amount of data that can be stored within the ring buffer +is actually less because of the additional storage space of a header for each +entry. + +The prb_lock() and prb_unlock() functions can be used to synchronize between +ring buffer writers and other external activities. The function of a +processor-reentrant spin lock is to disable preemption and local interrupts +and synchronize against other processors. It does *not* protect against +multiple contexts of a single processor, i.e NMI. + +Implementation +~~~~~~~~~~~~~~ +This section describes several of the implementation concepts and details to +help developers better understand the code. + +Entries +^^^^^^^ +All ring buffer data is stored within a single static byte array. The reason +for this is to ensure that any pointers to the data (past and present) will +always point to valid memory. This is important because the lockless readers +may be preempted for long periods of time and when they resume may be working +with expired pointers. + +Entries are identified by start index and size. (The start index plus size +is the start index of the next entry.) The start index is not simply an +offset into the byte array, but rather a logical position (lpos) that maps +directly to byte array offsets. + +For example, for a byte array of 1000, an entry may have have a start index +of 100. Another entry may have a start index of 1100. And yet another 2100. +All of these entry are pointing to the same memory region, but only the most +recent entry is valid. The other entries are pointing to valid memory, but +represent entries that have been overwritten. + +Note that due to overflowing, the most recent entry is not necessarily the one +with the highest lpos value. Indeed, the printk ring buffer initializes its +data such that an overflow happens relatively quickly in order to validate the +handling of this situation. The implementation assumes that an lpos (unsigned +long) will never completely wrap while a reader is preempted. If this were to +become an issue, the seq number (which never wraps) could be used to increase +the robustness of handling this situation. + +Buffer Wrapping +^^^^^^^^^^^^^^^ +If an entry starts near the end of the byte array but would extend beyond it, +a special terminating entry (size = -1) is inserted into the byte array and +the real entry is placed at the beginning of the byte array. This can waste +space at the end of the byte array, but simplifies the implementation by +allowing writers to always work with contiguous buffers. + +Note that the size field is the first 4 bytes of the entry header. Also note +that calc_next() always ensures that there are at least 4 bytes left at the +end of the byte array to allow room for a terminating entry. + +Ring Buffer Pointers +^^^^^^^^^^^^^^^^^^^^ +Three pointers (lpos values) are used to manage the ring buffer: + + - _tail_: points to the oldest entry + - _head_: points to where the next new committed entry will be + - _reserve_: points to where the next new reserved entry will be + +These pointers always maintain a logical ordering: + + tail <= head <= reserve + +The reserve pointer moves forward when a writer reserves a new entry. The +head pointer moves forward when a writer commits a new entry. + +The reserve pointer cannot overwrite the tail pointer in a wrap situation. In +such a situation, the tail pointer must be "pushed forward", thus +invalidating that oldest entry. Readers identify if they are accessing a +valid entry by ensuring their entry pointer is `>= tail && < head`. + +If the tail pointer is equal to the head pointer, it cannot be pushed and any +reserve operation will fail. The only resolution is for writers to commit +their reserved entries. + +Processor-Reentrant Locking +^^^^^^^^^^^^^^^^^^^^^^^^^^^ +The purpose of the processor-reentrant locking is to limit the interruption +scenarios of writers to 2 contexts. This allows for a simplified +implementation where: + +- The reserve/commit window only exists on 1 processor at a time. A reserve + can never fail due to uncommitted entries of other processors. + +- When committing entries, it is trivial to handle the situation when + subsequent entries have already been committed, i.e. managing the head + pointer. + +Performance +~~~~~~~~~~~ +Some basic tests were performed on a quad Intel(R) Xeon(R) CPU E5-2697 v4 at +2.30GHz (36 cores / 72 threads). All tests involved writing a total of +32,000,000 records at an average of 33 bytes each. Each writer was pinned to +its own CPU and would write as fast as it could until a total of 32,000,000 +records were written. All tests involved 2 readers that were both pinned +together to another CPU. Each reader would read as fast as it could and track +how many of the 32,000,000 records it could read. All tests used a ring buffer +of 16KB in size, which holds around 350 records (header + data for each +entry). + +The only difference between the tests is the number of writers (and thus also +the number of records per writer). As more writers are added, the time to +write a record increases. This is because data pointers, modified via cmpxchg, +and global data access in general become more contended. + +1 writer +^^^^^^^^ + runtime: 0m 18s + reader1: 16219900/32000000 (50%) records + reader2: 16141582/32000000 (50%) records + +2 writers +^^^^^^^^^ + runtime: 0m 32s + reader1: 16327957/32000000 (51%) records + reader2: 16313988/32000000 (50%) records + +4 writers +^^^^^^^^^ + runtime: 0m 42s + reader1: 16421642/32000000 (51%) records + reader2: 16417224/32000000 (51%) records + +8 writers +^^^^^^^^^ + runtime: 0m 43s + reader1: 16418300/32000000 (51%) records + reader2: 16432222/32000000 (51%) records + +16 writers +^^^^^^^^^^ + runtime: 0m 54s + reader1: 16539189/32000000 (51%) records + reader2: 16542711/32000000 (51%) records + +32 writers +^^^^^^^^^^ + runtime: 1m 13s + reader1: 16731808/32000000 (52%) records + reader2: 16735119/32000000 (52%) records + +Comments +^^^^^^^^ +It is particularly interesting to compare/contrast the 1-writer and 32-writer +tests. Despite the writing of the 32,000,000 records taking over 4 times +longer, the readers (which perform no cmpxchg) were still unable to keep up. +This shows that the memory contention between the increasing number of CPUs +also has a dramatic effect on readers. + +It should also be noted that in all cases each reader was able to read >=50% +of the records. This means that a single reader would have been able to keep +up with the writer(s) in all cases, becoming slightly easier as more writers +are added. This was the purpose of pinning 2 readers to 1 CPU: to observe how +maximum reader performance changes. @ MAINTAINERS:9824 @ F: arch/*/include/asm/spinlock*.h F: include/linux/rwlock*.h F: include/linux/mutex*.h F: include/linux/rwsem*.h -F: include/linux/seqlock.h +F: include/linux/seqlock*.h F: lib/locking*.[ch] F: kernel/locking/ X: kernel/locking/locktorture.c @ arch/Kconfig:34 @ config OPROFILE tristate "OProfile system profiling" depends on PROFILING depends on HAVE_OPROFILE + depends on !PREEMPT_RT select RING_BUFFER select RING_BUFFER_ALLOW_SWAP help @ arch/alpha/include/asm/spinlock_types.h:5 @ #ifndef _ALPHA_SPINLOCK_TYPES_H #define _ALPHA_SPINLOCK_TYPES_H -#ifndef __LINUX_SPINLOCK_TYPES_H -# error "please don't include this file directly" -#endif - typedef struct { volatile unsigned int lock; } arch_spinlock_t; @ arch/arm/Kconfig:35 @ config ARM select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT if CPU_V7 select ARCH_SUPPORTS_ATOMIC_RMW + select ARCH_SUPPORTS_RT select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_CMPXCHG_LOCKREF select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU @ arch/arm/Kconfig:68 @ config ARM select HARDIRQS_SW_RESEND select HAVE_ARCH_AUDITSYSCALL if AEABI && !OABI_COMPAT select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6 - select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU + select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU && !PREEMPT_RT select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU select HAVE_ARCH_MMAP_RND_BITS if MMU select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT @ arch/arm/Kconfig:107 @ config ARM select HAVE_PERF_EVENTS select HAVE_PERF_REGS select HAVE_PERF_USER_STACK_DUMP + select HAVE_PREEMPT_LAZY select MMU_GATHER_RCU_TABLE_FREE if SMP && ARM_LPAE select HAVE_REGS_AND_STACK_ACCESS_API select HAVE_RSEQ @ arch/arm/include/asm/irq.h:26 @ #endif #ifndef __ASSEMBLY__ +#include <linux/cpumask.h> + struct irqaction; struct pt_regs; @ arch/arm/include/asm/spinlock_types.h:5 @ #ifndef __ASM_SPINLOCK_TYPES_H #define __ASM_SPINLOCK_TYPES_H -#ifndef __LINUX_SPINLOCK_TYPES_H -# error "please don't include this file directly" -#endif - #define TICKET_SHIFT 16 typedef struct { @ arch/arm/include/asm/switch_to.h:7 @ #include <linux/thread_info.h> +#if defined CONFIG_PREEMPT_RT && defined CONFIG_HIGHMEM +void switch_kmaps(struct task_struct *prev_p, struct task_struct *next_p); +#else +static inline void +switch_kmaps(struct task_struct *prev_p, struct task_struct *next_p) { } +#endif + /* * For v7 SMP cores running a preemptible kernel we may be pre-empted * during a TLB maintenance operation, so execute an inner-shareable dsb @ arch/arm/include/asm/switch_to.h:36 @ extern struct task_struct *__switch_to(struct task_struct *, struct thread_info #define switch_to(prev,next,last) \ do { \ __complete_pending_tlbi(); \ + switch_kmaps(prev, next); \ last = __switch_to(prev,task_thread_info(prev), task_thread_info(next)); \ } while (0) @ arch/arm/include/asm/thread_info.h:49 @ struct cpu_context_save { struct thread_info { unsigned long flags; /* low level flags */ int preempt_count; /* 0 => preemptable, <0 => bug */ + int preempt_lazy_count; /* 0 => preemptable, <0 => bug */ mm_segment_t addr_limit; /* address limit */ struct task_struct *task; /* main task structure */ __u32 cpu; /* cpu */ @ arch/arm/include/asm/thread_info.h:143 @ extern int vfp_restore_user_hwstate(struct user_vfp *, #define TIF_SYSCALL_TRACE 4 /* syscall trace active */ #define TIF_SYSCALL_AUDIT 5 /* syscall auditing active */ #define TIF_SYSCALL_TRACEPOINT 6 /* syscall tracepoint instrumentation */ -#define TIF_SECCOMP 7 /* seccomp syscall filtering active */ +#define TIF_SECCOMP 8 /* seccomp syscall filtering active */ +#define TIF_NEED_RESCHED_LAZY 7 #define TIF_NOHZ 12 /* in adaptive nohz mode */ #define TIF_USING_IWMMXT 17 @ arch/arm/include/asm/thread_info.h:154 @ extern int vfp_restore_user_hwstate(struct user_vfp *, #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) +#define _TIF_NEED_RESCHED_LAZY (1 << TIF_NEED_RESCHED_LAZY) #define _TIF_UPROBE (1 << TIF_UPROBE) #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE) #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) @ arch/arm/include/asm/thread_info.h:170 @ extern int vfp_restore_user_hwstate(struct user_vfp *, * Change these and you break ASM code in entry-common.S */ #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ - _TIF_NOTIFY_RESUME | _TIF_UPROBE) + _TIF_NOTIFY_RESUME | _TIF_UPROBE | \ + _TIF_NEED_RESCHED_LAZY) #endif /* __KERNEL__ */ #endif /* __ASM_ARM_THREAD_INFO_H */ @ arch/arm/kernel/asm-offsets.c:56 @ int main(void) BLANK(); DEFINE(TI_FLAGS, offsetof(struct thread_info, flags)); DEFINE(TI_PREEMPT, offsetof(struct thread_info, preempt_count)); + DEFINE(TI_PREEMPT_LAZY, offsetof(struct thread_info, preempt_lazy_count)); DEFINE(TI_ADDR_LIMIT, offsetof(struct thread_info, addr_limit)); DEFINE(TI_TASK, offsetof(struct thread_info, task)); DEFINE(TI_CPU, offsetof(struct thread_info, cpu)); @ arch/arm/kernel/entry-armv.S:209 @ ENDPROC(__dabt_svc) #ifdef CONFIG_PREEMPTION ldr r8, [tsk, #TI_PREEMPT] @ get preempt count - ldr r0, [tsk, #TI_FLAGS] @ get flags teq r8, #0 @ if preempt count != 0 + bne 1f @ return from exeption + ldr r0, [tsk, #TI_FLAGS] @ get flags + tst r0, #_TIF_NEED_RESCHED @ if NEED_RESCHED is set + blne svc_preempt @ preempt! + + ldr r8, [tsk, #TI_PREEMPT_LAZY] @ get preempt lazy count + teq r8, #0 @ if preempt lazy count != 0 movne r0, #0 @ force flags to 0 - tst r0, #_TIF_NEED_RESCHED + tst r0, #_TIF_NEED_RESCHED_LAZY blne svc_preempt +1: #endif svc_exit r5, irq = 1 @ return from exception @ arch/arm/kernel/entry-armv.S:235 @ ENDPROC(__irq_svc) 1: bl preempt_schedule_irq @ irq en/disable is done inside ldr r0, [tsk, #TI_FLAGS] @ get new tasks TI_FLAGS tst r0, #_TIF_NEED_RESCHED + bne 1b + tst r0, #_TIF_NEED_RESCHED_LAZY reteq r8 @ go again - b 1b + ldr r0, [tsk, #TI_PREEMPT_LAZY] @ get preempt lazy count + teq r0, #0 @ if preempt lazy count != 0 + beq 1b + ret r8 @ go again + #endif __und_fault: @ arch/arm/kernel/entry-common.S:56 @ saved_pc .req lr cmp r2, #TASK_SIZE blne addr_limit_check_failed ldr r1, [tsk, #TI_FLAGS] @ re-check for syscall tracing - tst r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK + tst r1, #((_TIF_SYSCALL_WORK | _TIF_WORK_MASK) & ~_TIF_SECCOMP) + bne fast_work_pending + tst r1, #_TIF_SECCOMP bne fast_work_pending @ arch/arm/kernel/entry-common.S:95 @ ENDPROC(ret_fast_syscall) cmp r2, #TASK_SIZE blne addr_limit_check_failed ldr r1, [tsk, #TI_FLAGS] @ re-check for syscall tracing - tst r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK + tst r1, #((_TIF_SYSCALL_WORK | _TIF_WORK_MASK) & ~_TIF_SECCOMP) + bne do_slower_path + tst r1, #_TIF_SECCOMP beq no_work_pending +do_slower_path: UNWIND(.fnend ) ENDPROC(ret_fast_syscall) @ arch/arm/kernel/signal.c:652 @ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall) */ trace_hardirqs_off(); do { - if (likely(thread_flags & _TIF_NEED_RESCHED)) { + if (likely(thread_flags & (_TIF_NEED_RESCHED | + _TIF_NEED_RESCHED_LAZY))) { schedule(); } else { if (unlikely(!user_mode(regs))) @ arch/arm/kernel/smp.c:685 @ void handle_IPI(int ipinr, struct pt_regs *regs) break; case IPI_CPU_BACKTRACE: - printk_nmi_enter(); irq_enter(); nmi_cpu_backtrace(regs); irq_exit(); - printk_nmi_exit(); break; default: @ arch/arm/mm/fault.c:417 @ do_translation_fault(unsigned long addr, unsigned int fsr, if (addr < TASK_SIZE) return do_page_fault(addr, fsr, regs); + if (interrupts_enabled(regs)) + local_irq_enable(); + if (user_mode(regs)) goto bad_area; @ arch/arm/mm/fault.c:487 @ do_translation_fault(unsigned long addr, unsigned int fsr, static int do_sect_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) { + if (interrupts_enabled(regs)) + local_irq_enable(); + do_bad_area(addr, fsr, regs); return 0; } @ arch/arm/mm/highmem.c:34 @ static inline pte_t get_fixmap_pte(unsigned long vaddr) return *ptep; } +static unsigned int fixmap_idx(int type) +{ + return FIX_KMAP_BEGIN + type + KM_TYPE_NR * smp_processor_id(); +} + void *kmap(struct page *page) { might_sleep(); @ arch/arm/mm/highmem.c:59 @ EXPORT_SYMBOL(kunmap); void *kmap_atomic(struct page *page) { + pte_t pte = mk_pte(page, kmap_prot); unsigned int idx; unsigned long vaddr; void *kmap; int type; - preempt_disable(); + preempt_disable_nort(); pagefault_disable(); if (!PageHighMem(page)) return page_address(page); @ arch/arm/mm/highmem.c:85 @ void *kmap_atomic(struct page *page) type = kmap_atomic_idx_push(); - idx = FIX_KMAP_BEGIN + type + KM_TYPE_NR * smp_processor_id(); + idx = fixmap_idx(type); vaddr = __fix_to_virt(idx); #ifdef CONFIG_DEBUG_HIGHMEM /* @ arch/arm/mm/highmem.c:99 @ void *kmap_atomic(struct page *page) * in place, so the contained TLB flush ensures the TLB is updated * with the new mapping. */ - set_fixmap_pte(idx, mk_pte(page, kmap_prot)); +#ifdef CONFIG_PREEMPT_RT + current->kmap_pte[type] = pte; +#endif + set_fixmap_pte(idx, pte); return (void *)vaddr; } @ arch/arm/mm/highmem.c:115 @ void __kunmap_atomic(void *kvaddr) if (kvaddr >= (void *)FIXADDR_START) { type = kmap_atomic_idx(); - idx = FIX_KMAP_BEGIN + type + KM_TYPE_NR * smp_processor_id(); + idx = fixmap_idx(type); if (cache_is_vivt()) __cpuc_flush_dcache_area((void *)vaddr, PAGE_SIZE); +#ifdef CONFIG_PREEMPT_RT + current->kmap_pte[type] = __pte(0); +#endif #ifdef CONFIG_DEBUG_HIGHMEM BUG_ON(vaddr != __fix_to_virt(idx)); - set_fixmap_pte(idx, __pte(0)); #else (void) idx; /* to kill a warning */ #endif + set_fixmap_pte(idx, __pte(0)); kmap_atomic_idx_pop(); } else if (vaddr >= PKMAP_ADDR(0) && vaddr < PKMAP_ADDR(LAST_PKMAP)) { /* this address was obtained through kmap_high_get() */ kunmap_high(pte_page(pkmap_page_table[PKMAP_NR(vaddr)])); } pagefault_enable(); - preempt_enable(); + preempt_enable_nort(); } EXPORT_SYMBOL(__kunmap_atomic); void *kmap_atomic_pfn(unsigned long pfn) { + pte_t pte = pfn_pte(pfn, kmap_prot); unsigned long vaddr; int idx, type; struct page *page = pfn_to_page(pfn); - preempt_disable(); + preempt_disable_nort(); pagefault_disable(); if (!PageHighMem(page)) return page_address(page); type = kmap_atomic_idx_push(); - idx = FIX_KMAP_BEGIN + type + KM_TYPE_NR * smp_processor_id(); + idx = fixmap_idx(type); vaddr = __fix_to_virt(idx); #ifdef CONFIG_DEBUG_HIGHMEM BUG_ON(!pte_none(get_fixmap_pte(vaddr))); #endif - set_fixmap_pte(idx, pfn_pte(pfn, kmap_prot)); +#ifdef CONFIG_PREEMPT_RT + current->kmap_pte[type] = pte; +#endif + set_fixmap_pte(idx, pte); return (void *)vaddr; } +#if defined CONFIG_PREEMPT_RT +void switch_kmaps(struct task_struct *prev_p, struct task_struct *next_p) +{ + int i; + + /* + * Clear @prev's kmap_atomic mappings + */ + for (i = 0; i < prev_p->kmap_idx; i++) { + int idx = fixmap_idx(i); + + set_fixmap_pte(idx, __pte(0)); + } + /* + * Restore @next_p's kmap_atomic mappings + */ + for (i = 0; i < next_p->kmap_idx; i++) { + int idx = fixmap_idx(i); + + if (!pte_none(next_p->kmap_pte[i])) + set_fixmap_pte(idx, next_p->kmap_pte[i]); + } +} +#endif @ arch/arm64/Kconfig:72 @ config ARM64 select ARCH_SUPPORTS_ATOMIC_RMW select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && (GCC_VERSION >= 50000 || CC_IS_CLANG) select ARCH_SUPPORTS_NUMA_BALANCING + select ARCH_SUPPORTS_RT select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT select ARCH_WANT_DEFAULT_BPF_JIT select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT @ arch/arm64/Kconfig:167 @ config ARM64 select HAVE_PERF_EVENTS select HAVE_PERF_REGS select HAVE_PERF_USER_STACK_DUMP + select HAVE_PREEMPT_LAZY select HAVE_REGS_AND_STACK_ACCESS_API select HAVE_FUNCTION_ARG_ACCESS_API select HAVE_FUTEX_CMPXCHG if FUTEX @ arch/arm64/include/asm/preempt.h:73 @ static inline bool __preempt_count_dec_and_test(void) * interrupt occurring between the non-atomic READ_ONCE/WRITE_ONCE * pair. */ - return !pc || !READ_ONCE(ti->preempt_count); + if (!pc || !READ_ONCE(ti->preempt_count)) + return true; +#ifdef CONFIG_PREEMPT_LAZY + if ((pc & ~PREEMPT_NEED_RESCHED)) + return false; + if (current_thread_info()->preempt_lazy_count) + return false; + return test_thread_flag(TIF_NEED_RESCHED_LAZY); +#else + return false; +#endif } static inline bool should_resched(int preempt_offset) { +#ifdef CONFIG_PREEMPT_LAZY + u64 pc = READ_ONCE(current_thread_info()->preempt_count); + if (pc == preempt_offset) + return true; + + if ((pc & ~PREEMPT_NEED_RESCHED) != preempt_offset) + return false; + + if (current_thread_info()->preempt_lazy_count) + return false; + return test_thread_flag(TIF_NEED_RESCHED_LAZY); +#else u64 pc = READ_ONCE(current_thread_info()->preempt_count); return pc == preempt_offset; +#endif } #ifdef CONFIG_PREEMPTION @ arch/arm64/include/asm/spinlock_types.h:8 @ #ifndef __ASM_SPINLOCK_TYPES_H #define __ASM_SPINLOCK_TYPES_H -#if !defined(__LINUX_SPINLOCK_TYPES_H) && !defined(__ASM_SPINLOCK_H) -# error "please don't include this file directly" -#endif - #include <asm-generic/qspinlock_types.h> #include <asm-generic/qrwlock_types.h> @ arch/arm64/include/asm/thread_info.h:32 @ struct thread_info { #ifdef CONFIG_ARM64_SW_TTBR0_PAN u64 ttbr0; /* saved TTBR0_EL1 */ #endif + int preempt_lazy_count; /* 0 => preemptable, <0 => bug */ union { u64 preempt_count; /* 0 => preemptible, <0 => bug */ struct { @ arch/arm64/include/asm/thread_info.h:67 @ void arch_release_task_struct(struct task_struct *tsk); #define TIF_FOREIGN_FPSTATE 3 /* CPU's FP state is not current's */ #define TIF_UPROBE 4 /* uprobe breakpoint or singlestep */ #define TIF_FSCHECK 5 /* Check FS is USER_DS on return */ +#define TIF_NEED_RESCHED_LAZY 6 #define TIF_NOHZ 7 #define TIF_SYSCALL_TRACE 8 /* syscall trace active */ #define TIF_SYSCALL_AUDIT 9 /* syscall auditing */ @ arch/arm64/include/asm/thread_info.h:88 @ void arch_release_task_struct(struct task_struct *tsk); #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) #define _TIF_FOREIGN_FPSTATE (1 << TIF_FOREIGN_FPSTATE) +#define _TIF_NEED_RESCHED_LAZY (1 << TIF_NEED_RESCHED_LAZY) #define _TIF_NOHZ (1 << TIF_NOHZ) #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE) #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) @ arch/arm64/include/asm/thread_info.h:102 @ void arch_release_task_struct(struct task_struct *tsk); #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \ - _TIF_UPROBE | _TIF_FSCHECK) + _TIF_UPROBE | _TIF_FSCHECK | _TIF_NEED_RESCHED_LAZY) +#define _TIF_NEED_RESCHED_MASK (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY) #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \ _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \ _TIF_NOHZ | _TIF_SYSCALL_EMU) @ arch/arm64/kernel/asm-offsets.c:33 @ int main(void) BLANK(); DEFINE(TSK_TI_FLAGS, offsetof(struct task_struct, thread_info.flags)); DEFINE(TSK_TI_PREEMPT, offsetof(struct task_struct, thread_info.preempt_count)); + DEFINE(TSK_TI_PREEMPT_LAZY, offsetof(struct task_struct, thread_info.preempt_lazy_count)); DEFINE(TSK_TI_ADDR_LIMIT, offsetof(struct task_struct, thread_info.addr_limit)); #ifdef CONFIG_ARM64_SW_TTBR0_PAN DEFINE(TSK_TI_TTBR0, offsetof(struct task_struct, thread_info.ttbr0)); @ arch/arm64/kernel/entry.S:615 @ alternative_if ARM64_HAS_IRQ_PRIO_MASKING mrs x0, daif orr x24, x24, x0 alternative_else_nop_endif - cbnz x24, 1f // preempt count != 0 || NMI return path - bl arm64_preempt_schedule_irq // irq en/disable is done inside + + cbz x24, 1f // (need_resched + count) == 0 + cbnz w24, 2f // count != 0 + + ldr w24, [tsk, #TSK_TI_PREEMPT_LAZY] // get preempt lazy count + cbnz w24, 2f // preempt lazy count != 0 + + ldr x0, [tsk, #TSK_TI_FLAGS] // get flags + tbz x0, #TIF_NEED_RESCHED_LAZY, 2f // needs rescheduling? 1: + bl arm64_preempt_schedule_irq // irq en/disable is done inside +2: #endif #ifdef CONFIG_ARM64_PSEUDO_NMI @ arch/arm64/kernel/fpsimd.c:216 @ static void sve_free(struct task_struct *task) __sve_free(task); } +static void *sve_free_atomic(struct task_struct *task) +{ + void *sve_state = task->thread.sve_state; + + WARN_ON(test_tsk_thread_flag(task, TIF_SVE)); + + task->thread.sve_state = NULL; + return sve_state; +} + /* * TIF_SVE controls whether a task can use SVE without trapping while * in userspace, and also the way a task's FPSIMD/SVE state is stored @ arch/arm64/kernel/fpsimd.c:1023 @ void fpsimd_thread_switch(struct task_struct *next) void fpsimd_flush_thread(void) { int vl, supported_vl; + void *mem = NULL; if (!system_supports_fpsimd()) return; @ arch/arm64/kernel/fpsimd.c:1036 @ void fpsimd_flush_thread(void) if (system_supports_sve()) { clear_thread_flag(TIF_SVE); - sve_free(current); + mem = sve_free_atomic(current); /* * Reset the task vector length as required. @ arch/arm64/kernel/fpsimd.c:1070 @ void fpsimd_flush_thread(void) } put_cpu_fpsimd_context(); + kfree(mem); } /* @ arch/arm64/kernel/signal.c:915 @ asmlinkage void do_notify_resume(struct pt_regs *regs, /* Check valid user FS if needed */ addr_limit_user_check(); - if (thread_flags & _TIF_NEED_RESCHED) { + if (thread_flags & _TIF_NEED_RESCHED_MASK) { /* Unmask Debug and SError for the next task */ local_daif_restore(DAIF_PROCCTX_NOIRQ); @ arch/hexagon/include/asm/spinlock_types.h:11 @ #ifndef _ASM_SPINLOCK_TYPES_H #define _ASM_SPINLOCK_TYPES_H -#ifndef __LINUX_SPINLOCK_TYPES_H -# error "please don't include this file directly" -#endif - typedef struct { volatile unsigned int lock; } arch_spinlock_t; @ arch/ia64/include/asm/spinlock_types.h:5 @ #ifndef _ASM_IA64_SPINLOCK_TYPES_H #define _ASM_IA64_SPINLOCK_TYPES_H -#ifndef __LINUX_SPINLOCK_TYPES_H -# error "please don't include this file directly" -#endif - typedef struct { volatile unsigned int lock; } arch_spinlock_t; @ arch/mips/Kconfig:2640 @ config MIPS_CRC_SUPPORT # config HIGHMEM bool "High Memory Support" - depends on 32BIT && CPU_SUPPORTS_HIGHMEM && SYS_SUPPORTS_HIGHMEM && !CPU_MIPS32_3_5_EVA + depends on 32BIT && CPU_SUPPORTS_HIGHMEM && SYS_SUPPORTS_HIGHMEM && !CPU_MIPS32_3_5_EVA && !PREEMPT_RT config CPU_SUPPORTS_HIGHMEM bool @ arch/powerpc/Kconfig:144 @ config PPC select ARCH_MIGHT_HAVE_PC_SERIO select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX select ARCH_SUPPORTS_ATOMIC_RMW + select ARCH_SUPPORTS_RT select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_CMPXCHG_LOCKREF if PPC64 select ARCH_WANT_IPC_PARSE_VERSION @ arch/powerpc/Kconfig:224 @ config PPC select HAVE_HARDLOCKUP_DETECTOR_PERF if PERF_EVENTS && HAVE_PERF_EVENTS_NMI && !HAVE_HARDLOCKUP_DETECTOR_ARCH select HAVE_PERF_REGS select HAVE_PERF_USER_STACK_DUMP + select HAVE_PREEMPT_LAZY select MMU_GATHER_RCU_TABLE_FREE select MMU_GATHER_PAGE_SIZE select HAVE_REGS_AND_STACK_ACCESS_API @ arch/powerpc/Kconfig:402 @ menu "Kernel options" config HIGHMEM bool "High memory support" - depends on PPC32 + depends on PPC32 && !PREEMPT_RT source "kernel/Kconfig.hz" @ arch/powerpc/include/asm/spinlock_types.h:5 @ #ifndef _ASM_POWERPC_SPINLOCK_TYPES_H #define _ASM_POWERPC_SPINLOCK_TYPES_H -#ifndef __LINUX_SPINLOCK_TYPES_H -# error "please don't include this file directly" -#endif - typedef struct { volatile unsigned int slock; } arch_spinlock_t; @ arch/powerpc/include/asm/stackprotector.h:27 @ static __always_inline void boot_init_stack_canary(void) unsigned long canary; /* Try to get a semi random initial value. */ +#ifdef CONFIG_PREEMPT_RT + canary = (unsigned long)&canary; +#else canary = get_random_canary(); +#endif canary ^= mftb(); canary ^= LINUX_VERSION_CODE; canary &= CANARY_MASK; @ arch/powerpc/include/asm/thread_info.h:51 @ struct thread_info { int preempt_count; /* 0 => preemptable, <0 => BUG */ + int preempt_lazy_count; /* 0 => preemptable, + <0 => BUG */ unsigned long local_flags; /* private flags for thread */ #ifdef CONFIG_LIVEPATCH unsigned long *livepatch_sp; @ arch/powerpc/include/asm/thread_info.h:103 @ void arch_setup_new_exec(void); #define TIF_SINGLESTEP 8 /* singlestepping active */ #define TIF_NOHZ 9 /* in adaptive nohz mode */ #define TIF_SECCOMP 10 /* secure computing */ -#define TIF_RESTOREALL 11 /* Restore all regs (implies NOERROR) */ -#define TIF_NOERROR 12 /* Force successful syscall return */ + +#define TIF_NEED_RESCHED_LAZY 11 /* lazy rescheduling necessary */ +#define TIF_SYSCALL_TRACEPOINT 12 /* syscall tracepoint instrumentation */ + #define TIF_NOTIFY_RESUME 13 /* callback before returning to user */ #define TIF_UPROBE 14 /* breakpointed or single-stepping */ -#define TIF_SYSCALL_TRACEPOINT 15 /* syscall tracepoint instrumentation */ #define TIF_EMULATE_STACK_STORE 16 /* Is an instruction emulation for stack store? */ #define TIF_MEMDIE 17 /* is terminating due to OOM killer */ @ arch/powerpc/include/asm/thread_info.h:117 @ void arch_setup_new_exec(void); #endif #define TIF_POLLING_NRFLAG 19 /* true if poll_idle() is polling TIF_NEED_RESCHED */ #define TIF_32BIT 20 /* 32 bit binary */ +#define TIF_RESTOREALL 21 /* Restore all regs (implies NOERROR) */ +#define TIF_NOERROR 22 /* Force successful syscall return */ + /* as above, but as bit values */ #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) @ arch/powerpc/include/asm/thread_info.h:139 @ void arch_setup_new_exec(void); #define _TIF_SYSCALL_TRACEPOINT (1<<TIF_SYSCALL_TRACEPOINT) #define _TIF_EMULATE_STACK_STORE (1<<TIF_EMULATE_STACK_STORE) #define _TIF_NOHZ (1<<TIF_NOHZ) +#define _TIF_NEED_RESCHED_LAZY (1<<TIF_NEED_RESCHED_LAZY) #define _TIF_FSCHECK (1<<TIF_FSCHECK) #define _TIF_SYSCALL_EMU (1<<TIF_SYSCALL_EMU) #define _TIF_SYSCALL_DOTRACE (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \ @ arch/powerpc/include/asm/thread_info.h:149 @ void arch_setup_new_exec(void); #define _TIF_USER_WORK_MASK (_TIF_SIGPENDING | _TIF_NEED_RESCHED | \ _TIF_NOTIFY_RESUME | _TIF_UPROBE | \ _TIF_RESTORE_TM | _TIF_PATCH_PENDING | \ - _TIF_FSCHECK) + _TIF_FSCHECK | _TIF_NEED_RESCHED_LAZY) #define _TIF_PERSYSCALL_MASK (_TIF_RESTOREALL|_TIF_NOERROR) +#define _TIF_NEED_RESCHED_MASK (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY) /* Bits in local_flags */ /* Don't move TLF_NAPPING without adjusting the code in entry_32.S */ @ arch/powerpc/kernel/asm-offsets.c:188 @ int main(void) OFFSET(TI_FLAGS, thread_info, flags); OFFSET(TI_LOCAL_FLAGS, thread_info, local_flags); OFFSET(TI_PREEMPT, thread_info, preempt_count); + OFFSET(TI_PREEMPT_LAZY, thread_info, preempt_lazy_count); #ifdef CONFIG_PPC64 OFFSET(DCACHEL1BLOCKSIZE, ppc64_caches, l1d.block_size); @ arch/powerpc/kernel/entry_32.S:410 @ _GLOBAL(DoSyscall) mtmsr r10 lwz r9,TI_FLAGS(r2) li r8,-MAX_ERRNO - andi. r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK) + lis r0,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)@h + ori r0,r0, (_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)@l + and. r0,r9,r0 bne- syscall_exit_work cmplw 0,r3,r8 blt+ syscall_exit_cont @ arch/powerpc/kernel/entry_32.S:527 @ END_FTR_SECTION_IFSET(CPU_FTR_NEED_PAIRED_STWCX) b syscall_dotrace_cont syscall_exit_work: - andi. r0,r9,_TIF_RESTOREALL + andis. r0,r9,_TIF_RESTOREALL@h beq+ 0f REST_NVGPRS(r1) b 2f 0: cmplw 0,r3,r8 blt+ 1f - andi. r0,r9,_TIF_NOERROR + andis. r0,r9,_TIF_NOERROR@h bne- 1f lwz r11,_CCR(r1) /* Load CR */ neg r3,r3 @ arch/powerpc/kernel/entry_32.S:542 @ END_FTR_SECTION_IFSET(CPU_FTR_NEED_PAIRED_STWCX) 1: stw r6,RESULT(r1) /* Save result */ stw r3,GPR3(r1) /* Update return value */ -2: andi. r0,r9,(_TIF_PERSYSCALL_MASK) +2: andis. r0,r9,(_TIF_PERSYSCALL_MASK)@h beq 4f /* Clear per-syscall TIF flags if any are set. */ - li r11,_TIF_PERSYSCALL_MASK + lis r11,(_TIF_PERSYSCALL_MASK)@h addi r12,r2,TI_FLAGS 3: lwarx r8,0,r12 andc r8,r8,r11 @ arch/powerpc/kernel/entry_32.S:914 @ user_exc_return: /* r10 contains MSR_KERNEL here */ cmpwi 0,r0,0 /* if non-zero, just restore regs and return */ bne restore_kuap andi. r8,r8,_TIF_NEED_RESCHED + bne+ 1f + lwz r0,TI_PREEMPT_LAZY(r2) + cmpwi 0,r0,0 /* if non-zero, just restore regs and return */ + bne restore_kuap + lwz r0,TI_FLAGS(r2) + andi. r0,r0,_TIF_NEED_RESCHED_LAZY beq+ restore_kuap +1: lwz r3,_MSR(r1) andi. r0,r3,MSR_EE /* interrupts off? */ beq restore_kuap /* don't schedule if so */ @ arch/powerpc/kernel/entry_32.S:1242 @ END_FTR_SECTION_IFSET(CPU_FTR_NEED_PAIRED_STWCX) #endif /* !(CONFIG_4xx || CONFIG_BOOKE) */ do_work: /* r10 contains MSR_KERNEL here */ - andi. r0,r9,_TIF_NEED_RESCHED + andi. r0,r9,_TIF_NEED_RESCHED_MASK beq do_user_signal do_resched: /* r10 contains MSR_KERNEL here */ @ arch/powerpc/kernel/entry_32.S:1263 @ do_resched: /* r10 contains MSR_KERNEL here */ SYNC mtmsr r10 /* disable interrupts */ lwz r9,TI_FLAGS(r2) - andi. r0,r9,_TIF_NEED_RESCHED + andi. r0,r9,_TIF_NEED_RESCHED_MASK bne- do_resched andi. r0,r9,_TIF_USER_WORK_MASK beq restore_user @ arch/powerpc/kernel/entry_64.S:243 @ system_call: /* label this so stack traces look sane */ ld r9,TI_FLAGS(r12) li r11,-MAX_ERRNO - andi. r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK) + lis r0,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)@h + ori r0,r0,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)@l + and. r0,r9,r0 bne- .Lsyscall_exit_work andi. r0,r8,MSR_FP @ arch/powerpc/kernel/entry_64.S:368 @ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) /* If TIF_RESTOREALL is set, don't scribble on either r3 or ccr. If TIF_NOERROR is set, just save r3 as it is. */ - andi. r0,r9,_TIF_RESTOREALL + andis. r0,r9,_TIF_RESTOREALL@h beq+ 0f REST_NVGPRS(r1) b 2f 0: cmpld r3,r11 /* r11 is -MAX_ERRNO */ blt+ 1f - andi. r0,r9,_TIF_NOERROR + andis. r0,r9,_TIF_NOERROR@h bne- 1f ld r5,_CCR(r1) neg r3,r3 oris r5,r5,0x1000 /* Set SO bit in CR */ std r5,_CCR(r1) 1: std r3,GPR3(r1) -2: andi. r0,r9,(_TIF_PERSYSCALL_MASK) +2: andis. r0,r9,(_TIF_PERSYSCALL_MASK)@h beq 4f /* Clear per-syscall TIF flags if any are set. */ - li r11,_TIF_PERSYSCALL_MASK + lis r11,(_TIF_PERSYSCALL_MASK)@h addi r12,r12,TI_FLAGS 3: ldarx r10,0,r12 andc r10,r10,r11 @ arch/powerpc/kernel/entry_64.S:789 @ _GLOBAL(ret_from_except_lite) bl restore_math b restore #endif -1: andi. r0,r4,_TIF_NEED_RESCHED +1: andi. r0,r4,_TIF_NEED_RESCHED_MASK beq 2f bl restore_interrupts SCHEDULE_USER @ arch/powerpc/kernel/entry_64.S:851 @ _GLOBAL(ret_from_except_lite) #ifdef CONFIG_PREEMPTION /* Check if we need to preempt */ - andi. r0,r4,_TIF_NEED_RESCHED - beq+ restore - /* Check that preempt_count() == 0 and interrupts are enabled */ lwz r8,TI_PREEMPT(r9) + cmpwi 0,r8,0 /* if non-zero, just restore regs and return */ + bne restore + andi. r0,r4,_TIF_NEED_RESCHED + bne+ check_count + + andi. r0,r4,_TIF_NEED_RESCHED_LAZY + beq+ restore + lwz r8,TI_PREEMPT_LAZY(r9) + + /* Check that preempt_count() == 0 and interrupts are enabled */ +check_count: cmpwi cr0,r8,0 bne restore ld r0,SOFTE(r1) @ arch/powerpc/kernel/irq.c:704 @ void *mcheckirq_ctx[NR_CPUS] __read_mostly; void *softirq_ctx[NR_CPUS] __read_mostly; void *hardirq_ctx[NR_CPUS] __read_mostly; +#ifndef CONFIG_PREEMPT_RT void do_softirq_own_stack(void) { call_do_softirq(softirq_ctx[smp_processor_id()]); } +#endif irq_hw_number_t virq_to_hw(unsigned int virq) { @ arch/powerpc/kernel/misc_32.S:34 @ * We store the saved ksp_limit in the unused part * of the STACK_FRAME_OVERHEAD */ +#ifndef CONFIG_PREEMPT_RT _GLOBAL(call_do_softirq) mflr r0 stw r0,4(r1) @ arch/powerpc/kernel/misc_32.S:50 @ _GLOBAL(call_do_softirq) stw r10,THREAD+KSP_LIMIT(r2) mtlr r0 blr +#endif /* * void call_do_irq(struct pt_regs *regs, void *sp); @ arch/powerpc/kernel/misc_64.S:30 @ .text +#ifndef CONFIG_PREEMPT_RT _GLOBAL(call_do_softirq) mflr r0 std r0,16(r1) @ arch/powerpc/kernel/misc_64.S:41 @ _GLOBAL(call_do_softirq) ld r0,16(r1) mtlr r0 blr +#endif _GLOBAL(call_do_irq) mflr r0 @ arch/powerpc/kernel/traps.c:174 @ extern void panic_flush_kmsg_start(void) extern void panic_flush_kmsg_end(void) { - printk_safe_flush_on_panic(); kmsg_dump(KMSG_DUMP_PANIC); bust_spinlocks(0); debug_locks_off(); @ arch/powerpc/kernel/traps.c:263 @ static char *get_mmu_str(void) static int __die(const char *str, struct pt_regs *regs, long err) { + const char *pr = ""; + printk("Oops: %s, sig: %ld [#%d]\n", str, err, ++die_counter); + if (IS_ENABLED(CONFIG_PREEMPTION)) + pr = IS_ENABLED(CONFIG_PREEMPT_RT) ? " PREEMPT_RT" : " PREEMPT"; + printk("%s PAGE_SIZE=%luK%s%s%s%s%s%s %s\n", IS_ENABLED(CONFIG_CPU_LITTLE_ENDIAN) ? "LE" : "BE", PAGE_SIZE / 1024, get_mmu_str(), - IS_ENABLED(CONFIG_PREEMPT) ? " PREEMPT" : "", + pr, IS_ENABLED(CONFIG_SMP) ? " SMP" : "", IS_ENABLED(CONFIG_SMP) ? (" NR_CPUS=" __stringify(NR_CPUS)) : "", debug_pagealloc_enabled() ? " DEBUG_PAGEALLOC" : "", @ arch/powerpc/kernel/watchdog.c:184 @ static void watchdog_smp_panic(int cpu, u64 tb) wd_smp_unlock(&flags); - printk_safe_flush(); - /* - * printk_safe_flush() seems to require another print - * before anything actually goes out to console. - */ if (sysctl_hardlockup_all_cpu_backtrace) trigger_allbutself_cpu_backtrace(); @ arch/powerpc/kvm/Kconfig:181 @ config KVM_E500MC config KVM_MPIC bool "KVM in-kernel MPIC emulation" depends on KVM && E500 + depends on !PREEMPT_RT select HAVE_KVM_IRQCHIP select HAVE_KVM_IRQFD select HAVE_KVM_IRQ_ROUTING @ arch/powerpc/platforms/ps3/device-init.c:741 @ static int ps3_notification_read_write(struct ps3_notification_device *dev, } pr_debug("%s:%u: notification %s issued\n", __func__, __LINE__, op); - res = wait_event_interruptible(dev->done.wait, - dev->done.done || kthread_should_stop()); + res = swait_event_interruptible_exclusive(dev->done.wait, + dev->done.done || kthread_should_stop()); if (kthread_should_stop()) res = -EINTR; if (res) { @ arch/powerpc/platforms/pseries/iommu.c:27 @ #include <linux/of.h> #include <linux/iommu.h> #include <linux/rculist.h> +#include <linux/locallock.h> #include <asm/io.h> #include <asm/prom.h> #include <asm/rtas.h> @ arch/powerpc/platforms/pseries/iommu.c:181 @ static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long tceshift, } static DEFINE_PER_CPU(__be64 *, tce_page); +static DEFINE_LOCAL_IRQ_LOCK(tcp_page_lock); static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, long npages, unsigned long uaddr, @ arch/powerpc/platforms/pseries/iommu.c:203 @ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, direction, attrs); } - local_irq_save(flags); /* to protect tcep and the page behind it */ + /* to protect tcep and the page behind it */ + local_lock_irqsave(tcp_page_lock, flags); tcep = __this_cpu_read(tce_page); @ arch/powerpc/platforms/pseries/iommu.c:215 @ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, tcep = (__be64 *)__get_free_page(GFP_ATOMIC); /* If allocation fails, fall back to the loop implementation */ if (!tcep) { - local_irq_restore(flags); + local_unlock_irqrestore(tcp_page_lock, flags); return tce_build_pSeriesLP(tbl->it_index, tcenum, tbl->it_page_shift, npages, uaddr, direction, attrs); @ arch/powerpc/platforms/pseries/iommu.c:250 @ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, tcenum += limit; } while (npages > 0 && !rc); - local_irq_restore(flags); + local_unlock_irqrestore(tcp_page_lock, flags); if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) { ret = (int)rc; @ arch/powerpc/platforms/pseries/iommu.c:421 @ static int tce_setrange_multi_pSeriesLP(unsigned long start_pfn, DMA_BIDIRECTIONAL, 0); } - local_irq_disable(); /* to protect tcep and the page behind it */ + /* to protect tcep and the page behind it */ + local_lock_irq(tcp_page_lock); tcep = __this_cpu_read(tce_page); if (!tcep) { tcep = (__be64 *)__get_free_page(GFP_ATOMIC); if (!tcep) { - local_irq_enable(); + local_unlock_irq(tcp_page_lock); return -ENOMEM; } __this_cpu_write(tce_page, tcep); @ arch/powerpc/platforms/pseries/iommu.c:474 @ static int tce_setrange_multi_pSeriesLP(unsigned long start_pfn, /* error cleanup: caller will clear whole range */ - local_irq_enable(); + local_unlock_irq(tcp_page_lock); return rc; } @ arch/s390/include/asm/spinlock_types.h:5 @ #ifndef __ASM_SPINLOCK_TYPES_H #define __ASM_SPINLOCK_TYPES_H -#ifndef __LINUX_SPINLOCK_TYPES_H -# error "please don't include this file directly" -#endif - typedef struct { int lock; } __attribute__ ((aligned (4))) arch_spinlock_t; @ arch/sh/include/asm/spinlock_types.h:5 @ #ifndef __ASM_SH_SPINLOCK_TYPES_H #define __ASM_SH_SPINLOCK_TYPES_H -#ifndef __LINUX_SPINLOCK_TYPES_H -# error "please don't include this file directly" -#endif - typedef struct { volatile unsigned int lock; } arch_spinlock_t; @ arch/sh/kernel/irq.c:151 @ void irq_ctx_exit(int cpu) hardirq_ctx[cpu] = NULL; } +#ifndef CONFIG_PREEMPT_RT void do_softirq_own_stack(void) { struct thread_info *curctx; @ arch/sh/kernel/irq.c:179 @ void do_softirq_own_stack(void) "r5", "r6", "r7", "r8", "r9", "r15", "t", "pr" ); } +#endif #else static inline void handle_one_irq(unsigned int irq) { @ arch/sparc/kernel/irq_64.c:857 @ void __irq_entry handler_irq(int pil, struct pt_regs *regs) set_irq_regs(old_regs); } +#ifndef CONFIG_PREEMPT_RT void do_softirq_own_stack(void) { void *orig_sp, *sp = softirq_stack[smp_processor_id()]; @ arch/sparc/kernel/irq_64.c:872 @ void do_softirq_own_stack(void) __asm__ __volatile__("mov %0, %%sp" : : "r" (orig_sp)); } +#endif #ifdef CONFIG_HOTPLUG_CPU void fixup_irqs(void) @ arch/x86/Kconfig:93 @ config X86 select ARCH_SUPPORTS_ACPI select ARCH_SUPPORTS_ATOMIC_RMW select ARCH_SUPPORTS_NUMA_BALANCING if X86_64 + select ARCH_SUPPORTS_RT select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_QUEUED_RWLOCKS select ARCH_USE_QUEUED_SPINLOCKS @ arch/x86/Kconfig:139 @ config X86 select HAVE_ALIGNED_STRUCT_PAGE if SLUB select HAVE_ARCH_AUDITSYSCALL select HAVE_ARCH_HUGE_VMAP if X86_64 || X86_PAE - select HAVE_ARCH_JUMP_LABEL - select HAVE_ARCH_JUMP_LABEL_RELATIVE + select HAVE_ARCH_JUMP_LABEL if !PREEMPT_RT + select HAVE_ARCH_JUMP_LABEL_RELATIVE if !PREEMPT_RT select HAVE_ARCH_KASAN if X86_64 select HAVE_ARCH_KASAN_VMALLOC if X86_64 select HAVE_ARCH_KGDB @ arch/x86/Kconfig:208 @ config X86 select HAVE_PCI select HAVE_PERF_REGS select HAVE_PERF_USER_STACK_DUMP + select HAVE_PREEMPT_LAZY select MMU_GATHER_RCU_TABLE_FREE if PARAVIRT select HAVE_REGS_AND_STACK_ACCESS_API select HAVE_RELIABLE_STACKTRACE if X86_64 && (UNWINDER_FRAME_POINTER || UNWINDER_ORC) && STACK_VALIDATION @ arch/x86/crypto/aesni-intel_glue.c:385 @ static int ecb_encrypt(struct skcipher_request *req) err = skcipher_walk_virt(&walk, req, true); - kernel_fpu_begin(); while ((nbytes = walk.nbytes)) { + kernel_fpu_begin(); aesni_ecb_enc(ctx, walk.dst.virt.addr, walk.src.virt.addr, nbytes & AES_BLOCK_MASK); + kernel_fpu_end(); nbytes &= AES_BLOCK_SIZE - 1; err = skcipher_walk_done(&walk, nbytes); } - kernel_fpu_end(); return err; } @ arch/x86/crypto/aesni-intel_glue.c:407 @ static int ecb_decrypt(struct skcipher_request *req) err = skcipher_walk_virt(&walk, req, true); - kernel_fpu_begin(); while ((nbytes = walk.nbytes)) { + kernel_fpu_begin(); aesni_ecb_dec(ctx, walk.dst.virt.addr, walk.src.virt.addr, nbytes & AES_BLOCK_MASK); + kernel_fpu_end(); nbytes &= AES_BLOCK_SIZE - 1; err = skcipher_walk_done(&walk, nbytes); } - kernel_fpu_end(); return err; } @ arch/x86/crypto/aesni-intel_glue.c:429 @ static int cbc_encrypt(struct skcipher_request *req) err = skcipher_walk_virt(&walk, req, true); - kernel_fpu_begin(); while ((nbytes = walk.nbytes)) { + kernel_fpu_begin(); aesni_cbc_enc(ctx, walk.dst.virt.addr, walk.src.virt.addr, nbytes & AES_BLOCK_MASK, walk.iv); + kernel_fpu_end(); nbytes &= AES_BLOCK_SIZE - 1; err = skcipher_walk_done(&walk, nbytes); } - kernel_fpu_end(); return err; } @ arch/x86/crypto/aesni-intel_glue.c:451 @ static int cbc_decrypt(struct skcipher_request *req) err = skcipher_walk_virt(&walk, req, true); - kernel_fpu_begin(); while ((nbytes = walk.nbytes)) { + kernel_fpu_begin(); aesni_cbc_dec(ctx, walk.dst.virt.addr, walk.src.virt.addr, nbytes & AES_BLOCK_MASK, walk.iv); + kernel_fpu_end(); nbytes &= AES_BLOCK_SIZE - 1; err = skcipher_walk_done(&walk, nbytes); } - kernel_fpu_end(); return err; } @ arch/x86/crypto/aesni-intel_glue.c:508 @ static int ctr_crypt(struct skcipher_request *req) err = skcipher_walk_virt(&walk, req, true); - kernel_fpu_begin(); while ((nbytes = walk.nbytes) >= AES_BLOCK_SIZE) { + kernel_fpu_begin(); aesni_ctr_enc_tfm(ctx, walk.dst.virt.addr, walk.src.virt.addr, nbytes & AES_BLOCK_MASK, walk.iv); + kernel_fpu_end(); nbytes &= AES_BLOCK_SIZE - 1; err = skcipher_walk_done(&walk, nbytes); } if (walk.nbytes) { + kernel_fpu_begin(); ctr_crypt_final(ctx, &walk); + kernel_fpu_end(); err = skcipher_walk_done(&walk, 0); } - kernel_fpu_end(); return err; } @ arch/x86/crypto/cast5_avx_glue.c:49 @ static inline void cast5_fpu_end(bool fpu_enabled) static int ecb_crypt(struct skcipher_request *req, bool enc) { - bool fpu_enabled = false; + bool fpu_enabled; struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); struct cast5_ctx *ctx = crypto_skcipher_ctx(tfm); struct skcipher_walk walk; @ arch/x86/crypto/cast5_avx_glue.c:64 @ static int ecb_crypt(struct skcipher_request *req, bool enc) u8 *wsrc = walk.src.virt.addr; u8 *wdst = walk.dst.virt.addr; - fpu_enabled = cast5_fpu_begin(fpu_enabled, &walk, nbytes); + fpu_enabled = cast5_fpu_begin(false, &walk, nbytes); /* Process multi-block batch */ if (nbytes >= bsize * CAST5_PARALLEL_BLOCKS) { @ arch/x86/crypto/cast5_avx_glue.c:93 @ static int ecb_crypt(struct skcipher_request *req, bool enc) } while (nbytes >= bsize); done: + cast5_fpu_end(fpu_enabled); err = skcipher_walk_done(&walk, nbytes); } - - cast5_fpu_end(fpu_enabled); return err; } @ arch/x86/crypto/cast5_avx_glue.c:199 @ static int cbc_decrypt(struct skcipher_request *req) { struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); struct cast5_ctx *ctx = crypto_skcipher_ctx(tfm); - bool fpu_enabled = false; + bool fpu_enabled; struct skcipher_walk walk; unsigned int nbytes; int err; @ arch/x86/crypto/cast5_avx_glue.c:207 @ static int cbc_decrypt(struct skcipher_request *req) err = skcipher_walk_virt(&walk, req, false); while ((nbytes = walk.nbytes)) { - fpu_enabled = cast5_fpu_begin(fpu_enabled, &walk, nbytes); + fpu_enabled = cast5_fpu_begin(false, &walk, nbytes); nbytes = __cbc_decrypt(ctx, &walk); + cast5_fpu_end(fpu_enabled); err = skcipher_walk_done(&walk, nbytes); } - - cast5_fpu_end(fpu_enabled); return err; } @ arch/x86/crypto/cast5_avx_glue.c:278 @ static int ctr_crypt(struct skcipher_request *req) { struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); struct cast5_ctx *ctx = crypto_skcipher_ctx(tfm); - bool fpu_enabled = false; + bool fpu_enabled; struct skcipher_walk walk; unsigned int nbytes; int err; @ arch/x86/crypto/cast5_avx_glue.c:286 @ static int ctr_crypt(struct skcipher_request *req) err = skcipher_walk_virt(&walk, req, false); while ((nbytes = walk.nbytes) >= CAST5_BLOCK_SIZE) { - fpu_enabled = cast5_fpu_begin(fpu_enabled, &walk, nbytes); + fpu_enabled = cast5_fpu_begin(false, &walk, nbytes); nbytes = __ctr_crypt(&walk, ctx); + cast5_fpu_end(fpu_enabled); err = skcipher_walk_done(&walk, nbytes); } - cast5_fpu_end(fpu_enabled); - if (walk.nbytes) { ctr_crypt_final(&walk, ctx); err = skcipher_walk_done(&walk, 0); @ arch/x86/crypto/glue_helper.c:27 @ int glue_ecb_req_128bit(const struct common_glue_ctx *gctx, void *ctx = crypto_skcipher_ctx(crypto_skcipher_reqtfm(req)); const unsigned int bsize = 128 / 8; struct skcipher_walk walk; - bool fpu_enabled = false; + bool fpu_enabled; unsigned int nbytes; int err; @ arch/x86/crypto/glue_helper.c:40 @ int glue_ecb_req_128bit(const struct common_glue_ctx *gctx, unsigned int i; fpu_enabled = glue_fpu_begin(bsize, gctx->fpu_blocks_limit, - &walk, fpu_enabled, nbytes); + &walk, false, nbytes); for (i = 0; i < gctx->num_funcs; i++) { func_bytes = bsize * gctx->funcs[i].num_blocks; @ arch/x86/crypto/glue_helper.c:58 @ int glue_ecb_req_128bit(const struct common_glue_ctx *gctx, if (nbytes < bsize) break; } + glue_fpu_end(fpu_enabled); err = skcipher_walk_done(&walk, nbytes); } - - glue_fpu_end(fpu_enabled); return err; } EXPORT_SYMBOL_GPL(glue_ecb_req_128bit); @ arch/x86/crypto/glue_helper.c:103 @ int glue_cbc_decrypt_req_128bit(const struct common_glue_ctx *gctx, void *ctx = crypto_skcipher_ctx(crypto_skcipher_reqtfm(req)); const unsigned int bsize = 128 / 8; struct skcipher_walk walk; - bool fpu_enabled = false; + bool fpu_enabled; unsigned int nbytes; int err; @ arch/x86/crypto/glue_helper.c:117 @ int glue_cbc_decrypt_req_128bit(const struct common_glue_ctx *gctx, u128 last_iv; fpu_enabled = glue_fpu_begin(bsize, gctx->fpu_blocks_limit, - &walk, fpu_enabled, nbytes); + &walk, false, nbytes); /* Start of the last block. */ src += nbytes / bsize - 1; dst += nbytes / bsize - 1; @ arch/x86/crypto/glue_helper.c:150 @ int glue_cbc_decrypt_req_128bit(const struct common_glue_ctx *gctx, done: u128_xor(dst, dst, (u128 *)walk.iv); *(u128 *)walk.iv = last_iv; + glue_fpu_end(fpu_enabled); err = skcipher_walk_done(&walk, nbytes); } - glue_fpu_end(fpu_enabled); return err; } EXPORT_SYMBOL_GPL(glue_cbc_decrypt_req_128bit); @ arch/x86/crypto/glue_helper.c:164 @ int glue_ctr_req_128bit(const struct common_glue_ctx *gctx, void *ctx = crypto_skcipher_ctx(crypto_skcipher_reqtfm(req)); const unsigned int bsize = 128 / 8; struct skcipher_walk walk; - bool fpu_enabled = false; + bool fpu_enabled; unsigned int nbytes; int err; @ arch/x86/crypto/glue_helper.c:178 @ int glue_ctr_req_128bit(const struct common_glue_ctx *gctx, le128 ctrblk; fpu_enabled = glue_fpu_begin(bsize, gctx->fpu_blocks_limit, - &walk, fpu_enabled, nbytes); + &walk, false, nbytes); be128_to_le128(&ctrblk, (be128 *)walk.iv); @ arch/x86/crypto/glue_helper.c:204 @ int glue_ctr_req_128bit(const struct common_glue_ctx *gctx, } le128_to_be128((be128 *)walk.iv, &ctrblk); + glue_fpu_end(fpu_enabled); err = skcipher_walk_done(&walk, nbytes); } - glue_fpu_end(fpu_enabled); - if (nbytes) { le128 ctrblk; u128 tmp; @ arch/x86/crypto/glue_helper.c:307 @ int glue_xts_req_128bit(const struct common_glue_ctx *gctx, tweak_fn(tweak_ctx, walk.iv, walk.iv); while (nbytes) { + fpu_enabled = glue_fpu_begin(bsize, gctx->fpu_blocks_limit, + &walk, fpu_enabled, + nbytes < bsize ? bsize : nbytes); nbytes = __glue_xts_req_128bit(gctx, crypt_ctx, &walk); + glue_fpu_end(fpu_enabled); + fpu_enabled = false; + err = skcipher_walk_done(&walk, nbytes); nbytes = walk.nbytes; } @ arch/x86/entry/common.c:134 @ static long syscall_trace_enter(struct pt_regs *regs) #define EXIT_TO_USERMODE_LOOP_FLAGS \ (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \ - _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_PATCH_PENDING) + _TIF_NEED_RESCHED_MASK | _TIF_USER_RETURN_NOTIFY | _TIF_PATCH_PENDING) static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags) { @ arch/x86/entry/common.c:149 @ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags) /* We have work to do. */ local_irq_enable(); - if (cached_flags & _TIF_NEED_RESCHED) + if (cached_flags & _TIF_NEED_RESCHED_MASK) schedule(); +#ifdef ARCH_RT_DELAYS_SIGNAL_SEND + if (unlikely(current->forced_info.si_signo)) { + struct task_struct *t = current; + force_sig_info(&t->forced_info); + t->forced_info.si_signo = 0; + } +#endif if (cached_flags & _TIF_UPROBE) uprobe_notify_resume(regs); @ arch/x86/entry/entry_32.S:1113 @ SYM_FUNC_START(entry_INT80_32) restore_all_kernel: #ifdef CONFIG_PREEMPTION DISABLE_INTERRUPTS(CLBR_ANY) + # preempt count == 0 + NEED_RS set? cmpl $0, PER_CPU_VAR(__preempt_count) +#ifndef CONFIG_PREEMPT_LAZY jnz .Lno_preempt +#else + jz test_int_off + + # atleast preempt count == 0 ? + cmpl $_PREEMPT_ENABLED,PER_CPU_VAR(__preempt_count) + jne .Lno_preempt + + movl PER_CPU_VAR(current_task), %ebp + cmpl $0,TASK_TI_preempt_lazy_count(%ebp) # non-zero preempt_lazy_count ? + jnz .Lno_preempt + + testl $_TIF_NEED_RESCHED_LAZY, TASK_TI_flags(%ebp) + jz .Lno_preempt + +test_int_off: +#endif testl $X86_EFLAGS_IF, PT_EFLAGS(%esp) # interrupts off (exception path) ? jz .Lno_preempt call preempt_schedule_irq @ arch/x86/entry/entry_64.S:673 @ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL) btl $9, EFLAGS(%rsp) /* were interrupts off? */ jnc 1f cmpl $0, PER_CPU_VAR(__preempt_count) +#ifndef CONFIG_PREEMPT_LAZY jnz 1f +#else + jz do_preempt_schedule_irq + + # atleast preempt count == 0 ? + cmpl $_PREEMPT_ENABLED,PER_CPU_VAR(__preempt_count) + jnz 1f + + movq PER_CPU_VAR(current_task), %rcx + cmpl $0, TASK_TI_preempt_lazy_count(%rcx) + jnz 1f + + btl $TIF_NEED_RESCHED_LAZY,TASK_TI_flags(%rcx) + jnc 1f +do_preempt_schedule_irq: +#endif call preempt_schedule_irq 1: #endif @ arch/x86/entry/entry_64.S:1093 @ SYM_CODE_START_LOCAL_NOALIGN(.Lbad_gs) SYM_CODE_END(.Lbad_gs) .previous +#ifndef CONFIG_PREEMPT_RT /* Call softirq on interrupt stack. Interrupts are off. */ SYM_FUNC_START(do_softirq_own_stack) pushq %rbp @ arch/x86/entry/entry_64.S:1104 @ SYM_FUNC_START(do_softirq_own_stack) leaveq ret SYM_FUNC_END(do_softirq_own_stack) +#endif #ifdef CONFIG_XEN_PV idtentry hypervisor_callback xen_do_hypervisor_callback has_error_code=0 @ arch/x86/include/asm/fpu/api.h:26 @ extern void kernel_fpu_begin(void); extern void kernel_fpu_end(void); extern bool irq_fpu_usable(void); extern void fpregs_mark_activate(void); +extern void kernel_fpu_resched(void); /* * Use fpregs_lock() while editing CPU's FPU registers or fpu->state. @ arch/x86/include/asm/preempt.h:92 @ static __always_inline void __preempt_count_sub(int val) * a decrement which hits zero means we have no preempt_count and should * reschedule. */ -static __always_inline bool __preempt_count_dec_and_test(void) +static __always_inline bool ____preempt_count_dec_and_test(void) { return GEN_UNARY_RMWcc("decl", __preempt_count, e, __percpu_arg([var])); } +static __always_inline bool __preempt_count_dec_and_test(void) +{ + if (____preempt_count_dec_and_test()) + return true; +#ifdef CONFIG_PREEMPT_LAZY + if (preempt_count()) + return false; + if (current_thread_info()->preempt_lazy_count) + return false; + return test_thread_flag(TIF_NEED_RESCHED_LAZY); +#else + return false; +#endif +} + /* * Returns true when we need to resched and can (barring IRQ state). */ static __always_inline bool should_resched(int preempt_offset) { +#ifdef CONFIG_PREEMPT_LAZY + u32 tmp; + tmp = raw_cpu_read_4(__preempt_count); + if (tmp == preempt_offset) + return true; + + /* preempt count == 0 ? */ + tmp &= ~PREEMPT_NEED_RESCHED; + if (tmp != preempt_offset) + return false; + /* XXX PREEMPT_LOCK_OFFSET */ + if (current_thread_info()->preempt_lazy_count) + return false; + return test_thread_flag(TIF_NEED_RESCHED_LAZY); +#else return unlikely(raw_cpu_read_4(__preempt_count) == preempt_offset); +#endif } #ifdef CONFIG_PREEMPTION @ arch/x86/include/asm/signal.h:31 @ typedef struct { #define SA_IA32_ABI 0x02000000u #define SA_X32_ABI 0x01000000u +/* + * Because some traps use the IST stack, we must keep preemption + * disabled while calling do_trap(), but do_trap() may call + * force_sig_info() which will grab the signal spin_locks for the + * task, which in PREEMPT_RT are mutexes. By defining + * ARCH_RT_DELAYS_SIGNAL_SEND the force_sig_info() will set + * TIF_NOTIFY_RESUME and set up the signal to be sent on exit of the + * trap. + */ +#if defined(CONFIG_PREEMPT_RT) +#define ARCH_RT_DELAYS_SIGNAL_SEND +#endif + #ifndef CONFIG_COMPAT typedef sigset_t compat_sigset_t; #endif @ arch/x86/include/asm/stackprotector.h:68 @ */ static __always_inline void boot_init_stack_canary(void) { - u64 canary; + u64 uninitialized_var(canary); u64 tsc; #ifdef CONFIG_X86_64 @ arch/x86/include/asm/stackprotector.h:79 @ static __always_inline void boot_init_stack_canary(void) * of randomness. The TSC only matters for very early init, * there it already has some randomness on most systems. Later * on during the bootup the random pool has true entropy too. + * For preempt-rt we need to weaken the randomness a bit, as + * we can't call into the random generator from atomic context + * due to locking constraints. We just leave canary + * uninitialized and use the TSC based randomness on top of it. */ +#ifndef CONFIG_PREEMPT_RT get_random_bytes(&canary, sizeof(canary)); +#endif tsc = rdtsc(); canary += tsc + (tsc << 32UL); canary &= CANARY_MASK; @ arch/x86/include/asm/thread_info.h:59 @ struct task_struct; struct thread_info { unsigned long flags; /* low level flags */ u32 status; /* thread synchronous flags */ + int preempt_lazy_count; /* 0 => lazy preemptable + <0 => BUG */ }; #define INIT_THREAD_INFO(tsk) \ { \ .flags = 0, \ + .preempt_lazy_count = 0, \ } #else /* !__ASSEMBLY__ */ #include <asm/asm-offsets.h> +#define GET_THREAD_INFO(reg) \ + _ASM_MOV PER_CPU_VAR(cpu_current_top_of_stack),reg ; \ + _ASM_SUB $(THREAD_SIZE),reg ; + #endif /* @ arch/x86/include/asm/thread_info.h:102 @ struct thread_info { #define TIF_NOCPUID 15 /* CPUID is not accessible in userland */ #define TIF_NOTSC 16 /* TSC is not accessible in userland */ #define TIF_IA32 17 /* IA32 compatibility process */ +#define TIF_NEED_RESCHED_LAZY 18 /* lazy rescheduling necessary */ #define TIF_NOHZ 19 /* in adaptive nohz mode */ #define TIF_MEMDIE 20 /* is terminating due to OOM killer */ #define TIF_POLLING_NRFLAG 21 /* idle is polling for TIF_NEED_RESCHED */ @ arch/x86/include/asm/thread_info.h:133 @ struct thread_info { #define _TIF_NOCPUID (1 << TIF_NOCPUID) #define _TIF_NOTSC (1 << TIF_NOTSC) #define _TIF_IA32 (1 << TIF_IA32) +#define _TIF_NEED_RESCHED_LAZY (1 << TIF_NEED_RESCHED_LAZY) #define _TIF_NOHZ (1 << TIF_NOHZ) #define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG) #define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP) @ arch/x86/include/asm/thread_info.h:177 @ struct thread_info { #define _TIF_WORK_CTXSW_NEXT (_TIF_WORK_CTXSW) +#define _TIF_NEED_RESCHED_MASK (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY) + #define STACK_WARN (THREAD_SIZE/8) /* @ arch/x86/kernel/asm-offsets.c:41 @ static void __used common(void) #endif BLANK(); +#ifdef CONFIG_PREEMPT_LAZY + OFFSET(TASK_TI_flags, task_struct, thread_info.flags); + OFFSET(TASK_TI_preempt_lazy_count, task_struct, thread_info.preempt_lazy_count); +#endif OFFSET(TASK_addr_limit, task_struct, thread.addr_limit); BLANK(); @ arch/x86/kernel/asm-offsets.c:99 @ static void __used common(void) BLANK(); DEFINE(PTREGS_SIZE, sizeof(struct pt_regs)); + DEFINE(_PREEMPT_ENABLED, PREEMPT_ENABLED); /* TLB state for the entry code */ OFFSET(TLB_STATE_user_pcid_flush_mask, tlb_state, user_pcid_flush_mask); @ arch/x86/kernel/cpu/mshyperv.c:80 @ EXPORT_SYMBOL_GPL(hv_remove_vmbus_irq); __visible void __irq_entry hv_stimer0_vector_handler(struct pt_regs *regs) { struct pt_regs *old_regs = set_irq_regs(regs); + u64 ip = regs ? instruction_pointer(regs) : 0; entering_irq(); inc_irq_stat(hyperv_stimer0_count); if (hv_stimer0_handler) hv_stimer0_handler(); - add_interrupt_randomness(HYPERV_STIMER0_VECTOR, 0); + add_interrupt_randomness(HYPERV_STIMER0_VECTOR, 0, ip); ack_APIC_irq(); exiting_irq(); @ arch/x86/kernel/fpu/core.c:116 @ void kernel_fpu_end(void) } EXPORT_SYMBOL_GPL(kernel_fpu_end); +void kernel_fpu_resched(void) +{ + WARN_ON_FPU(!this_cpu_read(in_kernel_fpu)); + + if (should_resched(PREEMPT_OFFSET)) { + kernel_fpu_end(); + cond_resched(); + kernel_fpu_begin(); + } +} +EXPORT_SYMBOL_GPL(kernel_fpu_resched); + /* * Save the FPU state (mark it for reload if necessary): * @ arch/x86/kernel/irq_32.c:134 @ int irq_init_percpu_irqstack(unsigned int cpu) return 0; } +#ifndef CONFIG_PREEMPT_RT void do_softirq_own_stack(void) { struct irq_stack *irqstk; @ arch/x86/kernel/irq_32.c:151 @ void do_softirq_own_stack(void) call_on_stack(__do_softirq, isp); } +#endif void handle_irq(struct irq_desc *desc, struct pt_regs *regs) { @ arch/x86/kernel/process_32.c:41 @ #include <linux/io.h> #include <linux/kdebug.h> #include <linux/syscalls.h> +#include <linux/highmem.h> #include <asm/pgtable.h> #include <asm/ldt.h> @ arch/x86/kernel/process_32.c:131 @ start_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp) } EXPORT_SYMBOL_GPL(start_thread); +#ifdef CONFIG_PREEMPT_RT +static void switch_kmaps(struct task_struct *prev_p, struct task_struct *next_p) +{ + int i; + + /* + * Clear @prev's kmap_atomic mappings + */ + for (i = 0; i < prev_p->kmap_idx; i++) { + int idx = i + KM_TYPE_NR * smp_processor_id(); + pte_t *ptep = kmap_pte - idx; + + kpte_clear_flush(ptep, __fix_to_virt(FIX_KMAP_BEGIN + idx)); + } + /* + * Restore @next_p's kmap_atomic mappings + */ + for (i = 0; i < next_p->kmap_idx; i++) { + int idx = i + KM_TYPE_NR * smp_processor_id(); + + if (!pte_none(next_p->kmap_pte[i])) + set_pte(kmap_pte - idx, next_p->kmap_pte[i]); + } +} +#else +static inline void +switch_kmaps(struct task_struct *prev_p, struct task_struct *next_p) { } +#endif + /* * switch_to(x,y) should switch tasks from x to y. @ arch/x86/kernel/process_32.c:221 @ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) switch_to_extra(prev_p, next_p); + switch_kmaps(prev_p, next_p); + /* * Leave lazy mode, flushing any hypercalls made here. * This must be done before restoring TLS segments so @ arch/x86/kvm/x86.c:7353 @ int kvm_arch_init(void *opaque) goto out; } +#ifdef CONFIG_PREEMPT_RT + if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) { + pr_err("RT requires X86_FEATURE_CONSTANT_TSC\n"); + r = -EOPNOTSUPP; + goto out; + } +#endif + r = -ENOMEM; x86_fpu_cache = kmem_cache_create("x86_fpu", sizeof(struct fpu), __alignof__(struct fpu), SLAB_ACCOUNT, @ arch/x86/mm/highmem_32.c:36 @ EXPORT_SYMBOL(kunmap); */ void *kmap_atomic_prot(struct page *page, pgprot_t prot) { + pte_t pte = mk_pte(page, prot); unsigned long vaddr; int idx, type; - preempt_disable(); + preempt_disable_nort(); pagefault_disable(); if (!PageHighMem(page)) @ arch/x86/mm/highmem_32.c:50 @ void *kmap_atomic_prot(struct page *page, pgprot_t prot) idx = type + KM_TYPE_NR*smp_processor_id(); vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); BUG_ON(!pte_none(*(kmap_pte-idx))); - set_pte(kmap_pte-idx, mk_pte(page, prot)); +#ifdef CONFIG_PREEMPT_RT + current->kmap_pte[type] = pte; +#endif + set_pte(kmap_pte-idx, pte); arch_flush_lazy_mmu_mode(); return (void *)vaddr; @ arch/x86/mm/highmem_32.c:96 @ void __kunmap_atomic(void *kvaddr) * is a bad idea also, in case the page changes cacheability * attributes or becomes a protected page in a hypervisor. */ +#ifdef CONFIG_PREEMPT_RT + current->kmap_pte[type] = __pte(0); +#endif kpte_clear_flush(kmap_pte-idx, vaddr); kmap_atomic_idx_pop(); arch_flush_lazy_mmu_mode(); @ arch/x86/mm/highmem_32.c:111 @ void __kunmap_atomic(void *kvaddr) #endif pagefault_enable(); - preempt_enable(); + preempt_enable_nort(); } EXPORT_SYMBOL(__kunmap_atomic); @ arch/x86/mm/iomap_32.c:49 @ EXPORT_SYMBOL_GPL(iomap_free); void *kmap_atomic_prot_pfn(unsigned long pfn, pgprot_t prot) { + pte_t pte = pfn_pte(pfn, prot); unsigned long vaddr; int idx, type; @ arch/x86/mm/iomap_32.c:59 @ void *kmap_atomic_prot_pfn(unsigned long pfn, pgprot_t prot) type = kmap_atomic_idx_push(); idx = type + KM_TYPE_NR * smp_processor_id(); vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); - set_pte(kmap_pte - idx, pfn_pte(pfn, prot)); + WARN_ON(!pte_none(*(kmap_pte - idx))); + +#ifdef CONFIG_PREEMPT_RT + current->kmap_pte[type] = pte; +#endif + set_pte(kmap_pte - idx, pte); arch_flush_lazy_mmu_mode(); return (void *)vaddr; @ arch/x86/mm/iomap_32.c:115 @ iounmap_atomic(void __iomem *kvaddr) * is a bad idea also, in case the page changes cacheability * attributes or becomes a protected page in a hypervisor. */ +#ifdef CONFIG_PREEMPT_RT + current->kmap_pte[type] = __pte(0); +#endif kpte_clear_flush(kmap_pte-idx, vaddr); kmap_atomic_idx_pop(); } @ arch/xtensa/include/asm/spinlock_types.h:5 @ #ifndef __ASM_SPINLOCK_TYPES_H #define __ASM_SPINLOCK_TYPES_H -#if !defined(__LINUX_SPINLOCK_TYPES_H) && !defined(__ASM_SPINLOCK_H) -# error "please don't include this file directly" -#endif - #include <asm-generic/qspinlock_types.h> #include <asm-generic/qrwlock_types.h> @ block/blk-ioc.c:12 @ #include <linux/blkdev.h> #include <linux/slab.h> #include <linux/sched/task.h> +#include <linux/delay.h> #include "blk.h" @ block/blk-ioc.c:120 @ static void ioc_release_fn(struct work_struct *work) spin_unlock(&q->queue_lock); } else { spin_unlock_irqrestore(&ioc->lock, flags); - cpu_relax(); + cpu_chill(); spin_lock_irqsave_nested(&ioc->lock, flags, 1); } } @ block/blk-iocost.c:411 @ struct ioc { enum ioc_running running; atomic64_t vtime_rate; - seqcount_t period_seqcount; + seqcount_spinlock_t period_seqcount; u32 period_at; /* wallclock starttime */ u64 period_at_vtime; /* vtime starttime */ @ block/blk-iocost.c:878 @ static void ioc_now(struct ioc *ioc, struct ioc_now *now) static void ioc_start_period(struct ioc *ioc, struct ioc_now *now) { - lockdep_assert_held(&ioc->lock); WARN_ON_ONCE(ioc->running != IOC_RUNNING); write_seqcount_begin(&ioc->period_seqcount); @ block/blk-iocost.c:1963 @ static int blk_iocost_init(struct request_queue *q) ioc->running = IOC_IDLE; atomic64_set(&ioc->vtime_rate, VTIME_PER_USEC); - seqcount_init(&ioc->period_seqcount); + seqcount_spinlock_init(&ioc->period_seqcount, &ioc->lock); ioc->period_at = ktime_to_us(ktime_get()); atomic64_set(&ioc->cur_period, 0); atomic_set(&ioc->hweight_gen, 0); @ block/blk-mq.c:592 @ static void __blk_mq_complete_request(struct request *rq) return; } - cpu = get_cpu(); + cpu = get_cpu_light(); + /* + * Avoid SMP function calls for completions because they acquire + * sleeping spinlocks on RT. + */ +#ifdef CONFIG_PREEMPT_RT + shared = true; +#else if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) shared = cpus_share_cache(cpu, ctx->cpu); +#endif if (cpu != ctx->cpu && !shared && cpu_online(ctx->cpu)) { rq->csd.func = __blk_mq_complete_request_remote; @ block/blk-mq.c:612 @ static void __blk_mq_complete_request(struct request *rq) } else { q->mq_ops->complete(rq); } - put_cpu(); + put_cpu_light(); } static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx) @ block/blk-mq.c:1467 @ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async, return; if (!async && !(hctx->flags & BLK_MQ_F_BLOCKING)) { - int cpu = get_cpu(); + int cpu = get_cpu_light(); if (cpumask_test_cpu(cpu, hctx->cpumask)) { __blk_mq_run_hw_queue(hctx); - put_cpu(); + put_cpu_light(); return; } - put_cpu(); + put_cpu_light(); } kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work, @ block/blk-softirq.c:90 @ static int blk_softirq_cpu_dead(unsigned int cpu) this_cpu_ptr(&blk_cpu_done)); raise_softirq_irqoff(BLOCK_SOFTIRQ); local_irq_enable(); + preempt_check_resched_rt(); return 0; } @ block/blk-softirq.c:142 @ void __blk_complete_request(struct request *req) goto do_local; local_irq_restore(flags); + preempt_check_resched_rt(); } static __init int blk_softirq_init(void) @ crypto/cryptd.c:39 @ static struct workqueue_struct *cryptd_wq; struct cryptd_cpu_queue { struct crypto_queue queue; struct work_struct work; + spinlock_t qlock; }; struct cryptd_queue { @ crypto/cryptd.c:109 @ static int cryptd_init_queue(struct cryptd_queue *queue, cpu_queue = per_cpu_ptr(queue->cpu_queue, cpu); crypto_init_queue(&cpu_queue->queue, max_cpu_qlen); INIT_WORK(&cpu_queue->work, cryptd_queue_worker); + spin_lock_init(&cpu_queue->qlock); } pr_info("cryptd: max_cpu_qlen set to %d\n", max_cpu_qlen); return 0; @ crypto/cryptd.c:134 @ static int cryptd_enqueue_request(struct cryptd_queue *queue, struct cryptd_cpu_queue *cpu_queue; refcount_t *refcnt; - cpu = get_cpu(); - cpu_queue = this_cpu_ptr(queue->cpu_queue); + cpu_queue = raw_cpu_ptr(queue->cpu_queue); + spin_lock_bh(&cpu_queue->qlock); + cpu = smp_processor_id(); + err = crypto_enqueue_request(&cpu_queue->queue, request); refcnt = crypto_tfm_ctx(request->tfm); @ crypto/cryptd.c:153 @ static int cryptd_enqueue_request(struct cryptd_queue *queue, refcount_inc(refcnt); out_put_cpu: - put_cpu(); + spin_unlock_bh(&cpu_queue->qlock); return err; } @ crypto/cryptd.c:169 @ static void cryptd_queue_worker(struct work_struct *work) cpu_queue = container_of(work, struct cryptd_cpu_queue, work); /* * Only handle one request at a time to avoid hogging crypto workqueue. - * preempt_disable/enable is used to prevent being preempted by - * cryptd_enqueue_request(). local_bh_disable/enable is used to prevent - * cryptd_enqueue_request() being accessed from software interrupts. */ - local_bh_disable(); - preempt_disable(); + spin_lock_bh(&cpu_queue->qlock); backlog = crypto_get_backlog(&cpu_queue->queue); req = crypto_dequeue_request(&cpu_queue->queue); - preempt_enable(); - local_bh_enable(); + spin_unlock_bh(&cpu_queue->qlock); if (!req) return; @ drivers/block/zram/zcomp.c:116 @ ssize_t zcomp_available_show(const char *comp, char *buf) struct zcomp_strm *zcomp_stream_get(struct zcomp *comp) { - return *get_cpu_ptr(comp->stream); + struct zcomp_strm *zstrm; + + zstrm = *get_local_ptr(comp->stream); + spin_lock(&zstrm->zcomp_lock); + return zstrm; } void zcomp_stream_put(struct zcomp *comp) { - put_cpu_ptr(comp->stream); + struct zcomp_strm *zstrm; + + zstrm = *this_cpu_ptr(comp->stream); + spin_unlock(&zstrm->zcomp_lock); + put_local_ptr(zstrm); } int zcomp_compress(struct zcomp_strm *zstrm, @ drivers/block/zram/zcomp.c:179 @ int zcomp_cpu_up_prepare(unsigned int cpu, struct hlist_node *node) pr_err("Can't allocate a compression stream\n"); return -ENOMEM; } + spin_lock_init(&zstrm->zcomp_lock); *per_cpu_ptr(comp->stream, cpu) = zstrm; return 0; } @ drivers/block/zram/zcomp.h:13 @ struct zcomp_strm { /* compression/decompression buffer */ void *buffer; struct crypto_comp *tfm; + spinlock_t zcomp_lock; }; /* dynamic per-device compression frontend */ @ drivers/block/zram/zram_drv.c:58 @ static void zram_free_page(struct zram *zram, size_t index); static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec, u32 index, int offset, struct bio *bio); +#ifdef CONFIG_PREEMPT_RT +static void zram_meta_init_table_locks(struct zram *zram, size_t num_pages) +{ + size_t index; + + for (index = 0; index < num_pages; index++) + spin_lock_init(&zram->table[index].lock); +} + +static int zram_slot_trylock(struct zram *zram, u32 index) +{ + int ret; + + ret = spin_trylock(&zram->table[index].lock); + if (ret) + __set_bit(ZRAM_LOCK, &zram->table[index].flags); + return ret; +} + +static void zram_slot_lock(struct zram *zram, u32 index) +{ + spin_lock(&zram->table[index].lock); + __set_bit(ZRAM_LOCK, &zram->table[index].flags); +} + +static void zram_slot_unlock(struct zram *zram, u32 index) +{ + __clear_bit(ZRAM_LOCK, &zram->table[index].flags); + spin_unlock(&zram->table[index].lock); +} + +#else + +static void zram_meta_init_table_locks(struct zram *zram, size_t num_pages) { } static int zram_slot_trylock(struct zram *zram, u32 index) { @ drivers/block/zram/zram_drv.c:107 @ static void zram_slot_unlock(struct zram *zram, u32 index) { bit_spin_unlock(ZRAM_LOCK, &zram->table[index].flags); } +#endif static inline bool init_done(struct zram *zram) { @ drivers/block/zram/zram_drv.c:1195 @ static bool zram_meta_alloc(struct zram *zram, u64 disksize) if (!huge_class_size) huge_class_size = zs_huge_class_size(zram->mem_pool); + zram_meta_init_table_locks(zram, num_pages); return true; } @ drivers/block/zram/zram_drv.c:1258 @ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index, unsigned long handle; unsigned int size; void *src, *dst; + struct zcomp_strm *zstrm; zram_slot_lock(zram, index); if (zram_test_flag(zram, index, ZRAM_WB)) { @ drivers/block/zram/zram_drv.c:1289 @ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index, size = zram_get_obj_size(zram, index); + zstrm = zcomp_stream_get(zram->comp); src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO); if (size == PAGE_SIZE) { dst = kmap_atomic(page); @ drivers/block/zram/zram_drv.c:1297 @ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index, kunmap_atomic(dst); ret = 0; } else { - struct zcomp_strm *zstrm = zcomp_stream_get(zram->comp); dst = kmap_atomic(page); ret = zcomp_decompress(zstrm, src, size, dst); kunmap_atomic(dst); - zcomp_stream_put(zram->comp); } zs_unmap_object(zram->mem_pool, handle); + zcomp_stream_put(zram->comp); zram_slot_unlock(zram, index); /* Should NEVER happen. Return bio error if it does. */ @ drivers/block/zram/zram_drv.h:66 @ struct zram_table_entry { unsigned long element; }; unsigned long flags; + spinlock_t lock; #ifdef CONFIG_ZRAM_MEMORY_TRACKING ktime_t ac_time; #endif @ drivers/char/random.c:1227 @ static __u32 get_reg(struct fast_pool *f, struct pt_regs *regs) return *ptr; } -void add_interrupt_randomness(int irq, int irq_flags) +void add_interrupt_randomness(int irq, int irq_flags, __u64 ip) { struct entropy_store *r; struct fast_pool *fast_pool = this_cpu_ptr(&irq_randomness); - struct pt_regs *regs = get_irq_regs(); unsigned long now = jiffies; cycles_t cycles = random_get_entropy(); __u32 c_high, j_high; - __u64 ip; unsigned long seed; int credit = 0; if (cycles == 0) - cycles = get_reg(fast_pool, regs); + cycles = get_reg(fast_pool, NULL); c_high = (sizeof(cycles) > 4) ? cycles >> 32 : 0; j_high = (sizeof(now) > 4) ? now >> 32 : 0; fast_pool->pool[0] ^= cycles ^ j_high ^ irq; fast_pool->pool[1] ^= now ^ c_high; - ip = regs ? instruction_pointer(regs) : _RET_IP_; + if (!ip) + ip = _RET_IP_; fast_pool->pool[2] ^= ip; fast_pool->pool[3] ^= (sizeof(ip) > 4) ? ip >> 32 : - get_reg(fast_pool, regs); + get_reg(fast_pool, NULL); fast_mix(fast_pool); add_interrupt_bench(cycles); @ drivers/char/tpm/tpm-dev-common.c:23 @ #include "tpm-dev.h" static struct workqueue_struct *tpm_dev_wq; -static DEFINE_MUTEX(tpm_dev_wq_lock); static ssize_t tpm_dev_transmit(struct tpm_chip *chip, struct tpm_space *space, u8 *buf, size_t bufsiz) @ drivers/char/tpm/tpm_tis.c:52 @ static inline struct tpm_tis_tcg_phy *to_tpm_tis_tcg_phy(struct tpm_tis_data *da return container_of(data, struct tpm_tis_tcg_phy, priv); } +#ifdef CONFIG_PREEMPT_RT +/* + * Flushes previous write operations to chip so that a subsequent + * ioread*()s won't stall a cpu. + */ +static inline void tpm_tis_flush(void __iomem *iobase) +{ + ioread8(iobase + TPM_ACCESS(0)); +} +#else +#define tpm_tis_flush(iobase) do { } while (0) +#endif + +static inline void tpm_tis_iowrite8(u8 b, void __iomem *iobase, u32 addr) +{ + iowrite8(b, iobase + addr); + tpm_tis_flush(iobase); +} + +static inline void tpm_tis_iowrite32(u32 b, void __iomem *iobase, u32 addr) +{ + iowrite32(b, iobase + addr); + tpm_tis_flush(iobase); +} + static bool interrupts = true; module_param(interrupts, bool, 0444); MODULE_PARM_DESC(interrupts, "Enable interrupts"); @ drivers/char/tpm/tpm_tis.c:174 @ static int tpm_tcg_write_bytes(struct tpm_tis_data *data, u32 addr, u16 len, struct tpm_tis_tcg_phy *phy = to_tpm_tis_tcg_phy(data); while (len--) - iowrite8(*value++, phy->iobase + addr); + tpm_tis_iowrite8(*value++, phy->iobase, addr); return 0; } @ drivers/char/tpm/tpm_tis.c:201 @ static int tpm_tcg_write32(struct tpm_tis_data *data, u32 addr, u32 value) { struct tpm_tis_tcg_phy *phy = to_tpm_tis_tcg_phy(data); - iowrite32(value, phy->iobase + addr); + tpm_tis_iowrite32(value, phy->iobase, addr); return 0; } @ drivers/clocksource/Kconfig:437 @ config ATMEL_TCB_CLKSRC help Support for Timer Counter Blocks on Atmel SoCs. +config ATMEL_TCB_CLKSRC_USE_SLOW_CLOCK + bool "TC Block use 32 KiHz clock" + depends on ATMEL_TCB_CLKSRC + default y + help + Select this to use 32 KiHz base clock rate as TC block clock. + config CLKSRC_EXYNOS_MCT bool "Exynos multi core timer driver" if COMPILE_TEST depends on ARM || ARM64 @ drivers/clocksource/timer-atmel-tcb.c:31 @ * this 32 bit free-running counter. the second channel is not used. * * - The third channel may be used to provide a 16-bit clockevent - * source, used in either periodic or oneshot mode. This runs - * at 32 KiHZ, and can handle delays of up to two seconds. + * source, used in either periodic or oneshot mode. * * REVISIT behavior during system suspend states... we should disable * all clocks and save the power. Easily done for clockevent devices, @ drivers/clocksource/timer-atmel-tcb.c:145 @ static unsigned long notrace tc_delay_timer_read32(void) struct tc_clkevt_device { struct clock_event_device clkevt; struct clk *clk; + bool clk_enabled; + u32 freq; void __iomem *regs; }; @ drivers/clocksource/timer-atmel-tcb.c:155 @ static struct tc_clkevt_device *to_tc_clkevt(struct clock_event_device *clkevt) return container_of(clkevt, struct tc_clkevt_device, clkevt); } -/* For now, we always use the 32K clock ... this optimizes for NO_HZ, - * because using one of the divided clocks would usually mean the - * tick rate can never be less than several dozen Hz (vs 0.5 Hz). - * - * A divided clock could be good for high resolution timers, since - * 30.5 usec resolution can seem "low". - */ static u32 timer_clock; +static void tc_clk_disable(struct clock_event_device *d) +{ + struct tc_clkevt_device *tcd = to_tc_clkevt(d); + + clk_disable(tcd->clk); + tcd->clk_enabled = false; +} + +static void tc_clk_enable(struct clock_event_device *d) +{ + struct tc_clkevt_device *tcd = to_tc_clkevt(d); + + if (tcd->clk_enabled) + return; + clk_enable(tcd->clk); + tcd->clk_enabled = true; +} + static int tc_shutdown(struct clock_event_device *d) { struct tc_clkevt_device *tcd = to_tc_clkevt(d); @ drivers/clocksource/timer-atmel-tcb.c:182 @ static int tc_shutdown(struct clock_event_device *d) writel(0xff, regs + ATMEL_TC_REG(2, IDR)); writel(ATMEL_TC_CLKDIS, regs + ATMEL_TC_REG(2, CCR)); + return 0; +} + +static int tc_shutdown_clk_off(struct clock_event_device *d) +{ + tc_shutdown(d); if (!clockevent_state_detached(d)) - clk_disable(tcd->clk); + tc_clk_disable(d); return 0; } @ drivers/clocksource/timer-atmel-tcb.c:202 @ static int tc_set_oneshot(struct clock_event_device *d) if (clockevent_state_oneshot(d) || clockevent_state_periodic(d)) tc_shutdown(d); - clk_enable(tcd->clk); + tc_clk_enable(d); - /* slow clock, count up to RC, then irq and stop */ + /* count up to RC, then irq and stop */ writel(timer_clock | ATMEL_TC_CPCSTOP | ATMEL_TC_WAVE | ATMEL_TC_WAVESEL_UP_AUTO, regs + ATMEL_TC_REG(2, CMR)); writel(ATMEL_TC_CPCS, regs + ATMEL_TC_REG(2, IER)); @ drivers/clocksource/timer-atmel-tcb.c:224 @ static int tc_set_periodic(struct clock_event_device *d) /* By not making the gentime core emulate periodic mode on top * of oneshot, we get lower overhead and improved accuracy. */ - clk_enable(tcd->clk); + tc_clk_enable(d); - /* slow clock, count up to RC, then irq and restart */ + /* count up to RC, then irq and restart */ writel(timer_clock | ATMEL_TC_WAVE | ATMEL_TC_WAVESEL_UP_AUTO, regs + ATMEL_TC_REG(2, CMR)); - writel((32768 + HZ / 2) / HZ, tcaddr + ATMEL_TC_REG(2, RC)); + writel((tcd->freq + HZ / 2) / HZ, tcaddr + ATMEL_TC_REG(2, RC)); /* Enable clock and interrupts on RC compare */ writel(ATMEL_TC_CPCS, regs + ATMEL_TC_REG(2, IER)); @ drivers/clocksource/timer-atmel-tcb.c:255 @ static struct tc_clkevt_device clkevt = { .features = CLOCK_EVT_FEAT_PERIODIC | CLOCK_EVT_FEAT_ONESHOT, /* Should be lower than at91rm9200's system timer */ +#ifdef CONFIG_ATMEL_TCB_CLKSRC_USE_SLOW_CLOCK .rating = 125, +#else + .rating = 200, +#endif .set_next_event = tc_next_event, - .set_state_shutdown = tc_shutdown, + .set_state_shutdown = tc_shutdown_clk_off, .set_state_periodic = tc_set_periodic, .set_state_oneshot = tc_set_oneshot, }, @ drivers/clocksource/timer-atmel-tcb.c:281 @ static irqreturn_t ch2_irq(int irq, void *handle) return IRQ_NONE; } -static int __init setup_clkevents(struct atmel_tc *tc, int clk32k_divisor_idx) +static const u8 atmel_tcb_divisors[5] = { 2, 8, 32, 128, 0, }; + +static int __init setup_clkevents(struct atmel_tc *tc, int divisor_idx) { + unsigned divisor = atmel_tcb_divisors[divisor_idx]; int ret; struct clk *t2_clk = tc->clk[2]; int irq = tc->irq[2]; @ drivers/clocksource/timer-atmel-tcb.c:306 @ static int __init setup_clkevents(struct atmel_tc *tc, int clk32k_divisor_idx) clkevt.regs = tc->regs; clkevt.clk = t2_clk; - timer_clock = clk32k_divisor_idx; + timer_clock = divisor_idx; + if (!divisor) + clkevt.freq = 32768; + else + clkevt.freq = clk_get_rate(t2_clk) / divisor; clkevt.clkevt.cpumask = cpumask_of(0); @ drivers/clocksource/timer-atmel-tcb.c:321 @ static int __init setup_clkevents(struct atmel_tc *tc, int clk32k_divisor_idx) return ret; } - clockevents_config_and_register(&clkevt.clkevt, 32768, 1, 0xffff); + clockevents_config_and_register(&clkevt.clkevt, clkevt.freq, 1, 0xffff); return ret; } @ drivers/clocksource/timer-atmel-tcb.c:378 @ static void __init tcb_setup_single_chan(struct atmel_tc *tc, int mck_divisor_id writel(ATMEL_TC_SYNC, tcaddr + ATMEL_TC_BCR); } -static const u8 atmel_tcb_divisors[5] = { 2, 8, 32, 128, 0, }; - static const struct of_device_id atmel_tcb_of_match[] = { { .compatible = "atmel,at91rm9200-tcb", .data = (void *)16, }, { .compatible = "atmel,at91sam9x5-tcb", .data = (void *)32, }, @ drivers/clocksource/timer-atmel-tcb.c:497 @ static int __init tcb_clksrc_init(struct device_node *node) goto err_disable_t1; /* channel 2: periodic and oneshot timer support */ +#ifdef CONFIG_ATMEL_TCB_CLKSRC_USE_SLOW_CLOCK ret = setup_clkevents(&tc, clk32k_divisor_idx); +#else + ret = setup_clkevents(&tc, best_divisor_idx); +#endif if (ret) goto err_unregister_clksrc; @ drivers/connector/cn_proc.c:21 @ #include <linux/pid_namespace.h> #include <linux/cn_proc.h> +#include <linux/locallock.h> /* * Size of a cn_msg followed by a proc_event structure. Since the @ drivers/connector/cn_proc.c:44 @ static struct cb_id cn_proc_event_id = { CN_IDX_PROC, CN_VAL_PROC }; /* proc_event_counts is used as the sequence number of the netlink message */ static DEFINE_PER_CPU(__u32, proc_event_counts) = { 0 }; +static DEFINE_LOCAL_IRQ_LOCK(send_msg_lock); static inline void send_msg(struct cn_msg *msg) { - preempt_disable(); + local_lock(send_msg_lock); msg->seq = __this_cpu_inc_return(proc_event_counts) - 1; ((struct proc_event *)msg->data)->cpu = smp_processor_id(); @ drivers/connector/cn_proc.c:61 @ static inline void send_msg(struct cn_msg *msg) */ cn_netlink_send(msg, 0, CN_IDX_PROC, GFP_NOWAIT); - preempt_enable(); + local_unlock(send_msg_lock); } void proc_fork_connector(struct task_struct *task) @ drivers/dma-buf/dma-resv.c:53 @ DEFINE_WD_CLASS(reservation_ww_class); EXPORT_SYMBOL(reservation_ww_class); -struct lock_class_key reservation_seqcount_class; -EXPORT_SYMBOL(reservation_seqcount_class); - -const char reservation_seqcount_string[] = "reservation_seqcount"; -EXPORT_SYMBOL(reservation_seqcount_string); - /** * dma_resv_list_alloc - allocate fence list * @shared_max: number of fences we need space for @ drivers/dma-buf/dma-resv.c:131 @ subsys_initcall(dma_resv_lockdep); void dma_resv_init(struct dma_resv *obj) { ww_mutex_init(&obj->lock, &reservation_ww_class); + seqcount_ww_mutex_init(&obj->seq, &obj->lock); - __seqcount_init(&obj->seq, reservation_seqcount_string, - &reservation_seqcount_class); RCU_INIT_POINTER(obj->fence, NULL); RCU_INIT_POINTER(obj->fence_excl, NULL); } @ drivers/dma-buf/dma-resv.c:262 @ void dma_resv_add_shared_fence(struct dma_resv *obj, struct dma_fence *fence) fobj = dma_resv_get_list(obj); count = fobj->shared_count; - preempt_disable(); write_seqcount_begin(&obj->seq); for (i = 0; i < count; ++i) { @ drivers/dma-buf/dma-resv.c:283 @ void dma_resv_add_shared_fence(struct dma_resv *obj, struct dma_fence *fence) smp_store_mb(fobj->shared_count, count); write_seqcount_end(&obj->seq); - preempt_enable(); dma_fence_put(old); } EXPORT_SYMBOL(dma_resv_add_shared_fence); @ drivers/dma-buf/dma-resv.c:309 @ void dma_resv_add_excl_fence(struct dma_resv *obj, struct dma_fence *fence) if (fence) dma_fence_get(fence); - preempt_disable(); write_seqcount_begin(&obj->seq); /* write_seqcount_begin provides the necessary memory barrier */ RCU_INIT_POINTER(obj->fence_excl, fence); if (old) old->shared_count = 0; write_seqcount_end(&obj->seq); - preempt_enable(); /* inplace update, no shared fences */ while (i--) @ drivers/dma-buf/dma-resv.c:392 @ int dma_resv_copy_fences(struct dma_resv *dst, struct dma_resv *src) src_list = dma_resv_get_list(dst); old = dma_resv_get_excl(dst); - preempt_disable(); write_seqcount_begin(&dst->seq); /* write_seqcount_begin provides the necessary memory barrier */ RCU_INIT_POINTER(dst->fence_excl, new); RCU_INIT_POINTER(dst->fence, dst_list); write_seqcount_end(&dst->seq); - preempt_enable(); dma_resv_list_free(src_list); dma_fence_put(old); @ drivers/firmware/efi/efi.c:71 @ struct mm_struct efi_mm = { struct workqueue_struct *efi_rts_wq; -static bool disable_runtime; +static bool disable_runtime = IS_ENABLED(CONFIG_PREEMPT_RT); static int __init setup_noefi(char *arg) { disable_runtime = true; @ drivers/firmware/efi/efi.c:102 @ static int __init parse_efi_cmdline(char *str) if (parse_option_str(str, "noruntime")) disable_runtime = true; + if (parse_option_str(str, "runtime")) + disable_runtime = false; + if (parse_option_str(str, "nosoftreserve")) set_bit(EFI_MEM_NO_SOFT_RESERVE, &efi.flags); @ drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:260 @ static int amdgpu_amdkfd_remove_eviction_fence(struct amdgpu_bo *bo, new->shared_count = k; /* Install the new fence list, seqcount provides the barriers */ - preempt_disable(); write_seqcount_begin(&resv->seq); RCU_INIT_POINTER(resv->fence, new); write_seqcount_end(&resv->seq); - preempt_enable(); /* Drop the references to the removed fences or move them to ef_list */ for (i = j, k = 0; i < old->shared_count; ++i) { @ drivers/gpu/drm/i915/display/intel_sprite.c:41 @ #include <drm/drm_plane_helper.h> #include <drm/drm_rect.h> #include <drm/i915_drm.h> +#include <linux/locallock.h> #include "i915_drv.h" #include "i915_trace.h" @ drivers/gpu/drm/i915/display/intel_sprite.c:71 @ int intel_usecs_to_scanlines(const struct drm_display_mode *adjusted_mode, #define VBLANK_EVASION_TIME_US 100 #endif +static DEFINE_LOCAL_IRQ_LOCK(pipe_update_lock); + /** * intel_pipe_update_start() - start update of a set of display registers * @new_crtc_state: the new crtc state @ drivers/gpu/drm/i915/display/intel_sprite.c:122 @ void intel_pipe_update_start(const struct intel_crtc_state *new_crtc_state) DRM_ERROR("PSR idle timed out 0x%x, atomic update may fail\n", psr_status); - local_irq_disable(); + local_lock_irq(pipe_update_lock); crtc->debug.min_vbl = min; crtc->debug.max_vbl = max; @ drivers/gpu/drm/i915/display/intel_sprite.c:146 @ void intel_pipe_update_start(const struct intel_crtc_state *new_crtc_state) break; } - local_irq_enable(); + local_unlock_irq(pipe_update_lock); timeout = schedule_timeout(timeout); - local_irq_disable(); + local_lock_irq(pipe_update_lock); } finish_wait(wq, &wait); @ drivers/gpu/drm/i915/display/intel_sprite.c:183 @ void intel_pipe_update_start(const struct intel_crtc_state *new_crtc_state) return; irq_disable: - local_irq_disable(); + local_lock_irq(pipe_update_lock); } /** @ drivers/gpu/drm/i915/display/intel_sprite.c:220 @ void intel_pipe_update_end(struct intel_crtc_state *new_crtc_state) new_crtc_state->uapi.event = NULL; } - local_irq_enable(); + local_unlock_irq(pipe_update_lock); if (intel_vgpu_active(dev_priv)) return; @ drivers/gpu/drm/i915/gt/intel_engine_pm.c:67 @ static int __engine_unpark(struct intel_wakeref *wf) } #if IS_ENABLED(CONFIG_LOCKDEP) +#include <linux/locallock.h> + +static DEFINE_LOCAL_IRQ_LOCK(timeline_lock); static inline unsigned long __timeline_mark_lock(struct intel_context *ce) { unsigned long flags; - local_irq_save(flags); + local_lock_irqsave(timeline_lock, flags); mutex_acquire(&ce->timeline->mutex.dep_map, 2, 0, _THIS_IP_); return flags; @ drivers/gpu/drm/i915/gt/intel_engine_pm.c:85 @ static inline void __timeline_mark_unlock(struct intel_context *ce, unsigned long flags) { mutex_release(&ce->timeline->mutex.dep_map, _THIS_IP_); - local_irq_restore(flags); + local_unlock_irqrestore(timeline_lock, flags); } #else @ drivers/gpu/drm/i915/i915_irq.c:806 @ bool i915_get_crtc_scanoutpos(struct drm_device *dev, unsigned int index, spin_lock_irqsave(&dev_priv->uncore.lock, irqflags); /* preempt_disable_rt() should go right here in PREEMPT_RT patchset. */ + preempt_disable_rt(); /* Get optional system timestamp before query. */ if (stime) @ drivers/gpu/drm/i915/i915_irq.c:858 @ bool i915_get_crtc_scanoutpos(struct drm_device *dev, unsigned int index, *etime = ktime_get(); /* preempt_enable_rt() should go right here in PREEMPT_RT patchset. */ + preempt_enable_rt(); spin_unlock_irqrestore(&dev_priv->uncore.lock, irqflags); @ drivers/gpu/drm/i915/i915_trace.h:5 @ #if !defined(_I915_TRACE_H_) || defined(TRACE_HEADER_MULTI_READ) #define _I915_TRACE_H_ +#ifdef CONFIG_PREEMPT_RT +#define NOTRACE +#endif + #include <linux/stringify.h> #include <linux/types.h> #include <linux/tracepoint.h> @ drivers/gpu/drm/i915/i915_trace.h:723 @ DEFINE_EVENT(i915_request, i915_request_add, TP_ARGS(rq) ); -#if defined(CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS) +#if defined(CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS) && !defined(NOTRACE) DEFINE_EVENT(i915_request, i915_request_submit, TP_PROTO(struct i915_request *rq), TP_ARGS(rq) @ drivers/gpu/drm/radeon/radeon_display.c:1815 @ int radeon_get_crtc_scanoutpos(struct drm_device *dev, unsigned int pipe, struct radeon_device *rdev = dev->dev_private; /* preempt_disable_rt() should go right here in PREEMPT_RT patchset. */ + preempt_disable_rt(); /* Get optional system timestamp before query. */ if (stime) @ drivers/gpu/drm/radeon/radeon_display.c:1908 @ int radeon_get_crtc_scanoutpos(struct drm_device *dev, unsigned int pipe, *etime = ktime_get(); /* preempt_enable_rt() should go right here in PREEMPT_RT patchset. */ + preempt_enable_rt(); /* Decode into vertical and horizontal scanout position. */ *vpos = position & 0x1fff; @ drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:482 @ struct vmw_private { bool assume_16bpp; bool has_sm4_1; - /* - * VGA registers. - */ - - struct vmw_vga_topology_state vga_save[VMWGFX_MAX_DISPLAYS]; - uint32_t vga_width; - uint32_t vga_height; - uint32_t vga_bpp; - uint32_t vga_bpl; - uint32_t vga_pitchlock; - - uint32_t num_displays; - /* * Framebuffer info. */ @ drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:890 @ extern void vmw_fifo_commit(struct vmw_private *dev_priv, uint32_t bytes); extern void vmw_fifo_commit_flush(struct vmw_private *dev_priv, uint32_t bytes); extern int vmw_fifo_send_fence(struct vmw_private *dev_priv, uint32_t *seqno); -extern void vmw_fifo_ping_host_locked(struct vmw_private *, uint32_t reason); extern void vmw_fifo_ping_host(struct vmw_private *dev_priv, uint32_t reason); extern bool vmw_fifo_have_3d(struct vmw_private *dev_priv); extern bool vmw_fifo_have_pitchlock(struct vmw_private *dev_priv); @ drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:936 @ extern struct ttm_placement vmw_mob_placement; extern struct ttm_placement vmw_mob_ne_placement; extern struct ttm_placement vmw_nonfixed_placement; extern struct ttm_bo_driver vmw_bo_driver; -extern int vmw_dma_quiescent(struct drm_device *dev); extern int vmw_bo_map_dma(struct ttm_buffer_object *bo); extern void vmw_bo_unmap_dma(struct ttm_buffer_object *bo); extern const struct vmw_sg_table * @ drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:1073 @ int vmw_fb_on(struct vmw_private *vmw_priv); int vmw_kms_init(struct vmw_private *dev_priv); int vmw_kms_close(struct vmw_private *dev_priv); -int vmw_kms_save_vga(struct vmw_private *vmw_priv); -int vmw_kms_restore_vga(struct vmw_private *vmw_priv); int vmw_kms_cursor_bypass_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); void vmw_kms_cursor_post_execbuf(struct vmw_private *dev_priv); @ drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:1125 @ int vmw_overlay_init(struct vmw_private *dev_priv); int vmw_overlay_close(struct vmw_private *dev_priv); int vmw_overlay_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); -int vmw_overlay_stop_all(struct vmw_private *dev_priv); int vmw_overlay_resume_all(struct vmw_private *dev_priv); int vmw_overlay_pause_all(struct vmw_private *dev_priv); int vmw_overlay_claim(struct vmw_private *dev_priv, uint32_t *out); @ drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:1171 @ extern void vmw_otables_takedown(struct vmw_private *dev_priv); extern const struct vmw_user_resource_conv *user_context_converter; -extern int vmw_context_check(struct vmw_private *dev_priv, - struct ttm_object_file *tfile, - int id, - struct vmw_resource **p_res); extern int vmw_context_define_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int vmw_extended_context_define_ioctl(struct drm_device *dev, void *data, @ drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:1200 @ vmw_context_get_dx_query_mob(struct vmw_resource *ctx_res); extern const struct vmw_user_resource_conv *user_surface_converter; -extern void vmw_surface_res_free(struct vmw_resource *res); extern int vmw_surface_destroy_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int vmw_surface_define_ioctl(struct drm_device *dev, void *data, @ drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:1210 @ extern int vmw_gb_surface_define_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int vmw_gb_surface_reference_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); -extern int vmw_surface_check(struct vmw_private *dev_priv, - struct ttm_object_file *tfile, - uint32_t handle, int *id); -extern int vmw_surface_validate(struct vmw_private *dev_priv, - struct vmw_surface *srf); int vmw_surface_gb_priv_define(struct drm_device *dev, uint32_t user_accounting_size, SVGA3dSurfaceAllFlags svga3d_flags, @ drivers/gpu/drm/vmwgfx/vmwgfx_fifo.c:172 @ void vmw_fifo_ping_host(struct vmw_private *dev_priv, uint32_t reason) { u32 *fifo_mem = dev_priv->mmio_virt; - preempt_disable(); if (cmpxchg(fifo_mem + SVGA_FIFO_BUSY, 0, 1) == 0) vmw_write(dev_priv, SVGA_REG_SYNC, reason); - preempt_enable(); } void vmw_fifo_release(struct vmw_private *dev_priv, struct vmw_fifo_state *fifo) @ drivers/gpu/drm/vmwgfx/vmwgfx_kms.c:1900 @ int vmw_kms_write_svga(struct vmw_private *vmw_priv, return 0; } -int vmw_kms_save_vga(struct vmw_private *vmw_priv) -{ - struct vmw_vga_topology_state *save; - uint32_t i; - - vmw_priv->vga_width = vmw_read(vmw_priv, SVGA_REG_WIDTH); - vmw_priv->vga_height = vmw_read(vmw_priv, SVGA_REG_HEIGHT); - vmw_priv->vga_bpp = vmw_read(vmw_priv, SVGA_REG_BITS_PER_PIXEL); - if (vmw_priv->capabilities & SVGA_CAP_PITCHLOCK) - vmw_priv->vga_pitchlock = - vmw_read(vmw_priv, SVGA_REG_PITCHLOCK); - else if (vmw_fifo_have_pitchlock(vmw_priv)) - vmw_priv->vga_pitchlock = vmw_mmio_read(vmw_priv->mmio_virt + - SVGA_FIFO_PITCHLOCK); - - if (!(vmw_priv->capabilities & SVGA_CAP_DISPLAY_TOPOLOGY)) - return 0; - - vmw_priv->num_displays = vmw_read(vmw_priv, - SVGA_REG_NUM_GUEST_DISPLAYS); - - if (vmw_priv->num_displays == 0) - vmw_priv->num_displays = 1; - - for (i = 0; i < vmw_priv->num_displays; ++i) { - save = &vmw_priv->vga_save[i]; - vmw_write(vmw_priv, SVGA_REG_DISPLAY_ID, i); - save->primary = vmw_read(vmw_priv, SVGA_REG_DISPLAY_IS_PRIMARY); - save->pos_x = vmw_read(vmw_priv, SVGA_REG_DISPLAY_POSITION_X); - save->pos_y = vmw_read(vmw_priv, SVGA_REG_DISPLAY_POSITION_Y); - save->width = vmw_read(vmw_priv, SVGA_REG_DISPLAY_WIDTH); - save->height = vmw_read(vmw_priv, SVGA_REG_DISPLAY_HEIGHT); - vmw_write(vmw_priv, SVGA_REG_DISPLAY_ID, SVGA_ID_INVALID); - if (i == 0 && vmw_priv->num_displays == 1 && - save->width == 0 && save->height == 0) { - - /* - * It should be fairly safe to assume that these - * values are uninitialized. - */ - - save->width = vmw_priv->vga_width - save->pos_x; - save->height = vmw_priv->vga_height - save->pos_y; - } - } - - return 0; -} - -int vmw_kms_restore_vga(struct vmw_private *vmw_priv) -{ - struct vmw_vga_topology_state *save; - uint32_t i; - - vmw_write(vmw_priv, SVGA_REG_WIDTH, vmw_priv->vga_width); - vmw_write(vmw_priv, SVGA_REG_HEIGHT, vmw_priv->vga_height); - vmw_write(vmw_priv, SVGA_REG_BITS_PER_PIXEL, vmw_priv->vga_bpp); - if (vmw_priv->capabilities & SVGA_CAP_PITCHLOCK) - vmw_write(vmw_priv, SVGA_REG_PITCHLOCK, - vmw_priv->vga_pitchlock); - else if (vmw_fifo_have_pitchlock(vmw_priv)) - vmw_mmio_write(vmw_priv->vga_pitchlock, - vmw_priv->mmio_virt + SVGA_FIFO_PITCHLOCK); - - if (!(vmw_priv->capabilities & SVGA_CAP_DISPLAY_TOPOLOGY)) - return 0; - - for (i = 0; i < vmw_priv->num_displays; ++i) { - save = &vmw_priv->vga_save[i]; - vmw_write(vmw_priv, SVGA_REG_DISPLAY_ID, i); - vmw_write(vmw_priv, SVGA_REG_DISPLAY_IS_PRIMARY, save->primary); - vmw_write(vmw_priv, SVGA_REG_DISPLAY_POSITION_X, save->pos_x); - vmw_write(vmw_priv, SVGA_REG_DISPLAY_POSITION_Y, save->pos_y); - vmw_write(vmw_priv, SVGA_REG_DISPLAY_WIDTH, save->width); - vmw_write(vmw_priv, SVGA_REG_DISPLAY_HEIGHT, save->height); - vmw_write(vmw_priv, SVGA_REG_DISPLAY_ID, SVGA_ID_INVALID); - } - - return 0; -} - bool vmw_kms_validate_mode_vram(struct vmw_private *dev_priv, uint32_t pitch, uint32_t height) @ drivers/gpu/drm/vmwgfx/vmwgfx_overlay.c:356 @ static int vmw_overlay_update_stream(struct vmw_private *dev_priv, return 0; } -/** - * Stop all streams. - * - * Used by the fb code when starting. - * - * Takes the overlay lock. - */ -int vmw_overlay_stop_all(struct vmw_private *dev_priv) -{ - struct vmw_overlay *overlay = dev_priv->overlay_priv; - int i, ret; - - if (!overlay) - return 0; - - mutex_lock(&overlay->mutex); - - for (i = 0; i < VMW_MAX_NUM_STREAMS; i++) { - struct vmw_stream *stream = &overlay->stream[i]; - if (!stream->buf) - continue; - - ret = vmw_overlay_stop(dev_priv, i, false, false); - WARN_ON(ret != 0); - } - - mutex_unlock(&overlay->mutex); - - return 0; -} - /** * Try to resume all paused streams. * @ drivers/hv/hyperv_vmbus.h:21 @ #include <linux/atomic.h> #include <linux/hyperv.h> #include <linux/interrupt.h> +#include <linux/irq.h> #include "hv_trace.h" @ drivers/hv/vmbus_drv.c:25 @ #include <linux/clockchips.h> #include <linux/cpu.h> #include <linux/sched/task_stack.h> +#include <linux/irq.h> #include <asm/mshyperv.h> #include <linux/delay.h> @ drivers/hv/vmbus_drv.c:1251 @ static void vmbus_isr(void) void *page_addr = hv_cpu->synic_event_page; struct hv_message *msg; union hv_synic_event_flags *event; + struct pt_regs *regs = get_irq_regs(); + u64 ip = regs ? instruction_pointer(regs) : 0; bool handled = false; if (unlikely(page_addr == NULL)) @ drivers/hv/vmbus_drv.c:1297 @ static void vmbus_isr(void) tasklet_schedule(&hv_cpu->msg_dpc); } - add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR, 0); + add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR, 0, ip); } /* @ drivers/leds/trigger/Kconfig:67 @ config LEDS_TRIGGER_BACKLIGHT config LEDS_TRIGGER_CPU bool "LED CPU Trigger" + depends on !PREEMPT_RT help This allows LEDs to be controlled by active CPUs. This shows the active CPUs across an array of LEDs so you can see which @ drivers/md/raid5.c:2061 @ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request) struct raid5_percpu *percpu; unsigned long cpu; - cpu = get_cpu(); + cpu = get_cpu_light(); percpu = per_cpu_ptr(conf->percpu, cpu); + spin_lock(&percpu->lock); if (test_bit(STRIPE_OP_BIOFILL, &ops_request)) { ops_run_biofill(sh); overlap_clear++; @ drivers/md/raid5.c:2122 @ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request) if (test_and_clear_bit(R5_Overlap, &dev->flags)) wake_up(&sh->raid_conf->wait_for_overlap); } - put_cpu(); + spin_unlock(&percpu->lock); + put_cpu_light(); } static void free_stripe(struct kmem_cache *sc, struct stripe_head *sh) @ drivers/md/raid5.c:6820 @ static int raid456_cpu_up_prepare(unsigned int cpu, struct hlist_node *node) __func__, cpu); return -ENOMEM; } + spin_lock_init(&per_cpu_ptr(conf->percpu, cpu)->lock); return 0; } @ drivers/md/raid5.c:6935 @ static struct r5conf *setup_conf(struct mddev *mddev) } else goto abort; spin_lock_init(&conf->device_lock); - seqcount_init(&conf->gen_lock); + seqcount_spinlock_init(&conf->gen_lock, &conf->device_lock); mutex_init(&conf->cache_size_mutex); init_waitqueue_head(&conf->wait_for_quiescent); init_waitqueue_head(&conf->wait_for_stripe); @ drivers/md/raid5.h:592 @ struct r5conf { int prev_chunk_sectors; int prev_algo; short generation; /* increments with every reshape */ - seqcount_t gen_lock; /* lock against generation changes */ + seqcount_spinlock_t gen_lock; /* lock against generation changes */ unsigned long reshape_checkpoint; /* Time we last updated * metadata */ long long min_offset_diff; /* minimum difference between @ drivers/md/raid5.h:637 @ struct r5conf { int recovery_disabled; /* per cpu variables */ struct raid5_percpu { + spinlock_t lock; /* Protection for -RT */ struct page *spare_page; /* Used when checking P/Q in raid6 */ void *scribble; /* space for constructing buffer * lists and performing address @ drivers/net/phy/fixed_phy.c:22 @ #include <linux/slab.h> #include <linux/of.h> #include <linux/gpio/consumer.h> -#include <linux/seqlock.h> #include <linux/idr.h> #include <linux/netdevice.h> #include <linux/linkmode.h> @ drivers/net/phy/fixed_phy.c:36 @ struct fixed_mdio_bus { struct fixed_phy { int addr; struct phy_device *phydev; - seqcount_t seqcount; struct fixed_phy_status status; bool no_carrier; int (*link_update)(struct net_device *, struct fixed_phy_status *); @ drivers/net/phy/fixed_phy.c:81 @ static int fixed_mdio_read(struct mii_bus *bus, int phy_addr, int reg_num) list_for_each_entry(fp, &fmb->phys, node) { if (fp->addr == phy_addr) { struct fixed_phy_status state; - int s; - do { - s = read_seqcount_begin(&fp->seqcount); - fp->status.link = !fp->no_carrier; - /* Issue callback if user registered it. */ - if (fp->link_update) - fp->link_update(fp->phydev->attached_dev, - &fp->status); - /* Check the GPIO for change in status */ - fixed_phy_update(fp); - state = fp->status; - } while (read_seqcount_retry(&fp->seqcount, s)); + fp->status.link = !fp->no_carrier; + + /* Issue callback if user registered it. */ + if (fp->link_update) + fp->link_update(fp->phydev->attached_dev, + &fp->status); + + /* Check the GPIO for change in status */ + fixed_phy_update(fp); + state = fp->status; return swphy_read_reg(reg_num, &state); } @ drivers/net/phy/fixed_phy.c:149 @ static int fixed_phy_add_gpiod(unsigned int irq, int phy_addr, if (!fp) return -ENOMEM; - seqcount_init(&fp->seqcount); - if (irq != PHY_POLL) fmb->mii_bus->irq[phy_addr] = irq; @ drivers/net/phy/mdio_bus.c:743 @ EXPORT_SYMBOL(mdiobus_scan); static void mdiobus_stats_acct(struct mdio_bus_stats *stats, bool op, int ret) { + preempt_disable(); u64_stats_update_begin(&stats->syncp); u64_stats_inc(&stats->transfers); @ drivers/net/phy/mdio_bus.c:758 @ static void mdiobus_stats_acct(struct mdio_bus_stats *stats, bool op, int ret) u64_stats_inc(&stats->writes); out: u64_stats_update_end(&stats->syncp); + preempt_enable(); } /** @ drivers/net/wireless/intersil/orinoco/orinoco_usb.c:696 @ static void ezusb_req_ctx_wait(struct ezusb_priv *upriv, while (!ctx->done.done && msecs--) udelay(1000); } else { - wait_event_interruptible(ctx->done.wait, - ctx->done.done); + swait_event_interruptible_exclusive(ctx->done.wait, + ctx->done.done); } break; default: @ drivers/pci/switch/switchtec.c:55 @ struct switchtec_user { enum mrpc_state state; - struct completion comp; + wait_queue_head_t cmd_comp; struct kref kref; struct list_head list; + bool cmd_done; u32 cmd; u32 status; u32 return_code; @ drivers/pci/switch/switchtec.c:81 @ static struct switchtec_user *stuser_create(struct switchtec_dev *stdev) stuser->stdev = stdev; kref_init(&stuser->kref); INIT_LIST_HEAD(&stuser->list); - init_completion(&stuser->comp); + init_waitqueue_head(&stuser->cmd_comp); stuser->event_cnt = atomic_read(&stdev->event_cnt); dev_dbg(&stdev->dev, "%s: %p\n", __func__, stuser); @ drivers/pci/switch/switchtec.c:179 @ static int mrpc_queue_cmd(struct switchtec_user *stuser) kref_get(&stuser->kref); stuser->read_len = sizeof(stuser->data); stuser_set_state(stuser, MRPC_QUEUED); - reinit_completion(&stuser->comp); + stuser->cmd_done = false; list_add_tail(&stuser->list, &stdev->mrpc_queue); mrpc_cmd_submit(stdev); @ drivers/pci/switch/switchtec.c:226 @ static void mrpc_complete_cmd(struct switchtec_dev *stdev) memcpy_fromio(stuser->data, &stdev->mmio_mrpc->output_data, stuser->read_len); out: - complete_all(&stuser->comp); + stuser->cmd_done = true; + wake_up_interruptible(&stuser->cmd_comp); list_del_init(&stuser->list); stuser_put(stuser); stdev->mrpc_busy = 0; @ drivers/pci/switch/switchtec.c:534 @ static ssize_t switchtec_dev_read(struct file *filp, char __user *data, mutex_unlock(&stdev->mrpc_mutex); if (filp->f_flags & O_NONBLOCK) { - if (!try_wait_for_completion(&stuser->comp)) + if (!READ_ONCE(stuser->cmd_done)) return -EAGAIN; } else { - rc = wait_for_completion_interruptible(&stuser->comp); + rc = wait_event_interruptible(stuser->cmd_comp, + stuser->cmd_done); if (rc < 0) return rc; } @ drivers/pci/switch/switchtec.c:586 @ static __poll_t switchtec_dev_poll(struct file *filp, poll_table *wait) struct switchtec_dev *stdev = stuser->stdev; __poll_t ret = 0; - poll_wait(filp, &stuser->comp.wait, wait); + poll_wait(filp, &stuser->cmd_comp, wait); poll_wait(filp, &stdev->event_wq, wait); if (lock_mutex_and_test_alive(stdev)) @ drivers/pci/switch/switchtec.c:594 @ static __poll_t switchtec_dev_poll(struct file *filp, poll_table *wait) mutex_unlock(&stdev->mrpc_mutex); - if (try_wait_for_completion(&stuser->comp)) + if (READ_ONCE(stuser->cmd_done)) ret |= EPOLLIN | EPOLLRDNORM; if (stuser->event_cnt != atomic_read(&stdev->event_cnt)) @ drivers/pci/switch/switchtec.c:1278 @ static void stdev_kill(struct switchtec_dev *stdev) /* Wake up and kill any users waiting on an MRPC request */ list_for_each_entry_safe(stuser, tmpuser, &stdev->mrpc_queue, list) { - complete_all(&stuser->comp); + stuser->cmd_done = true; + wake_up_interruptible(&stuser->cmd_comp); list_del_init(&stuser->list); stuser_put(stuser); } @ drivers/scsi/fcoe/fcoe.c:1455 @ static int fcoe_rcv(struct sk_buff *skb, struct net_device *netdev, static int fcoe_alloc_paged_crc_eof(struct sk_buff *skb, int tlen) { struct fcoe_percpu_s *fps; - int rc; + int rc, cpu = get_cpu_light(); - fps = &get_cpu_var(fcoe_percpu); + fps = &per_cpu(fcoe_percpu, cpu); rc = fcoe_get_paged_crc_eof(skb, tlen, fps); - put_cpu_var(fcoe_percpu); + put_cpu_light(); return rc; } @ drivers/scsi/fcoe/fcoe.c:1644 @ static inline int fcoe_filter_frames(struct fc_lport *lport, return 0; } - stats = per_cpu_ptr(lport->stats, get_cpu()); + stats = per_cpu_ptr(lport->stats, get_cpu_light()); stats->InvalidCRCCount++; if (stats->InvalidCRCCount < 5) printk(KERN_WARNING "fcoe: dropping frame with CRC error\n"); - put_cpu(); + put_cpu_light(); return -EINVAL; } @ drivers/scsi/fcoe/fcoe.c:1689 @ static void fcoe_recv_frame(struct sk_buff *skb) */ hp = (struct fcoe_hdr *) skb_network_header(skb); - stats = per_cpu_ptr(lport->stats, get_cpu()); + stats = per_cpu_ptr(lport->stats, get_cpu_light()); if (unlikely(FC_FCOE_DECAPS_VER(hp) != FC_FCOE_VER)) { if (stats->ErrorFrames < 5) printk(KERN_WARNING "fcoe: FCoE version " @ drivers/scsi/fcoe/fcoe.c:1721 @ static void fcoe_recv_frame(struct sk_buff *skb) goto drop; if (!fcoe_filter_frames(lport, fp)) { - put_cpu(); + put_cpu_light(); fc_exch_recv(lport, fp); return; } drop: stats->ErrorFrames++; - put_cpu(); + put_cpu_light(); kfree_skb(skb); } @ drivers/scsi/fcoe/fcoe_ctlr.c:829 @ static unsigned long fcoe_ctlr_age_fcfs(struct fcoe_ctlr *fip) INIT_LIST_HEAD(&del_list); - stats = per_cpu_ptr(fip->lp->stats, get_cpu()); + stats = per_cpu_ptr(fip->lp->stats, get_cpu_light()); list_for_each_entry_safe(fcf, next, &fip->fcfs, list) { deadline = fcf->time + fcf->fka_period + fcf->fka_period / 2; @ drivers/scsi/fcoe/fcoe_ctlr.c:865 @ static unsigned long fcoe_ctlr_age_fcfs(struct fcoe_ctlr *fip) sel_time = fcf->time; } } - put_cpu(); + put_cpu_light(); list_for_each_entry_safe(fcf, next, &del_list, list) { /* Removes fcf from current list */ @ drivers/scsi/libfc/fc_exch.c:824 @ static struct fc_exch *fc_exch_em_alloc(struct fc_lport *lport, } memset(ep, 0, sizeof(*ep)); - cpu = get_cpu(); + cpu = get_cpu_light(); pool = per_cpu_ptr(mp->pool, cpu); spin_lock_bh(&pool->lock); - put_cpu(); + put_cpu_light(); /* peek cache of free slot */ if (pool->left != FC_XID_UNKNOWN) { @ drivers/thermal/intel/x86_pkg_temp_thermal.c:66 @ static int max_id __read_mostly; /* Array of zone pointers */ static struct zone_device **zones; /* Serializes interrupt notification, work and hotplug */ -static DEFINE_SPINLOCK(pkg_temp_lock); +static DEFINE_RAW_SPINLOCK(pkg_temp_lock); /* Protects zone operation in the work function against hotplug removal */ static DEFINE_MUTEX(thermal_zone_mutex); @ drivers/thermal/intel/x86_pkg_temp_thermal.c:269 @ static void pkg_temp_thermal_threshold_work_fn(struct work_struct *work) u64 msr_val, wr_val; mutex_lock(&thermal_zone_mutex); - spin_lock_irq(&pkg_temp_lock); + raw_spin_lock_irq(&pkg_temp_lock); ++pkg_work_cnt; zonedev = pkg_temp_thermal_get_dev(cpu); if (!zonedev) { - spin_unlock_irq(&pkg_temp_lock); + raw_spin_unlock_irq(&pkg_temp_lock); mutex_unlock(&thermal_zone_mutex); return; } @ drivers/thermal/intel/x86_pkg_temp_thermal.c:288 @ static void pkg_temp_thermal_threshold_work_fn(struct work_struct *work) } enable_pkg_thres_interrupt(); - spin_unlock_irq(&pkg_temp_lock); + raw_spin_unlock_irq(&pkg_temp_lock); /* * If tzone is not NULL, then thermal_zone_mutex will prevent the @ drivers/thermal/intel/x86_pkg_temp_thermal.c:313 @ static int pkg_thermal_notify(u64 msr_val) struct zone_device *zonedev; unsigned long flags; - spin_lock_irqsave(&pkg_temp_lock, flags); + raw_spin_lock_irqsave(&pkg_temp_lock, flags); ++pkg_interrupt_cnt; disable_pkg_thres_interrupt(); @ drivers/thermal/intel/x86_pkg_temp_thermal.c:325 @ static int pkg_thermal_notify(u64 msr_val) pkg_thermal_schedule_work(zonedev->cpu, &zonedev->work); } - spin_unlock_irqrestore(&pkg_temp_lock, flags); + raw_spin_unlock_irqrestore(&pkg_temp_lock, flags); return 0; } @ drivers/thermal/intel/x86_pkg_temp_thermal.c:371 @ static int pkg_temp_thermal_device_add(unsigned int cpu) zonedev->msr_pkg_therm_high); cpumask_set_cpu(cpu, &zonedev->cpumask); - spin_lock_irq(&pkg_temp_lock); + raw_spin_lock_irq(&pkg_temp_lock); zones[id] = zonedev; - spin_unlock_irq(&pkg_temp_lock); + raw_spin_unlock_irq(&pkg_temp_lock); return 0; } @ drivers/thermal/intel/x86_pkg_temp_thermal.c:410 @ static int pkg_thermal_cpu_offline(unsigned int cpu) } /* Protect against work and interrupts */ - spin_lock_irq(&pkg_temp_lock); + raw_spin_lock_irq(&pkg_temp_lock); /* * Check whether this cpu was the current target and store the new @ drivers/thermal/intel/x86_pkg_temp_thermal.c:442 @ static int pkg_thermal_cpu_offline(unsigned int cpu) * To cancel the work we need to drop the lock, otherwise * we might deadlock if the work needs to be flushed. */ - spin_unlock_irq(&pkg_temp_lock); + raw_spin_unlock_irq(&pkg_temp_lock); cancel_delayed_work_sync(&zonedev->work); - spin_lock_irq(&pkg_temp_lock); + raw_spin_lock_irq(&pkg_temp_lock); /* * If this is not the last cpu in the package and the work * did not run after we dropped the lock above, then we @ drivers/thermal/intel/x86_pkg_temp_thermal.c:455 @ static int pkg_thermal_cpu_offline(unsigned int cpu) pkg_thermal_schedule_work(target, &zonedev->work); } - spin_unlock_irq(&pkg_temp_lock); + raw_spin_unlock_irq(&pkg_temp_lock); /* Final cleanup if this is the last cpu */ if (lastcpu) @ drivers/tty/serial/8250/8250.h:133 @ static inline void serial_dl_write(struct uart_8250_port *up, int value) up->dl_write(up, value); } +static inline void serial8250_set_IER(struct uart_8250_port *up, + unsigned char ier) +{ + struct uart_port *port = &up->port; + unsigned int flags; + bool is_console; + + is_console = uart_console(port); + + if (is_console) + console_atomic_lock(&flags); + + serial_out(up, UART_IER, ier); + + if (is_console) + console_atomic_unlock(flags); +} + +static inline unsigned char serial8250_clear_IER(struct uart_8250_port *up) +{ + struct uart_port *port = &up->port; + unsigned int clearval = 0; + unsigned int prior; + unsigned int flags; + bool is_console; + + is_console = uart_console(port); + + if (up->capabilities & UART_CAP_UUE) + clearval = UART_IER_UUE; + + if (is_console) + console_atomic_lock(&flags); + + prior = serial_port_in(port, UART_IER); + serial_port_out(port, UART_IER, clearval); + + if (is_console) + console_atomic_unlock(flags); + + return prior; +} + static inline bool serial8250_set_THRI(struct uart_8250_port *up) { if (up->ier & UART_IER_THRI) return false; up->ier |= UART_IER_THRI; - serial_out(up, UART_IER, up->ier); + serial8250_set_IER(up, up->ier); return true; } @ drivers/tty/serial/8250/8250.h:190 @ static inline bool serial8250_clear_THRI(struct uart_8250_port *up) if (!(up->ier & UART_IER_THRI)) return false; up->ier &= ~UART_IER_THRI; - serial_out(up, UART_IER, up->ier); + serial8250_set_IER(up, up->ier); return true; } @ drivers/tty/serial/8250/8250_core.c:277 @ static void serial8250_backup_timeout(struct timer_list *t) * Must disable interrupts or else we risk racing with the interrupt * based handler. */ - if (up->port.irq) { - ier = serial_in(up, UART_IER); - serial_out(up, UART_IER, 0); - } + if (up->port.irq) + ier = serial8250_clear_IER(up); iir = serial_in(up, UART_IIR); @ drivers/tty/serial/8250/8250_core.c:301 @ static void serial8250_backup_timeout(struct timer_list *t) serial8250_tx_chars(up); if (up->port.irq) - serial_out(up, UART_IER, ier); + serial8250_set_IER(up, ier); spin_unlock_irqrestore(&up->port.lock, flags); @ drivers/tty/serial/8250/8250_core.c:579 @ serial8250_register_ports(struct uart_driver *drv, struct device *dev) #ifdef CONFIG_SERIAL_8250_CONSOLE +static void univ8250_console_write_atomic(struct console *co, const char *s, + unsigned int count) +{ + struct uart_8250_port *up = &serial8250_ports[co->index]; + + serial8250_console_write_atomic(up, s, count); +} + static void univ8250_console_write(struct console *co, const char *s, unsigned int count) { @ drivers/tty/serial/8250/8250_core.c:672 @ static int univ8250_console_match(struct console *co, char *name, int idx, static struct console univ8250_console = { .name = "ttyS", + .write_atomic = univ8250_console_write_atomic, .write = univ8250_console_write, .device = uart_console_device, .setup = univ8250_console_setup, @ drivers/tty/serial/8250/8250_fsl.c:56 @ int fsl8250_handle_irq(struct uart_port *port) /* Stop processing interrupts on input overrun */ if ((orig_lsr & UART_LSR_OE) && (up->overrun_backoff_time_ms > 0)) { + unsigned int ca_flags; unsigned long delay; + bool is_console; + is_console = uart_console(port); + + if (is_console) + console_atomic_lock(&ca_flags); up->ier = port->serial_in(port, UART_IER); + if (is_console) + console_atomic_unlock(ca_flags); + if (up->ier & (UART_IER_RLSI | UART_IER_RDI)) { port->ops->stop_rx(port); } else { @ drivers/tty/serial/8250/8250_ingenic.c:149 @ OF_EARLYCON_DECLARE(x1000_uart, "ingenic,x1000-uart", static void ingenic_uart_serial_out(struct uart_port *p, int offset, int value) { + unsigned int flags; + bool is_console; int ier; switch (offset) { @ drivers/tty/serial/8250/8250_ingenic.c:172 @ static void ingenic_uart_serial_out(struct uart_port *p, int offset, int value) * If we have enabled modem status IRQs we should enable * modem mode. */ + is_console = uart_console(p); + if (is_console) + console_atomic_lock(&flags); ier = p->serial_in(p, UART_IER); + if (is_console) + console_atomic_unlock(flags); if (ier & UART_IER_MSI) value |= UART_MCR_MDCE | UART_MCR_FCM; @ drivers/tty/serial/8250/8250_mtk.c:215 @ static void mtk8250_shutdown(struct uart_port *port) static void mtk8250_disable_intrs(struct uart_8250_port *up, int mask) { - serial_out(up, UART_IER, serial_in(up, UART_IER) & (~mask)); + struct uart_port *port = &up->port; + unsigned int flags; + unsigned int ier; + bool is_console; + + is_console = uart_console(port); + + if (is_console) + console_atomic_lock(&flags); + + ier = serial_in(up, UART_IER); + serial_out(up, UART_IER, ier & (~mask)); + + if (is_console) + console_atomic_unlock(flags); } static void mtk8250_enable_intrs(struct uart_8250_port *up, int mask) { - serial_out(up, UART_IER, serial_in(up, UART_IER) | mask); + struct uart_port *port = &up->port; + unsigned int flags; + unsigned int ier; + + if (uart_console(port)) + console_atomic_lock(&flags); + + ier = serial_in(up, UART_IER); + serial_out(up, UART_IER, ier | mask); + + if (uart_console(port)) + console_atomic_unlock(flags); } static void mtk8250_set_flow_ctrl(struct uart_8250_port *up, int mode) @ drivers/tty/serial/8250/8250_port.c:720 @ static void serial8250_set_sleep(struct uart_8250_port *p, int sleep) serial_out(p, UART_EFR, UART_EFR_ECB); serial_out(p, UART_LCR, 0); } - serial_out(p, UART_IER, sleep ? UART_IERX_SLEEP : 0); + serial8250_set_IER(p, sleep ? UART_IERX_SLEEP : 0); if (p->capabilities & UART_CAP_EFR) { serial_out(p, UART_LCR, UART_LCR_CONF_MODE_B); serial_out(p, UART_EFR, efr); @ drivers/tty/serial/8250/8250_port.c:1392 @ static void serial8250_stop_rx(struct uart_port *port) up->ier &= ~(UART_IER_RLSI | UART_IER_RDI); up->port.read_status_mask &= ~UART_LSR_DR; - serial_port_out(port, UART_IER, up->ier); + serial8250_set_IER(up, up->ier); serial8250_rpm_put(up); } @ drivers/tty/serial/8250/8250_port.c:1410 @ static void __do_stop_tx_rs485(struct uart_8250_port *p) serial8250_clear_and_reinit_fifos(p); p->ier |= UART_IER_RLSI | UART_IER_RDI; - serial_port_out(&p->port, UART_IER, p->ier); + serial8250_set_IER(p, p->ier); } } static enum hrtimer_restart serial8250_em485_handle_stop_tx(struct hrtimer *t) @ drivers/tty/serial/8250/8250_port.c:1618 @ static void serial8250_disable_ms(struct uart_port *port) mctrl_gpio_disable_ms(up->gpios); up->ier &= ~UART_IER_MSI; - serial_port_out(port, UART_IER, up->ier); + serial8250_set_IER(up, up->ier); } static void serial8250_enable_ms(struct uart_port *port) @ drivers/tty/serial/8250/8250_port.c:1634 @ static void serial8250_enable_ms(struct uart_port *port) up->ier |= UART_IER_MSI; serial8250_rpm_get(up); - serial_port_out(port, UART_IER, up->ier); + serial8250_set_IER(up, up->ier); serial8250_rpm_put(up); } @ drivers/tty/serial/8250/8250_port.c:2028 @ static void serial8250_put_poll_char(struct uart_port *port, struct uart_8250_port *up = up_to_u8250p(port); serial8250_rpm_get(up); - /* - * First save the IER then disable the interrupts - */ - ier = serial_port_in(port, UART_IER); - if (up->capabilities & UART_CAP_UUE) - serial_port_out(port, UART_IER, UART_IER_UUE); - else - serial_port_out(port, UART_IER, 0); + ier = serial8250_clear_IER(up); wait_for_xmitr(up, BOTH_EMPTY); /* @ drivers/tty/serial/8250/8250_port.c:2041 @ static void serial8250_put_poll_char(struct uart_port *port, * and restore the IER */ wait_for_xmitr(up, BOTH_EMPTY); - serial_port_out(port, UART_IER, ier); + serial8250_set_IER(up, ier); serial8250_rpm_put(up); } @ drivers/tty/serial/8250/8250_port.c:2339 @ void serial8250_do_shutdown(struct uart_port *port) */ spin_lock_irqsave(&port->lock, flags); up->ier = 0; - serial_port_out(port, UART_IER, 0); + serial8250_set_IER(up, 0); spin_unlock_irqrestore(&port->lock, flags); synchronize_irq(port->irq); @ drivers/tty/serial/8250/8250_port.c:2624 @ serial8250_do_set_termios(struct uart_port *port, struct ktermios *termios, if (up->capabilities & UART_CAP_RTOIE) up->ier |= UART_IER_RTOIE; - serial_port_out(port, UART_IER, up->ier); + serial8250_set_IER(up, up->ier); if (up->capabilities & UART_CAP_EFR) { unsigned char efr = 0; @ drivers/tty/serial/8250/8250_port.c:3089 @ EXPORT_SYMBOL_GPL(serial8250_set_defaults); #ifdef CONFIG_SERIAL_8250_CONSOLE -static void serial8250_console_putchar(struct uart_port *port, int ch) +static void serial8250_console_putchar_locked(struct uart_port *port, int ch) { struct uart_8250_port *up = up_to_u8250p(port); @ drivers/tty/serial/8250/8250_port.c:3097 @ static void serial8250_console_putchar(struct uart_port *port, int ch) serial_port_out(port, UART_TX, ch); } +static void serial8250_console_putchar(struct uart_port *port, int ch) +{ + struct uart_8250_port *up = up_to_u8250p(port); + unsigned int flags; + + wait_for_xmitr(up, UART_LSR_THRE); + + console_atomic_lock(&flags); + serial8250_console_putchar_locked(port, ch); + console_atomic_unlock(flags); +} + /* * Restore serial console when h/w power-off detected */ @ drivers/tty/serial/8250/8250_port.c:3130 @ static void serial8250_console_restore(struct uart_8250_port *up) serial8250_out_MCR(up, UART_MCR_DTR | UART_MCR_RTS); } +void serial8250_console_write_atomic(struct uart_8250_port *up, + const char *s, unsigned int count) +{ + struct uart_port *port = &up->port; + unsigned int flags; + unsigned int ier; + + console_atomic_lock(&flags); + + touch_nmi_watchdog(); + + ier = serial8250_clear_IER(up); + + if (atomic_fetch_inc(&up->console_printing)) { + uart_console_write(port, "\n", 1, + serial8250_console_putchar_locked); + } + uart_console_write(port, s, count, serial8250_console_putchar_locked); + atomic_dec(&up->console_printing); + + wait_for_xmitr(up, BOTH_EMPTY); + serial8250_set_IER(up, ier); + + console_atomic_unlock(flags); +} + /* * Print a string to the serial port trying not to disturb * any possible real use of the port... @ drivers/tty/serial/8250/8250_port.c:3168 @ void serial8250_console_write(struct uart_8250_port *up, const char *s, struct uart_port *port = &up->port; unsigned long flags; unsigned int ier; - int locked = 1; touch_nmi_watchdog(); serial8250_rpm_get(up); + spin_lock_irqsave(&port->lock, flags); - if (oops_in_progress) - locked = spin_trylock_irqsave(&port->lock, flags); - else - spin_lock_irqsave(&port->lock, flags); - - /* - * First save the IER then disable the interrupts - */ - ier = serial_port_in(port, UART_IER); - - if (up->capabilities & UART_CAP_UUE) - serial_port_out(port, UART_IER, UART_IER_UUE); - else - serial_port_out(port, UART_IER, 0); + ier = serial8250_clear_IER(up); /* check scratch reg to see if port powered off during system sleep */ if (up->canary && (up->canary != serial_port_in(port, UART_SCR))) { @ drivers/tty/serial/8250/8250_port.c:3182 @ void serial8250_console_write(struct uart_8250_port *up, const char *s, up->canary = 0; } + atomic_inc(&up->console_printing); uart_console_write(port, s, count, serial8250_console_putchar); + atomic_dec(&up->console_printing); /* * Finally, wait for transmitter to become empty * and restore the IER */ wait_for_xmitr(up, BOTH_EMPTY); - serial_port_out(port, UART_IER, ier); + serial8250_set_IER(up, ier); /* * The receive handling will happen properly because the @ drivers/tty/serial/8250/8250_port.c:3203 @ void serial8250_console_write(struct uart_8250_port *up, const char *s, if (up->msr_saved_flags) serial8250_modem_status(up); - if (locked) - spin_unlock_irqrestore(&port->lock, flags); + spin_unlock_irqrestore(&port->lock, flags); serial8250_rpm_put(up); } @ drivers/tty/serial/8250/8250_port.c:3224 @ static unsigned int probe_baud(struct uart_port *port) int serial8250_console_setup(struct uart_port *port, char *options, bool probe) { + struct uart_8250_port *up = up_to_u8250p(port); int baud = 9600; int bits = 8; int parity = 'n'; @ drivers/tty/serial/8250/8250_port.c:3233 @ int serial8250_console_setup(struct uart_port *port, char *options, bool probe) if (!port->iobase && !port->membase) return -ENODEV; + atomic_set(&up->console_printing, 0); + if (options) uart_parse_options(options, &baud, &parity, &bits, &flow); else if (probe) @ drivers/tty/serial/amba-pl011.c:2201 @ pl011_console_write(struct console *co, const char *s, unsigned int count) { struct uart_amba_port *uap = amba_ports[co->index]; unsigned int old_cr = 0, new_cr; - unsigned long flags; + unsigned long flags = 0; int locked = 1; clk_enable(uap->clk); - local_irq_save(flags); + /* + * local_irq_save(flags); + * + * This local_irq_save() is nonsense. If we come in via sysrq + * handling then interrupts are already disabled. Aside of + * that the port.sysrq check is racy on SMP regardless. + */ if (uap->port.sysrq) locked = 0; else if (oops_in_progress) - locked = spin_trylock(&uap->port.lock); + locked = spin_trylock_irqsave(&uap->port.lock, flags); else - spin_lock(&uap->port.lock); + spin_lock_irqsave(&uap->port.lock, flags); /* * First save the CR then disable the interrupts @ drivers/tty/serial/amba-pl011.c:2244 @ pl011_console_write(struct console *co, const char *s, unsigned int count) pl011_write(old_cr, uap, REG_CR); if (locked) - spin_unlock(&uap->port.lock); - local_irq_restore(flags); + spin_unlock_irqrestore(&uap->port.lock, flags); clk_disable(uap->clk); } @ drivers/tty/serial/omap-serial.c:1309 @ serial_omap_console_write(struct console *co, const char *s, pm_runtime_get_sync(up->dev); - local_irq_save(flags); - if (up->port.sysrq) - locked = 0; - else if (oops_in_progress) - locked = spin_trylock(&up->port.lock); + if (up->port.sysrq || oops_in_progress) + locked = spin_trylock_irqsave(&up->port.lock, flags); else - spin_lock(&up->port.lock); + spin_lock_irqsave(&up->port.lock, flags); /* * First save the IER then disable the interrupts @ drivers/tty/serial/omap-serial.c:1341 @ serial_omap_console_write(struct console *co, const char *s, pm_runtime_mark_last_busy(up->dev); pm_runtime_put_autosuspend(up->dev); if (locked) - spin_unlock(&up->port.lock); - local_irq_restore(flags); + spin_unlock_irqrestore(&up->port.lock, flags); } static int __init @ drivers/usb/gadget/function/f_fs.c:1707 @ static void ffs_data_put(struct ffs_data *ffs) pr_info("%s(): freeing\n", __func__); ffs_data_clear(ffs); BUG_ON(waitqueue_active(&ffs->ev.waitq) || - waitqueue_active(&ffs->ep0req_completion.wait) || + swait_active(&ffs->ep0req_completion.wait) || waitqueue_active(&ffs->wait)); destroy_workqueue(ffs->io_completion_wq); kfree(ffs->dev_name); @ drivers/usb/gadget/legacy/inode.c:347 @ ep_io (struct ep_data *epdata, void *buf, unsigned len) spin_unlock_irq (&epdata->dev->lock); if (likely (value == 0)) { - value = wait_event_interruptible (done.wait, done.done); + value = swait_event_interruptible_exclusive(done.wait, done.done); if (value != 0) { spin_lock_irq (&epdata->dev->lock); if (likely (epdata->ep != NULL)) { @ drivers/usb/gadget/legacy/inode.c:356 @ ep_io (struct ep_data *epdata, void *buf, unsigned len) usb_ep_dequeue (epdata->ep, epdata->req); spin_unlock_irq (&epdata->dev->lock); - wait_event (done.wait, done.done); + swait_event_exclusive(done.wait, done.done); if (epdata->status == -ECONNRESET) epdata->status = -EINTR; } else { @ fs/afs/dir_silly.c:213 @ int afs_silly_iput(struct dentry *dentry, struct inode *inode) struct dentry *alias; int ret; - DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); + DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(wq); _enter("%p{%pd},%llx", dentry, dentry, vnode->fid.vnode); @ fs/buffer.c:277 @ static void end_buffer_async_read(struct buffer_head *bh, int uptodate) * decide that the page is now completely done. */ first = page_buffers(page); - local_irq_save(flags); - bit_spin_lock(BH_Uptodate_Lock, &first->b_state); + spin_lock_irqsave(&first->b_uptodate_lock, flags); clear_buffer_async_read(bh); unlock_buffer(bh); tmp = bh; @ fs/buffer.c:290 @ static void end_buffer_async_read(struct buffer_head *bh, int uptodate) } tmp = tmp->b_this_page; } while (tmp != bh); - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); - local_irq_restore(flags); + spin_unlock_irqrestore(&first->b_uptodate_lock, flags); /* * If none of the buffers had errors and they are all @ fs/buffer.c:302 @ static void end_buffer_async_read(struct buffer_head *bh, int uptodate) return; still_busy: - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); - local_irq_restore(flags); + spin_unlock_irqrestore(&first->b_uptodate_lock, flags); return; } @ fs/buffer.c:371 @ void end_buffer_async_write(struct buffer_head *bh, int uptodate) } first = page_buffers(page); - local_irq_save(flags); - bit_spin_lock(BH_Uptodate_Lock, &first->b_state); + spin_lock_irqsave(&first->b_uptodate_lock, flags); clear_buffer_async_write(bh); unlock_buffer(bh); @ fs/buffer.c:383 @ void end_buffer_async_write(struct buffer_head *bh, int uptodate) } tmp = tmp->b_this_page; } - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); - local_irq_restore(flags); + spin_unlock_irqrestore(&first->b_uptodate_lock, flags); end_page_writeback(page); return; still_busy: - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); - local_irq_restore(flags); + spin_unlock_irqrestore(&first->b_uptodate_lock, flags); return; } EXPORT_SYMBOL(end_buffer_async_write); @ fs/buffer.c:3393 @ struct buffer_head *alloc_buffer_head(gfp_t gfp_flags) struct buffer_head *ret = kmem_cache_zalloc(bh_cachep, gfp_flags); if (ret) { INIT_LIST_HEAD(&ret->b_assoc_buffers); + spin_lock_init(&ret->b_uptodate_lock); preempt_disable(); __this_cpu_inc(bh_accounting.nr); recalc_bh_state(); @ fs/cifs/readdir.c:83 @ cifs_prime_dcache(struct dentry *parent, struct qstr *name, struct inode *inode; struct super_block *sb = parent->d_sb; struct cifs_sb_info *cifs_sb = CIFS_SB(sb); - DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); + DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(wq); cifs_dbg(FYI, "%s: for %s\n", __func__, name->name); @ fs/dcache.c:1730 @ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name) dentry->d_lockref.count = 1; dentry->d_flags = 0; spin_lock_init(&dentry->d_lock); - seqcount_init(&dentry->d_seq); + seqcount_spinlock_init(&dentry->d_seq, &dentry->d_lock); dentry->d_inode = NULL; dentry->d_parent = dentry; dentry->d_sb = sb; @ fs/dcache.c:2487 @ EXPORT_SYMBOL(d_rehash); static inline unsigned start_dir_add(struct inode *dir) { + preempt_disable_rt(); for (;;) { - unsigned n = dir->i_dir_seq; - if (!(n & 1) && cmpxchg(&dir->i_dir_seq, n, n + 1) == n) + unsigned n = dir->__i_dir_seq; + if (!(n & 1) && cmpxchg(&dir->__i_dir_seq, n, n + 1) == n) return n; cpu_relax(); } @ fs/dcache.c:2498 @ static inline unsigned start_dir_add(struct inode *dir) static inline void end_dir_add(struct inode *dir, unsigned n) { - smp_store_release(&dir->i_dir_seq, n + 2); + smp_store_release(&dir->__i_dir_seq, n + 2); + preempt_enable_rt(); } static void d_wait_lookup(struct dentry *dentry) { - if (d_in_lookup(dentry)) { - DECLARE_WAITQUEUE(wait, current); - add_wait_queue(dentry->d_wait, &wait); - do { - set_current_state(TASK_UNINTERRUPTIBLE); - spin_unlock(&dentry->d_lock); - schedule(); - spin_lock(&dentry->d_lock); - } while (d_in_lookup(dentry)); - } + struct swait_queue __wait; + + if (!d_in_lookup(dentry)) + return; + + INIT_LIST_HEAD(&__wait.task_list); + do { + prepare_to_swait_exclusive(dentry->d_wait, &__wait, TASK_UNINTERRUPTIBLE); + spin_unlock(&dentry->d_lock); + schedule(); + spin_lock(&dentry->d_lock); + } while (d_in_lookup(dentry)); + finish_swait(dentry->d_wait, &__wait); } struct dentry *d_alloc_parallel(struct dentry *parent, const struct qstr *name, - wait_queue_head_t *wq) + struct swait_queue_head *wq) { unsigned int hash = name->hash; struct hlist_bl_head *b = in_lookup_hash(parent, hash); @ fs/dcache.c:2535 @ struct dentry *d_alloc_parallel(struct dentry *parent, retry: rcu_read_lock(); - seq = smp_load_acquire(&parent->d_inode->i_dir_seq); + seq = smp_load_acquire(&parent->d_inode->__i_dir_seq); r_seq = read_seqbegin(&rename_lock); dentry = __d_lookup_rcu(parent, name, &d_seq); if (unlikely(dentry)) { @ fs/dcache.c:2563 @ struct dentry *d_alloc_parallel(struct dentry *parent, } hlist_bl_lock(b); - if (unlikely(READ_ONCE(parent->d_inode->i_dir_seq) != seq)) { + if (unlikely(READ_ONCE(parent->d_inode->__i_dir_seq) != seq)) { hlist_bl_unlock(b); rcu_read_unlock(); goto retry; @ fs/dcache.c:2636 @ void __d_lookup_done(struct dentry *dentry) hlist_bl_lock(b); dentry->d_flags &= ~DCACHE_PAR_LOOKUP; __hlist_bl_del(&dentry->d_u.d_in_lookup_hash); - wake_up_all(dentry->d_wait); + swake_up_all(dentry->d_wait); dentry->d_wait = NULL; hlist_bl_unlock(b); INIT_HLIST_NODE(&dentry->d_u.d_alias); @ fs/eventpoll.c:222 @ struct eventpoll { /* used to optimize loop detection check */ int visited; - struct list_head visited_list_link; #ifdef CONFIG_NET_RX_BUSY_POLL /* used to track busy poll napi_id */ unsigned int napi_id; #endif + + struct list_head visited_list_link; + +#ifdef CONFIG_DEBUG_LOCK_ALLOC + /* used to track wakeup nests for lockdep validation */ + u8 nests; +#endif }; /* Wait structure used by the poll hooks */ @ fs/eventpoll.c:554 @ static int ep_call_nested(struct nested_calls *ncalls, */ #ifdef CONFIG_DEBUG_LOCK_ALLOC -static DEFINE_PER_CPU(int, wakeup_nest); - -static void ep_poll_safewake(wait_queue_head_t *wq) +static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi) { + struct eventpoll *ep_src; unsigned long flags; - int subclass; + u8 nests = 0; - local_irq_save(flags); - preempt_disable(); - subclass = __this_cpu_read(wakeup_nest); - spin_lock_nested(&wq->lock, subclass + 1); - __this_cpu_inc(wakeup_nest); - wake_up_locked_poll(wq, POLLIN); - __this_cpu_dec(wakeup_nest); - spin_unlock(&wq->lock); - local_irq_restore(flags); - preempt_enable(); + /* + * If we are not being call from ep_poll_callback(), epi is + * NULL and we are at the first level of nesting, 0. Otherwise, + * we are being called from ep_poll_callback() and if a previous + * wakeup source is not an epoll file itself, we are at depth + * 1 since the wakeup source is depth 0. If the wakeup source + * is a previous epoll file in the wakeup chain then we use its + * nests value and record ours as nests + 1. The previous epoll + * file nests value is stable since its already holding its + * own poll_wait.lock. + */ + if (epi) { + if ((is_file_epoll(epi->ffd.file))) { + ep_src = epi->ffd.file->private_data; + nests = ep_src->nests; + } else { + nests = 1; + } + } + spin_lock_irqsave_nested(&ep->poll_wait.lock, flags, nests); + ep->nests = nests + 1; + wake_up_locked_poll(&ep->poll_wait, EPOLLIN); + ep->nests = 0; + spin_unlock_irqrestore(&ep->poll_wait.lock, flags); } #else -static void ep_poll_safewake(wait_queue_head_t *wq) +static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi) { - wake_up_poll(wq, EPOLLIN); + wake_up_poll(&ep->poll_wait, EPOLLIN); } #endif @ fs/eventpoll.c:811 @ static void ep_free(struct eventpoll *ep) /* We need to release all tasks waiting for these file */ if (waitqueue_active(&ep->poll_wait)) - ep_poll_safewake(&ep->poll_wait); + ep_poll_safewake(ep, NULL); /* * We need to lock this because we could be hit by @ fs/eventpoll.c:1280 @ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v /* We have to call this outside the lock */ if (pwake) - ep_poll_safewake(&ep->poll_wait); + ep_poll_safewake(ep, epi); if (!(epi->event.events & EPOLLEXCLUSIVE)) ewake = 1; @ fs/eventpoll.c:1584 @ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, /* We have to call this outside the lock */ if (pwake) - ep_poll_safewake(&ep->poll_wait); + ep_poll_safewake(ep, NULL); return 0; @ fs/eventpoll.c:1688 @ static int ep_modify(struct eventpoll *ep, struct epitem *epi, /* We have to call this outside the lock */ if (pwake) - ep_poll_safewake(&ep->poll_wait); + ep_poll_safewake(ep, NULL); return 0; } @ fs/ext4/page-io.c:128 @ static void ext4_finish_bio(struct bio *bio) } bh = head = page_buffers(page); /* - * We check all buffers in the page under BH_Uptodate_Lock + * We check all buffers in the page under b_uptodate_lock * to avoid races with other end io clearing async_write flags */ - local_irq_save(flags); - bit_spin_lock(BH_Uptodate_Lock, &head->b_state); + spin_lock_irqsave(&head->b_uptodate_lock, flags); do { if (bh_offset(bh) < bio_start || bh_offset(bh) + bh->b_size > bio_end) { @ fs/ext4/page-io.c:143 @ static void ext4_finish_bio(struct bio *bio) if (bio->bi_status) buffer_io_error(bh); } while ((bh = bh->b_this_page) != head); - bit_spin_unlock(BH_Uptodate_Lock, &head->b_state); - local_irq_restore(flags); + spin_unlock_irqrestore(&head->b_uptodate_lock, flags); if (!under_io) { fscrypt_free_bounce_page(bounce_page); end_page_writeback(page); @ fs/fs_struct.c:120 @ struct fs_struct *copy_fs_struct(struct fs_struct *old) fs->users = 1; fs->in_exec = 0; spin_lock_init(&fs->lock); - seqcount_init(&fs->seq); + seqcount_spinlock_init(&fs->seq, &fs->lock); fs->umask = old->umask; spin_lock(&old->lock); @ fs/fs_struct.c:166 @ EXPORT_SYMBOL(current_umask); struct fs_struct init_fs = { .users = 1, .lock = __SPIN_LOCK_UNLOCKED(init_fs.lock), - .seq = SEQCNT_ZERO(init_fs.seq), + .seq = SEQCNT_SPINLOCK_ZERO(init_fs.seq, &init_fs.lock), .umask = 0022, }; @ fs/fuse/readdir.c:161 @ static int fuse_direntplus_link(struct file *file, struct inode *dir = d_inode(parent); struct fuse_conn *fc; struct inode *inode; - DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); + DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(wq); if (!o->nodeid) { /* @ fs/inode.c:161 @ int inode_init_always(struct super_block *sb, struct inode *inode) inode->i_bdev = NULL; inode->i_cdev = NULL; inode->i_link = NULL; - inode->i_dir_seq = 0; + inode->__i_dir_seq = 0; inode->i_rdev = 0; inode->dirtied_when = 0; @ fs/namei.c:1736 @ static struct dentry *__lookup_slow(const struct qstr *name, { struct dentry *dentry, *old; struct inode *inode = dir->d_inode; - DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); + DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(wq); /* Don't go there if it's already dead */ if (unlikely(IS_DEADDIR(inode))) @ fs/namei.c:3213 @ static int lookup_open(struct nameidata *nd, struct path *path, struct dentry *dentry; int error, create_error = 0; umode_t mode = op->mode; - DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); + DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(wq); if (unlikely(IS_DEADDIR(dir_inode))) return -ENOENT; @ fs/namespace.c:17 @ #include <linux/mnt_namespace.h> #include <linux/user_namespace.h> #include <linux/namei.h> +#include <linux/delay.h> #include <linux/security.h> #include <linux/cred.h> #include <linux/idr.h> @ fs/namespace.c:325 @ int __mnt_want_write(struct vfsmount *m) * incremented count after it has set MNT_WRITE_HOLD. */ smp_mb(); - while (READ_ONCE(mnt->mnt.mnt_flags) & MNT_WRITE_HOLD) - cpu_relax(); + while (READ_ONCE(mnt->mnt.mnt_flags) & MNT_WRITE_HOLD) { + preempt_enable(); + cpu_chill(); + preempt_disable(); + } /* * After the slowpath clears MNT_WRITE_HOLD, mnt_is_readonly will * be set to match its requirements. So we must not load that until @ fs/nfs/dir.c:464 @ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry, unsigned long dir_verifier) { struct qstr filename = QSTR_INIT(entry->name, entry->len); - DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); + DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(wq); struct dentry *dentry; struct dentry *alias; struct inode *inode; @ fs/nfs/dir.c:1638 @ int nfs_atomic_open(struct inode *dir, struct dentry *dentry, struct file *file, unsigned open_flags, umode_t mode) { - DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); + DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(wq); struct nfs_open_context *ctx; struct dentry *res; struct iattr attr = { .ia_valid = ATTR_OPEN }; @ fs/nfs/nfs4_fs.h:118 @ struct nfs4_state_owner { unsigned long so_flags; struct list_head so_states; struct nfs_seqid_counter so_seqid; - seqcount_t so_reclaim_seqcount; + seqcount_spinlock_t so_reclaim_seqcount; struct mutex so_delegreturn_mutex; }; @ fs/nfs/nfs4state.c:512 @ nfs4_alloc_state_owner(struct nfs_server *server, nfs4_init_seqid_counter(&sp->so_seqid); atomic_set(&sp->so_count, 1); INIT_LIST_HEAD(&sp->so_lru); - seqcount_init(&sp->so_reclaim_seqcount); + seqcount_spinlock_init(&sp->so_reclaim_seqcount, &sp->so_lock); mutex_init(&sp->so_delegreturn_mutex); return sp; } @ fs/nfs/unlink.c:16 @ #include <linux/sunrpc/clnt.h> #include <linux/nfs_fs.h> #include <linux/sched.h> -#include <linux/wait.h> +#include <linux/swait.h> #include <linux/namei.h> #include <linux/fsnotify.h> @ fs/nfs/unlink.c:183 @ nfs_async_unlink(struct dentry *dentry, const struct qstr *name) data->cred = get_current_cred(); data->res.dir_attr = &data->dir_attr; - init_waitqueue_head(&data->wq); + init_swait_queue_head(&data->wq); status = -EBUSY; spin_lock(&dentry->d_lock); @ fs/ntfs/aops.c:95 @ static void ntfs_end_buffer_async_read(struct buffer_head *bh, int uptodate) "0x%llx.", (unsigned long long)bh->b_blocknr); } first = page_buffers(page); - local_irq_save(flags); - bit_spin_lock(BH_Uptodate_Lock, &first->b_state); + spin_lock_irqsave(&first->b_uptodate_lock, flags); clear_buffer_async_read(bh); unlock_buffer(bh); tmp = bh; @ fs/ntfs/aops.c:110 @ static void ntfs_end_buffer_async_read(struct buffer_head *bh, int uptodate) } tmp = tmp->b_this_page; } while (tmp != bh); - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); - local_irq_restore(flags); + spin_unlock_irqrestore(&first->b_uptodate_lock, flags); /* * If none of the buffers had errors then we can set the page uptodate, * but we first have to perform the post read mst fixups, if the @ fs/ntfs/aops.c:143 @ static void ntfs_end_buffer_async_read(struct buffer_head *bh, int uptodate) unlock_page(page); return; still_busy: - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); - local_irq_restore(flags); + spin_unlock_irqrestore(&first->b_uptodate_lock, flags); return; } @ fs/proc/base.c:99 @ #include <linux/posix-timers.h> #include <linux/time_namespace.h> #include <linux/resctrl.h> +#include <linux/swait.h> #include <trace/events/oom.h> #include "internal.h" #include "fd.h" @ fs/proc/base.c:1998 @ bool proc_fill_cache(struct file *file, struct dir_context *ctx, child = d_hash_and_lookup(dir, &qname); if (!child) { - DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); + DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(wq); child = d_alloc_parallel(dir, &qname, &wq); if (IS_ERR(child)) goto end_instantiate; @ fs/proc/kmsg.c:21 @ #include <linux/uaccess.h> #include <asm/io.h> -extern wait_queue_head_t log_wait; - static int kmsg_open(struct inode * inode, struct file * file) { return do_syslog(SYSLOG_ACTION_OPEN, NULL, 0, SYSLOG_FROM_PROC); @ fs/proc/kmsg.c:43 @ static ssize_t kmsg_read(struct file *file, char __user *buf, static __poll_t kmsg_poll(struct file *file, poll_table *wait) { - poll_wait(file, &log_wait, wait); + poll_wait(file, printk_wait_queue(), wait); if (do_syslog(SYSLOG_ACTION_SIZE_UNREAD, NULL, 0, SYSLOG_FROM_PROC)) return EPOLLIN | EPOLLRDNORM; return 0; @ fs/proc/proc_sysctl.c:705 @ static bool proc_sys_fill_cache(struct file *file, child = d_lookup(dir, &qname); if (!child) { - DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); + DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(wq); child = d_alloc_parallel(dir, &qname, &wq); if (IS_ERR(child)) return false; @ fs/squashfs/decompressor_multi_percpu.c:11 @ #include <linux/slab.h> #include <linux/percpu.h> #include <linux/buffer_head.h> +#include <linux/locallock.h> #include "squashfs_fs.h" #include "squashfs_fs_sb.h" @ fs/squashfs/decompressor_multi_percpu.c:27 @ struct squashfs_stream { void *stream; }; +static DEFINE_LOCAL_IRQ_LOCK(stream_lock); + void *squashfs_decompressor_create(struct squashfs_sb_info *msblk, void *comp_opts) { @ fs/squashfs/decompressor_multi_percpu.c:83 @ int squashfs_decompress(struct squashfs_sb_info *msblk, struct buffer_head **bh, { struct squashfs_stream __percpu *percpu = (struct squashfs_stream __percpu *) msblk->stream; - struct squashfs_stream *stream = get_cpu_ptr(percpu); - int res = msblk->decompressor->decompress(msblk, stream->stream, bh, b, - offset, length, output); - put_cpu_ptr(stream); + struct squashfs_stream *stream; + int res; + + stream = get_locked_ptr(stream_lock, percpu); + + res = msblk->decompressor->decompress(msblk, stream->stream, bh, b, + offset, length, output); + + put_locked_ptr(stream_lock, stream); if (res < 0) ERROR("%s decompression failed, data probably corrupt\n", @ fs/userfaultfd.c:64 @ struct userfaultfd_ctx { /* waitqueue head for events */ wait_queue_head_t event_wqh; /* a refile sequence protected by fault_pending_wqh lock */ - struct seqcount refile_seq; + seqcount_spinlock_t refile_seq; /* pseudo fd refcounting */ refcount_t refcount; /* userfaultfd syscall flags */ @ fs/userfaultfd.c:1943 @ static void init_once_userfaultfd_ctx(void *mem) init_waitqueue_head(&ctx->fault_wqh); init_waitqueue_head(&ctx->event_wqh); init_waitqueue_head(&ctx->fd_wqh); - seqcount_init(&ctx->refile_seq); + seqcount_spinlock_init(&ctx->refile_seq, &ctx->fault_pending_wqh.lock); } SYSCALL_DEFINE1(userfaultfd, int, flags) @ include/linux/bottom_half.h:7 @ #include <linux/preempt.h> +#ifdef CONFIG_PREEMPT_RT +extern void __local_bh_disable_ip(unsigned long ip, unsigned int cnt); +#else + #ifdef CONFIG_TRACE_IRQFLAGS extern void __local_bh_disable_ip(unsigned long ip, unsigned int cnt); #else @ include/linux/bottom_half.h:20 @ static __always_inline void __local_bh_disable_ip(unsigned long ip, unsigned int barrier(); } #endif +#endif static inline void local_bh_disable(void) { @ include/linux/bpf.h:889 @ int bpf_prog_array_copy(struct bpf_prog_array *old_array, struct bpf_prog *_prog; \ struct bpf_prog_array *_array; \ u32 _ret = 1; \ - preempt_disable(); \ + migrate_disable(); \ rcu_read_lock(); \ _array = rcu_dereference(array); \ if (unlikely(check_non_null && !_array))\ @ include/linux/bpf.h:902 @ int bpf_prog_array_copy(struct bpf_prog_array *old_array, } \ _out: \ rcu_read_unlock(); \ - preempt_enable(); \ + migrate_enable(); \ _ret; \ }) @ include/linux/bpf.h:936 @ _out: \ u32 ret; \ u32 _ret = 1; \ u32 _cn = 0; \ - preempt_disable(); \ + migrate_disable(); \ rcu_read_lock(); \ _array = rcu_dereference(array); \ _item = &_array->items[0]; \ @ include/linux/bpf.h:948 @ _out: \ _item++; \ } \ rcu_read_unlock(); \ - preempt_enable(); \ + migrate_enable(); \ if (_ret) \ _ret = (_cn ? NET_XMIT_CN : NET_XMIT_SUCCESS); \ else \ @ include/linux/bpf.h:965 @ _out: \ #ifdef CONFIG_BPF_SYSCALL DECLARE_PER_CPU(int, bpf_prog_active); +/* + * Block execution of BPF programs attached to instrumentation (perf, + * kprobes, tracepoints) to prevent deadlocks on map operations as any of + * these events can happen inside a region which holds a map bucket lock + * and can deadlock on it. + * + * Use the preemption safe inc/dec variants on RT because migrate disable + * is preemptible on RT and preemption in the middle of the RMW operation + * might lead to inconsistent state. Use the raw variants for non RT + * kernels as migrate_disable() maps to preempt_disable() so the slightly + * more expensive save operation can be avoided. + */ +static inline void bpf_disable_instrumentation(void) +{ + migrate_disable(); + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + this_cpu_inc(bpf_prog_active); + else + __this_cpu_inc(bpf_prog_active); +} + +static inline void bpf_enable_instrumentation(void) +{ + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + this_cpu_dec(bpf_prog_active); + else + __this_cpu_dec(bpf_prog_active); + migrate_enable(); +} + extern const struct file_operations bpf_map_fops; extern const struct file_operations bpf_prog_fops; @ include/linux/buffer_head.h:25 @ enum bh_state_bits { BH_Dirty, /* Is dirty */ BH_Lock, /* Is locked */ BH_Req, /* Has been submitted for I/O */ - BH_Uptodate_Lock,/* Used by the first bh in a page, to serialise - * IO completion of other buffers in the page - */ BH_Mapped, /* Has a disk mapping */ BH_New, /* Disk mapping was newly created by get_block */ @ include/linux/buffer_head.h:76 @ struct buffer_head { struct address_space *b_assoc_map; /* mapping this buffer is associated with */ atomic_t b_count; /* users using this buffer_head */ + spinlock_t b_uptodate_lock; /* Used by the first bh in a page, to + * serialise IO completion of other + * buffers in the page */ }; /* @ include/linux/completion.h:12 @ * See kernel/sched/completion.c for details. */ -#include <linux/wait.h> +#include <linux/swait.h> /* * struct completion - structure used to maintain state for a "completion" @ include/linux/completion.h:28 @ */ struct completion { unsigned int done; - wait_queue_head_t wait; + struct swait_queue_head wait; }; #define init_completion_map(x, m) __init_completion(x) @ include/linux/completion.h:37 @ static inline void complete_acquire(struct completion *x) {} static inline void complete_release(struct completion *x) {} #define COMPLETION_INITIALIZER(work) \ - { 0, __WAIT_QUEUE_HEAD_INITIALIZER((work).wait) } + { 0, __SWAIT_QUEUE_HEAD_INITIALIZER((work).wait) } #define COMPLETION_INITIALIZER_ONSTACK_MAP(work, map) \ (*({ init_completion_map(&(work), &(map)); &(work); })) @ include/linux/completion.h:88 @ static inline void complete_release(struct completion *x) {} static inline void __init_completion(struct completion *x) { x->done = 0; - init_waitqueue_head(&x->wait); + init_swait_queue_head(&x->wait); } /** @ include/linux/console.h:147 @ static inline int con_debug_leave(void) struct console { char name[16]; void (*write)(struct console *, const char *, unsigned); + void (*write_atomic)(struct console *, const char *, unsigned); int (*read)(struct console *, char *, unsigned); struct tty_driver *(*device)(struct console *, int *); void (*unblank)(void); @ include/linux/console.h:156 @ struct console { short flags; short index; int cflag; + unsigned long printk_seq; + int wrote_history; void *data; struct console *next; }; @ include/linux/console.h:238 @ extern void console_init(void); void dummycon_register_output_notifier(struct notifier_block *nb); void dummycon_unregister_output_notifier(struct notifier_block *nb); +extern void console_atomic_lock(unsigned int *flags); +extern void console_atomic_unlock(unsigned int flags); + #endif /* _LINUX_CONSOLE_H */ @ include/linux/dcache.h:92 @ extern struct dentry_stat_t dentry_stat; struct dentry { /* RCU lookup touched fields */ unsigned int d_flags; /* protected by d_lock */ - seqcount_t d_seq; /* per dentry seqlock */ + seqcount_spinlock_t d_seq; /* per dentry seqlock */ struct hlist_bl_node d_hash; /* lookup hash list */ struct dentry *d_parent; /* parent directory */ struct qstr d_name; @ include/linux/dcache.h:109 @ struct dentry { union { struct list_head d_lru; /* LRU list */ - wait_queue_head_t *d_wait; /* in-lookup ones only */ + struct swait_queue_head *d_wait; /* in-lookup ones only */ }; struct list_head d_child; /* child of parent list */ struct list_head d_subdirs; /* our children */ @ include/linux/dcache.h:239 @ extern void d_set_d_op(struct dentry *dentry, const struct dentry_operations *op extern struct dentry * d_alloc(struct dentry *, const struct qstr *); extern struct dentry * d_alloc_anon(struct super_block *); extern struct dentry * d_alloc_parallel(struct dentry *, const struct qstr *, - wait_queue_head_t *); + struct swait_queue_head *); extern struct dentry * d_splice_alias(struct inode *, struct dentry *); extern struct dentry * d_add_ci(struct dentry *, struct inode *, struct qstr *); extern struct dentry * d_exact_alias(struct dentry *, struct inode *); @ include/linux/delay.h:68 @ static inline void ssleep(unsigned int seconds) msleep(seconds * 1000); } +#ifdef CONFIG_PREEMPT_RT +extern void cpu_chill(void); +#else +# define cpu_chill() cpu_relax() +#endif + #endif /* defined(_LINUX_DELAY_H) */ @ include/linux/dma-resv.h:49 @ #include <linux/rcupdate.h> extern struct ww_class reservation_ww_class; -extern struct lock_class_key reservation_seqcount_class; -extern const char reservation_seqcount_string[]; /** * struct dma_resv_list - a list of shared fences @ include/linux/dma-resv.h:72 @ struct dma_resv_list { */ struct dma_resv { struct ww_mutex lock; - seqcount_t seq; + seqcount_ww_mutex_t seq; struct dma_fence __rcu *fence_excl; struct dma_resv_list __rcu *fence; @ include/linux/filter.h:564 @ DECLARE_STATIC_KEY_FALSE(bpf_stats_enabled_key); #define __BPF_PROG_RUN(prog, ctx, dfunc) ({ \ u32 ret; \ - cant_sleep(); \ + cant_migrate(); \ if (static_branch_unlikely(&bpf_stats_enabled_key)) { \ struct bpf_prog_stats *stats; \ u64 start = sched_clock(); \ @ include/linux/filter.h:579 @ DECLARE_STATIC_KEY_FALSE(bpf_stats_enabled_key); } \ ret; }) -#define BPF_PROG_RUN(prog, ctx) __BPF_PROG_RUN(prog, ctx, \ - bpf_dispatcher_nopfunc) +#define BPF_PROG_RUN(prog, ctx) \ + __BPF_PROG_RUN(prog, ctx, bpf_dispatcher_nopfunc) + +/* + * Use in preemptible and therefore migratable context to make sure that + * the execution of the BPF program runs on one CPU. + * + * This uses migrate_disable/enable() explicitly to document that the + * invocation of a BPF program does not require reentrancy protection + * against a BPF program which is invoked from a preempting task. + * + * For non RT enabled kernels migrate_disable/enable() maps to + * preempt_disable/enable(), i.e. it disables also preemption. + */ +static inline u32 bpf_prog_run_pin_on_cpu(const struct bpf_prog *prog, + const void *ctx) +{ + u32 ret; + + migrate_disable(); + ret = __BPF_PROG_RUN(prog, ctx, bpf_dispatcher_nopfunc); + migrate_enable(); + return ret; +} #define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN @ include/linux/filter.h:680 @ static inline u8 *bpf_skb_cb(struct sk_buff *skb) return qdisc_skb_cb(skb)->data; } +/* Must be invoked with migration disabled */ static inline u32 __bpf_prog_run_save_cb(const struct bpf_prog *prog, struct sk_buff *skb) { @ include/linux/filter.h:706 @ static inline u32 bpf_prog_run_save_cb(const struct bpf_prog *prog, { u32 res; - preempt_disable(); + migrate_disable(); res = __bpf_prog_run_save_cb(prog, skb); - preempt_enable(); + migrate_enable(); return res; } @ include/linux/filter.h:721 @ static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog, if (unlikely(prog->cb_access)) memset(cb_data, 0, BPF_SKB_CB_LEN); - preempt_disable(); - res = BPF_PROG_RUN(prog, skb); - preempt_enable(); + res = bpf_prog_run_pin_on_cpu(prog, skb); return res; } @ include/linux/fs.h:720 @ struct inode { struct block_device *i_bdev; struct cdev *i_cdev; char *i_link; - unsigned i_dir_seq; + unsigned __i_dir_seq; }; __u32 i_generation; @ include/linux/fs_struct.h:12 @ struct fs_struct { int users; spinlock_t lock; - seqcount_t seq; + seqcount_spinlock_t seq; int umask; int in_exec; struct path root, pwd; @ include/linux/genhd.h:755 @ static inline sector_t part_nr_sects_read(struct hd_struct *part) static inline void part_nr_sects_write(struct hd_struct *part, sector_t size) { #if BITS_PER_LONG==32 && defined(CONFIG_SMP) + preempt_disable(); write_seqcount_begin(&part->nr_sects_seq); part->nr_sects = size; write_seqcount_end(&part->nr_sects_seq); + preempt_enable(); #elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPTION) preempt_disable(); part->nr_sects = size; @ include/linux/hardirq.h:71 @ extern void irq_exit(void); #define nmi_enter() \ do { \ arch_nmi_enter(); \ - printk_nmi_enter(); \ lockdep_off(); \ ftrace_nmi_enter(); \ BUG_ON(in_nmi()); \ @ include/linux/hardirq.h:87 @ extern void irq_exit(void); preempt_count_sub(NMI_OFFSET + HARDIRQ_OFFSET); \ ftrace_nmi_exit(); \ lockdep_on(); \ - printk_nmi_exit(); \ arch_nmi_exit(); \ } while (0) @ include/linux/highmem.h:11 @ #include <linux/mm.h> #include <linux/uaccess.h> #include <linux/hardirq.h> +#include <linux/sched.h> #include <asm/cacheflush.h> @ include/linux/highmem.h:94 @ static inline void kunmap(struct page *page) static inline void *kmap_atomic(struct page *page) { - preempt_disable(); + preempt_disable_nort(); pagefault_disable(); return page_address(page); } @ include/linux/highmem.h:103 @ static inline void *kmap_atomic(struct page *page) static inline void __kunmap_atomic(void *addr) { pagefault_enable(); - preempt_enable(); + preempt_enable_nort(); } #define kmap_atomic_pfn(pfn) kmap_atomic(pfn_to_page(pfn)) @ include/linux/highmem.h:115 @ static inline void __kunmap_atomic(void *addr) #if defined(CONFIG_HIGHMEM) || defined(CONFIG_X86_32) +#ifndef CONFIG_PREEMPT_RT DECLARE_PER_CPU(int, __kmap_atomic_idx); +#endif static inline int kmap_atomic_idx_push(void) { +#ifndef CONFIG_PREEMPT_RT int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1; -#ifdef CONFIG_DEBUG_HIGHMEM +# ifdef CONFIG_DEBUG_HIGHMEM WARN_ON_ONCE(in_irq() && !irqs_disabled()); BUG_ON(idx >= KM_TYPE_NR); -#endif +# endif return idx; +#else + current->kmap_idx++; + BUG_ON(current->kmap_idx > KM_TYPE_NR); + return current->kmap_idx - 1; +#endif } static inline int kmap_atomic_idx(void) { +#ifndef CONFIG_PREEMPT_RT return __this_cpu_read(__kmap_atomic_idx) - 1; +#else + return current->kmap_idx - 1; +#endif } static inline void kmap_atomic_idx_pop(void) { -#ifdef CONFIG_DEBUG_HIGHMEM +#ifndef CONFIG_PREEMPT_RT +# ifdef CONFIG_DEBUG_HIGHMEM int idx = __this_cpu_dec_return(__kmap_atomic_idx); BUG_ON(idx < 0); -#else +# else __this_cpu_dec(__kmap_atomic_idx); +# endif +#else + current->kmap_idx--; +# ifdef CONFIG_DEBUG_HIGHMEM + BUG_ON(current->kmap_idx < 0); +# endif #endif } @ include/linux/hrtimer.h:162 @ struct hrtimer_clock_base { struct hrtimer_cpu_base *cpu_base; unsigned int index; clockid_t clockid; - seqcount_t seq; + seqcount_raw_spinlock_t seq; struct hrtimer *running; struct timerqueue_head active; ktime_t (*get_time)(void); @ include/linux/idr.h:172 @ static inline bool idr_is_empty(const struct idr *idr) * Each idr_preload() should be matched with an invocation of this * function. See idr_preload() for details. */ -static inline void idr_preload_end(void) -{ - preempt_enable(); -} +void idr_preload_end(void); /** * idr_for_each_entry() - Iterate over an IDR's elements of a given type. @ include/linux/interrupt.h:561 @ struct softirq_action asmlinkage void do_softirq(void); asmlinkage void __do_softirq(void); -#ifdef __ARCH_HAS_DO_SOFTIRQ +#if defined(__ARCH_HAS_DO_SOFTIRQ) && !defined(CONFIG_PREEMPT_RT) void do_softirq_own_stack(void); #else static inline void do_softirq_own_stack(void) @ include/linux/interrupt.h:576 @ extern void __raise_softirq_irqoff(unsigned int nr); extern void raise_softirq_irqoff(unsigned int nr); extern void raise_softirq(unsigned int nr); +extern void softirq_check_pending_idle(void); DECLARE_PER_CPU(struct task_struct *, ksoftirqd); @ include/linux/interrupt.h:641 @ static inline void tasklet_unlock(struct tasklet_struct *t) static inline void tasklet_unlock_wait(struct tasklet_struct *t) { - while (test_bit(TASKLET_STATE_RUN, &(t)->state)) { barrier(); } + while (test_bit(TASKLET_STATE_RUN, &(t)->state)) { + local_bh_disable(); + local_bh_enable(); + } } #else #define tasklet_trylock(t) 1 @ include/linux/irq_work.h:21 @ /* Doesn't want IPI, wait for tick: */ #define IRQ_WORK_LAZY BIT(2) +/* Run hard IRQ context, even on RT */ +#define IRQ_WORK_HARD_IRQ BIT(3) #define IRQ_WORK_CLAIMED (IRQ_WORK_PENDING | IRQ_WORK_BUSY) @ include/linux/irq_work.h:61 @ static inline bool irq_work_needs_cpu(void) { return false; } static inline void irq_work_run(void) { } #endif +#if defined(CONFIG_IRQ_WORK) && defined(CONFIG_PREEMPT_RT) +void irq_work_tick_soft(void); +#else +static inline void irq_work_tick_soft(void) { } +#endif + #endif /* _LINUX_IRQ_WORK_H */ @ include/linux/irqdesc.h:75 @ struct irq_desc { unsigned int irqs_unhandled; atomic_t threads_handled; int threads_handled_last; + u64 random_ip; raw_spinlock_t lock; struct cpumask *percpu_enabled; const struct cpumask *percpu_affinity; @ include/linux/irqflags.h:46 @ do { \ do { \ current->hardirq_context--; \ } while (0) -# define lockdep_softirq_enter() \ -do { \ - current->softirq_context++; \ -} while (0) -# define lockdep_softirq_exit() \ -do { \ - current->softirq_context--; \ -} while (0) #else # define trace_hardirqs_on() do { } while (0) # define trace_hardirqs_off() do { } while (0) @ include/linux/irqflags.h:59 @ do { \ # define lockdep_softirq_exit() do { } while (0) #endif +#if defined(CONFIG_TRACE_IRQFLAGS) && !defined(CONFIG_PREEMPT_RT) +# define lockdep_softirq_enter() \ +do { \ + current->softirq_context++; \ +} while (0) +# define lockdep_softirq_exit() \ +do { \ + current->softirq_context--; \ +} while (0) + +#else +# define lockdep_softirq_enter() do { } while (0) +# define lockdep_softirq_exit() do { } while (0) +#endif + #if defined(CONFIG_IRQSOFF_TRACER) || \ defined(CONFIG_PREEMPT_TRACER) extern void stop_critical_timings(void); @ include/linux/kernel.h:221 @ extern void __cant_sleep(const char *file, int line, int preempt_offset); */ # define might_sleep() \ do { __might_sleep(__FILE__, __LINE__, 0); might_resched(); } while (0) + +# define might_sleep_no_state_check() \ + do { ___might_sleep(__FILE__, __LINE__, 0); might_resched(); } while (0) + /** * cant_sleep - annotation for functions that cannot sleep * @ include/linux/kernel.h:256 @ extern void __cant_sleep(const char *file, int line, int preempt_offset); static inline void __might_sleep(const char *file, int line, int preempt_offset) { } # define might_sleep() do { might_resched(); } while (0) +# define might_sleep_no_state_check() do { might_resched(); } while (0) # define cant_sleep() do { } while (0) # define sched_annotate_sleep() do { } while (0) # define non_block_start() do { } while (0) @ include/linux/kernel.h:265 @ extern void __cant_sleep(const char *file, int line, int preempt_offset); #define might_sleep_if(cond) do { if (cond) might_sleep(); } while (0) +#ifndef CONFIG_PREEMPT_RT +# define cant_migrate() cant_sleep() +#else + /* Placeholder for now */ +# define cant_migrate() do { } while (0) +#endif + /** * abs - return absolute value of an argument * @x: the value. If it is unsigned type, it is converted to signed type first. @ include/linux/kmsg_dump.h:49 @ struct kmsg_dumper { bool registered; /* private state of the kmsg iterator */ - u32 cur_idx; - u32 next_idx; - u64 cur_seq; - u64 next_seq; + u64 line_seq; + u64 buffer_end_seq; }; #ifdef CONFIG_PRINTK @ include/linux/kvm_irqfd.h:45 @ struct kvm_kernel_irqfd { wait_queue_entry_t wait; /* Update side is protected by irqfds.lock */ struct kvm_kernel_irq_routing_entry irq_entry; - seqcount_t irq_entry_sc; + seqcount_spinlock_t irq_entry_sc; /* Used for level IRQ fast-path */ int gsi; struct work_struct inject; @ include/linux/locallock.h:4 @ +#ifndef _LINUX_LOCALLOCK_H +#define _LINUX_LOCALLOCK_H + +#include <linux/percpu.h> +#include <linux/spinlock.h> +#include <asm/current.h> + +#ifdef CONFIG_PREEMPT_RT + +#ifdef CONFIG_DEBUG_SPINLOCK +# define LL_WARN(cond) WARN_ON(cond) +#else +# define LL_WARN(cond) do { } while (0) +#endif + +/* + * per cpu lock based substitute for local_irq_*() + */ +struct local_irq_lock { + spinlock_t lock; + struct task_struct *owner; + int nestcnt; + unsigned long flags; +}; + +#define DEFINE_LOCAL_IRQ_LOCK(lvar) \ + DEFINE_PER_CPU(struct local_irq_lock, lvar) = { \ + .lock = __SPIN_LOCK_UNLOCKED((lvar).lock) } + +#define DECLARE_LOCAL_IRQ_LOCK(lvar) \ + DECLARE_PER_CPU(struct local_irq_lock, lvar) + +#define local_irq_lock_init(lvar) \ + do { \ + int __cpu; \ + for_each_possible_cpu(__cpu) \ + spin_lock_init(&per_cpu(lvar, __cpu).lock); \ + } while (0) + +static inline void __local_lock(struct local_irq_lock *lv) +{ + if (lv->owner != current) { + spin_lock(&lv->lock); + LL_WARN(lv->owner); + LL_WARN(lv->nestcnt); + lv->owner = current; + } + lv->nestcnt++; +} + +#define local_lock(lvar) \ + do { __local_lock(&get_local_var(lvar)); } while (0) + +static inline int __local_trylock(struct local_irq_lock *lv) +{ + if (lv->owner != current && spin_trylock(&lv->lock)) { + LL_WARN(lv->owner); + LL_WARN(lv->nestcnt); + lv->owner = current; + lv->nestcnt = 1; + return 1; + } else if (lv->owner == current) { + lv->nestcnt++; + return 1; + } + return 0; +} + +#define local_trylock(lvar) \ + ({ \ + int __locked; \ + __locked = __local_trylock(&get_local_var(lvar)); \ + if (!__locked) \ + put_local_var(lvar); \ + __locked; \ + }) + +static inline void __local_unlock(struct local_irq_lock *lv) +{ + LL_WARN(lv->nestcnt == 0); + LL_WARN(lv->owner != current); + if (--lv->nestcnt) + return; + + lv->owner = NULL; + spin_unlock(&lv->lock); +} + +#define local_unlock(lvar) \ + do { \ + __local_unlock(this_cpu_ptr(&lvar)); \ + put_local_var(lvar); \ + } while (0) + +static inline void __local_lock_irq(struct local_irq_lock *lv) +{ + spin_lock_irqsave(&lv->lock, lv->flags); + LL_WARN(lv->owner); + LL_WARN(lv->nestcnt); + lv->owner = current; + lv->nestcnt = 1; +} + +#define local_lock_irq(lvar) \ + do { __local_lock_irq(&get_local_var(lvar)); } while (0) + +static inline void __local_unlock_irq(struct local_irq_lock *lv) +{ + LL_WARN(!lv->nestcnt); + LL_WARN(lv->owner != current); + lv->owner = NULL; + lv->nestcnt = 0; + spin_unlock_irq(&lv->lock); +} + +#define local_unlock_irq(lvar) \ + do { \ + __local_unlock_irq(this_cpu_ptr(&lvar)); \ + put_local_var(lvar); \ + } while (0) + +static inline int __local_lock_irqsave(struct local_irq_lock *lv) +{ + if (lv->owner != current) { + __local_lock_irq(lv); + return 0; + } else { + lv->nestcnt++; + return 1; + } +} + +#define local_lock_irqsave(lvar, _flags) \ + do { \ + if (__local_lock_irqsave(&get_local_var(lvar))) \ + put_local_var(lvar); \ + _flags = __this_cpu_read(lvar.flags); \ + } while (0) + +static inline int __local_unlock_irqrestore(struct local_irq_lock *lv, + unsigned long flags) +{ + LL_WARN(!lv->nestcnt); + LL_WARN(lv->owner != current); + if (--lv->nestcnt) + return 0; + + lv->owner = NULL; + spin_unlock_irqrestore(&lv->lock, lv->flags); + return 1; +} + +#define local_unlock_irqrestore(lvar, flags) \ + do { \ + if (__local_unlock_irqrestore(this_cpu_ptr(&lvar), flags)) \ + put_local_var(lvar); \ + } while (0) + +#define local_spin_trylock_irq(lvar, lock) \ + ({ \ + int __locked; \ + local_lock_irq(lvar); \ + __locked = spin_trylock(lock); \ + if (!__locked) \ + local_unlock_irq(lvar); \ + __locked; \ + }) + +#define local_spin_lock_irq(lvar, lock) \ + do { \ + local_lock_irq(lvar); \ + spin_lock(lock); \ + } while (0) + +#define local_spin_unlock_irq(lvar, lock) \ + do { \ + spin_unlock(lock); \ + local_unlock_irq(lvar); \ + } while (0) + +#define local_spin_lock_irqsave(lvar, lock, flags) \ + do { \ + local_lock_irqsave(lvar, flags); \ + spin_lock(lock); \ + } while (0) + +#define local_spin_unlock_irqrestore(lvar, lock, flags) \ + do { \ + spin_unlock(lock); \ + local_unlock_irqrestore(lvar, flags); \ + } while (0) + +#define get_locked_var(lvar, var) \ + (*({ \ + local_lock(lvar); \ + this_cpu_ptr(&var); \ + })) + +#define put_locked_var(lvar, var) local_unlock(lvar); + +#define get_locked_ptr(lvar, var) \ + ({ \ + local_lock(lvar); \ + this_cpu_ptr(var); \ + }) + +#define put_locked_ptr(lvar, var) local_unlock(lvar); + +#define local_lock_cpu(lvar) \ + ({ \ + local_lock(lvar); \ + smp_processor_id(); \ + }) + +#define local_unlock_cpu(lvar) local_unlock(lvar) + +#else /* PREEMPT_RT */ + +#define DEFINE_LOCAL_IRQ_LOCK(lvar) __typeof__(const int) lvar +#define DECLARE_LOCAL_IRQ_LOCK(lvar) extern __typeof__(const int) lvar + +static inline void local_irq_lock_init(int lvar) { } + +#define local_trylock(lvar) \ + ({ \ + preempt_disable(); \ + 1; \ + }) + +#define local_lock(lvar) preempt_disable() +#define local_unlock(lvar) preempt_enable() +#define local_lock_irq(lvar) local_irq_disable() +#define local_lock_irq_on(lvar, cpu) local_irq_disable() +#define local_unlock_irq(lvar) local_irq_enable() +#define local_unlock_irq_on(lvar, cpu) local_irq_enable() +#define local_lock_irqsave(lvar, flags) local_irq_save(flags) +#define local_unlock_irqrestore(lvar, flags) local_irq_restore(flags) + +#define local_spin_trylock_irq(lvar, lock) spin_trylock_irq(lock) +#define local_spin_lock_irq(lvar, lock) spin_lock_irq(lock) +#define local_spin_unlock_irq(lvar, lock) spin_unlock_irq(lock) +#define local_spin_lock_irqsave(lvar, lock, flags) \ + spin_lock_irqsave(lock, flags) +#define local_spin_unlock_irqrestore(lvar, lock, flags) \ + spin_unlock_irqrestore(lock, flags) + +#define get_locked_var(lvar, var) get_cpu_var(var) +#define put_locked_var(lvar, var) put_cpu_var(var) +#define get_locked_ptr(lvar, var) get_cpu_ptr(var) +#define put_locked_ptr(lvar, var) put_cpu_ptr(var) + +#define local_lock_cpu(lvar) get_cpu() +#define local_unlock_cpu(lvar) put_cpu() + +#endif + +#endif @ include/linux/mm_types.h:15 @ #include <linux/completion.h> #include <linux/cpumask.h> #include <linux/uprobes.h> +#include <linux/rcupdate.h> #include <linux/page-flags-layout.h> #include <linux/workqueue.h> @ include/linux/mm_types.h:529 @ struct mm_struct { bool tlb_flush_batched; #endif struct uprobes_state uprobes_state; +#ifdef CONFIG_PREEMPT_RT + struct rcu_head delayed_drop; +#endif #ifdef CONFIG_HUGETLB_PAGE atomic_long_t hugetlb_usage; #endif @ include/linux/mutex.h:25 @ struct ww_acquire_ctx; +#ifdef CONFIG_DEBUG_LOCK_ALLOC +# define __DEP_MAP_MUTEX_INITIALIZER(lockname) \ + , .dep_map = { .name = #lockname } +#else +# define __DEP_MAP_MUTEX_INITIALIZER(lockname) +#endif + +#ifdef CONFIG_PREEMPT_RT +# include <linux/mutex_rt.h> +#else + /* * Simple, straightforward mutexes with strict semantics: * @ include/linux/mutex.h:122 @ do { \ __mutex_init((mutex), #mutex, &__key); \ } while (0) -#ifdef CONFIG_DEBUG_LOCK_ALLOC -# define __DEP_MAP_MUTEX_INITIALIZER(lockname) \ - , .dep_map = { .name = #lockname } -#else -# define __DEP_MAP_MUTEX_INITIALIZER(lockname) -#endif - #define __MUTEX_INITIALIZER(lockname) \ { .owner = ATOMIC_LONG_INIT(0) \ , .wait_lock = __SPIN_LOCK_UNLOCKED(lockname.wait_lock) \ @ include/linux/mutex.h:217 @ enum mutex_trylock_recursive_enum { extern /* __deprecated */ __must_check enum mutex_trylock_recursive_enum mutex_trylock_recursive(struct mutex *lock); +#endif /* !PREEMPT_RT */ + #endif /* __LINUX_MUTEX_H */ @ include/linux/mutex_rt.h:4 @ +#ifndef __LINUX_MUTEX_RT_H +#define __LINUX_MUTEX_RT_H + +#ifndef __LINUX_MUTEX_H +#error "Please include mutex.h" +#endif + +#include <linux/rtmutex.h> + +/* FIXME: Just for __lockfunc */ +#include <linux/spinlock.h> + +struct mutex { + struct rt_mutex lock; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map dep_map; +#endif +}; + +#define __MUTEX_INITIALIZER(mutexname) \ + { \ + .lock = __RT_MUTEX_INITIALIZER(mutexname.lock) \ + __DEP_MAP_MUTEX_INITIALIZER(mutexname) \ + } + +#define DEFINE_MUTEX(mutexname) \ + struct mutex mutexname = __MUTEX_INITIALIZER(mutexname) + +extern void __mutex_do_init(struct mutex *lock, const char *name, struct lock_class_key *key); +extern void __lockfunc _mutex_lock(struct mutex *lock); +extern void __lockfunc _mutex_lock_io(struct mutex *lock); +extern void __lockfunc _mutex_lock_io_nested(struct mutex *lock, int subclass); +extern int __lockfunc _mutex_lock_interruptible(struct mutex *lock); +extern int __lockfunc _mutex_lock_killable(struct mutex *lock); +extern void __lockfunc _mutex_lock_nested(struct mutex *lock, int subclass); +extern void __lockfunc _mutex_lock_nest_lock(struct mutex *lock, struct lockdep_map *nest_lock); +extern int __lockfunc _mutex_lock_interruptible_nested(struct mutex *lock, int subclass); +extern int __lockfunc _mutex_lock_killable_nested(struct mutex *lock, int subclass); +extern int __lockfunc _mutex_trylock(struct mutex *lock); +extern void __lockfunc _mutex_unlock(struct mutex *lock); + +#define mutex_is_locked(l) rt_mutex_is_locked(&(l)->lock) +#define mutex_lock(l) _mutex_lock(l) +#define mutex_lock_interruptible(l) _mutex_lock_interruptible(l) +#define mutex_lock_killable(l) _mutex_lock_killable(l) +#define mutex_trylock(l) _mutex_trylock(l) +#define mutex_unlock(l) _mutex_unlock(l) +#define mutex_lock_io(l) _mutex_lock_io(l); + +#define __mutex_owner(l) ((l)->lock.owner) + +#ifdef CONFIG_DEBUG_MUTEXES +#define mutex_destroy(l) rt_mutex_destroy(&(l)->lock) +#else +static inline void mutex_destroy(struct mutex *lock) {} +#endif + +#ifdef CONFIG_DEBUG_LOCK_ALLOC +# define mutex_lock_nested(l, s) _mutex_lock_nested(l, s) +# define mutex_lock_interruptible_nested(l, s) \ + _mutex_lock_interruptible_nested(l, s) +# define mutex_lock_killable_nested(l, s) \ + _mutex_lock_killable_nested(l, s) +# define mutex_lock_io_nested(l, s) _mutex_lock_io_nested(l, s) + +# define mutex_lock_nest_lock(lock, nest_lock) \ +do { \ + typecheck(struct lockdep_map *, &(nest_lock)->dep_map); \ + _mutex_lock_nest_lock(lock, &(nest_lock)->dep_map); \ +} while (0) + +#else +# define mutex_lock_nested(l, s) _mutex_lock(l) +# define mutex_lock_interruptible_nested(l, s) \ + _mutex_lock_interruptible(l) +# define mutex_lock_killable_nested(l, s) \ + _mutex_lock_killable(l) +# define mutex_lock_nest_lock(lock, nest_lock) mutex_lock(lock) +# define mutex_lock_io_nested(l, s) _mutex_lock_io(l) +#endif + +# define mutex_init(mutex) \ +do { \ + static struct lock_class_key __key; \ + \ + rt_mutex_init(&(mutex)->lock); \ + __mutex_do_init((mutex), #mutex, &__key); \ +} while (0) + +# define __mutex_init(mutex, name, key) \ +do { \ + rt_mutex_init(&(mutex)->lock); \ + __mutex_do_init((mutex), name, key); \ +} while (0) + +/** + * These values are chosen such that FAIL and SUCCESS match the + * values of the regular mutex_trylock(). + */ +enum mutex_trylock_recursive_enum { + MUTEX_TRYLOCK_FAILED = 0, + MUTEX_TRYLOCK_SUCCESS = 1, + MUTEX_TRYLOCK_RECURSIVE, +}; +/** + * mutex_trylock_recursive - trylock variant that allows recursive locking + * @lock: mutex to be locked + * + * This function should not be used, _ever_. It is purely for hysterical GEM + * raisins, and once those are gone this will be removed. + * + * Returns: + * MUTEX_TRYLOCK_FAILED - trylock failed, + * MUTEX_TRYLOCK_SUCCESS - lock acquired, + * MUTEX_TRYLOCK_RECURSIVE - we already owned the lock. + */ +int __rt_mutex_owner_current(struct rt_mutex *lock); + +static inline /* __deprecated */ __must_check enum mutex_trylock_recursive_enum +mutex_trylock_recursive(struct mutex *lock) +{ + if (unlikely(__rt_mutex_owner_current(&lock->lock))) + return MUTEX_TRYLOCK_RECURSIVE; + + return mutex_trylock(lock); +} + +extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock); + +#endif @ include/linux/netdevice.h:3093 @ struct softnet_data { unsigned int dropped; struct sk_buff_head input_pkt_queue; struct napi_struct backlog; + struct sk_buff_head tofree_queue; }; @ include/linux/nfs_xdr.h:1616 @ struct nfs_unlinkdata { struct nfs_removeargs args; struct nfs_removeres res; struct dentry *dentry; - wait_queue_head_t wq; + struct swait_queue_head wq; const struct cred *cred; struct nfs_fattr dir_attr; long timeout; @ include/linux/percpu-rwsem.h:6 @ #define _LINUX_PERCPU_RWSEM_H #include <linux/atomic.h> -#include <linux/rwsem.h> #include <linux/percpu.h> #include <linux/rcuwait.h> +#include <linux/wait.h> #include <linux/rcu_sync.h> #include <linux/lockdep.h> struct percpu_rw_semaphore { struct rcu_sync rss; unsigned int __percpu *read_count; - struct rw_semaphore rw_sem; /* slowpath */ - struct rcuwait writer; /* blocked writer */ - int readers_block; + struct rcuwait writer; + wait_queue_head_t waiters; + atomic_t block; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map dep_map; +#endif }; +#ifdef CONFIG_DEBUG_LOCK_ALLOC +#define __PERCPU_RWSEM_DEP_MAP_INIT(lockname) .dep_map = { .name = #lockname }, +#else +#define __PERCPU_RWSEM_DEP_MAP_INIT(lockname) +#endif + #define __DEFINE_PERCPU_RWSEM(name, is_static) \ static DEFINE_PER_CPU(unsigned int, __percpu_rwsem_rc_##name); \ is_static struct percpu_rw_semaphore name = { \ .rss = __RCU_SYNC_INITIALIZER(name.rss), \ .read_count = &__percpu_rwsem_rc_##name, \ - .rw_sem = __RWSEM_INITIALIZER(name.rw_sem), \ .writer = __RCUWAIT_INITIALIZER(name.writer), \ + .waiters = __WAIT_QUEUE_HEAD_INITIALIZER(name.waiters), \ + .block = ATOMIC_INIT(0), \ + __PERCPU_RWSEM_DEP_MAP_INIT(name) \ } + #define DEFINE_PERCPU_RWSEM(name) \ __DEFINE_PERCPU_RWSEM(name, /* not static */) #define DEFINE_STATIC_PERCPU_RWSEM(name) \ __DEFINE_PERCPU_RWSEM(name, static) -extern int __percpu_down_read(struct percpu_rw_semaphore *, int); -extern void __percpu_up_read(struct percpu_rw_semaphore *); +extern bool __percpu_down_read(struct percpu_rw_semaphore *, bool); static inline void percpu_down_read(struct percpu_rw_semaphore *sem) { might_sleep(); - rwsem_acquire_read(&sem->rw_sem.dep_map, 0, 0, _RET_IP_); + rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_); preempt_disable(); /* @ include/linux/percpu-rwsem.h:62 @ static inline void percpu_down_read(struct percpu_rw_semaphore *sem) * and that once the synchronize_rcu() is done, the writer will see * anything we did within this RCU-sched read-size critical section. */ - __this_cpu_inc(*sem->read_count); - if (unlikely(!rcu_sync_is_idle(&sem->rss))) + if (likely(rcu_sync_is_idle(&sem->rss))) + __this_cpu_inc(*sem->read_count); + else __percpu_down_read(sem, false); /* Unconditional memory barrier */ /* * The preempt_enable() prevents the compiler from @ include/linux/percpu-rwsem.h:73 @ static inline void percpu_down_read(struct percpu_rw_semaphore *sem) preempt_enable(); } -static inline int percpu_down_read_trylock(struct percpu_rw_semaphore *sem) +static inline bool percpu_down_read_trylock(struct percpu_rw_semaphore *sem) { - int ret = 1; + bool ret = true; preempt_disable(); /* * Same as in percpu_down_read(). */ - __this_cpu_inc(*sem->read_count); - if (unlikely(!rcu_sync_is_idle(&sem->rss))) + if (likely(rcu_sync_is_idle(&sem->rss))) + __this_cpu_inc(*sem->read_count); + else ret = __percpu_down_read(sem, true); /* Unconditional memory barrier */ preempt_enable(); /* @ include/linux/percpu-rwsem.h:92 @ static inline int percpu_down_read_trylock(struct percpu_rw_semaphore *sem) */ if (ret) - rwsem_acquire_read(&sem->rw_sem.dep_map, 0, 1, _RET_IP_); + rwsem_acquire_read(&sem->dep_map, 0, 1, _RET_IP_); return ret; } static inline void percpu_up_read(struct percpu_rw_semaphore *sem) { + rwsem_release(&sem->dep_map, _RET_IP_); + preempt_disable(); /* * Same as in percpu_down_read(). */ - if (likely(rcu_sync_is_idle(&sem->rss))) + if (likely(rcu_sync_is_idle(&sem->rss))) { __this_cpu_dec(*sem->read_count); - else - __percpu_up_read(sem); /* Unconditional memory barrier */ + } else { + /* + * slowpath; reader will only ever wake a single blocked + * writer. + */ + smp_mb(); /* B matches C */ + /* + * In other words, if they see our decrement (presumably to + * aggregate zero, as that is the only time it matters) they + * will also see our critical section. + */ + __this_cpu_dec(*sem->read_count); + rcuwait_wake_up(&sem->writer); + } preempt_enable(); - - rwsem_release(&sem->rw_sem.dep_map, _RET_IP_); } extern void percpu_down_write(struct percpu_rw_semaphore *); @ include/linux/percpu-rwsem.h:138 @ extern void percpu_free_rwsem(struct percpu_rw_semaphore *); __percpu_init_rwsem(sem, #sem, &rwsem_key); \ }) -#define percpu_rwsem_is_held(sem) lockdep_is_held(&(sem)->rw_sem) - -#define percpu_rwsem_assert_held(sem) \ - lockdep_assert_held(&(sem)->rw_sem) +#define percpu_rwsem_is_held(sem) lockdep_is_held(sem) +#define percpu_rwsem_assert_held(sem) lockdep_assert_held(sem) static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem, bool read, unsigned long ip) { - lock_release(&sem->rw_sem.dep_map, ip); -#ifdef CONFIG_RWSEM_SPIN_ON_OWNER - if (!read) - atomic_long_set(&sem->rw_sem.owner, RWSEM_OWNER_UNKNOWN); -#endif + lock_release(&sem->dep_map, ip); } static inline void percpu_rwsem_acquire(struct percpu_rw_semaphore *sem, bool read, unsigned long ip) { - lock_acquire(&sem->rw_sem.dep_map, 0, 1, read, 1, NULL, ip); -#ifdef CONFIG_RWSEM_SPIN_ON_OWNER - if (!read) - atomic_long_set(&sem->rw_sem.owner, (long)current); -#endif + lock_acquire(&sem->dep_map, 0, 1, read, 1, NULL, ip); } #endif @ include/linux/percpu.h:22 @ #define PERCPU_MODULE_RESERVE 0 #endif +#ifdef CONFIG_PREEMPT_RT + +#define get_local_var(var) (*({ \ + migrate_disable(); \ + this_cpu_ptr(&var); })) + +#define put_local_var(var) do { \ + (void)&(var); \ + migrate_enable(); \ +} while (0) + +# define get_local_ptr(var) ({ \ + migrate_disable(); \ + this_cpu_ptr(var); }) + +# define put_local_ptr(var) do { \ + (void)(var); \ + migrate_enable(); \ +} while (0) + +#else + +#define get_local_var(var) get_cpu_var(var) +#define put_local_var(var) put_cpu_var(var) +#define get_local_ptr(var) get_cpu_ptr(var) +#define put_local_ptr(var) put_cpu_ptr(var) + +#endif + /* minimum unit size, also is the maximum supported allocation size */ #define PCPU_MIN_UNIT_SIZE PFN_ALIGN(32 << 10) @ include/linux/pid.h:6 @ #define _LINUX_PID_H #include <linux/rculist.h> +#include <linux/atomic.h> #include <linux/wait.h> #include <linux/refcount.h> @ include/linux/posix-timers.h:75 @ struct cpu_timer { struct task_struct *task; struct list_head elist; int firing; + int firing_cpu; }; static inline bool cpu_timer_enqueue(struct timerqueue_head *head, @ include/linux/posix-timers.h:127 @ struct posix_cputimers { struct posix_cputimer_base bases[CPUCLOCK_MAX]; unsigned int timers_active; unsigned int expiry_active; +#ifdef CONFIG_PREEMPT_RT + struct task_struct *posix_timer_list; +#endif }; static inline void posix_cputimers_init(struct posix_cputimers *pct) @ include/linux/posix-timers.h:159 @ static inline void posix_cputimers_rt_watchdog(struct posix_cputimers *pct, INIT_CPU_TIMERBASE(b[2]), \ } +#ifdef CONFIG_PREEMPT_RT +# define INIT_TIMER_LIST .posix_timer_list = NULL, +#else +# define INIT_TIMER_LIST +#endif + #define INIT_CPU_TIMERS(s) \ .posix_cputimers = { \ .bases = INIT_CPU_TIMERBASES(s.posix_cputimers.bases), \ + INIT_TIMER_LIST \ }, #else struct posix_cputimers { }; @ include/linux/preempt.h:81 @ #include <asm/preempt.h> #define hardirq_count() (preempt_count() & HARDIRQ_MASK) -#define softirq_count() (preempt_count() & SOFTIRQ_MASK) #define irq_count() (preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK \ | NMI_MASK)) - /* * Are we doing bottom half or hardware interrupt processing? * @ include/linux/preempt.h:97 @ * should not be used in new code. */ #define in_irq() (hardirq_count()) -#define in_softirq() (softirq_count()) #define in_interrupt() (irq_count()) -#define in_serving_softirq() (softirq_count() & SOFTIRQ_OFFSET) #define in_nmi() (preempt_count() & NMI_MASK) #define in_task() (!(preempt_count() & \ (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET))) +#ifdef CONFIG_PREEMPT_RT + +#define softirq_count() ((long)current->softirq_count) +#define in_softirq() (softirq_count()) +#define in_serving_softirq() (current->softirq_count & SOFTIRQ_OFFSET) + +#else + +#define softirq_count() (preempt_count() & SOFTIRQ_MASK) +#define in_softirq() (softirq_count()) +#define in_serving_softirq() (softirq_count() & SOFTIRQ_OFFSET) + +#endif /* * The preempt_count offset after preempt_disable(); @ include/linux/preempt.h:127 @ /* * The preempt_count offset after spin_lock() */ +#if !defined(CONFIG_PREEMPT_RT) #define PREEMPT_LOCK_OFFSET PREEMPT_DISABLE_OFFSET +#else +#define PREEMPT_LOCK_OFFSET 0 +#endif /* * The preempt_count offset needed for things like: @ include/linux/preempt.h:180 @ extern void preempt_count_sub(int val); #define preempt_count_inc() preempt_count_add(1) #define preempt_count_dec() preempt_count_sub(1) +#ifdef CONFIG_PREEMPT_LAZY +#define add_preempt_lazy_count(val) do { preempt_lazy_count() += (val); } while (0) +#define sub_preempt_lazy_count(val) do { preempt_lazy_count() -= (val); } while (0) +#define inc_preempt_lazy_count() add_preempt_lazy_count(1) +#define dec_preempt_lazy_count() sub_preempt_lazy_count(1) +#define preempt_lazy_count() (current_thread_info()->preempt_lazy_count) +#else +#define add_preempt_lazy_count(val) do { } while (0) +#define sub_preempt_lazy_count(val) do { } while (0) +#define inc_preempt_lazy_count() do { } while (0) +#define dec_preempt_lazy_count() do { } while (0) +#define preempt_lazy_count() (0) +#endif + #ifdef CONFIG_PREEMPT_COUNT #define preempt_disable() \ @ include/linux/preempt.h:202 @ do { \ barrier(); \ } while (0) +#define preempt_lazy_disable() \ +do { \ + inc_preempt_lazy_count(); \ + barrier(); \ +} while (0) + #define sched_preempt_enable_no_resched() \ do { \ barrier(); \ preempt_count_dec(); \ } while (0) -#define preempt_enable_no_resched() sched_preempt_enable_no_resched() +#ifdef CONFIG_PREEMPT_RT +# define preempt_enable_no_resched() sched_preempt_enable_no_resched() +# define preempt_check_resched_rt() preempt_check_resched() +#else +# define preempt_enable_no_resched() preempt_enable() +# define preempt_check_resched_rt() barrier(); +#endif #define preemptible() (preempt_count() == 0 && !irqs_disabled()) @ include/linux/preempt.h:245 @ do { \ __preempt_schedule(); \ } while (0) +#define preempt_lazy_enable() \ +do { \ + dec_preempt_lazy_count(); \ + barrier(); \ + preempt_check_resched(); \ +} while (0) + #else /* !CONFIG_PREEMPTION */ #define preempt_enable() \ do { \ @ include/linux/preempt.h:259 @ do { \ preempt_count_dec(); \ } while (0) +#define preempt_lazy_enable() \ +do { \ + dec_preempt_lazy_count(); \ + barrier(); \ +} while (0) + #define preempt_enable_notrace() \ do { \ barrier(); \ @ include/linux/preempt.h:303 @ do { \ #define preempt_disable_notrace() barrier() #define preempt_enable_no_resched_notrace() barrier() #define preempt_enable_notrace() barrier() +#define preempt_check_resched_rt() barrier() #define preemptible() 0 #endif /* CONFIG_PREEMPT_COUNT */ @ include/linux/preempt.h:324 @ do { \ } while (0) #define preempt_fold_need_resched() \ do { \ - if (tif_need_resched()) \ + if (tif_need_resched_now()) \ set_preempt_need_resched(); \ } while (0) +#ifdef CONFIG_PREEMPT_RT +# define preempt_disable_rt() preempt_disable() +# define preempt_enable_rt() preempt_enable() +# define preempt_disable_nort() barrier() +# define preempt_enable_nort() barrier() +#else +# define preempt_disable_rt() barrier() +# define preempt_enable_rt() barrier() +# define preempt_disable_nort() preempt_disable() +# define preempt_enable_nort() preempt_enable() +#endif + #ifdef CONFIG_PREEMPT_NOTIFIERS struct preempt_notifier; @ include/linux/preempt.h:390 @ static inline void preempt_notifier_init(struct preempt_notifier *notifier, #endif +#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT) + +extern void migrate_disable(void); +extern void migrate_enable(void); + +int __migrate_disabled(struct task_struct *p); + +#elif !defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT) + +extern void migrate_disable(void); +extern void migrate_enable(void); +static inline int __migrate_disabled(struct task_struct *p) +{ + return 0; +} + +#else +/** + * migrate_disable - Prevent migration of the current task + * + * Maps to preempt_disable() which also disables preemption. Use + * migrate_disable() to annotate that the intent is to prevent migration, + * but not necessarily preemption. + * + * Can be invoked nested like preempt_disable() and needs the corresponding + * number of migrate_enable() invocations. + */ +static __always_inline void migrate_disable(void) +{ + preempt_disable(); +} + +/** + * migrate_enable - Allow migration of the current task + * + * Counterpart to migrate_disable(). + * + * As migrate_disable() can be invoked nested, only the outermost invocation + * reenables migration. + * + * Currently mapped to preempt_enable(). + */ +static __always_inline void migrate_enable(void) +{ + preempt_enable(); +} + +static inline int __migrate_disabled(struct task_struct *p) +{ + return 0; +} +#endif #endif /* __LINUX_PREEMPT_H */ @ include/linux/printk.h:61 @ static inline const char *printk_skip_headers(const char *buffer) */ #define CONSOLE_LOGLEVEL_DEFAULT CONFIG_CONSOLE_LOGLEVEL_DEFAULT #define CONSOLE_LOGLEVEL_QUIET CONFIG_CONSOLE_LOGLEVEL_QUIET +#define CONSOLE_LOGLEVEL_EMERGENCY CONFIG_CONSOLE_LOGLEVEL_EMERGENCY extern int console_printk[]; @ include/linux/printk.h:69 @ extern int console_printk[]; #define default_message_loglevel (console_printk[1]) #define minimum_console_loglevel (console_printk[2]) #define default_console_loglevel (console_printk[3]) +#define emergency_console_loglevel (console_printk[4]) static inline void console_silent(void) { @ include/linux/printk.h:151 @ static inline __printf(1, 2) __cold void early_printk(const char *s, ...) { } #endif -#ifdef CONFIG_PRINTK_NMI -extern void printk_nmi_enter(void); -extern void printk_nmi_exit(void); -extern void printk_nmi_direct_enter(void); -extern void printk_nmi_direct_exit(void); -#else -static inline void printk_nmi_enter(void) { } -static inline void printk_nmi_exit(void) { } -static inline void printk_nmi_direct_enter(void) { } -static inline void printk_nmi_direct_exit(void) { } -#endif /* PRINTK_NMI */ - #ifdef CONFIG_PRINTK asmlinkage __printf(5, 0) int vprintk_emit(int facility, int level, @ include/linux/printk.h:195 @ __printf(1, 2) void dump_stack_set_arch_desc(const char *fmt, ...); void dump_stack_print_info(const char *log_lvl); void show_regs_print_info(const char *log_lvl); extern asmlinkage void dump_stack(void) __cold; -extern void printk_safe_flush(void); -extern void printk_safe_flush_on_panic(void); +struct wait_queue_head *printk_wait_queue(void); #else static inline __printf(1, 0) int vprintk(const char *s, va_list args) @ include/linux/printk.h:260 @ static inline void dump_stack(void) { } -static inline void printk_safe_flush(void) -{ -} - -static inline void printk_safe_flush_on_panic(void) -{ -} #endif extern int kptr_restrict; @ include/linux/printk_ringbuffer.h:4 @ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_PRINTK_RINGBUFFER_H +#define _LINUX_PRINTK_RINGBUFFER_H + +#include <linux/irq_work.h> +#include <linux/atomic.h> +#include <linux/percpu.h> +#include <linux/wait.h> + +struct prb_cpulock { + atomic_t owner; + unsigned long __percpu *irqflags; +}; + +struct printk_ringbuffer { + void *buffer; + unsigned int size_bits; + + u64 seq; + atomic_long_t lost; + + atomic_long_t tail; + atomic_long_t head; + atomic_long_t reserve; + + struct prb_cpulock *cpulock; + atomic_t ctx; + + struct wait_queue_head *wq; + atomic_long_t wq_counter; + struct irq_work *wq_work; +}; + +struct prb_entry { + unsigned int size; + u64 seq; + char data[0]; +}; + +struct prb_handle { + struct printk_ringbuffer *rb; + unsigned int cpu; + struct prb_entry *entry; +}; + +#define DECLARE_STATIC_PRINTKRB_CPULOCK(name) \ +static DEFINE_PER_CPU(unsigned long, _##name##_percpu_irqflags); \ +static struct prb_cpulock name = { \ + .owner = ATOMIC_INIT(-1), \ + .irqflags = &_##name##_percpu_irqflags, \ +} + +#define PRB_INIT ((unsigned long)-1) + +#define DECLARE_STATIC_PRINTKRB_ITER(name, rbaddr) \ +static struct prb_iterator name = { \ + .rb = rbaddr, \ + .lpos = PRB_INIT, \ +} + +struct prb_iterator { + struct printk_ringbuffer *rb; + unsigned long lpos; +}; + +#define DECLARE_STATIC_PRINTKRB(name, szbits, cpulockptr) \ +static char _##name##_buffer[1 << (szbits)] \ + __aligned(__alignof__(long)); \ +static DECLARE_WAIT_QUEUE_HEAD(_##name##_wait); \ +static void _##name##_wake_work_func(struct irq_work *irq_work) \ +{ \ + wake_up_interruptible_all(&_##name##_wait); \ +} \ +static struct irq_work _##name##_wake_work = { \ + .func = _##name##_wake_work_func, \ + .flags = ATOMIC_INIT(IRQ_WORK_LAZY), \ +}; \ +static struct printk_ringbuffer name = { \ + .buffer = &_##name##_buffer[0], \ + .size_bits = szbits, \ + .seq = 0, \ + .lost = ATOMIC_LONG_INIT(0), \ + .tail = ATOMIC_LONG_INIT(-111 * sizeof(long)), \ + .head = ATOMIC_LONG_INIT(-111 * sizeof(long)), \ + .reserve = ATOMIC_LONG_INIT(-111 * sizeof(long)), \ + .cpulock = cpulockptr, \ + .ctx = ATOMIC_INIT(0), \ + .wq = &_##name##_wait, \ + .wq_counter = ATOMIC_LONG_INIT(0), \ + .wq_work = &_##name##_wake_work, \ +} + +/* writer interface */ +char *prb_reserve(struct prb_handle *h, struct printk_ringbuffer *rb, + unsigned int size); +void prb_commit(struct prb_handle *h); + +/* reader interface */ +void prb_iter_init(struct prb_iterator *iter, struct printk_ringbuffer *rb, + u64 *seq); +void prb_iter_copy(struct prb_iterator *dest, struct prb_iterator *src); +int prb_iter_next(struct prb_iterator *iter, char *buf, int size, u64 *seq); +int prb_iter_wait_next(struct prb_iterator *iter, char *buf, int size, + u64 *seq); +int prb_iter_seek(struct prb_iterator *iter, u64 seq); +int prb_iter_data(struct prb_iterator *iter, char *buf, int size, u64 *seq); + +/* utility functions */ +int prb_buffer_size(struct printk_ringbuffer *rb); +void prb_inc_lost(struct printk_ringbuffer *rb); +void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store); +void prb_unlock(struct prb_cpulock *cpu_lock, unsigned int cpu_store); + +#endif /*_LINUX_PRINTK_RINGBUFFER_H */ @ include/linux/radix-tree.h:229 @ unsigned int radix_tree_gang_lookup(const struct radix_tree_root *, unsigned int max_items); int radix_tree_preload(gfp_t gfp_mask); int radix_tree_maybe_preload(gfp_t gfp_mask); +void radix_tree_preload_end(void); void radix_tree_init(void); void *radix_tree_tag_set(struct radix_tree_root *, unsigned long index, unsigned int tag); @ include/linux/radix-tree.h:247 @ unsigned int radix_tree_gang_lookup_tag_slot(const struct radix_tree_root *, unsigned int max_items, unsigned int tag); int radix_tree_tagged(const struct radix_tree_root *, unsigned int tag); -static inline void radix_tree_preload_end(void) -{ - preempt_enable(); -} - void __rcu **idr_get_free(struct radix_tree_root *root, struct radix_tree_iter *iter, gfp_t gfp, unsigned long max); @ include/linux/random.h:36 @ static inline void add_latent_entropy(void) {} extern void add_input_randomness(unsigned int type, unsigned int code, unsigned int value) __latent_entropy; -extern void add_interrupt_randomness(int irq, int irq_flags) __latent_entropy; +extern void add_interrupt_randomness(int irq, int irq_flags, __u64 ip) __latent_entropy; extern void get_random_bytes(void *buf, int nbytes); extern int wait_for_random_bytes(void); @ include/linux/ratelimit.h:62 @ static inline void ratelimit_state_exit(struct ratelimit_state *rs) return; if (rs->missed) { - pr_warn("%s: %d output lines suppressed due to ratelimiting\n", + pr_info("%s: %d output lines suppressed due to ratelimiting\n", current->comm, rs->missed); rs->missed = 0; } @ include/linux/rbtree.h:22 @ #include <linux/kernel.h> #include <linux/stddef.h> -#include <linux/rcupdate.h> +#include <linux/rcu_assign_pointer.h> struct rb_node { unsigned long __rb_parent_color; @ include/linux/rcu_assign_pointer.h:4 @ +/* SPDX-License-Identifier: GPL-2.0+ */ +#ifndef __LINUX_RCU_ASSIGN_POINTER_H__ +#define __LINUX_RCU_ASSIGN_POINTER_H__ +#include <linux/compiler.h> +#include <asm/barrier.h> + +#ifdef __CHECKER__ +#define rcu_check_sparse(p, space) \ + ((void)(((typeof(*p) space *)p) == p)) +#else /* #ifdef __CHECKER__ */ +#define rcu_check_sparse(p, space) +#endif /* #else #ifdef __CHECKER__ */ + +/** + * RCU_INITIALIZER() - statically initialize an RCU-protected global variable + * @v: The value to statically initialize with. + */ +#define RCU_INITIALIZER(v) (typeof(*(v)) __force __rcu *)(v) + +/** + * rcu_assign_pointer() - assign to RCU-protected pointer + * @p: pointer to assign to + * @v: value to assign (publish) + * + * Assigns the specified value to the specified RCU-protected + * pointer, ensuring that any concurrent RCU readers will see + * any prior initialization. + * + * Inserts memory barriers on architectures that require them + * (which is most of them), and also prevents the compiler from + * reordering the code that initializes the structure after the pointer + * assignment. More importantly, this call documents which pointers + * will be dereferenced by RCU read-side code. + * + * In some special cases, you may use RCU_INIT_POINTER() instead + * of rcu_assign_pointer(). RCU_INIT_POINTER() is a bit faster due + * to the fact that it does not constrain either the CPU or the compiler. + * That said, using RCU_INIT_POINTER() when you should have used + * rcu_assign_pointer() is a very bad thing that results in + * impossible-to-diagnose memory corruption. So please be careful. + * See the RCU_INIT_POINTER() comment header for details. + * + * Note that rcu_assign_pointer() evaluates each of its arguments only + * once, appearances notwithstanding. One of the "extra" evaluations + * is in typeof() and the other visible only to sparse (__CHECKER__), + * neither of which actually execute the argument. As with most cpp + * macros, this execute-arguments-only-once property is important, so + * please be careful when making changes to rcu_assign_pointer() and the + * other macros that it invokes. + */ +#define rcu_assign_pointer(p, v) \ +do { \ + uintptr_t _r_a_p__v = (uintptr_t)(v); \ + rcu_check_sparse(p, __rcu); \ + \ + if (__builtin_constant_p(v) && (_r_a_p__v) == (uintptr_t)NULL) \ + WRITE_ONCE((p), (typeof(p))(_r_a_p__v)); \ + else \ + smp_store_release(&p, RCU_INITIALIZER((typeof(p))_r_a_p__v)); \ +} while (0) + +#endif @ include/linux/rcupdate.h:32 @ #include <linux/lockdep.h> #include <asm/processor.h> #include <linux/cpumask.h> +#include <linux/rcu_assign_pointer.h> #define ULONG_CMP_GE(a, b) (ULONG_MAX / 2 >= (a) - (b)) #define ULONG_CMP_LT(a, b) (ULONG_MAX / 2 < (a) - (b)) @ include/linux/rcupdate.h:55 @ void __rcu_read_unlock(void); * types of kernel builds, the rcu_read_lock() nesting depth is unknowable. */ #define rcu_preempt_depth() (current->rcu_read_lock_nesting) +#ifndef CONFIG_PREEMPT_RT +#define sched_rcu_preempt_depth() rcu_preempt_depth() +#else +static inline int sched_rcu_preempt_depth(void) { return 0; } +#endif #else /* #ifdef CONFIG_PREEMPT_RCU */ @ include/linux/rcupdate.h:78 @ static inline int rcu_preempt_depth(void) return 0; } +#define sched_rcu_preempt_depth() rcu_preempt_depth() + #endif /* #else #ifdef CONFIG_PREEMPT_RCU */ /* Internal to kernel */ @ include/linux/rcupdate.h:290 @ static inline void rcu_preempt_sleep_check(void) { } #define rcu_sleep_check() \ do { \ rcu_preempt_sleep_check(); \ - RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map), \ + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) \ + RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map), \ "Illegal context switch in RCU-bh read-side critical section"); \ RCU_LOCKDEP_WARN(lock_is_held(&rcu_sched_lock_map), \ "Illegal context switch in RCU-sched read-side critical section"); \ @ include/linux/rcupdate.h:312 @ static inline void rcu_preempt_sleep_check(void) { } * (e.g., __srcu), should this make sense in the future. */ -#ifdef __CHECKER__ -#define rcu_check_sparse(p, space) \ - ((void)(((typeof(*p) space *)p) == p)) -#else /* #ifdef __CHECKER__ */ -#define rcu_check_sparse(p, space) -#endif /* #else #ifdef __CHECKER__ */ - #define __rcu_access_pointer(p, space) \ ({ \ typeof(*p) *_________p1 = (typeof(*p) *__force)READ_ONCE(p); \ @ include/linux/rcupdate.h:339 @ static inline void rcu_preempt_sleep_check(void) { } ((typeof(*p) __force __kernel *)(________p1)); \ }) -/** - * RCU_INITIALIZER() - statically initialize an RCU-protected global variable - * @v: The value to statically initialize with. - */ -#define RCU_INITIALIZER(v) (typeof(*(v)) __force __rcu *)(v) - -/** - * rcu_assign_pointer() - assign to RCU-protected pointer - * @p: pointer to assign to - * @v: value to assign (publish) - * - * Assigns the specified value to the specified RCU-protected - * pointer, ensuring that any concurrent RCU readers will see - * any prior initialization. - * - * Inserts memory barriers on architectures that require them - * (which is most of them), and also prevents the compiler from - * reordering the code that initializes the structure after the pointer - * assignment. More importantly, this call documents which pointers - * will be dereferenced by RCU read-side code. - * - * In some special cases, you may use RCU_INIT_POINTER() instead - * of rcu_assign_pointer(). RCU_INIT_POINTER() is a bit faster due - * to the fact that it does not constrain either the CPU or the compiler. - * That said, using RCU_INIT_POINTER() when you should have used - * rcu_assign_pointer() is a very bad thing that results in - * impossible-to-diagnose memory corruption. So please be careful. - * See the RCU_INIT_POINTER() comment header for details. - * - * Note that rcu_assign_pointer() evaluates each of its arguments only - * once, appearances notwithstanding. One of the "extra" evaluations - * is in typeof() and the other visible only to sparse (__CHECKER__), - * neither of which actually execute the argument. As with most cpp - * macros, this execute-arguments-only-once property is important, so - * please be careful when making changes to rcu_assign_pointer() and the - * other macros that it invokes. - */ -#define rcu_assign_pointer(p, v) \ -do { \ - uintptr_t _r_a_p__v = (uintptr_t)(v); \ - rcu_check_sparse(p, __rcu); \ - \ - if (__builtin_constant_p(v) && (_r_a_p__v) == (uintptr_t)NULL) \ - WRITE_ONCE((p), (typeof(p))(_r_a_p__v)); \ - else \ - smp_store_release(&p, RCU_INITIALIZER((typeof(p))_r_a_p__v)); \ -} while (0) - /** * rcu_replace_pointer() - replace an RCU pointer, returning its old value * @rcu_ptr: RCU pointer, whose old value is returned @ include/linux/rtmutex.h:17 @ #define __LINUX_RT_MUTEX_H #include <linux/linkage.h> +#include <linux/spinlock_types_raw.h> #include <linux/rbtree.h> -#include <linux/spinlock_types.h> extern int max_lock_depth; /* for sysctl */ +#ifdef CONFIG_DEBUG_MUTEXES +#include <linux/debug_locks.h> +#endif + /** * The rt_mutex structure * @ include/linux/rtmutex.h:38 @ struct rt_mutex { raw_spinlock_t wait_lock; struct rb_root_cached waiters; struct task_struct *owner; -#ifdef CONFIG_DEBUG_RT_MUTEXES int save_state; +#ifdef CONFIG_DEBUG_RT_MUTEXES const char *name, *file; int line; void *magic; @ include/linux/rtmutex.h:89 @ do { \ #define __DEP_MAP_RT_MUTEX_INITIALIZER(mutexname) #endif -#define __RT_MUTEX_INITIALIZER(mutexname) \ - { .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(mutexname.wait_lock) \ +#define __RT_MUTEX_INITIALIZER_PLAIN(mutexname) \ + .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(mutexname.wait_lock) \ , .waiters = RB_ROOT_CACHED \ , .owner = NULL \ __DEBUG_RT_MUTEX_INITIALIZER(mutexname) \ - __DEP_MAP_RT_MUTEX_INITIALIZER(mutexname)} + __DEP_MAP_RT_MUTEX_INITIALIZER(mutexname) + +#define __RT_MUTEX_INITIALIZER(mutexname) \ + { __RT_MUTEX_INITIALIZER_PLAIN(mutexname) } #define DEFINE_RT_MUTEX(mutexname) \ struct rt_mutex mutexname = __RT_MUTEX_INITIALIZER(mutexname) +#define __RT_MUTEX_INITIALIZER_SAVE_STATE(mutexname) \ + { __RT_MUTEX_INITIALIZER_PLAIN(mutexname) \ + , .save_state = 1 } + /** * rt_mutex_is_locked - is the mutex locked * @lock: the mutex to be queried @ include/linux/rtmutex.h:129 @ extern void rt_mutex_lock(struct rt_mutex *lock); #endif extern int rt_mutex_lock_interruptible(struct rt_mutex *lock); +extern int rt_mutex_lock_killable(struct rt_mutex *lock); extern int rt_mutex_timed_lock(struct rt_mutex *lock, struct hrtimer_sleeper *timeout); @ include/linux/rwlock_rt.h:4 @ +#ifndef __LINUX_RWLOCK_RT_H +#define __LINUX_RWLOCK_RT_H + +#ifndef __LINUX_SPINLOCK_H +#error Do not include directly. Use spinlock.h +#endif + +extern void __lockfunc rt_write_lock(rwlock_t *rwlock); +extern void __lockfunc rt_read_lock(rwlock_t *rwlock); +extern int __lockfunc rt_write_trylock(rwlock_t *rwlock); +extern int __lockfunc rt_read_trylock(rwlock_t *rwlock); +extern void __lockfunc rt_write_unlock(rwlock_t *rwlock); +extern void __lockfunc rt_read_unlock(rwlock_t *rwlock); +extern int __lockfunc rt_read_can_lock(rwlock_t *rwlock); +extern int __lockfunc rt_write_can_lock(rwlock_t *rwlock); +extern void __rt_rwlock_init(rwlock_t *rwlock, char *name, struct lock_class_key *key); + +#define read_can_lock(rwlock) rt_read_can_lock(rwlock) +#define write_can_lock(rwlock) rt_write_can_lock(rwlock) + +#define read_trylock(lock) __cond_lock(lock, rt_read_trylock(lock)) +#define write_trylock(lock) __cond_lock(lock, rt_write_trylock(lock)) + +static inline int __write_trylock_rt_irqsave(rwlock_t *lock, unsigned long *flags) +{ + /* XXX ARCH_IRQ_ENABLED */ + *flags = 0; + return rt_write_trylock(lock); +} + +#define write_trylock_irqsave(lock, flags) \ + __cond_lock(lock, __write_trylock_rt_irqsave(lock, &(flags))) + +#define read_lock_irqsave(lock, flags) \ + do { \ + typecheck(unsigned long, flags); \ + rt_read_lock(lock); \ + flags = 0; \ + } while (0) + +#define write_lock_irqsave(lock, flags) \ + do { \ + typecheck(unsigned long, flags); \ + rt_write_lock(lock); \ + flags = 0; \ + } while (0) + +#define read_lock(lock) rt_read_lock(lock) + +#define read_lock_bh(lock) \ + do { \ + local_bh_disable(); \ + rt_read_lock(lock); \ + } while (0) + +#define read_lock_irq(lock) read_lock(lock) + +#define write_lock(lock) rt_write_lock(lock) + +#define write_lock_bh(lock) \ + do { \ + local_bh_disable(); \ + rt_write_lock(lock); \ + } while (0) + +#define write_lock_irq(lock) write_lock(lock) + +#define read_unlock(lock) rt_read_unlock(lock) + +#define read_unlock_bh(lock) \ + do { \ + rt_read_unlock(lock); \ + local_bh_enable(); \ + } while (0) + +#define read_unlock_irq(lock) read_unlock(lock) + +#define write_unlock(lock) rt_write_unlock(lock) + +#define write_unlock_bh(lock) \ + do { \ + rt_write_unlock(lock); \ + local_bh_enable(); \ + } while (0) + +#define write_unlock_irq(lock) write_unlock(lock) + +#define read_unlock_irqrestore(lock, flags) \ + do { \ + typecheck(unsigned long, flags); \ + (void) flags; \ + rt_read_unlock(lock); \ + } while (0) + +#define write_unlock_irqrestore(lock, flags) \ + do { \ + typecheck(unsigned long, flags); \ + (void) flags; \ + rt_write_unlock(lock); \ + } while (0) + +#define rwlock_init(rwl) \ +do { \ + static struct lock_class_key __key; \ + \ + __rt_rwlock_init(rwl, #rwl, &__key); \ +} while (0) + +/* + * Internal functions made global for CPU pinning + */ +void __read_rt_lock(struct rt_rw_lock *lock); +int __read_rt_trylock(struct rt_rw_lock *lock); +void __write_rt_lock(struct rt_rw_lock *lock); +int __write_rt_trylock(struct rt_rw_lock *lock); +void __read_rt_unlock(struct rt_rw_lock *lock); +void __write_rt_unlock(struct rt_rw_lock *lock); + +#endif @ include/linux/rwlock_types.h:4 @ #ifndef __LINUX_RWLOCK_TYPES_H #define __LINUX_RWLOCK_TYPES_H +#if !defined(__LINUX_SPINLOCK_TYPES_H) +# error "Do not include directly, include spinlock_types.h" +#endif + /* * include/linux/rwlock_types.h - generic rwlock type definitions * and initializers @ include/linux/rwlock_types_rt.h:4 @ +#ifndef __LINUX_RWLOCK_TYPES_RT_H +#define __LINUX_RWLOCK_TYPES_RT_H + +#ifndef __LINUX_SPINLOCK_TYPES_H +#error "Do not include directly. Include spinlock_types.h instead" +#endif + +#ifdef CONFIG_DEBUG_LOCK_ALLOC +# define RW_DEP_MAP_INIT(lockname) .dep_map = { .name = #lockname } +#else +# define RW_DEP_MAP_INIT(lockname) +#endif + +typedef struct rt_rw_lock rwlock_t; + +#define __RW_LOCK_UNLOCKED(name) __RWLOCK_RT_INITIALIZER(name) + +#define DEFINE_RWLOCK(name) \ + rwlock_t name = __RW_LOCK_UNLOCKED(name) + +/* + * A reader biased implementation primarily for CPU pinning. + * + * Can be selected as general replacement for the single reader RT rwlock + * variant + */ +struct rt_rw_lock { + struct rt_mutex rtmutex; + atomic_t readers; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map dep_map; +#endif +}; + +#define READER_BIAS (1U << 31) +#define WRITER_BIAS (1U << 30) + +#define __RWLOCK_RT_INITIALIZER(name) \ +{ \ + .readers = ATOMIC_INIT(READER_BIAS), \ + .rtmutex = __RT_MUTEX_INITIALIZER_SAVE_STATE(name.rtmutex), \ + RW_DEP_MAP_INIT(name) \ +} + +void __rwlock_biased_rt_init(struct rt_rw_lock *lock, const char *name, + struct lock_class_key *key); + +#define rwlock_biased_rt_init(rwlock) \ + do { \ + static struct lock_class_key __key; \ + \ + __rwlock_biased_rt_init((rwlock), #rwlock, &__key); \ + } while (0) + +#endif @ include/linux/rwsem-rt.h:4 @ +#ifndef _LINUX_RWSEM_RT_H +#define _LINUX_RWSEM_RT_H + +#ifndef _LINUX_RWSEM_H +#error "Include rwsem.h" +#endif + +#include <linux/rtmutex.h> +#include <linux/swait.h> + +#define READER_BIAS (1U << 31) +#define WRITER_BIAS (1U << 30) + +struct rw_semaphore { + atomic_t readers; + struct rt_mutex rtmutex; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map dep_map; +#endif +}; + +#define __RWSEM_INITIALIZER(name) \ +{ \ + .readers = ATOMIC_INIT(READER_BIAS), \ + .rtmutex = __RT_MUTEX_INITIALIZER(name.rtmutex), \ + RW_DEP_MAP_INIT(name) \ +} + +#define DECLARE_RWSEM(lockname) \ + struct rw_semaphore lockname = __RWSEM_INITIALIZER(lockname) + +extern void __rwsem_init(struct rw_semaphore *rwsem, const char *name, + struct lock_class_key *key); + +#define __init_rwsem(sem, name, key) \ +do { \ + rt_mutex_init(&(sem)->rtmutex); \ + __rwsem_init((sem), (name), (key)); \ +} while (0) + +#define init_rwsem(sem) \ +do { \ + static struct lock_class_key __key; \ + \ + __init_rwsem((sem), #sem, &__key); \ +} while (0) + +static inline int rwsem_is_locked(struct rw_semaphore *sem) +{ + return atomic_read(&sem->readers) != READER_BIAS; +} + +static inline int rwsem_is_contended(struct rw_semaphore *sem) +{ + return atomic_read(&sem->readers) > 0; +} + +extern void __down_read(struct rw_semaphore *sem); +extern int __down_read_killable(struct rw_semaphore *sem); +extern int __down_read_trylock(struct rw_semaphore *sem); +extern void __down_write(struct rw_semaphore *sem); +extern int __must_check __down_write_killable(struct rw_semaphore *sem); +extern int __down_write_trylock(struct rw_semaphore *sem); +extern void __up_read(struct rw_semaphore *sem); +extern void __up_write(struct rw_semaphore *sem); +extern void __downgrade_write(struct rw_semaphore *sem); + +#endif @ include/linux/rwsem.h:19 @ #include <linux/spinlock.h> #include <linux/atomic.h> #include <linux/err.h> + +#ifdef CONFIG_PREEMPT_RT +#include <linux/rwsem-rt.h> +#else /* PREEMPT_RT */ + #ifdef CONFIG_RWSEM_SPIN_ON_OWNER #include <linux/osq_lock.h> #endif @ include/linux/rwsem.h:61 @ struct rw_semaphore { #endif }; -/* - * Setting all bits of the owner field except bit 0 will indicate - * that the rwsem is writer-owned with an unknown owner. - */ -#define RWSEM_OWNER_UNKNOWN (-2L) - /* In all implementations count != 0 means locked */ static inline int rwsem_is_locked(struct rw_semaphore *sem) { @ include/linux/rwsem.h:123 @ static inline int rwsem_is_contended(struct rw_semaphore *sem) return !list_empty(&sem->wait_list); } +#endif /* !PREEMPT_RT */ + +/* + * The functions below are the same for all rwsem implementations including + * the RT specific variant. + */ + /* * lock for reading */ @ include/linux/sched.h:34 @ #include <linux/task_io_accounting.h> #include <linux/posix-timers.h> #include <linux/rseq.h> +#include <asm/kmap_types.h> /* task_struct member predeclarations (sorted alphabetically): */ struct audit_context; @ include/linux/sched.h:111 @ struct task_group; __TASK_TRACED | EXIT_DEAD | EXIT_ZOMBIE | \ TASK_PARKED) -#define task_is_traced(task) ((task->state & __TASK_TRACED) != 0) - #define task_is_stopped(task) ((task->state & __TASK_STOPPED) != 0) -#define task_is_stopped_or_traced(task) ((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0) - #define task_contributes_to_load(task) ((task->state & TASK_UNINTERRUPTIBLE) != 0 && \ (task->flags & PF_FROZEN) == 0 && \ (task->state & TASK_NOLOAD) == 0) @ include/linux/sched.h:140 @ struct task_group; smp_store_mb(current->state, (state_value)); \ } while (0) +#define __set_current_state_no_track(state_value) \ + current->state = (state_value); + #define set_special_state(state_value) \ do { \ unsigned long flags; /* may shadow */ \ @ include/linux/sched.h:152 @ struct task_group; current->state = (state_value); \ raw_spin_unlock_irqrestore(¤t->pi_lock, flags); \ } while (0) + #else /* * set_current_state() includes a barrier so that the write of current->state @ include/linux/sched.h:197 @ struct task_group; #define set_current_state(state_value) \ smp_store_mb(current->state, (state_value)) +#define __set_current_state_no_track(state_value) \ + __set_current_state(state_value) + /* * set_special_state() should be used for those states when the blocking task * can not use the regular condition based wait-loop. In that case we must @ include/linux/sched.h:237 @ extern void io_schedule_finish(int token); extern long io_schedule_timeout(long timeout); extern void io_schedule(void); +int cpu_nr_pinned(int cpu); + /** * struct prev_cputime - snapshot of system and user cputime * @utime: time spent in user mode @ include/linux/sched.h:645 @ struct task_struct { #endif /* -1 unrunnable, 0 runnable, >0 stopped: */ volatile long state; + /* saved state for "spinlock sleepers" */ + volatile long saved_state; /* * This begins the randomizable portion of task_struct. Only @ include/linux/sched.h:716 @ struct task_struct { int nr_cpus_allowed; const cpumask_t *cpus_ptr; cpumask_t cpus_mask; +#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT) + int migrate_disable; + bool migrate_disable_scheduled; +# ifdef CONFIG_SCHED_DEBUG + int pinned_on_cpu; +# endif +#elif !defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT) +# ifdef CONFIG_SCHED_DEBUG + int migrate_disable; +# endif +#endif +#ifdef CONFIG_PREEMPT_RT + int sleeping_lock; +#endif #ifdef CONFIG_PREEMPT_RCU int rcu_read_lock_nesting; @ include/linux/sched.h:943 @ struct task_struct { /* Signal handlers: */ struct signal_struct *signal; struct sighand_struct __rcu *sighand; + struct sigqueue *sigqueue_cache; sigset_t blocked; sigset_t real_blocked; /* Restored if set_restore_sigmask() was used: */ sigset_t saved_sigmask; struct sigpending pending; +#ifdef CONFIG_PREEMPT_RT + /* TODO: move me into ->restart_block ? */ + struct kernel_siginfo forced_info; +#endif unsigned long sas_ss_sp; size_t sas_ss_size; unsigned int sas_ss_flags; @ include/linux/sched.h:979 @ struct task_struct { raw_spinlock_t pi_lock; struct wake_q_node wake_q; + struct wake_q_node wake_q_sleeper; #ifdef CONFIG_RT_MUTEXES /* PI waiters blocked on a rt_mutex held by this task: */ @ include/linux/sched.h:1014 @ struct task_struct { int softirqs_enabled; int softirq_context; #endif +#ifdef CONFIG_PREEMPT_RT + int softirq_count; +#endif #ifdef CONFIG_LOCKDEP # define MAX_LOCK_DEPTH 48UL @ include/linux/sched.h:1072 @ struct task_struct { /* Protected by ->alloc_lock: */ nodemask_t mems_allowed; /* Seqence number to catch updates: */ - seqcount_t mems_allowed_seq; + seqcount_spinlock_t mems_allowed_seq; int cpuset_mem_spread_rotor; int cpuset_slab_spread_rotor; #endif @ include/linux/sched.h:1288 @ struct task_struct { unsigned int sequential_io; unsigned int sequential_io_avg; #endif +#ifdef CONFIG_PREEMPT_RT +# if defined CONFIG_HIGHMEM || defined CONFIG_X86_32 + int kmap_idx; + pte_t kmap_pte[KM_TYPE_NR]; +# endif +#endif #ifdef CONFIG_DEBUG_ATOMIC_SLEEP unsigned long task_state_change; #endif @ include/linux/sched.h:1726 @ extern struct task_struct *find_get_task_by_vpid(pid_t nr); extern int wake_up_state(struct task_struct *tsk, unsigned int state); extern int wake_up_process(struct task_struct *tsk); +extern int wake_up_lock_sleeper(struct task_struct *tsk); extern void wake_up_new_task(struct task_struct *tsk); #ifdef CONFIG_SMP @ include/linux/sched.h:1809 @ static inline int test_tsk_need_resched(struct task_struct *tsk) return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED)); } +#ifdef CONFIG_PREEMPT_LAZY +static inline void set_tsk_need_resched_lazy(struct task_struct *tsk) +{ + set_tsk_thread_flag(tsk,TIF_NEED_RESCHED_LAZY); +} + +static inline void clear_tsk_need_resched_lazy(struct task_struct *tsk) +{ + clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED_LAZY); +} + +static inline int test_tsk_need_resched_lazy(struct task_struct *tsk) +{ + return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED_LAZY)); +} + +static inline int need_resched_lazy(void) +{ + return test_thread_flag(TIF_NEED_RESCHED_LAZY); +} + +static inline int need_resched_now(void) +{ + return test_thread_flag(TIF_NEED_RESCHED); +} + +#else +static inline void clear_tsk_need_resched_lazy(struct task_struct *tsk) { } +static inline int need_resched_lazy(void) { return 0; } + +static inline int need_resched_now(void) +{ + return test_thread_flag(TIF_NEED_RESCHED); +} + +#endif + + +static inline bool __task_is_stopped_or_traced(struct task_struct *task) +{ + if (task->state & (__TASK_STOPPED | __TASK_TRACED)) + return true; +#ifdef CONFIG_PREEMPT_RT + if (task->saved_state & (__TASK_STOPPED | __TASK_TRACED)) + return true; +#endif + return false; +} + +static inline bool task_is_stopped_or_traced(struct task_struct *task) +{ + bool traced_stopped; + +#ifdef CONFIG_PREEMPT_RT + unsigned long flags; + + raw_spin_lock_irqsave(&task->pi_lock, flags); + traced_stopped = __task_is_stopped_or_traced(task); + raw_spin_unlock_irqrestore(&task->pi_lock, flags); +#else + traced_stopped = __task_is_stopped_or_traced(task); +#endif + return traced_stopped; +} + +static inline bool task_is_traced(struct task_struct *task) +{ + bool traced = false; + + if (task->state & __TASK_TRACED) + return true; +#ifdef CONFIG_PREEMPT_RT + /* in case the task is sleeping on tasklist_lock */ + raw_spin_lock_irq(&task->pi_lock); + if (task->state & __TASK_TRACED) + traced = true; + else if (task->saved_state & __TASK_TRACED) + traced = true; + raw_spin_unlock_irq(&task->pi_lock); +#endif + return traced; +} + /* * cond_resched() and cond_resched_lock(): latency reduction via * explicit rescheduling in places that are safe. The return @ include/linux/sched.h:1944 @ static __always_inline bool need_resched(void) return unlikely(tif_need_resched()); } +#ifdef CONFIG_PREEMPT_RT +static inline void sleeping_lock_inc(void) +{ + current->sleeping_lock++; +} + +static inline void sleeping_lock_dec(void) +{ + current->sleeping_lock--; +} + +#else + +static inline void sleeping_lock_inc(void) { } +static inline void sleeping_lock_dec(void) { } +#endif + /* * Wrappers for p->thread_info->cpu access. No-op on UP. */ @ include/linux/sched.h:2152 @ int sched_trace_rq_cpu(struct rq *rq); const struct cpumask *sched_trace_rd_span(struct root_domain *rd); +extern struct task_struct *takedown_cpu_task; + #endif @ include/linux/sched/mm.h:52 @ static inline void mmdrop(struct mm_struct *mm) __mmdrop(mm); } +#ifdef CONFIG_PREEMPT_RT +extern void __mmdrop_delayed(struct rcu_head *rhp); +static inline void mmdrop_delayed(struct mm_struct *mm) +{ + if (atomic_dec_and_test(&mm->mm_count)) + call_rcu(&mm->delayed_drop, __mmdrop_delayed); +} +#else +# define mmdrop_delayed(mm) mmdrop(mm) +#endif + /* * This has to be called after a get_task_mm()/mmget_not_zero() * followed by taking the mmap_sem for writing before modifying the @ include/linux/sched/wake_q.h:61 @ static inline bool wake_q_empty(struct wake_q_head *head) extern void wake_q_add(struct wake_q_head *head, struct task_struct *task); extern void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task); -extern void wake_up_q(struct wake_q_head *head); +extern void wake_q_add_sleeper(struct wake_q_head *head, struct task_struct *task); +extern void __wake_up_q(struct wake_q_head *head, bool sleeper); + +static inline void wake_up_q(struct wake_q_head *head) +{ + __wake_up_q(head, false); +} + +static inline void wake_up_q_sleeper(struct wake_q_head *head) +{ + __wake_up_q(head, true); +} #endif /* _LINUX_SCHED_WAKE_Q_H */ @ include/linux/seqlock.h:4 @ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef __LINUX_SEQLOCK_H #define __LINUX_SEQLOCK_H + /* - * Reader/writer consistent mechanism without starving writers. This type of - * lock for data where the reader wants a consistent set of information - * and is willing to retry if the information changes. There are two types - * of readers: - * 1. Sequence readers which never block a writer but they may have to retry - * if a writer is in progress by detecting change in sequence number. - * Writers do not wait for a sequence reader. - * 2. Locking readers which will wait if a writer or another locking reader - * is in progress. A locking reader in progress will also block a writer - * from going forward. Unlike the regular rwlock, the read lock here is - * exclusive so that only one locking reader can get it. + * seqcount_t / seqlock_t - a reader-writer consistency mechanism with + * lockless readers (read-only retry loops), and no writer starvation. * - * This is not as cache friendly as brlock. Also, this may not work well - * for data that contains pointers, because any writer could - * invalidate a pointer that a reader was following. + * See Documentation/locking/seqlock.rst for full description. * - * Expected non-blocking reader usage: - * do { - * seq = read_seqbegin(&foo); - * ... - * } while (read_seqretry(&foo, seq)); - * - * - * On non-SMP the spin locks disappear but the writer still needs - * to increment the sequence variables because an interrupt routine could - * change the state of the data. - * - * Based on x86_64 vsyscall gettimeofday - * by Keith Owens and Andrea Arcangeli + * Copyrights: + * - Based on x86_64 vsyscall gettimeofday: Keith Owens, Andrea Arcangeli */ #include <linux/spinlock.h> @ include/linux/seqlock.h:22 @ #include <asm/processor.h> /* - * Version using sequence counter only. - * This can be used when code has its own mutex protecting the - * updating starting before the write_seqcountbeqin() and ending - * after the write_seqcount_end(). + * Sequence counters (seqcount_t) + * + * This is the raw counting mechanism, without any writer protection. + * + * Write side critical sections must be serialized and non-preemptible. + * + * If readers can be invoked from hardirq or softirq contexts, + * interrupts or bottom halves must also be respectively disabled before + * entering the write section. + * + * This mechanism can't be used if the protected data contains pointers, + * as the writer can invalidate a pointer that a reader is following. + * + * If the write serialization mechanism is one of the common kernel + * locking primitives, use a sequence counter with associated lock + * (seqcount_LOCKTYPE_t) instead. + * + * If it's desired to automatically handle the sequence counter writer + * serialization and non-preemptibility requirements, use a sequential + * lock (seqlock_t) instead. + * + * See Documentation/locking/seqlock.rst */ typedef struct seqcount { unsigned sequence; @ include/linux/seqlock.h:66 @ static inline void __seqcount_init(seqcount_t *s, const char *name, # define SEQCOUNT_DEP_MAP_INIT(lockname) \ .dep_map = { .name = #lockname } \ +/** + * seqcount_init() - runtime initializer for seqcount_t + * @s: Pointer to the &typedef seqcount_t instance + */ # define seqcount_init(s) \ do { \ static struct lock_class_key __key; \ @ include/linux/seqlock.h:93 @ static inline void seqcount_lockdep_reader_access(const seqcount_t *s) # define seqcount_lockdep_reader_access(x) #endif -#define SEQCNT_ZERO(lockname) { .sequence = 0, SEQCOUNT_DEP_MAP_INIT(lockname)} - +/** + * SEQCNT_ZERO() - static initializer for seqcount_t + * @name: Name of the &typedef seqcount_t instance + */ +#define SEQCNT_ZERO(name) { .sequence = 0, SEQCOUNT_DEP_MAP_INIT(name) } /** - * __read_seqcount_begin - begin a seq-read critical section (without barrier) - * @s: pointer to seqcount_t + * __read_seqcount_begin() - begin a seq-read critical section (without barrier) + * @s: Pointer to &typedef seqcount_t or any of the seqcount_locktype_t variants * Returns: count to be passed to read_seqcount_retry * * __read_seqcount_begin is like read_seqcount_begin, but has no smp_rmb() @ include/linux/seqlock.h:112 @ static inline void seqcount_lockdep_reader_access(const seqcount_t *s) * Use carefully, only in critical code, and comment how the barrier is * provided. */ -static inline unsigned __read_seqcount_begin(const seqcount_t *s) +#define __read_seqcount_begin(s) do___read_seqcount_begin(s) + +static inline unsigned __read_seqcount_t_begin(const seqcount_t *s) { unsigned ret; @ include/linux/seqlock.h:128 @ static inline unsigned __read_seqcount_begin(const seqcount_t *s) } /** - * raw_read_seqcount - Read the raw seqcount - * @s: pointer to seqcount_t + * raw_read_seqcount() - Read the raw seqcount + * @s: Pointer to &typedef seqcount_t or any of the seqcount_locktype_t variants * Returns: count to be passed to read_seqcount_retry * * raw_read_seqcount opens a read critical section of the given - * seqcount without any lockdep checking and without checking or - * masking the LSB. Calling code is responsible for handling that. + * seqcount_t, without any lockdep checks and without checking or + * masking the sequence counter LSB. Calling code is responsible for + * handling that. */ -static inline unsigned raw_read_seqcount(const seqcount_t *s) +#define raw_read_seqcount(s) do_raw_read_seqcount(s) + +static inline unsigned raw_read_seqcount_t(const seqcount_t *s) { unsigned ret = READ_ONCE(s->sequence); smp_rmb(); @ include/linux/seqlock.h:147 @ static inline unsigned raw_read_seqcount(const seqcount_t *s) } /** - * raw_read_seqcount_begin - start seq-read critical section w/o lockdep - * @s: pointer to seqcount_t + * raw_read_seqcount_begin() - start seq-read critical section w/o lockdep + * @s: Pointer to &typedef seqcount_t or any of the seqcount_locktype_t variants * Returns: count to be passed to read_seqcount_retry * * raw_read_seqcount_begin opens a read critical section of the given - * seqcount, but without any lockdep checking. Validity of the critical - * section is tested by checking read_seqcount_retry function. + * seqcount_t, but without any lockdep checking. Validity of the read + * section must be checked with read_seqcount_retry(). */ -static inline unsigned raw_read_seqcount_begin(const seqcount_t *s) +#define raw_read_seqcount_begin(s) do_raw_read_seqcount_begin(s) + +static inline unsigned raw_read_seqcount_t_begin(const seqcount_t *s) { - unsigned ret = __read_seqcount_begin(s); + unsigned ret = __read_seqcount_t_begin(s); smp_rmb(); return ret; } /** - * read_seqcount_begin - begin a seq-read critical section - * @s: pointer to seqcount_t + * read_seqcount_begin() - begin a seq-read critical section + * @s: pointer to &typedef seqcount_t or any of the seqcount_locktype_t variants * Returns: count to be passed to read_seqcount_retry * - * read_seqcount_begin opens a read critical section of the given seqcount. - * Validity of the critical section is tested by checking read_seqcount_retry - * function. + * read_seqcount_begin opens a read critical section of the given + * seqcount_t. Validity of the read section must be checked with + * read_seqcount_retry(). */ -static inline unsigned read_seqcount_begin(const seqcount_t *s) +#define read_seqcount_begin(s) do_read_seqcount_begin(s) + +static inline unsigned read_seqcount_t_begin(const seqcount_t *s) { seqcount_lockdep_reader_access(s); - return raw_read_seqcount_begin(s); + return raw_read_seqcount_t_begin(s); } /** - * raw_seqcount_begin - begin a seq-read critical section - * @s: pointer to seqcount_t + * raw_seqcount_begin() - begin a seq-read critical section + * @s: pointer to &typedef seqcount_t or any of the seqcount_locktype_t variants * Returns: count to be passed to read_seqcount_retry * - * raw_seqcount_begin opens a read critical section of the given seqcount. + * raw_seqcount_begin opens a read critical section of the given seqcount_t. * Validity of the critical section is tested by checking read_seqcount_retry * function. * @ include/linux/seqlock.h:195 @ static inline unsigned read_seqcount_begin(const seqcount_t *s) * read_seqcount_retry() instead of stabilizing at the beginning of the * critical section. */ -static inline unsigned raw_seqcount_begin(const seqcount_t *s) +#define raw_seqcount_begin(s) do_raw_seqcount_begin(s) + +static inline unsigned raw_seqcount_t_begin(const seqcount_t *s) { unsigned ret = READ_ONCE(s->sequence); smp_rmb(); @ include/linux/seqlock.h:205 @ static inline unsigned raw_seqcount_begin(const seqcount_t *s) } /** - * __read_seqcount_retry - end a seq-read critical section (without barrier) - * @s: pointer to seqcount_t + * __read_seqcount_retry() - end a seq-read critical section (without barrier) + * @s: pointer to &typedef seqcount_t or any of the seqcount_locktype_t variants * @start: count, from read_seqcount_begin * Returns: 1 if retry is required, else 0 * @ include/linux/seqlock.h:218 @ static inline unsigned raw_seqcount_begin(const seqcount_t *s) * Use carefully, only in critical code, and comment how the barrier is * provided. */ -static inline int __read_seqcount_retry(const seqcount_t *s, unsigned start) +#define __read_seqcount_retry(s, start) do___read_seqcount_retry(s, start) + +static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start) { return unlikely(s->sequence != start); } /** - * read_seqcount_retry - end a seq-read critical section - * @s: pointer to seqcount_t + * read_seqcount_retry() - end a seq-read critical section + * @s: pointer to &typedef seqcount_t or any of the seqcount_locktype_t variants * @start: count, from read_seqcount_begin * Returns: 1 if retry is required, else 0 * - * read_seqcount_retry closes a read critical section of the given seqcount. + * read_seqcount_retry closes a read critical section of given seqcount_t. * If the critical section was invalid, it must be ignored (and typically * retried). */ -static inline int read_seqcount_retry(const seqcount_t *s, unsigned start) +#define read_seqcount_retry(s, start) do_read_seqcount_retry(s, start) + +static inline int read_seqcount_t_retry(const seqcount_t *s, unsigned start) { smp_rmb(); - return __read_seqcount_retry(s, start); + return __read_seqcount_t_retry(s, start); } +#define raw_write_seqcount_begin(s) do_raw_write_seqcount_begin(s) - -static inline void raw_write_seqcount_begin(seqcount_t *s) +static inline void raw_write_seqcount_t_begin(seqcount_t *s) { s->sequence++; smp_wmb(); } -static inline void raw_write_seqcount_end(seqcount_t *s) +#define raw_write_seqcount_end(s) do_raw_write_seqcount_end(s) + +static inline void raw_write_seqcount_t_end(seqcount_t *s) { smp_wmb(); s->sequence++; } /** - * raw_write_seqcount_barrier - do a seq write barrier - * @s: pointer to seqcount_t + * raw_write_seqcount_barrier() - do a seq write barrier + * @s: Pointer to &typedef seqcount_t or any of the seqcount_locktype_t variants * * This can be used to provide an ordering guarantee instead of the * usual consistency guarantee. It is one wmb cheaper, because we can - * collapse the two back-to-back wmb()s. + * collapse the two back-to-back wmb()s:: * * seqcount_t seq; * bool X = true, Y = false; @ include/linux/seqlock.h:293 @ static inline void raw_write_seqcount_end(seqcount_t *s) * X = false; * } */ -static inline void raw_write_seqcount_barrier(seqcount_t *s) +#define raw_write_seqcount_barrier(s) do_raw_write_seqcount_barrier(s) + +static inline void raw_write_seqcount_t_barrier(seqcount_t *s) { s->sequence++; smp_wmb(); s->sequence++; } -static inline int raw_read_seqcount_latch(seqcount_t *s) +/** + * raw_read_seqcount_latch() - pick even or odd seqcount latch data copy + * @s: pointer to &typedef seqcount_t or any of the seqcount_locktype_t variants + * + * Use seqcount latching to switch between two storage places with + * sequence protection to allow interruptible, preemptible, writer + * sections. + * + * Check raw_write_seqcount_latch() for more details and a full reader + * and writer usage example. + * + * Return: sequence counter. Use the lowest bit as index for picking + * which data copy to read. Full counter must then be checked with + * read_seqcount_retry(). + */ +#define raw_read_seqcount_latch(s) do_raw_read_seqcount_latch(s) + +static inline int raw_read_seqcount_t_latch(seqcount_t *s) { /* Pairs with the first smp_wmb() in raw_write_seqcount_latch() */ int seq = READ_ONCE(s->sequence); /* ^^^ */ @ include/linux/seqlock.h:327 @ static inline int raw_read_seqcount_latch(seqcount_t *s) } /** - * raw_write_seqcount_latch - redirect readers to even/odd copy - * @s: pointer to seqcount_t + * raw_write_seqcount_latch() - redirect readers to even/odd copy + * @s: pointer to &typedef seqcount_t or any of the seqcount_locktype_t variants * * The latch technique is a multiversion concurrency control method that allows * queries during non-atomic modifications. If you can guarantee queries never @ include/linux/seqlock.h:344 @ static inline int raw_read_seqcount_latch(seqcount_t *s) * Very simply put: we first modify one copy and then the other. This ensures * there is always one copy in a stable state, ready to give us an answer. * - * The basic form is a data structure like: + * The basic form is a data structure like:: * - * struct latch_struct { - * seqcount_t seq; - * struct data_struct data[2]; - * }; + * struct latch_struct { + * seqcount_t seq; + * struct data_struct data[2]; + * }; * * Where a modification, which is assumed to be externally serialized, does the - * following: + * following:: * - * void latch_modify(struct latch_struct *latch, ...) - * { - * smp_wmb(); <- Ensure that the last data[1] update is visible - * latch->seq++; - * smp_wmb(); <- Ensure that the seqcount update is visible + * void latch_modify(struct latch_struct *latch, ...) + * { + * smp_wmb(); // Ensure that the last data[1] update is visible + * latch->seq++; + * smp_wmb(); // Ensure that the seqcount update is visible * - * modify(latch->data[0], ...); + * modify(latch->data[0], ...); * - * smp_wmb(); <- Ensure that the data[0] update is visible - * latch->seq++; - * smp_wmb(); <- Ensure that the seqcount update is visible + * smp_wmb(); // Ensure that the data[0] update is visible + * latch->seq++; + * smp_wmb(); // Ensure that the seqcount update is visible * - * modify(latch->data[1], ...); - * } + * modify(latch->data[1], ...); + * } * - * The query will have a form like: + * The query will have a form like:: * - * struct entry *latch_query(struct latch_struct *latch, ...) - * { - * struct entry *entry; - * unsigned seq, idx; + * struct entry *latch_query(struct latch_struct *latch, ...) + * { + * struct entry *entry; + * unsigned seq, idx; * - * do { - * seq = raw_read_seqcount_latch(&latch->seq); + * do { + * seq = raw_read_seqcount_latch(&latch->seq); * - * idx = seq & 0x01; - * entry = data_query(latch->data[idx], ...); + * idx = seq & 0x01; + * entry = data_query(latch->data[idx], ...); * - * smp_rmb(); - * } while (seq != latch->seq); + * // read_seqcount_retry() includes necessary smp_rmb() + * } while (read_seqcount_retry(&latch->seq, seq); * - * return entry; - * } + * return entry; + * } * * So during the modification, queries are first redirected to data[1]. Then we * modify data[0]. When that is complete, we redirect queries back to data[0] * and we can modify data[1]. * - * NOTE: The non-requirement for atomic modifications does _NOT_ include - * the publishing of new entries in the case where data is a dynamic - * data structure. + * NOTE: * - * An iteration might start in data[0] and get suspended long enough - * to miss an entire modification sequence, once it resumes it might - * observe the new entry. + * The non-requirement for atomic modifications does _NOT_ include + * the publishing of new entries in the case where data is a dynamic + * data structure. * - * NOTE: When data is a dynamic data structure; one should use regular RCU - * patterns to manage the lifetimes of the objects within. + * An iteration might start in data[0] and get suspended long enough + * to miss an entire modification sequence, once it resumes it might + * observe the new entry. + * + * NOTE: + * + * When data is a dynamic data structure; one should use regular RCU + * patterns to manage the lifetimes of the objects within. */ -static inline void raw_write_seqcount_latch(seqcount_t *s) +#define raw_write_seqcount_latch(s) do_raw_write_seqcount_latch(s) + +static inline void raw_write_seqcount_t_latch(seqcount_t *s) { smp_wmb(); /* prior stores before incrementing "sequence" */ s->sequence++; smp_wmb(); /* increment "sequence" before following stores */ } -/* - * Sequence counter only version assumes that callers are using their - * own mutexing. - */ -static inline void write_seqcount_begin_nested(seqcount_t *s, int subclass) +#define write_seqcount_begin_nested(s, subclass) \ + do_write_seqcount_begin_nested(s, subclass) + +static inline void write_seqcount_t_begin_nested(seqcount_t *s, int subclass) { - raw_write_seqcount_begin(s); + raw_write_seqcount_t_begin(s); seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_); } -static inline void write_seqcount_begin(seqcount_t *s) -{ - write_seqcount_begin_nested(s, 0); -} +/** + * write_seqcount_begin() - start a seqcount write-side critical section + * @s: Pointer to &typedef seqcount_t + * + * write_seqcount_begin opens a write-side critical section of the given + * seqcount. Seqcount write-side critical sections must be externally + * serialized and non-preemptible. + */ +#define write_seqcount_begin(s) do_write_seqcount_begin(s) -static inline void write_seqcount_end(seqcount_t *s) +static inline void write_seqcount_t_begin(seqcount_t *s) { - seqcount_release(&s->dep_map, _RET_IP_); - raw_write_seqcount_end(s); + write_seqcount_t_begin_nested(s, 0); } /** - * write_seqcount_invalidate - invalidate in-progress read-side seq operations - * @s: pointer to seqcount_t + * write_seqcount_end() - end a seqcount write-side critical section + * @s: Pointer to &typedef seqcount_t + * + * The write section must've been opened with write_seqcount_begin(). + */ +#define write_seqcount_end(s) do_write_seqcount_end(s) + +static inline void write_seqcount_t_end(seqcount_t *s) +{ + seqcount_release(&s->dep_map, _RET_IP_); + raw_write_seqcount_t_end(s); +} + +/** + * write_seqcount_invalidate() - invalidate in-progress read-side seq operations + * @s: Pointer to &typedef seqcount_t or any of the seqcount_locktype_t variants * * After write_seqcount_invalidate, no read-side seq operations will complete * successfully and see data older than this. */ -static inline void write_seqcount_invalidate(seqcount_t *s) +#define write_seqcount_invalidate(s) do_write_seqcount_invalidate(s) + +static inline void write_seqcount_t_invalidate(seqcount_t *s) { smp_wmb(); s->sequence+=2; } +/* + * Sequence counters with associated locks (seqcount_LOCKTYPE_t) + * + * A sequence counter which associates the lock used for writer + * serialization at initialization time. This enables lockdep to validate + * that the write side critical section is properly serialized. + * + * For associated locks which do not implicitly disable preemption, + * preemption protection is enforced in the write side function. + * + * See Documentation/locking/seqlock.rst + */ + +#if defined(CONFIG_LOCKDEP) || defined(CONFIG_PREEMPT_RT) +#define SEQCOUNT_ASSOC_LOCK +#endif + +/** + * typedef seqcount_spinlock_t - sequence count with spinlock associated + * @seqcount: The real sequence counter + * @lock: Pointer to the associated spinlock + * + * A plain sequence counter with external writer synchronization by a + * spinlock. The spinlock is associated to the sequence count in the + * static initializer or init function. This enables lockdep to validate + * that the write side critical section is properly serialized. + */ +typedef struct seqcount_spinlock { + seqcount_t seqcount; +#ifdef SEQCOUNT_ASSOC_LOCK + spinlock_t *lock; +#endif +} seqcount_spinlock_t; + +#ifdef SEQCOUNT_ASSOC_LOCK + +#define SEQCOUNT_LOCKTYPE_ZERO(seq_name, assoc_lock) { \ + .seqcount = SEQCNT_ZERO(seq_name.seqcount), \ + .lock = (assoc_lock), \ +} + +/* Define as macro due to static lockdep key @ seqcount_init() */ +#define seqcount_locktype_init(s, assoc_lock) \ +do { \ + seqcount_init(&(s)->seqcount); \ + (s)->lock = (assoc_lock); \ +} while (0) + +#else /* !SEQCOUNT_ASSOC_LOCK */ + +#define SEQCOUNT_LOCKTYPE_ZERO(seq_name, assoc_lock) { \ + .seqcount = SEQCNT_ZERO(seq_name.seqcount), \ +} + +#define seqcount_locktype_init(s, assoc_lock) \ +do { \ + seqcount_init(&(s)->seqcount); \ +} while (0) + +#endif /* SEQCOUNT_ASSOC_LOCK */ + +/** + * SEQCNT_SPINLOCK_ZERO - static initializer for seqcount_spinlock_t + * @name: Name of the &typedef seqcount_spinlock_t instance + * @lock: Pointer to the associated spinlock + */ +#define SEQCNT_SPINLOCK_ZERO(name, lock) \ + SEQCOUNT_LOCKTYPE_ZERO(name, lock) + +/** + * seqcount_spinlock_init - runtime initializer for seqcount_spinlock_t + * @s: Pointer to the &typedef seqcount_spinlock_t instance + * @lock: Pointer to the associated spinlock + */ +#define seqcount_spinlock_init(s, lock) \ + seqcount_locktype_init(s, lock) + +/** + * typedef seqcount_raw_spinlock_t - sequence count with raw spinlock associated + * @seqcount: The real sequence counter + * @lock: Pointer to the associated raw spinlock + * + * A plain sequence counter with external writer synchronization by a + * raw spinlock. The raw spinlock is associated to the sequence count in + * the static initializer or init function. This enables lockdep to + * validate that the write side critical section is properly serialized. + */ +typedef struct seqcount_raw_spinlock { + seqcount_t seqcount; +#ifdef SEQCOUNT_ASSOC_LOCK + raw_spinlock_t *lock; +#endif +} seqcount_raw_spinlock_t; + +/** + * SEQCNT_RAW_SPINLOCK_ZERO - static initializer for seqcount_raw_spinlock_t + * @name: Name of the &typedef seqcount_raw_spinlock_t instance + * @lock: Pointer to the associated raw_spinlock + */ +#define SEQCNT_RAW_SPINLOCK_ZERO(name, lock) \ + SEQCOUNT_LOCKTYPE_ZERO(name, lock) + +/** + * seqcount_raw_spinlock_init - runtime initializer for seqcount_raw_spinlock_t + * @s: Pointer to the &typedef seqcount_raw_spinlock_t instance + * @lock: Pointer to the associated raw_spinlock + */ +#define seqcount_raw_spinlock_init(s, lock) \ + seqcount_locktype_init(s, lock) + +/** + * typedef seqcount_rwlock_t - sequence count with rwlock associated + * @seqcount: The real sequence counter + * @lock: Pointer to the associated rwlock + * + * A plain sequence counter with external writer synchronization by a + * rwlock. The rwlock is associated to the sequence count in the static + * initializer or init function. This enables lockdep to validate that + * the write side critical section is properly serialized. + */ +typedef struct seqcount_rwlock { + seqcount_t seqcount; +#ifdef SEQCOUNT_ASSOC_LOCK + rwlock_t *lock; +#endif +} seqcount_rwlock_t; + +/** + * SEQCNT_RWLOCK_ZERO - static initializer for seqcount_rwlock_t + * @name: Name of the &typedef seqcount_rwlock_t instance + * @lock: Pointer to the associated rwlock + */ +#define SEQCNT_RWLOCK_ZERO(name, lock) \ + SEQCOUNT_LOCKTYPE_ZERO(name, lock) + +/** + * seqcount_rwlock_init - runtime initializer for seqcount_rwlock_t + * @s: Pointer to the &typedef seqcount_rwlock_t instance + * @lock: Pointer to the associated rwlock + */ +#define seqcount_rwlock_init(s, lock) \ + seqcount_locktype_init(s, lock) + +/** + * typedef seqcount_mutex_t - sequence count with mutex associated + * @seqcount: The real sequence counter + * @lock: Pointer to the associated mutex + * + * A plain sequence counter with external writer synchronization by a + * mutex. The mutex is associated to the sequence counter in the static + * initializer or init function. This enables lockdep to validate that + * the write side critical section is properly serialized. + * + * The write side API functions write_seqcount_begin()/end() automatically + * disable and enable preemption when used with seqcount_mutex_t. + */ +typedef struct seqcount_mutex { + seqcount_t seqcount; +#ifdef SEQCOUNT_ASSOC_LOCK + struct mutex *lock; +#endif +} seqcount_mutex_t; + +/** + * SEQCNT_MUTEX_ZERO - static initializer for seqcount_mutex_t + * @name: Name of the &typedef seqcount_mutex_t instance + * @lock: Pointer to the associated mutex + */ +#define SEQCNT_MUTEX_ZERO(name, lock) \ + SEQCOUNT_LOCKTYPE_ZERO(name, lock) + +/** + * seqcount_mutex_init - runtime initializer for seqcount_mutex_t + * @s: Pointer to the &typedef seqcount_mutex_t instance + * @lock: Pointer to the associated mutex + */ +#define seqcount_mutex_init(s, lock) \ + seqcount_locktype_init(s, lock) + +/** + * typedef seqcount_ww_mutex_t - sequence count with ww_mutex associated + * @seqcount: The real sequence counter + * @lock: Pointer to the associated ww_mutex + * + * A plain sequence counter with external writer synchronization by a + * ww_mutex. The ww_mutex is associated to the sequence counter in the static + * initializer or init function. This enables lockdep to validate that + * the write side critical section is properly serialized. + * + * The write side API functions write_seqcount_begin()/end() automatically + * disable and enable preemption when used with seqcount_ww_mutex_t. + */ +typedef struct seqcount_ww_mutex { + seqcount_t seqcount; +#ifdef SEQCOUNT_ASSOC_LOCK + struct ww_mutex *lock; +#endif +} seqcount_ww_mutex_t; + +/** + * SEQCNT_WW_MUTEX_ZERO - static initializer for seqcount_ww_mutex_t + * @name: Name of the &typedef seqcount_ww_mutex_t instance + * @lock: Pointer to the associated ww_mutex + */ +#define SEQCNT_WW_MUTEX_ZERO(name, lock) \ + SEQCOUNT_LOCKTYPE_ZERO(name, lock) + +/** + * seqcount_ww_mutex_init - runtime initializer for seqcount_ww_mutex_t + * @s: Pointer to the &typedef seqcount_ww_mutex_t instance + * @lock: Pointer to the associated ww_mutex + */ +#define seqcount_ww_mutex_init(s, lock) \ + seqcount_locktype_init(s, lock) + +#include <linux/seqlock_types_internal.h> + +/* + * Sequential locks (seqlock_t) + * + * Sequence counters with an embedded spinlock for writer serialization + * and non-preemptibility. + * + * For more info, see: + * - Comments on top of seqcount_t + * - Documentation/locking/seqlock.rst + */ typedef struct { struct seqcount seqcount; spinlock_t lock; } seqlock_t; -/* - * These macros triggered gcc-3.x compile-time problems. We think these are - * OK now. Be cautious. - */ #define __SEQLOCK_UNLOCKED(lockname) \ { \ .seqcount = SEQCNT_ZERO(lockname), \ .lock = __SPIN_LOCK_UNLOCKED(lockname) \ } -#define seqlock_init(x) \ +/** + * seqlock_init() - dynamic initializer for seqlock_t + * @sl: Pointer to the &typedef seqlock_t instance + */ +#define seqlock_init(sl) \ do { \ - seqcount_init(&(x)->seqcount); \ - spin_lock_init(&(x)->lock); \ + seqcount_init(&(sl)->seqcount); \ + spin_lock_init(&(sl)->lock); \ } while (0) -#define DEFINE_SEQLOCK(x) \ - seqlock_t x = __SEQLOCK_UNLOCKED(x) +/** + * DEFINE_SEQLOCK() - Define a statically-allocated seqlock_t + * @sl: Name of the &typedef seqlock_t instance + */ +#define DEFINE_SEQLOCK(sl) \ + seqlock_t sl = __SEQLOCK_UNLOCKED(sl) + +/** + * read_seqbegin() - start a seqlock_t read-side critical section + * @sl: Pointer to &typedef seqlock_t + * + * read_seqbegin opens a read side critical section of the given + * seqlock_t. Validity of the critical section is tested by checking + * read_seqretry(). + * + * Return: count to be passed to read_seqretry() + */ /* - * Read side functions for starting and finalizing a read side section. + * For PREEMPT_RT, preemption cannot be disabled upon entering the write + * side critical section. With disabled preemption: + * + * - The writer cannot be preempted by a task with higher priority + * + * - The writer cannot acquire a spinlock_t since it's a sleeping + * lock. This would invalidate the existing, and non-PREEMPT_RT + * valid, code pattern of acquiring a spinlock_t inside the seqcount + * write side critical section. + * + * To remain preemptible, while avoiding a livelock caused by the reader + * preempting the writer, use a different technique: + * + * - If the sequence counter is even upon entering a read side + * section, then no writer is in progress, and the reader did not + * preempt any write side sections. It can continue. + * + * - If the counter is odd, a writer is in progress and the reader may + * have preempted a write side section. Let the reader acquire the + * lock used for seqcount writer serialization, which is already + * held by the writer. + * + * The higher-priority reader will block on the lock, and the + * lower-priority preempted writer will make progress until it + * finishes its write serialization lock critical section. + * + * Once the reader has the writer serialization lock acquired, the + * writer is finished and the counter is even. Drop the writer + * serialization lock and re-read the sequence counter. + * + * This technique must be implemented for all PREEMPT_RT sleeping locks. */ +#ifdef CONFIG_PREEMPT_RT + static inline unsigned read_seqbegin(const seqlock_t *sl) { - return read_seqcount_begin(&sl->seqcount); + unsigned seq; + + seqcount_lockdep_reader_access(&sl->seqcount); + + do { + seq = READ_ONCE(sl->seqcount.sequence); + if (unlikely(seq & 1)) { + seqlock_t *msl = (seqlock_t *)sl; + spin_lock(&msl->lock); + spin_unlock(&msl->lock); + } + } while (unlikely(seq & 1)); + + smp_rmb(); + return seq; } +#else /* !CONFIG_PREEMPT_RT */ + +static inline unsigned read_seqbegin(const seqlock_t *sl) +{ + return read_seqcount_t_begin(&sl->seqcount); +} + +#endif + +/** + * read_seqretry() - end and validate a seqlock_t read side section + * @sl: Pointer to &typedef seqlock_t + * @start: count, from read_seqbegin() + * + * read_seqretry closes the given seqlock_t read side critical section, + * and checks its validity. If the read section was invalid, it must be + * ignored and retried. + * + * Return: 1 if a retry is required, 0 otherwise + */ static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start) { - return read_seqcount_retry(&sl->seqcount, start); + return read_seqcount_t_retry(&sl->seqcount, start); } -/* - * Lock out other writers and update the count. - * Acts like a normal spin_lock/unlock. - * Don't need preempt_disable() because that is in the spin_lock already. +/** + * write_seqlock() - start a seqlock_t write side critical section + * @sl: Pointer to &typedef seqlock_t + * + * write_seqlock opens a write side critical section of the given + * seqlock_t. It also acquires the spinlock_t embedded inside the + * sequential lock. All the seqlock_t write side critical sections are + * thus automatically serialized and non-preemptible. + * + * Use the ``_irqsave`` and ``_bh`` variants instead if the read side + * can be invoked from a hardirq or softirq context. + * + * The opened write side section must be closed with write_sequnlock(). */ static inline void write_seqlock(seqlock_t *sl) { spin_lock(&sl->lock); - write_seqcount_begin(&sl->seqcount); + write_seqcount_t_begin(&sl->seqcount); } +/** + * write_sequnlock() - end a seqlock_t write side critical section + * @sl: Pointer to &typedef seqlock_t + * + * write_sequnlock closes the (serialized and non-preemptible) write + * side critical section of given seqlock_t. + */ static inline void write_sequnlock(seqlock_t *sl) { - write_seqcount_end(&sl->seqcount); + write_seqcount_t_end(&sl->seqcount); spin_unlock(&sl->lock); } +/** + * write_seqlock_bh() - start a softirqs-disabled seqlock_t write section + * @sl: Pointer to &typedef seqlock_t + * + * ``_bh`` variant of write_seqlock(). Use only if the read side section + * can be invoked from a softirq context. + * + * The opened write section must be closed with write_sequnlock_bh(). + */ static inline void write_seqlock_bh(seqlock_t *sl) { spin_lock_bh(&sl->lock); - write_seqcount_begin(&sl->seqcount); + write_seqcount_t_begin(&sl->seqcount); } +/** + * write_sequnlock_bh() - end a softirqs-disabled seqlock_t write section + * @sl: Pointer to &typedef seqlock_t + * + * write_sequnlock_bh closes the serialized, non-preemptible, + * softirqs-disabled, seqlock_t write side critical section opened with + * write_seqlock_bh(). + */ static inline void write_sequnlock_bh(seqlock_t *sl) { - write_seqcount_end(&sl->seqcount); + write_seqcount_t_end(&sl->seqcount); spin_unlock_bh(&sl->lock); } +/** + * write_seqlock_irq() - start a non-interruptible seqlock_t write side section + * @sl: Pointer to &typedef seqlock_t + * + * This is the ``_irq`` variant of write_seqlock(). Use only if the read + * section of given seqlock_t can be invoked from a hardirq context. + */ static inline void write_seqlock_irq(seqlock_t *sl) { spin_lock_irq(&sl->lock); - write_seqcount_begin(&sl->seqcount); + write_seqcount_t_begin(&sl->seqcount); } +/** + * write_sequnlock_irq() - end a non-interruptible seqlock_t write side section + * @sl: Pointer to &typedef seqlock_t + * + * ``_irq`` variant of write_sequnlock(). The write side section of + * given seqlock_t must've been opened with write_seqlock_irq(). + */ static inline void write_sequnlock_irq(seqlock_t *sl) { - write_seqcount_end(&sl->seqcount); + write_seqcount_t_end(&sl->seqcount); spin_unlock_irq(&sl->lock); } @ include/linux/seqlock.h:907 @ static inline unsigned long __write_seqlock_irqsave(seqlock_t *sl) unsigned long flags; spin_lock_irqsave(&sl->lock, flags); - write_seqcount_begin(&sl->seqcount); + write_seqcount_t_begin(&sl->seqcount); + return flags; } +/** + * write_seqlock_irqsave() - start a non-interruptible seqlock_t write section + * @lock: Pointer to &typedef seqlock_t + * @flags: Stack-allocated storage for saving caller's local interrupt + * state, to be passed to write_sequnlock_irqrestore(). + * + * ``_irqsave`` variant of write_seqlock(). Use if the read section of + * given seqlock_t can be invoked from a hardirq context. + * + * The opened write section must be closed with write_sequnlock_irqrestore(). + */ #define write_seqlock_irqsave(lock, flags) \ do { flags = __write_seqlock_irqsave(lock); } while (0) +/** + * write_sequnlock_irqrestore() - end non-interruptible seqlock_t write section + * @sl: Pointer to &typedef seqlock_t + * @flags: Caller's saved interrupt state, from write_seqlock_irqsave() + * + * ``_irqrestore`` variant of write_sequnlock(). The write section of + * given seqlock_t must've been opened with write_seqlock_irqsave(). + */ static inline void write_sequnlock_irqrestore(seqlock_t *sl, unsigned long flags) { - write_seqcount_end(&sl->seqcount); + write_seqcount_t_end(&sl->seqcount); spin_unlock_irqrestore(&sl->lock, flags); } -/* - * A locking reader exclusively locks out other writers and locking readers, - * but doesn't update the sequence number. Acts like a normal spin_lock/unlock. - * Don't need preempt_disable() because that is in the spin_lock already. +/** + * read_seqlock_excl() - begin a seqlock_t locking reader critical section + * @sl: Pointer to &typedef seqlock_t + * + * read_seqlock_excl opens a locking reader critical section for the + * given seqlock_t. A locking reader exclusively locks out other writers + * and other *locking* readers, but doesn't update the sequence number. + * + * Locking readers act like a normal spin_lock()/spin_unlock(). + * + * The opened read side section must be closed with read_sequnlock_excl(). */ static inline void read_seqlock_excl(seqlock_t *sl) { spin_lock(&sl->lock); } +/** + * read_sequnlock_excl() - end a seqlock_t locking reader critical section + * @sl: Pointer to &typedef seqlock_t + * + * read_sequnlock_excl closes the locking reader critical section opened + * with read_seqlock_excl(). + */ static inline void read_sequnlock_excl(seqlock_t *sl) { spin_unlock(&sl->lock); } /** - * read_seqbegin_or_lock - begin a sequence number check or locking block - * @lock: sequence lock - * @seq : sequence number to be checked + * read_seqbegin_or_lock() - begin a seqlock_t lockless or locking reader + * @lock: Pointer to &typedef seqlock_t + * @seq : Marker and return parameter. If the passed value is even, the + * reader will become a *lockless* seqlock_t sequence counter reader as + * in read_seqbegin(). If the passed value is odd, the reader will + * become a fully locking reader, as in read_seqlock_excl(). In the + * first call to read_seqbegin_or_lock(), the caller **must** initialize + * and pass an even value to @seq so a lockless read is optimistically + * tried first. * - * First try it once optimistically without taking the lock. If that fails, - * take the lock. The sequence number is also used as a marker for deciding - * whether to be a reader (even) or writer (odd). - * N.B. seq must be initialized to an even number to begin with. + * read_seqbegin_or_lock is an API designed to optimistically try a + * normal lockless seqlock_t read section first, as in read_seqbegin(). + * If an odd counter is found, the normal lockless read trial has + * failed, and the next reader iteration transforms to a full seqlock_t + * locking reader as in read_seqlock_excl(). + * + * This is typically used to avoid lockless seqlock_t readers starvation + * (too much retry loops) in the case of a sharp spike in write + * activity. + * + * The opened read section must be closed with done_seqretry(). Check + * Documentation/locking/seqlock.rst for template example code. + * + * Return: The encountered sequence counter value, returned through the + * @seq parameter, which is overloaded as a return parameter. The + * returned value must be checked with need_seqretry(). If the read + * section must be retried, the returned value must also be passed to + * the @seq parameter of the next read_seqbegin_or_lock() iteration. */ static inline void read_seqbegin_or_lock(seqlock_t *lock, int *seq) { @ include/linux/seqlock.h:1008 @ static inline void read_seqbegin_or_lock(seqlock_t *lock, int *seq) read_seqlock_excl(lock); } +/** + * need_seqretry() - validate seqlock_t "locking or lockless" reader section + * @lock: Pointer to &typedef seqlock_t + * @seq: count, from read_seqbegin_or_lock() + * + * need_seqretry checks if the seqlock_t read-side critical section + * started with read_seqbegin_or_lock() is valid. If it was not, the + * caller must retry the read-side section. + * + * Return: 1 if a retry is required, 0 otherwise + */ static inline int need_seqretry(seqlock_t *lock, int seq) { return !(seq & 1) && read_seqretry(lock, seq); } +/** + * done_seqretry() - end seqlock_t "locking or lockless" reader section + * @lock: Pointer to &typedef seqlock_t + * @seq: count, from read_seqbegin_or_lock() + * + * done_seqretry finishes the seqlock_t read side critical section + * started by read_seqbegin_or_lock(). The read section must've been + * already validated with need_seqretry(). + */ static inline void done_seqretry(seqlock_t *lock, int seq) { if (seq & 1) read_sequnlock_excl(lock); } +/** + * read_seqlock_excl_bh() - start a locking reader seqlock_t section + * with softirqs disabled + * @sl: Pointer to &typedef seqlock_t + * + * ``_bh`` variant of read_seqlock_excl(). Use this variant if the + * seqlock_t write side section, *or other read sections*, can be + * invoked from a softirq context + * + * The opened section must be closed with read_sequnlock_excl_bh(). + */ static inline void read_seqlock_excl_bh(seqlock_t *sl) { spin_lock_bh(&sl->lock); } +/** + * read_sequnlock_excl_bh() - stop a seqlock_t softirq-disabled locking + * reader section + * @sl: Pointer to &typedef seqlock_t + * + * ``_bh`` variant of read_sequnlock_excl(). The closed section must've + * been opened with read_seqlock_excl_bh(). + */ static inline void read_sequnlock_excl_bh(seqlock_t *sl) { spin_unlock_bh(&sl->lock); } +/** + * read_seqlock_excl_irq() - start a non-interruptible seqlock_t locking + * reader section + * @sl: Pointer to &typedef seqlock_t + * + * ``_irq`` variant of read_seqlock_excl(). Use this only if the + * seqlock_t write side critical section, *or other read side sections*, + * can be invoked from a hardirq context. + * + * The opened read section must be closed with read_sequnlock_excl_irq(). + */ static inline void read_seqlock_excl_irq(seqlock_t *sl) { spin_lock_irq(&sl->lock); } +/** + * read_sequnlock_excl_irq() - end an interrupts-disabled seqlock_t + * locking reader section + * @sl: Pointer to &typedef seqlock_t + * + * ``_irq`` variant of read_sequnlock_excl(). The closed section must've + * been opened with read_seqlock_excl_irq(). + */ static inline void read_sequnlock_excl_irq(seqlock_t *sl) { spin_unlock_irq(&sl->lock); @ include/linux/seqlock.h:1105 @ static inline unsigned long __read_seqlock_excl_irqsave(seqlock_t *sl) return flags; } +/** + * read_seqlock_excl_irqsave() - start a non-interruptible seqlock_t + * locking reader section + * @lock: Pointer to &typedef seqlock_t + * @flags: Stack-allocated storage for saving caller's local interrupt + * state, to be passed to read_sequnlock_excl_irqrestore(). + * + * ``_irqsave`` variant of read_seqlock_excl(). Use this only if the + * seqlock_t write side critical section, *or other read side sections*, + * can be invoked from a hardirq context. + * + * Opened section must be closed with read_sequnlock_excl_irqrestore(). + */ #define read_seqlock_excl_irqsave(lock, flags) \ do { flags = __read_seqlock_excl_irqsave(lock); } while (0) +/** + * read_sequnlock_excl_irqrestore() - end non-interruptible seqlock_t + * locking reader section + * @sl: Pointer to &typedef seqlock_t + * @flags: Caller's saved interrupt state, from + * read_seqlock_excl_irqsave() + * + * ``_irqrestore`` variant of read_sequnlock_excl(). The closed section + * must've been opened with read_seqlock_excl_irqsave(). + */ static inline void read_sequnlock_excl_irqrestore(seqlock_t *sl, unsigned long flags) { spin_unlock_irqrestore(&sl->lock, flags); } +/** + * read_seqbegin_or_lock_irqsave() - begin a seqlock_t lockless reader, or + * a non-interruptible locking reader + * @lock: Pointer to &typedef seqlock_t + * @seq: Marker and return parameter. Check read_seqbegin_or_lock(). + * + * This is the ``_irqsave`` variant of read_seqbegin_or_lock(). Use if + * the seqlock_t write side critical section, *or other read side sections*, + * can be invoked from hardirq context. + * + * The validity of the read section must be checked with need_seqretry(). + * The opened section must be closed with done_seqretry_irqrestore(). + * + * Return: + * + * 1. The saved local interrupts state in case of a locking reader, to be + * passed to done_seqretry_irqrestore(). + * + * 2. The encountered sequence counter value, returned through @seq which + * is overloaded as a return parameter. Check read_seqbegin_or_lock(). + */ static inline unsigned long read_seqbegin_or_lock_irqsave(seqlock_t *lock, int *seq) { @ include/linux/seqlock.h:1171 @ read_seqbegin_or_lock_irqsave(seqlock_t *lock, int *seq) return flags; } +/** + * done_seqretry_irqrestore() - end a seqlock_t lockless reader, or a + * non-interruptible locking reader section + * @lock: Pointer to &typedef seqlock_t + * @seq: Count, from read_seqbegin_or_lock_irqsave() + * @flags: Caller's saved local interrupt state in case of a locking + * reader, also from read_seqbegin_or_lock_irqsave() + * + * This is the ``_irqrestore`` variant of done_seqretry(). The read + * section must've been opened with read_seqbegin_or_lock_irqsave(), and + * validated with need_seqretry(). + */ static inline void done_seqretry_irqrestore(seqlock_t *lock, int seq, unsigned long flags) { @ include/linux/seqlock_types_internal.h:4 @ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __LINUX_SEQLOCK_TYPES_INTERNAL_H +#define __LINUX_SEQLOCK_TYPES_INTERNAL_H + +/* + * Sequence counters with associated locks + * + * Copyright (C) 2020 Linutronix GmbH + */ + +#ifndef __LINUX_SEQLOCK_H +#error This is an INTERNAL header; it must only be included by seqlock.h +#endif + +#include <linux/mutex.h> +#include <linux/spinlock.h> +#include <linux/ww_mutex.h> + +/* + * @s: pointer to seqcount_t or any of the seqcount_locktype_t variants + */ +#define __to_seqcount_t(s) \ +({ \ + seqcount_t *seq; \ + \ + if (__same_type(*(s), seqcount_t)) \ + seq = (seqcount_t *)(s); \ + else if (__same_type(*(s), seqcount_spinlock_t)) \ + seq = &((seqcount_spinlock_t *)(s))->seqcount; \ + else if (__same_type(*(s), seqcount_raw_spinlock_t)) \ + seq = &((seqcount_raw_spinlock_t *)(s))->seqcount; \ + else if (__same_type(*(s), seqcount_rwlock_t)) \ + seq = &((seqcount_rwlock_t *)(s))->seqcount; \ + else if (__same_type(*(s), seqcount_mutex_t)) \ + seq = &((seqcount_mutex_t *)(s))->seqcount; \ + else if (__same_type(*(s), seqcount_ww_mutex_t)) \ + seq = &((seqcount_ww_mutex_t *)(s))->seqcount; \ + else \ + BUILD_BUG_ON_MSG(1, "Unknown seqcount type"); \ + \ + seq; \ +}) + +/* + * seqcount_LOCKTYPE_t -- write APIs + * + * For associated lock types which do not implicitly disable preemption, + * enforce preemption protection in the write side functions. + * + * Never use lockdep for the raw write variants. + */ + +#ifdef CONFIG_PREEMPT_RT + +/* + * Do not disable preemption for PREEMPT_RT. Check comment on top of + * seqlock.h read_seqbegin() for rationale. + */ +#define __enforce_preemption_protection(s) (false) + +#else + +#define __associated_lock_is_preemptible(s) \ +({ \ + bool ret; \ + \ + if (__same_type(*(s), seqcount_t) || \ + __same_type(*(s), seqcount_spinlock_t) || \ + __same_type(*(s), seqcount_raw_spinlock_t) || \ + __same_type(*(s), seqcount_rwlock_t)) { \ + ret = false; \ + } else if (__same_type(*(s), seqcount_mutex_t) || \ + __same_type(*(s), seqcount_ww_mutex_t)) { \ + ret = true; \ + } else \ + BUILD_BUG_ON_MSG(1, "Unknown seqcount type"); \ + \ + ret; \ +}) + +#define __enforce_preemption_protection(s) \ + __associated_lock_is_preemptible(s) + +#endif /* CONFIG_PREEMPT_RT */ + +#ifdef CONFIG_LOCKDEP + +#define __assert_associated_lock_held(s) \ +do { \ + if (__same_type(*(s), seqcount_t)) \ + break; \ + \ + if (__same_type(*(s), seqcount_spinlock_t)) \ + lockdep_assert_held(((seqcount_spinlock_t *)(s))->lock);\ + else if (__same_type(*(s), seqcount_raw_spinlock_t)) \ + lockdep_assert_held(((seqcount_raw_spinlock_t *)(s))->lock); \ + else if (__same_type(*(s), seqcount_rwlock_t)) \ + lockdep_assert_held_write(((seqcount_rwlock_t *)(s))->lock); \ + else if (__same_type(*(s), seqcount_mutex_t)) \ + lockdep_assert_held(((seqcount_mutex_t *)(s))->lock); \ + else if (__same_type(*(s), seqcount_ww_mutex_t)) \ + lockdep_assert_held(&((seqcount_ww_mutex_t *)(s))->lock->base); \ + else \ + BUILD_BUG_ON_MSG(1, "Unknown seqcount type"); \ +} while (0) + +#else + +#define __assert_associated_lock_held(s) \ +do { \ + (void) __to_seqcount_t(s); \ +} while (0) + +#endif /* CONFIG_LOCKDEP */ + +#define do_raw_write_seqcount_begin(s) \ +do { \ + if (__enforce_preemption_protection(s)) \ + preempt_disable(); \ + \ + raw_write_seqcount_t_begin(__to_seqcount_t(s)); \ +} while (0) + +#define do_raw_write_seqcount_end(s) \ +do { \ + raw_write_seqcount_t_end(__to_seqcount_t(s)); \ + \ + if (__enforce_preemption_protection(s)) \ + preempt_enable(); \ +} while (0) + +#define do_write_seqcount_begin_nested(s, subclass) \ +do { \ + __assert_associated_lock_held(s); \ + \ + if (__enforce_preemption_protection(s)) \ + preempt_disable(); \ + \ + write_seqcount_t_begin_nested(__to_seqcount_t(s), subclass); \ +} while (0) + +#define do_write_seqcount_begin(s) \ +do { \ + __assert_associated_lock_held(s); \ + \ + if (__enforce_preemption_protection(s)) \ + preempt_disable(); \ + \ + write_seqcount_t_begin(__to_seqcount_t(s)); \ +} while (0) + +#define do_write_seqcount_end(s) \ +do { \ + write_seqcount_t_end(__to_seqcount_t(s)); \ + \ + if (__enforce_preemption_protection(s)) \ + preempt_enable(); \ +} while (0) + +#define do_write_seqcount_invalidate(s) \ + write_seqcount_t_invalidate(__to_seqcount_t(s)) + +#define do_raw_write_seqcount_barrier(s) \ + raw_write_seqcount_t_barrier(__to_seqcount_t(s)) + +/* + * Latch sequence counters write side critical sections don't need to + * run with preemption disabled. Check @raw_write_seqcount_latch(). + */ +#define do_raw_write_seqcount_latch(s) \ + raw_write_seqcount_t_latch(__to_seqcount_t(s)) + +/* + * seqcount_LOCKTYPE_t -- read APIs + */ + +#ifdef CONFIG_PREEMPT_RT + +/* + * Check comment on top of read_seqbegin() for rationale. + * + * @s: pointer to seqcount_t or any of the seqcount_locktype_t variants + */ +#define __rt_lock_unlock_associated_sleeping_lock(s) \ +do { \ + if (__same_type(*(s), seqcount_t) || \ + __same_type(*(s), seqcount_raw_spinlock_t)) { \ + break; /* NOP */ \ + } \ + \ + if (__same_type(*(s), seqcount_spinlock_t)) { \ + spin_lock(((seqcount_spinlock_t *) s)->lock); \ + spin_unlock(((seqcount_spinlock_t *) s)->lock); \ + } else if (__same_type(*(s), seqcount_rwlock_t)) { \ + read_lock(((seqcount_rwlock_t *) s)->lock); \ + read_unlock(((seqcount_rwlock_t *) s)->lock); \ + } else if (__same_type(*(s), seqcount_mutex_t)) { \ + mutex_lock(((seqcount_mutex_t *) s)->lock); \ + mutex_unlock(((seqcount_mutex_t *) s)->lock); \ + } else if (__same_type(*(s), seqcount_ww_mutex_t)) { \ + ww_mutex_lock(((seqcount_ww_mutex_t *) s)->lock, NULL); \ + ww_mutex_unlock(((seqcount_ww_mutex_t *) s)->lock); \ + } else \ + BUILD_BUG_ON_MSG(1, "Unknown seqcount type"); \ +} while (0) + +/* + * @s: pointer to seqcount_t or any of the seqcount_locktype_t variants + * + * After the lock-unlock operation, re-read the sequence counter since + * the writer made progress. + * + * Do not lock-unlock the seqcount associated sleeping lock again if the + * second counter read value is odd. If the first counter read was odd + * because the reader preempted the write-side critical section, the + * second odd value read must've been the result of a writer running on + * a parallel core instead. + */ +#define __raw_read_seqcount(s) \ +({ \ + unsigned seq = READ_ONCE(__to_seqcount_t(s)->sequence); \ + \ + if (unlikely(seq & 1)) \ + __rt_lock_unlock_associated_sleeping_lock(s); \ + \ + /* no read barrier, no counter stabilization, no lockdep */ \ + READ_ONCE(__to_seqcount_t(s)->sequence); \ +}) + +#define do___read_seqcount_begin(s) \ +({ \ + unsigned seq; \ + \ + do { \ + seq = __raw_read_seqcount(s); \ + cpu_relax(); \ + } while (unlikely(seq & 1)); \ + \ + /* no read barrier, with stabilized counter, no lockdep */ \ + seq; \ +}) + +#define do_raw_read_seqcount(s) \ +({ \ + unsigned seq = __raw_read_seqcount(s); \ + \ + smp_rmb(); \ + \ + /* with read barrier, no counter stabilization, no lockdep */ \ + seq; \ +}) + +#define do_raw_seqcount_begin(s) \ +({ \ + /* with read barrier, no counter stabilization, no lockdep */ \ + (do_raw_read_seqcount(s) & ~1); \ +}) + +#define do_raw_read_seqcount_begin(s) \ +({ \ + unsigned seq = do___read_seqcount_begin(s); \ + \ + smp_rmb(); \ + \ + /* with read barrier, with stabilized counter, no lockdep */ \ + seq; \ +}) + +#define do_read_seqcount_begin(s) \ +({ \ + seqcount_lockdep_reader_access(__to_seqcount_t(s)); \ + \ + /* with read barrier, stabilized counter, and lockdep */ \ + do_raw_read_seqcount_begin(s); \ +}) + +#else /* !CONFIG_PREEMPT_RT */ + +#define do___read_seqcount_begin(s) \ + __read_seqcount_t_begin(__to_seqcount_t(s)) + +#define do_raw_read_seqcount(s) \ + raw_read_seqcount_t(__to_seqcount_t(s)) + +#define do_raw_seqcount_begin(s) \ + raw_seqcount_t_begin(__to_seqcount_t(s)) + +#define do_raw_read_seqcount_begin(s) \ + raw_read_seqcount_t_begin(__to_seqcount_t(s)) + +#define do_read_seqcount_begin(s) \ + read_seqcount_t_begin(__to_seqcount_t(s)) + +#endif /* CONFIG_PREEMPT_RT */ + +/* + * Latch sequence counters allows interruptible, preemptible, writer + * sections. There is no need for a special PREEMPT_RT implementation. + */ +#define do_raw_read_seqcount_latch(s) \ + raw_read_seqcount_t_latch(__to_seqcount_t(s)) + +#define do___read_seqcount_retry(s, start) \ + __read_seqcount_t_retry(__to_seqcount_t(s), start) + +#define do_read_seqcount_retry(s, start) \ + read_seqcount_t_retry(__to_seqcount_t(s), start) + +#endif /* __LINUX_SEQLOCK_TYPES_INTERNAL_H */ @ include/linux/serial_8250.h:10 @ #ifndef _LINUX_SERIAL_8250_H #define _LINUX_SERIAL_8250_H +#include <linux/atomic.h> #include <linux/serial_core.h> #include <linux/serial_reg.h> #include <linux/platform_device.h> @ include/linux/serial_8250.h:128 @ struct uart_8250_port { #define MSR_SAVE_FLAGS UART_MSR_ANY_DELTA unsigned char msr_saved_flags; + atomic_t console_printing; + struct uart_8250_dma *dma; const struct uart_8250_ops *ops; @ include/linux/serial_8250.h:181 @ void serial8250_init_port(struct uart_8250_port *up); void serial8250_set_defaults(struct uart_8250_port *up); void serial8250_console_write(struct uart_8250_port *up, const char *s, unsigned int count); +void serial8250_console_write_atomic(struct uart_8250_port *up, const char *s, + unsigned int count); int serial8250_console_setup(struct uart_port *port, char *options, bool probe); extern void serial8250_set_isa_configurator(void (*v) @ include/linux/signal.h:258 @ static inline void init_sigpending(struct sigpending *sig) } extern void flush_sigqueue(struct sigpending *queue); +extern void flush_task_sigqueue(struct task_struct *tsk); /* Test if 'sig' is valid signal. Use this instead of testing _NSIG directly */ static inline int valid_signal(unsigned long sig) @ include/linux/skbuff.h:296 @ struct sk_buff_head { __u32 qlen; spinlock_t lock; + raw_spinlock_t raw_lock; }; struct sk_buff; @ include/linux/skbuff.h:1899 @ static inline void skb_queue_head_init(struct sk_buff_head *list) __skb_queue_head_init(list); } +static inline void skb_queue_head_init_raw(struct sk_buff_head *list) +{ + raw_spin_lock_init(&list->raw_lock); + __skb_queue_head_init(list); +} + static inline void skb_queue_head_init_class(struct sk_buff_head *list, struct lock_class_key *class) { @ include/linux/smp.h:224 @ static inline int get_boot_cpu_id(void) #define get_cpu() ({ preempt_disable(); __smp_processor_id(); }) #define put_cpu() preempt_enable() +#define get_cpu_light() ({ migrate_disable(); __smp_processor_id(); }) +#define put_cpu_light() migrate_enable() + /* * Callback to arch code if there's nosmp or maxcpus=0 on the * boot command line: @ include/linux/spinlock.h:310 @ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock) }) /* Include rwlock functions */ -#include <linux/rwlock.h> +#ifdef CONFIG_PREEMPT_RT +# include <linux/rwlock_rt.h> +#else +# include <linux/rwlock.h> +#endif /* * Pull the _spin_*()/_read_*()/_write_*() functions/declarations: @ include/linux/spinlock.h:325 @ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock) # include <linux/spinlock_api_up.h> #endif +#ifdef CONFIG_PREEMPT_RT +# include <linux/spinlock_rt.h> +#else /* PREEMPT_RT */ + /* * Map the spin_lock functions to the raw variants for PREEMPT_RT=n */ @ include/linux/spinlock.h:449 @ static __always_inline int spin_is_contended(spinlock_t *lock) #define assert_spin_locked(lock) assert_raw_spin_locked(&(lock)->rlock) +#endif /* !PREEMPT_RT */ + /* * Pull the atomic_t declaration: * (asm-mips/atomic.h needs above definitions) @ include/linux/spinlock_api_smp.h:190 @ static inline int __raw_spin_trylock_bh(raw_spinlock_t *lock) return 0; } -#include <linux/rwlock_api_smp.h> +#ifndef CONFIG_PREEMPT_RT +# include <linux/rwlock_api_smp.h> +#endif #endif /* __LINUX_SPINLOCK_API_SMP_H */ @ include/linux/spinlock_rt.h:4 @ +#ifndef __LINUX_SPINLOCK_RT_H +#define __LINUX_SPINLOCK_RT_H + +#ifndef __LINUX_SPINLOCK_H +#error Do not include directly. Use spinlock.h +#endif + +#include <linux/bug.h> + +extern void +__rt_spin_lock_init(spinlock_t *lock, const char *name, struct lock_class_key *key); + +#define spin_lock_init(slock) \ +do { \ + static struct lock_class_key __key; \ + \ + rt_mutex_init(&(slock)->lock); \ + __rt_spin_lock_init(slock, #slock, &__key); \ +} while (0) + +extern void __lockfunc rt_spin_lock(spinlock_t *lock); +extern unsigned long __lockfunc rt_spin_lock_trace_flags(spinlock_t *lock); +extern void __lockfunc rt_spin_lock_nested(spinlock_t *lock, int subclass); +extern void __lockfunc rt_spin_unlock(spinlock_t *lock); +extern void __lockfunc rt_spin_lock_unlock(spinlock_t *lock); +extern int __lockfunc rt_spin_trylock_irqsave(spinlock_t *lock, unsigned long *flags); +extern int __lockfunc rt_spin_trylock_bh(spinlock_t *lock); +extern int __lockfunc rt_spin_trylock(spinlock_t *lock); +extern int atomic_dec_and_spin_lock(atomic_t *atomic, spinlock_t *lock); + +/* + * lockdep-less calls, for derived types like rwlock: + * (for trylock they can use rt_mutex_trylock() directly. + * Migrate disable handling must be done at the call site. + */ +extern void __lockfunc __rt_spin_lock(struct rt_mutex *lock); +extern void __lockfunc __rt_spin_trylock(struct rt_mutex *lock); +extern void __lockfunc __rt_spin_unlock(struct rt_mutex *lock); + +#define spin_lock(lock) rt_spin_lock(lock) + +#define spin_lock_bh(lock) \ + do { \ + local_bh_disable(); \ + rt_spin_lock(lock); \ + } while (0) + +#define spin_lock_irq(lock) spin_lock(lock) + +#define spin_do_trylock(lock) __cond_lock(lock, rt_spin_trylock(lock)) + +#define spin_trylock(lock) \ +({ \ + int __locked; \ + __locked = spin_do_trylock(lock); \ + __locked; \ +}) + +#ifdef CONFIG_LOCKDEP +# define spin_lock_nested(lock, subclass) \ + do { \ + rt_spin_lock_nested(lock, subclass); \ + } while (0) + +#define spin_lock_bh_nested(lock, subclass) \ + do { \ + local_bh_disable(); \ + rt_spin_lock_nested(lock, subclass); \ + } while (0) + +# define spin_lock_irqsave_nested(lock, flags, subclass) \ + do { \ + typecheck(unsigned long, flags); \ + flags = 0; \ + rt_spin_lock_nested(lock, subclass); \ + } while (0) +#else +# define spin_lock_nested(lock, subclass) spin_lock(lock) +# define spin_lock_bh_nested(lock, subclass) spin_lock_bh(lock) + +# define spin_lock_irqsave_nested(lock, flags, subclass) \ + do { \ + typecheck(unsigned long, flags); \ + flags = 0; \ + spin_lock(lock); \ + } while (0) +#endif + +#define spin_lock_irqsave(lock, flags) \ + do { \ + typecheck(unsigned long, flags); \ + flags = 0; \ + spin_lock(lock); \ + } while (0) + +static inline unsigned long spin_lock_trace_flags(spinlock_t *lock) +{ + unsigned long flags = 0; +#ifdef CONFIG_TRACE_IRQFLAGS + flags = rt_spin_lock_trace_flags(lock); +#else + spin_lock(lock); /* lock_local */ +#endif + return flags; +} + +/* FIXME: we need rt_spin_lock_nest_lock */ +#define spin_lock_nest_lock(lock, nest_lock) spin_lock_nested(lock, 0) + +#define spin_unlock(lock) rt_spin_unlock(lock) + +#define spin_unlock_bh(lock) \ + do { \ + rt_spin_unlock(lock); \ + local_bh_enable(); \ + } while (0) + +#define spin_unlock_irq(lock) spin_unlock(lock) + +#define spin_unlock_irqrestore(lock, flags) \ + do { \ + typecheck(unsigned long, flags); \ + (void) flags; \ + spin_unlock(lock); \ + } while (0) + +#define spin_trylock_bh(lock) __cond_lock(lock, rt_spin_trylock_bh(lock)) +#define spin_trylock_irq(lock) spin_trylock(lock) + +#define spin_trylock_irqsave(lock, flags) \ + rt_spin_trylock_irqsave(lock, &(flags)) + +#ifdef CONFIG_GENERIC_LOCKBREAK +# define spin_is_contended(lock) ((lock)->break_lock) +#else +# define spin_is_contended(lock) (((void)(lock), 0)) +#endif + +static inline int spin_can_lock(spinlock_t *lock) +{ + return !rt_mutex_is_locked(&lock->lock); +} + +static inline int spin_is_locked(spinlock_t *lock) +{ + return rt_mutex_is_locked(&lock->lock); +} + +static inline void assert_spin_locked(spinlock_t *lock) +{ + BUG_ON(!spin_is_locked(lock)); +} + +#endif @ include/linux/spinlock_types.h:12 @ * Released under the General Public License (GPL). */ -#if defined(CONFIG_SMP) -# include <asm/spinlock_types.h> +#include <linux/spinlock_types_raw.h> + +#ifndef CONFIG_PREEMPT_RT +# include <linux/spinlock_types_nort.h> +# include <linux/rwlock_types.h> #else -# include <linux/spinlock_types_up.h> +# include <linux/rtmutex.h> +# include <linux/spinlock_types_rt.h> +# include <linux/rwlock_types_rt.h> #endif -#include <linux/lockdep.h> - -typedef struct raw_spinlock { - arch_spinlock_t raw_lock; -#ifdef CONFIG_DEBUG_SPINLOCK - unsigned int magic, owner_cpu; - void *owner; -#endif -#ifdef CONFIG_DEBUG_LOCK_ALLOC - struct lockdep_map dep_map; -#endif -} raw_spinlock_t; - -#define SPINLOCK_MAGIC 0xdead4ead - -#define SPINLOCK_OWNER_INIT ((void *)-1L) - -#ifdef CONFIG_DEBUG_LOCK_ALLOC -# define SPIN_DEP_MAP_INIT(lockname) .dep_map = { .name = #lockname } -#else -# define SPIN_DEP_MAP_INIT(lockname) -#endif - -#ifdef CONFIG_DEBUG_SPINLOCK -# define SPIN_DEBUG_INIT(lockname) \ - .magic = SPINLOCK_MAGIC, \ - .owner_cpu = -1, \ - .owner = SPINLOCK_OWNER_INIT, -#else -# define SPIN_DEBUG_INIT(lockname) -#endif - -#define __RAW_SPIN_LOCK_INITIALIZER(lockname) \ - { \ - .raw_lock = __ARCH_SPIN_LOCK_UNLOCKED, \ - SPIN_DEBUG_INIT(lockname) \ - SPIN_DEP_MAP_INIT(lockname) } - -#define __RAW_SPIN_LOCK_UNLOCKED(lockname) \ - (raw_spinlock_t) __RAW_SPIN_LOCK_INITIALIZER(lockname) - -#define DEFINE_RAW_SPINLOCK(x) raw_spinlock_t x = __RAW_SPIN_LOCK_UNLOCKED(x) - -typedef struct spinlock { - union { - struct raw_spinlock rlock; - -#ifdef CONFIG_DEBUG_LOCK_ALLOC -# define LOCK_PADSIZE (offsetof(struct raw_spinlock, dep_map)) - struct { - u8 __padding[LOCK_PADSIZE]; - struct lockdep_map dep_map; - }; -#endif - }; -} spinlock_t; - -#define __SPIN_LOCK_INITIALIZER(lockname) \ - { { .rlock = __RAW_SPIN_LOCK_INITIALIZER(lockname) } } - -#define __SPIN_LOCK_UNLOCKED(lockname) \ - (spinlock_t ) __SPIN_LOCK_INITIALIZER(lockname) - -#define DEFINE_SPINLOCK(x) spinlock_t x = __SPIN_LOCK_UNLOCKED(x) - -#include <linux/rwlock_types.h> - #endif /* __LINUX_SPINLOCK_TYPES_H */ @ include/linux/spinlock_types_nort.h:4 @ +#ifndef __LINUX_SPINLOCK_TYPES_NORT_H +#define __LINUX_SPINLOCK_TYPES_NORT_H + +#ifndef __LINUX_SPINLOCK_TYPES_H +#error "Do not include directly. Include spinlock_types.h instead" +#endif + +/* + * The non RT version maps spinlocks to raw_spinlocks + */ +typedef struct spinlock { + union { + struct raw_spinlock rlock; + +#ifdef CONFIG_DEBUG_LOCK_ALLOC +# define LOCK_PADSIZE (offsetof(struct raw_spinlock, dep_map)) + struct { + u8 __padding[LOCK_PADSIZE]; + struct lockdep_map dep_map; + }; +#endif + }; +} spinlock_t; + +#define __SPIN_LOCK_INITIALIZER(lockname) \ + { { .rlock = __RAW_SPIN_LOCK_INITIALIZER(lockname) } } + +#define __SPIN_LOCK_UNLOCKED(lockname) \ + (spinlock_t ) __SPIN_LOCK_INITIALIZER(lockname) + +#define DEFINE_SPINLOCK(x) spinlock_t x = __SPIN_LOCK_UNLOCKED(x) + +#endif @ include/linux/spinlock_types_raw.h:4 @ +#ifndef __LINUX_SPINLOCK_TYPES_RAW_H +#define __LINUX_SPINLOCK_TYPES_RAW_H + +#include <linux/types.h> + +#if defined(CONFIG_SMP) +# include <asm/spinlock_types.h> +#else +# include <linux/spinlock_types_up.h> +#endif + +#include <linux/lockdep.h> + +typedef struct raw_spinlock { + arch_spinlock_t raw_lock; +#ifdef CONFIG_DEBUG_SPINLOCK + unsigned int magic, owner_cpu; + void *owner; +#endif +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map dep_map; +#endif +} raw_spinlock_t; + +#define SPINLOCK_MAGIC 0xdead4ead + +#define SPINLOCK_OWNER_INIT ((void *)-1L) + +#ifdef CONFIG_DEBUG_LOCK_ALLOC +# define SPIN_DEP_MAP_INIT(lockname) .dep_map = { .name = #lockname } +#else +# define SPIN_DEP_MAP_INIT(lockname) +#endif + +#ifdef CONFIG_DEBUG_SPINLOCK +# define SPIN_DEBUG_INIT(lockname) \ + .magic = SPINLOCK_MAGIC, \ + .owner_cpu = -1, \ + .owner = SPINLOCK_OWNER_INIT, +#else +# define SPIN_DEBUG_INIT(lockname) +#endif + +#define __RAW_SPIN_LOCK_INITIALIZER(lockname) \ + { \ + .raw_lock = __ARCH_SPIN_LOCK_UNLOCKED, \ + SPIN_DEBUG_INIT(lockname) \ + SPIN_DEP_MAP_INIT(lockname) } + +#define __RAW_SPIN_LOCK_UNLOCKED(lockname) \ + (raw_spinlock_t) __RAW_SPIN_LOCK_INITIALIZER(lockname) + +#define DEFINE_RAW_SPINLOCK(x) raw_spinlock_t x = __RAW_SPIN_LOCK_UNLOCKED(x) + +#endif @ include/linux/spinlock_types_rt.h:4 @ +#ifndef __LINUX_SPINLOCK_TYPES_RT_H +#define __LINUX_SPINLOCK_TYPES_RT_H + +#ifndef __LINUX_SPINLOCK_TYPES_H +#error "Do not include directly. Include spinlock_types.h instead" +#endif + +#include <linux/cache.h> + +/* + * PREEMPT_RT: spinlocks - an RT mutex plus lock-break field: + */ +typedef struct spinlock { + struct rt_mutex lock; + unsigned int break_lock; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map dep_map; +#endif +} spinlock_t; + +#ifdef CONFIG_DEBUG_RT_MUTEXES +# define __RT_SPIN_INITIALIZER(name) \ + { \ + .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \ + .save_state = 1, \ + .file = __FILE__, \ + .line = __LINE__ , \ + } +#else +# define __RT_SPIN_INITIALIZER(name) \ + { \ + .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \ + .save_state = 1, \ + } +#endif + +/* +.wait_list = PLIST_HEAD_INIT_RAW((name).lock.wait_list, (name).lock.wait_lock) +*/ + +#define __SPIN_LOCK_UNLOCKED(name) \ + { .lock = __RT_SPIN_INITIALIZER(name.lock), \ + SPIN_DEP_MAP_INIT(name) } + +#define DEFINE_SPINLOCK(name) \ + spinlock_t name = __SPIN_LOCK_UNLOCKED(name) + +#endif @ include/linux/spinlock_types_up.h:3 @ #ifndef __LINUX_SPINLOCK_TYPES_UP_H #define __LINUX_SPINLOCK_TYPES_UP_H -#ifndef __LINUX_SPINLOCK_TYPES_H -# error "please don't include this file directly" -#endif - /* * include/linux/spinlock_types_up.h - spinlock type definitions for UP * @ include/linux/stop_machine.h:29 @ struct cpu_stop_work { cpu_stop_fn_t fn; void *arg; struct cpu_stop_done *done; + /* Did not run due to disabled stopper; for nowait debug checks */ + bool disabled; }; int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg); @ include/linux/swait.h:163 @ static inline bool swq_has_sleeper(struct swait_queue_head *wq) extern void swake_up_one(struct swait_queue_head *q); extern void swake_up_all(struct swait_queue_head *q); extern void swake_up_locked(struct swait_queue_head *q); +extern void swake_up_all_locked(struct swait_queue_head *q); +extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait); extern void prepare_to_swait_exclusive(struct swait_queue_head *q, struct swait_queue *wait, int state); extern long prepare_to_swait_event(struct swait_queue_head *q, struct swait_queue *wait, int state); @ include/linux/swap.h:15 @ #include <linux/fs.h> #include <linux/atomic.h> #include <linux/page-flags.h> +#include <linux/locallock.h> #include <asm/page.h> struct notifier_block; @ include/linux/swap.h:332 @ extern unsigned long nr_free_pagecache_pages(void); /* linux/mm/swap.c */ +DECLARE_LOCAL_IRQ_LOCK(swapvec_lock); extern void lru_cache_add(struct page *); extern void lru_cache_add_anon(struct page *page); extern void lru_cache_add_file(struct page *page); @ include/linux/thread_info.h:100 @ static inline int test_ti_thread_flag(struct thread_info *ti, int flag) #define test_thread_flag(flag) \ test_ti_thread_flag(current_thread_info(), flag) -#define tif_need_resched() test_thread_flag(TIF_NEED_RESCHED) +#ifdef CONFIG_PREEMPT_LAZY +#define tif_need_resched() (test_thread_flag(TIF_NEED_RESCHED) || \ + test_thread_flag(TIF_NEED_RESCHED_LAZY)) +#define tif_need_resched_now() (test_thread_flag(TIF_NEED_RESCHED)) +#define tif_need_resched_lazy() test_thread_flag(TIF_NEED_RESCHED_LAZY)) + +#else +#define tif_need_resched() test_thread_flag(TIF_NEED_RESCHED) +#define tif_need_resched_now() test_thread_flag(TIF_NEED_RESCHED) +#define tif_need_resched_lazy() 0 +#endif #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES static inline int arch_within_stack_frames(const void * const stack, @ include/linux/trace_events.h:70 @ struct trace_entry { unsigned char flags; unsigned char preempt_count; int pid; + unsigned char migrate_disable; + unsigned char preempt_lazy_count; }; #define TRACE_EVENT_TYPE_MAX \ @ include/linux/uaccess.h:185 @ static __always_inline void pagefault_disabled_dec(void) */ static inline void pagefault_disable(void) { + migrate_disable(); pagefault_disabled_inc(); /* * make sure to have issued the store before a pagefault @ include/linux/uaccess.h:202 @ static inline void pagefault_enable(void) */ barrier(); pagefault_disabled_dec(); + migrate_enable(); } /* @ include/linux/vmstat.h:63 @ DECLARE_PER_CPU(struct vm_event_state, vm_event_states); */ static inline void __count_vm_event(enum vm_event_item item) { + preempt_disable_rt(); raw_cpu_inc(vm_event_states.event[item]); + preempt_enable_rt(); } static inline void count_vm_event(enum vm_event_item item) @ include/linux/vmstat.h:75 @ static inline void count_vm_event(enum vm_event_item item) static inline void __count_vm_events(enum vm_event_item item, long delta) { + preempt_disable_rt(); raw_cpu_add(vm_event_states.event[item], delta); + preempt_enable_rt(); } static inline void count_vm_events(enum vm_event_item item, long delta) @ include/linux/wait.h:13 @ #include <asm/current.h> #include <uapi/linux/wait.h> +#include <linux/atomic.h> typedef struct wait_queue_entry wait_queue_entry_t; @ include/linux/wait.h:24 @ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int #define WQ_FLAG_EXCLUSIVE 0x01 #define WQ_FLAG_WOKEN 0x02 #define WQ_FLAG_BOOKMARK 0x04 +#define WQ_FLAG_CUSTOM 0x08 /* * A single wait-queue entry structure: @ include/net/gen_stats.h:9 @ #include <linux/socket.h> #include <linux/rtnetlink.h> #include <linux/pkt_sched.h> +#include <net/net_seq_lock.h> /* Note: this used to be in include/uapi/linux/gen_stats.h */ struct gnet_stats_basic_packed { @ include/net/gen_stats.h:46 @ int gnet_stats_start_copy_compat(struct sk_buff *skb, int type, spinlock_t *lock, struct gnet_dump *d, int padattr); -int gnet_stats_copy_basic(const seqcount_t *running, +int gnet_stats_copy_basic(net_seqlock_t *running, struct gnet_dump *d, struct gnet_stats_basic_cpu __percpu *cpu, struct gnet_stats_basic_packed *b); -void __gnet_stats_copy_basic(const seqcount_t *running, +void __gnet_stats_copy_basic(net_seqlock_t *running, struct gnet_stats_basic_packed *bstats, struct gnet_stats_basic_cpu __percpu *cpu, struct gnet_stats_basic_packed *b); -int gnet_stats_copy_basic_hw(const seqcount_t *running, +int gnet_stats_copy_basic_hw(net_seqlock_t *running, struct gnet_dump *d, struct gnet_stats_basic_cpu __percpu *cpu, struct gnet_stats_basic_packed *b); @ include/net/gen_stats.h:74 @ int gen_new_estimator(struct gnet_stats_basic_packed *bstats, struct gnet_stats_basic_cpu __percpu *cpu_bstats, struct net_rate_estimator __rcu **rate_est, spinlock_t *lock, - seqcount_t *running, struct nlattr *opt); + net_seqlock_t *running, struct nlattr *opt); void gen_kill_estimator(struct net_rate_estimator __rcu **ptr); int gen_replace_estimator(struct gnet_stats_basic_packed *bstats, struct gnet_stats_basic_cpu __percpu *cpu_bstats, struct net_rate_estimator __rcu **ptr, spinlock_t *lock, - seqcount_t *running, struct nlattr *opt); + net_seqlock_t *running, struct nlattr *opt); bool gen_estimator_active(struct net_rate_estimator __rcu **ptr); bool gen_estimator_read(struct net_rate_estimator __rcu **ptr, struct gnet_stats_rate_est64 *sample); @ include/net/net_seq_lock.h:4 @ +#ifndef __NET_NET_SEQ_LOCK_H__ +#define __NET_NET_SEQ_LOCK_H__ + +#ifdef CONFIG_PREEMPT_RT +# define net_seqlock_t seqlock_t +# define net_seq_begin(__r) read_seqbegin(__r) +# define net_seq_retry(__r, __s) read_seqretry(__r, __s) + +static inline int try_write_seqlock(seqlock_t *sl) +{ + if (spin_trylock(&sl->lock)) { + write_seqcount_begin(&sl->seqcount); + return 1; + } + return 0; +} + +#else +# define net_seqlock_t seqcount_t +# define net_seq_begin(__r) read_seqcount_begin(__r) +# define net_seq_retry(__r, __s) read_seqcount_retry(__r, __s) +#endif + +#endif @ include/net/netfilter/nf_conntrack.h:289 @ int nf_conntrack_hash_resize(unsigned int hashsize); extern struct hlist_nulls_head *nf_conntrack_hash; extern unsigned int nf_conntrack_htable_size; -extern seqcount_t nf_conntrack_generation; +extern seqcount_spinlock_t nf_conntrack_generation; extern unsigned int nf_conntrack_max; /* must be called with rcu read lock held */ @ include/net/sch_generic.h:13 @ #include <linux/percpu.h> #include <linux/dynamic_queue_limits.h> #include <linux/list.h> +#include <net/net_seq_lock.h> #include <linux/refcount.h> #include <linux/workqueue.h> #include <linux/mutex.h> @ include/net/sch_generic.h:104 @ struct Qdisc { struct sk_buff_head gso_skb ____cacheline_aligned_in_smp; struct qdisc_skb_head q; struct gnet_stats_basic_packed bstats; - seqcount_t running; + net_seqlock_t running; struct gnet_stats_queue qstats; unsigned long state; struct Qdisc *next_sched; @ include/net/sch_generic.h:142 @ static inline bool qdisc_is_running(struct Qdisc *qdisc) { if (qdisc->flags & TCQ_F_NOLOCK) return spin_is_locked(&qdisc->seqlock); +#ifdef CONFIG_PREEMPT_RT + return spin_is_locked(&qdisc->running.lock) ? true : false; +#else return (raw_read_seqcount(&qdisc->running) & 1) ? true : false; +#endif } static inline bool qdisc_is_percpu_stats(const struct Qdisc *q) @ include/net/sch_generic.h:170 @ static inline bool qdisc_run_begin(struct Qdisc *qdisc) } else if (qdisc_is_running(qdisc)) { return false; } +#ifdef CONFIG_PREEMPT_RT + if (try_write_seqlock(&qdisc->running)) + return true; + return false; +#else /* Variant of write_seqcount_begin() telling lockdep a trylock * was attempted. */ raw_write_seqcount_begin(&qdisc->running); seqcount_acquire(&qdisc->running.dep_map, 0, 1, _RET_IP_); return true; +#endif } static inline void qdisc_run_end(struct Qdisc *qdisc) { +#ifdef CONFIG_PREEMPT_RT + write_sequnlock(&qdisc->running); +#else write_seqcount_end(&qdisc->running); +#endif if (qdisc->flags & TCQ_F_NOLOCK) spin_unlock(&qdisc->seqlock); } @ include/net/sch_generic.h:560 @ static inline spinlock_t *qdisc_root_sleeping_lock(const struct Qdisc *qdisc) return qdisc_lock(root); } -static inline seqcount_t *qdisc_root_sleeping_running(const struct Qdisc *qdisc) +static inline net_seqlock_t *qdisc_root_sleeping_running(const struct Qdisc *qdisc) { struct Qdisc *root = qdisc_root_sleeping(qdisc); @ init/Kconfig:894 @ config CFS_BANDWIDTH config RT_GROUP_SCHED bool "Group scheduling for SCHED_RR/FIFO" depends on CGROUP_SCHED + depends on !PREEMPT_RT default n help This feature lets you explicitly allocate real CPU bandwidth @ init/Kconfig:1785 @ choice config SLAB bool "SLAB" + depends on !PREEMPT_RT select HAVE_HARDENED_USERCOPY_ALLOCATOR help The regular slab allocator that is established and known to work @ init/Kconfig:1806 @ config SLUB config SLOB depends on EXPERT bool "SLOB (Simple Allocator)" + depends on !PREEMPT_RT help SLOB replaces the stock allocator with a drastically simpler allocator. SLOB is generally more space efficient but @ init/Kconfig:1872 @ config SHUFFLE_PAGE_ALLOCATOR config SLUB_CPU_PARTIAL default y - depends on SLUB && SMP + depends on SLUB && SMP && !PREEMPT_RT bool "SLUB per cpu partial cache" help Per cpu partial caches accelerate objects allocation and freeing @ init/init_task.c:76 @ struct task_struct init_task .cpus_ptr = &init_task.cpus_mask, .cpus_mask = CPU_MASK_ALL, .nr_cpus_allowed= NR_CPUS, +#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT) && \ + defined(CONFIG_SCHED_DEBUG) + .pinned_on_cpu = -1, +#endif .mm = NULL, .active_mm = &init_mm, .restart_block = { @ init/init_task.c:148 @ struct task_struct init_task .rcu_tasks_idle_cpu = -1, #endif #ifdef CONFIG_CPUSETS - .mems_allowed_seq = SEQCNT_ZERO(init_task.mems_allowed_seq), + .mems_allowed_seq = SEQCNT_SPINLOCK_ZERO(init_task.mems_allowed_seq, + &init_task.alloc_lock), #endif #ifdef CONFIG_RT_MUTEXES .pi_waiters = RB_ROOT_CACHED, @ kernel/Kconfig.preempt:4 @ # SPDX-License-Identifier: GPL-2.0-only +config HAVE_PREEMPT_LAZY + bool + +config PREEMPT_LAZY + def_bool y if HAVE_PREEMPT_LAZY && PREEMPT_RT + choice prompt "Preemption Model" default PREEMPT_NONE @ kernel/bpf/hashtab.c:30 @ .map_delete_batch = \ generic_map_delete_batch +/* + * The bucket lock has two protection scopes: + * + * 1) Serializing concurrent operations from BPF programs on differrent + * CPUs + * + * 2) Serializing concurrent operations from BPF programs and sys_bpf() + * + * BPF programs can execute in any context including perf, kprobes and + * tracing. As there are almost no limits where perf, kprobes and tracing + * can be invoked from the lock operations need to be protected against + * deadlocks. Deadlocks can be caused by recursion and by an invocation in + * the lock held section when functions which acquire this lock are invoked + * from sys_bpf(). BPF recursion is prevented by incrementing the per CPU + * variable bpf_prog_active, which prevents BPF programs attached to perf + * events, kprobes and tracing to be invoked before the prior invocation + * from one of these contexts completed. sys_bpf() uses the same mechanism + * by pinning the task to the current CPU and incrementing the recursion + * protection accross the map operation. + * + * This has subtle implications on PREEMPT_RT. PREEMPT_RT forbids certain + * operations like memory allocations (even with GFP_ATOMIC) from atomic + * contexts. This is required because even with GFP_ATOMIC the memory + * allocator calls into code pathes which acquire locks with long held lock + * sections. To ensure the deterministic behaviour these locks are regular + * spinlocks, which are converted to 'sleepable' spinlocks on RT. The only + * true atomic contexts on an RT kernel are the low level hardware + * handling, scheduling, low level interrupt handling, NMIs etc. None of + * these contexts should ever do memory allocations. + * + * As regular device interrupt handlers and soft interrupts are forced into + * thread context, the existing code which does + * spin_lock*(); alloc(GPF_ATOMIC); spin_unlock*(); + * just works. + * + * In theory the BPF locks could be converted to regular spinlocks as well, + * but the bucket locks and percpu_freelist locks can be taken from + * arbitrary contexts (perf, kprobes, tracepoints) which are required to be + * atomic contexts even on RT. These mechanisms require preallocated maps, + * so there is no need to invoke memory allocations within the lock held + * sections. + * + * BPF maps which need dynamic allocation are only used from (forced) + * thread context on RT and can therefore use regular spinlocks which in + * turn allows to invoke memory allocations from the lock held section. + * + * On a non RT kernel this distinction is neither possible nor required. + * spinlock maps to raw_spinlock and the extra code is optimized out by the + * compiler. + */ struct bucket { struct hlist_nulls_head head; - raw_spinlock_t lock; + union { + raw_spinlock_t raw_lock; + spinlock_t lock; + }; }; struct bpf_htab { @ kernel/bpf/hashtab.c:124 @ struct htab_elem { char key[0] __aligned(8); }; +static inline bool htab_is_prealloc(const struct bpf_htab *htab) +{ + return !(htab->map.map_flags & BPF_F_NO_PREALLOC); +} + +static inline bool htab_use_raw_lock(const struct bpf_htab *htab) +{ + return (!IS_ENABLED(CONFIG_PREEMPT_RT) || htab_is_prealloc(htab)); +} + +static void htab_init_buckets(struct bpf_htab *htab) +{ + unsigned i; + + for (i = 0; i < htab->n_buckets; i++) { + INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i); + if (htab_use_raw_lock(htab)) + raw_spin_lock_init(&htab->buckets[i].raw_lock); + else + spin_lock_init(&htab->buckets[i].lock); + } +} + +static inline unsigned long htab_lock_bucket(const struct bpf_htab *htab, + struct bucket *b) +{ + unsigned long flags; + + if (htab_use_raw_lock(htab)) + raw_spin_lock_irqsave(&b->raw_lock, flags); + else + spin_lock_irqsave(&b->lock, flags); + return flags; +} + +static inline void htab_unlock_bucket(const struct bpf_htab *htab, + struct bucket *b, + unsigned long flags) +{ + if (htab_use_raw_lock(htab)) + raw_spin_unlock_irqrestore(&b->raw_lock, flags); + else + spin_unlock_irqrestore(&b->lock, flags); +} + static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node); static bool htab_is_lru(const struct bpf_htab *htab) @ kernel/bpf/hashtab.c:183 @ static bool htab_is_percpu(const struct bpf_htab *htab) htab->map.map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH; } -static bool htab_is_prealloc(const struct bpf_htab *htab) -{ - return !(htab->map.map_flags & BPF_F_NO_PREALLOC); -} - static inline void htab_elem_set_ptr(struct htab_elem *l, u32 key_size, void __percpu *pptr) { @ kernel/bpf/hashtab.c:424 @ static struct bpf_map *htab_map_alloc(union bpf_attr *attr) bool percpu_lru = (attr->map_flags & BPF_F_NO_COMMON_LRU); bool prealloc = !(attr->map_flags & BPF_F_NO_PREALLOC); struct bpf_htab *htab; - int err, i; u64 cost; + int err; htab = kzalloc(sizeof(*htab), GFP_USER); if (!htab) @ kernel/bpf/hashtab.c:487 @ static struct bpf_map *htab_map_alloc(union bpf_attr *attr) else htab->hashrnd = get_random_int(); - for (i = 0; i < htab->n_buckets; i++) { - INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i); - raw_spin_lock_init(&htab->buckets[i].lock); - } + htab_init_buckets(htab); if (prealloc) { err = prealloc_init(htab); @ kernel/bpf/hashtab.c:695 @ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node) b = __select_bucket(htab, tgt_l->hash); head = &b->head; - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); hlist_nulls_for_each_entry_rcu(l, n, head, hash_node) if (l == tgt_l) { @ kernel/bpf/hashtab.c:703 @ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node) break; } - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); return l == tgt_l; } @ kernel/bpf/hashtab.c:779 @ static void htab_elem_free_rcu(struct rcu_head *head) struct htab_elem *l = container_of(head, struct htab_elem, rcu); struct bpf_htab *htab = l->htab; - /* must increment bpf_prog_active to avoid kprobe+bpf triggering while - * we're calling kfree, otherwise deadlock is possible if kprobes - * are placed somewhere inside of slub - */ - preempt_disable(); - __this_cpu_inc(bpf_prog_active); htab_elem_free(htab, l); - __this_cpu_dec(bpf_prog_active); - preempt_enable(); } static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l) @ kernel/bpf/hashtab.c:969 @ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value, */ } - /* bpf_map_update_elem() can be called in_irq() */ - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); l_old = lookup_elem_raw(head, hash, key, key_size); @ kernel/bpf/hashtab.c:1010 @ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value, } ret = 0; err: - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); return ret; } @ kernel/bpf/hashtab.c:1048 @ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value, return -ENOMEM; memcpy(l_new->key + round_up(map->key_size, 8), value, map->value_size); - /* bpf_map_update_elem() can be called in_irq() */ - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); l_old = lookup_elem_raw(head, hash, key, key_size); @ kernel/bpf/hashtab.c:1067 @ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value, ret = 0; err: - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); if (ret) bpf_lru_push_free(&htab->lru, &l_new->lru_node); @ kernel/bpf/hashtab.c:1102 @ static int __htab_percpu_map_update_elem(struct bpf_map *map, void *key, b = __select_bucket(htab, hash); head = &b->head; - /* bpf_map_update_elem() can be called in_irq() */ - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); l_old = lookup_elem_raw(head, hash, key, key_size); @ kernel/bpf/hashtab.c:1125 @ static int __htab_percpu_map_update_elem(struct bpf_map *map, void *key, } ret = 0; err: - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); return ret; } @ kernel/bpf/hashtab.c:1165 @ static int __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key, return -ENOMEM; } - /* bpf_map_update_elem() can be called in_irq() */ - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); l_old = lookup_elem_raw(head, hash, key, key_size); @ kernel/bpf/hashtab.c:1187 @ static int __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key, } ret = 0; err: - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); if (l_new) bpf_lru_push_free(&htab->lru, &l_new->lru_node); return ret; @ kernel/bpf/hashtab.c:1225 @ static int htab_map_delete_elem(struct bpf_map *map, void *key) b = __select_bucket(htab, hash); head = &b->head; - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); l = lookup_elem_raw(head, hash, key, key_size); @ kernel/bpf/hashtab.c:1235 @ static int htab_map_delete_elem(struct bpf_map *map, void *key) ret = 0; } - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); return ret; } @ kernel/bpf/hashtab.c:1257 @ static int htab_lru_map_delete_elem(struct bpf_map *map, void *key) b = __select_bucket(htab, hash); head = &b->head; - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); l = lookup_elem_raw(head, hash, key, key_size); @ kernel/bpf/hashtab.c:1266 @ static int htab_lru_map_delete_elem(struct bpf_map *map, void *key) ret = 0; } - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); if (l) bpf_lru_push_free(&htab->lru, &l->lru_node); return ret; @ kernel/bpf/hashtab.c:1406 @ __htab_map_lookup_and_delete_batch(struct bpf_map *map, } again: - preempt_disable(); - this_cpu_inc(bpf_prog_active); + bpf_disable_instrumentation(); rcu_read_lock(); again_nocopy: dst_key = keys; @ kernel/bpf/hashtab.c:1415 @ __htab_map_lookup_and_delete_batch(struct bpf_map *map, head = &b->head; /* do not grab the lock unless need it (bucket_cnt > 0). */ if (locked) - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); bucket_cnt = 0; hlist_nulls_for_each_entry_rcu(l, n, head, hash_node) @ kernel/bpf/hashtab.c:1432 @ __htab_map_lookup_and_delete_batch(struct bpf_map *map, /* Note that since bucket_cnt > 0 here, it is implicit * that the locked was grabbed, so release it. */ - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); rcu_read_unlock(); - this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); goto after_loop; } @ kernel/bpf/hashtab.c:1443 @ __htab_map_lookup_and_delete_batch(struct bpf_map *map, /* Note that since bucket_cnt > 0 here, it is implicit * that the locked was grabbed, so release it. */ - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); rcu_read_unlock(); - this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); kvfree(keys); kvfree(values); goto alloc; @ kernel/bpf/hashtab.c:1496 @ __htab_map_lookup_and_delete_batch(struct bpf_map *map, dst_val += value_size; } - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); locked = false; while (node_to_free) { @ kernel/bpf/hashtab.c:1515 @ __htab_map_lookup_and_delete_batch(struct bpf_map *map, } rcu_read_unlock(); - this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); if (bucket_cnt && (copy_to_user(ukeys + total * key_size, keys, key_size * bucket_cnt) || copy_to_user(uvalues + total * value_size, values, @ kernel/bpf/lpm_trie.c:37 @ struct lpm_trie { size_t n_entries; size_t max_prefixlen; size_t data_size; - raw_spinlock_t lock; + spinlock_t lock; }; /* This trie implements a longest prefix match algorithm that can be used to @ kernel/bpf/lpm_trie.c:318 @ static int trie_update_elem(struct bpf_map *map, if (key->prefixlen > trie->max_prefixlen) return -EINVAL; - raw_spin_lock_irqsave(&trie->lock, irq_flags); + spin_lock_irqsave(&trie->lock, irq_flags); /* Allocate and fill a new node */ @ kernel/bpf/lpm_trie.c:425 @ static int trie_update_elem(struct bpf_map *map, kfree(im_node); } - raw_spin_unlock_irqrestore(&trie->lock, irq_flags); + spin_unlock_irqrestore(&trie->lock, irq_flags); return ret; } @ kernel/bpf/lpm_trie.c:445 @ static int trie_delete_elem(struct bpf_map *map, void *_key) if (key->prefixlen > trie->max_prefixlen) return -EINVAL; - raw_spin_lock_irqsave(&trie->lock, irq_flags); + spin_lock_irqsave(&trie->lock, irq_flags); /* Walk the tree looking for an exact key/length match and keeping * track of the path we traverse. We will need to know the node @ kernel/bpf/lpm_trie.c:521 @ static int trie_delete_elem(struct bpf_map *map, void *_key) kfree_rcu(node, rcu); out: - raw_spin_unlock_irqrestore(&trie->lock, irq_flags); + spin_unlock_irqrestore(&trie->lock, irq_flags); return ret; } @ kernel/bpf/lpm_trie.c:578 @ static struct bpf_map *trie_alloc(union bpf_attr *attr) if (ret) goto out_err; - raw_spin_lock_init(&trie->lock); + spin_lock_init(&trie->lock); return &trie->map; out_err: @ kernel/bpf/percpu_freelist.c:28 @ void pcpu_freelist_destroy(struct pcpu_freelist *s) free_percpu(s->freelist); } +static inline void pcpu_freelist_push_node(struct pcpu_freelist_head *head, + struct pcpu_freelist_node *node) +{ + node->next = head->first; + head->first = node; +} + static inline void ___pcpu_freelist_push(struct pcpu_freelist_head *head, struct pcpu_freelist_node *node) { raw_spin_lock(&head->lock); - node->next = head->first; - head->first = node; + pcpu_freelist_push_node(head, node); raw_spin_unlock(&head->lock); } @ kernel/bpf/percpu_freelist.c:65 @ void pcpu_freelist_populate(struct pcpu_freelist *s, void *buf, u32 elem_size, u32 nr_elems) { struct pcpu_freelist_head *head; - unsigned long flags; int i, cpu, pcpu_entries; pcpu_entries = nr_elems / num_possible_cpus() + 1; i = 0; - /* disable irq to workaround lockdep false positive - * in bpf usage pcpu_freelist_populate() will never race - * with pcpu_freelist_push() - */ - local_irq_save(flags); for_each_possible_cpu(cpu) { again: head = per_cpu_ptr(s->freelist, cpu); - ___pcpu_freelist_push(head, buf); + /* No locking required as this is not visible yet. */ + pcpu_freelist_push_node(head, buf); i++; buf += elem_size; if (i == nr_elems) @ kernel/bpf/percpu_freelist.c:82 @ void pcpu_freelist_populate(struct pcpu_freelist *s, void *buf, u32 elem_size, if (i % pcpu_entries) goto again; } - local_irq_restore(flags); } struct pcpu_freelist_node *__pcpu_freelist_pop(struct pcpu_freelist *s) @ kernel/bpf/stackmap.c:43 @ static void do_up_read(struct irq_work *entry) { struct stack_map_irq_work *work; + if (WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT))) + return; + work = container_of(entry, struct stack_map_irq_work, irq_work); up_read_non_owner(work->sem); work->sem = NULL; @ kernel/bpf/stackmap.c:294 @ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, struct stack_map_irq_work *work = NULL; if (irqs_disabled()) { - work = this_cpu_ptr(&up_read_work); - if (atomic_read(&work->irq_work.flags) & IRQ_WORK_BUSY) - /* cannot queue more up_read, fallback */ + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) { + work = this_cpu_ptr(&up_read_work); + if (atomic_read(&work->irq_work.flags) & IRQ_WORK_BUSY) { + /* cannot queue more up_read, fallback */ + irq_work_busy = true; + } + } else { + /* + * PREEMPT_RT does not allow to trylock mmap sem in + * interrupt disabled context. Force the fallback code. + */ irq_work_busy = true; + } } /* @ kernel/bpf/syscall.c:174 @ static int bpf_map_update_value(struct bpf_map *map, struct fd f, void *key, flags); } - /* must increment bpf_prog_active to avoid kprobe+bpf triggering from - * inside bpf map update or delete otherwise deadlocks are possible - */ - preempt_disable(); - __this_cpu_inc(bpf_prog_active); + bpf_disable_instrumentation(); if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH || map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) { err = bpf_percpu_hash_update(map, key, value, flags); @ kernel/bpf/syscall.c:205 @ static int bpf_map_update_value(struct bpf_map *map, struct fd f, void *key, err = map->ops->map_update_elem(map, key, value, flags); rcu_read_unlock(); } - __this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); maybe_wait_bpf_programs(map); return err; @ kernel/bpf/syscall.c:220 @ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value, if (bpf_map_is_dev_bound(map)) return bpf_map_offload_lookup_elem(map, key, value); - preempt_disable(); - this_cpu_inc(bpf_prog_active); + bpf_disable_instrumentation(); if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH || map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) { err = bpf_percpu_hash_copy(map, key, value); @ kernel/bpf/syscall.c:265 @ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value, rcu_read_unlock(); } - this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); maybe_wait_bpf_programs(map); return err; @ kernel/bpf/syscall.c:1143 @ static int map_delete_elem(union bpf_attr *attr) goto out; } - preempt_disable(); - __this_cpu_inc(bpf_prog_active); + bpf_disable_instrumentation(); rcu_read_lock(); err = map->ops->map_delete_elem(map, key); rcu_read_unlock(); - __this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); maybe_wait_bpf_programs(map); out: kfree(key); @ kernel/bpf/syscall.c:1259 @ int generic_map_delete_batch(struct bpf_map *map, break; } - preempt_disable(); - __this_cpu_inc(bpf_prog_active); + bpf_disable_instrumentation(); rcu_read_lock(); err = map->ops->map_delete_elem(map, key); rcu_read_unlock(); - __this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); maybe_wait_bpf_programs(map); if (err) break; @ kernel/bpf/trampoline.c:370 @ void bpf_trampoline_put(struct bpf_trampoline *tr) mutex_unlock(&trampoline_mutex); } -/* The logic is similar to BPF_PROG_RUN, but with explicit rcu and preempt that - * are needed for trampoline. The macro is split into +/* The logic is similar to BPF_PROG_RUN, but with an explicit + * rcu_read_lock() and migrate_disable() which are required + * for the trampoline. The macro is split into * call _bpf_prog_enter * call prog->bpf_func * call __bpf_prog_exit @ kernel/bpf/trampoline.c:382 @ u64 notrace __bpf_prog_enter(void) u64 start = 0; rcu_read_lock(); - preempt_disable(); + migrate_disable(); if (static_branch_unlikely(&bpf_stats_enabled_key)) start = sched_clock(); return start; @ kernel/bpf/trampoline.c:405 @ void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start) stats->nsecs += sched_clock() - start; u64_stats_update_end(&stats->syncp); } - preempt_enable(); + migrate_enable(); rcu_read_unlock(); } @ kernel/bpf/verifier.c:8210 @ static bool is_tracing_prog_type(enum bpf_prog_type type) } } +static bool is_preallocated_map(struct bpf_map *map) +{ + if (!check_map_prealloc(map)) + return false; + if (map->inner_map_meta && !check_map_prealloc(map->inner_map_meta)) + return false; + return true; +} + static int check_map_prog_compatibility(struct bpf_verifier_env *env, struct bpf_map *map, struct bpf_prog *prog) { - /* Make sure that BPF_PROG_TYPE_PERF_EVENT programs only use - * preallocated hash maps, since doing memory allocation - * in overflow_handler can crash depending on where nmi got - * triggered. + /* + * Validate that trace type programs use preallocated hash maps. + * + * For programs attached to PERF events this is mandatory as the + * perf NMI can hit any arbitrary code sequence. + * + * All other trace types using preallocated hash maps are unsafe as + * well because tracepoint or kprobes can be inside locked regions + * of the memory allocator or at a place where a recursion into the + * memory allocator would see inconsistent state. + * + * On RT enabled kernels run-time allocation of all trace type + * programs is strictly prohibited due to lock type constraints. On + * !RT kernels it is allowed for backwards compatibility reasons for + * now, but warnings are emitted so developers are made aware of + * the unsafety and can fix their programs before this is enforced. */ - if (prog->type == BPF_PROG_TYPE_PERF_EVENT) { - if (!check_map_prealloc(map)) { + if (is_tracing_prog_type(prog->type) && !is_preallocated_map(map)) { + if (prog->type == BPF_PROG_TYPE_PERF_EVENT) { verbose(env, "perf_event programs can only use preallocated hash map\n"); return -EINVAL; } - if (map->inner_map_meta && - !check_map_prealloc(map->inner_map_meta)) { - verbose(env, "perf_event programs can only use preallocated inner hash map\n"); + if (IS_ENABLED(CONFIG_PREEMPT_RT)) { + verbose(env, "trace type programs can only use preallocated hash map\n"); return -EINVAL; } + WARN_ONCE(1, "trace type BPF program uses run-time allocation\n"); + verbose(env, "trace type programs with run-time allocated hash maps are unsafe. Switch to preallocated hash maps.\n"); } if ((is_tracing_prog_type(prog->type) || @ kernel/cgroup/cpuset.c:348 @ void cpuset_read_unlock(void) percpu_up_read(&cpuset_rwsem); } -static DEFINE_SPINLOCK(callback_lock); +static DEFINE_RAW_SPINLOCK(callback_lock); static struct workqueue_struct *cpuset_migrate_mm_wq; @ kernel/cgroup/cpuset.c:1256 @ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, * Newly added CPUs will be removed from effective_cpus and * newly deleted ones will be added back to effective_cpus. */ - spin_lock_irq(&callback_lock); + raw_spin_lock_irq(&callback_lock); if (adding) { cpumask_or(parent->subparts_cpus, parent->subparts_cpus, tmp->addmask); @ kernel/cgroup/cpuset.c:1275 @ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, } parent->nr_subparts_cpus = cpumask_weight(parent->subparts_cpus); - spin_unlock_irq(&callback_lock); + raw_spin_unlock_irq(&callback_lock); return cmd == partcmd_update; } @ kernel/cgroup/cpuset.c:1380 @ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) continue; rcu_read_unlock(); - spin_lock_irq(&callback_lock); + raw_spin_lock_irq(&callback_lock); cpumask_copy(cp->effective_cpus, tmp->new_cpus); if (cp->nr_subparts_cpus && @ kernel/cgroup/cpuset.c:1411 @ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) = cpumask_weight(cp->subparts_cpus); } } - spin_unlock_irq(&callback_lock); + raw_spin_unlock_irq(&callback_lock); WARN_ON(!is_in_v2_mode() && !cpumask_equal(cp->cpus_allowed, cp->effective_cpus)); @ kernel/cgroup/cpuset.c:1529 @ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, return -EINVAL; } - spin_lock_irq(&callback_lock); + raw_spin_lock_irq(&callback_lock); cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed); /* @ kernel/cgroup/cpuset.c:1540 @ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, cs->cpus_allowed); cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus); } - spin_unlock_irq(&callback_lock); + raw_spin_unlock_irq(&callback_lock); update_cpumasks_hier(cs, &tmp); @ kernel/cgroup/cpuset.c:1734 @ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems) continue; rcu_read_unlock(); - spin_lock_irq(&callback_lock); + raw_spin_lock_irq(&callback_lock); cp->effective_mems = *new_mems; - spin_unlock_irq(&callback_lock); + raw_spin_unlock_irq(&callback_lock); WARN_ON(!is_in_v2_mode() && !nodes_equal(cp->mems_allowed, cp->effective_mems)); @ kernel/cgroup/cpuset.c:1804 @ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs, if (retval < 0) goto done; - spin_lock_irq(&callback_lock); + raw_spin_lock_irq(&callback_lock); cs->mems_allowed = trialcs->mems_allowed; - spin_unlock_irq(&callback_lock); + raw_spin_unlock_irq(&callback_lock); /* use trialcs->mems_allowed as a temp variable */ update_nodemasks_hier(cs, &trialcs->mems_allowed); @ kernel/cgroup/cpuset.c:1897 @ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs)) || (is_spread_page(cs) != is_spread_page(trialcs))); - spin_lock_irq(&callback_lock); + raw_spin_lock_irq(&callback_lock); cs->flags = trialcs->flags; - spin_unlock_irq(&callback_lock); + raw_spin_unlock_irq(&callback_lock); if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed) rebuild_sched_domains_locked(); @ kernel/cgroup/cpuset.c:2408 @ static int cpuset_common_seq_show(struct seq_file *sf, void *v) cpuset_filetype_t type = seq_cft(sf)->private; int ret = 0; - spin_lock_irq(&callback_lock); + raw_spin_lock_irq(&callback_lock); switch (type) { case FILE_CPULIST: @ kernel/cgroup/cpuset.c:2430 @ static int cpuset_common_seq_show(struct seq_file *sf, void *v) ret = -EINVAL; } - spin_unlock_irq(&callback_lock); + raw_spin_unlock_irq(&callback_lock); return ret; } @ kernel/cgroup/cpuset.c:2743 @ static int cpuset_css_online(struct cgroup_subsys_state *css) cpuset_inc(); - spin_lock_irq(&callback_lock); + raw_spin_lock_irq(&callback_lock); if (is_in_v2_mode()) { cpumask_copy(cs->effective_cpus, parent->effective_cpus); cs->effective_mems = parent->effective_mems; cs->use_parent_ecpus = true; parent->child_ecpus_count++; } - spin_unlock_irq(&callback_lock); + raw_spin_unlock_irq(&callback_lock); if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags)) goto out_unlock; @ kernel/cgroup/cpuset.c:2777 @ static int cpuset_css_online(struct cgroup_subsys_state *css) } rcu_read_unlock(); - spin_lock_irq(&callback_lock); + raw_spin_lock_irq(&callback_lock); cs->mems_allowed = parent->mems_allowed; cs->effective_mems = parent->mems_allowed; cpumask_copy(cs->cpus_allowed, parent->cpus_allowed); cpumask_copy(cs->effective_cpus, parent->cpus_allowed); - spin_unlock_irq(&callback_lock); + raw_spin_unlock_irq(&callback_lock); out_unlock: percpu_up_write(&cpuset_rwsem); put_online_cpus(); @ kernel/cgroup/cpuset.c:2838 @ static void cpuset_css_free(struct cgroup_subsys_state *css) static void cpuset_bind(struct cgroup_subsys_state *root_css) { percpu_down_write(&cpuset_rwsem); - spin_lock_irq(&callback_lock); + raw_spin_lock_irq(&callback_lock); if (is_in_v2_mode()) { cpumask_copy(top_cpuset.cpus_allowed, cpu_possible_mask); @ kernel/cgroup/cpuset.c:2849 @ static void cpuset_bind(struct cgroup_subsys_state *root_css) top_cpuset.mems_allowed = top_cpuset.effective_mems; } - spin_unlock_irq(&callback_lock); + raw_spin_unlock_irq(&callback_lock); percpu_up_write(&cpuset_rwsem); } @ kernel/cgroup/cpuset.c:2946 @ hotplug_update_tasks_legacy(struct cpuset *cs, { bool is_empty; - spin_lock_irq(&callback_lock); + raw_spin_lock_irq(&callback_lock); cpumask_copy(cs->cpus_allowed, new_cpus); cpumask_copy(cs->effective_cpus, new_cpus); cs->mems_allowed = *new_mems; cs->effective_mems = *new_mems; - spin_unlock_irq(&callback_lock); + raw_spin_unlock_irq(&callback_lock); /* * Don't call update_tasks_cpumask() if the cpuset becomes empty, @ kernel/cgroup/cpuset.c:2988 @ hotplug_update_tasks(struct cpuset *cs, if (nodes_empty(*new_mems)) *new_mems = parent_cs(cs)->effective_mems; - spin_lock_irq(&callback_lock); + raw_spin_lock_irq(&callback_lock); cpumask_copy(cs->effective_cpus, new_cpus); cs->effective_mems = *new_mems; - spin_unlock_irq(&callback_lock); + raw_spin_unlock_irq(&callback_lock); if (cpus_updated) update_tasks_cpumask(cs); @ kernel/cgroup/cpuset.c:3146 @ static void cpuset_hotplug_workfn(struct work_struct *work) /* synchronize cpus_allowed to cpu_active_mask */ if (cpus_updated) { - spin_lock_irq(&callback_lock); + raw_spin_lock_irq(&callback_lock); if (!on_dfl) cpumask_copy(top_cpuset.cpus_allowed, &new_cpus); /* @ kernel/cgroup/cpuset.c:3166 @ static void cpuset_hotplug_workfn(struct work_struct *work) } } cpumask_copy(top_cpuset.effective_cpus, &new_cpus); - spin_unlock_irq(&callback_lock); + raw_spin_unlock_irq(&callback_lock); /* we don't mess with cpumasks of tasks in top_cpuset */ } /* synchronize mems_allowed to N_MEMORY */ if (mems_updated) { - spin_lock_irq(&callback_lock); + raw_spin_lock_irq(&callback_lock); if (!on_dfl) top_cpuset.mems_allowed = new_mems; top_cpuset.effective_mems = new_mems; - spin_unlock_irq(&callback_lock); + raw_spin_unlock_irq(&callback_lock); update_tasks_nodemask(&top_cpuset); } @ kernel/cgroup/cpuset.c:3277 @ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask) { unsigned long flags; - spin_lock_irqsave(&callback_lock, flags); + raw_spin_lock_irqsave(&callback_lock, flags); rcu_read_lock(); guarantee_online_cpus(task_cs(tsk), pmask); rcu_read_unlock(); - spin_unlock_irqrestore(&callback_lock, flags); + raw_spin_unlock_irqrestore(&callback_lock, flags); } /** @ kernel/cgroup/cpuset.c:3342 @ nodemask_t cpuset_mems_allowed(struct task_struct *tsk) nodemask_t mask; unsigned long flags; - spin_lock_irqsave(&callback_lock, flags); + raw_spin_lock_irqsave(&callback_lock, flags); rcu_read_lock(); guarantee_online_mems(task_cs(tsk), &mask); rcu_read_unlock(); - spin_unlock_irqrestore(&callback_lock, flags); + raw_spin_unlock_irqrestore(&callback_lock, flags); return mask; } @ kernel/cgroup/cpuset.c:3438 @ bool __cpuset_node_allowed(int node, gfp_t gfp_mask) return true; /* Not hardwall and node outside mems_allowed: scan up cpusets */ - spin_lock_irqsave(&callback_lock, flags); + raw_spin_lock_irqsave(&callback_lock, flags); rcu_read_lock(); cs = nearest_hardwall_ancestor(task_cs(current)); allowed = node_isset(node, cs->mems_allowed); rcu_read_unlock(); - spin_unlock_irqrestore(&callback_lock, flags); + raw_spin_unlock_irqrestore(&callback_lock, flags); return allowed; } @ kernel/cgroup/rstat.c:153 @ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep) raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); struct cgroup *pos = NULL; + unsigned long flags; - raw_spin_lock(cpu_lock); + raw_spin_lock_irqsave(cpu_lock, flags); while ((pos = cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) { struct cgroup_subsys_state *css; @ kernel/cgroup/rstat.c:167 @ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep) css->ss->css_rstat_flush(css, cpu); rcu_read_unlock(); } - raw_spin_unlock(cpu_lock); + raw_spin_unlock_irqrestore(cpu_lock, flags); /* if @may_sleep, play nice and yield if necessary */ if (may_sleep && (need_resched() || @ kernel/cpu.c:334 @ void lockdep_assert_cpus_held(void) static void lockdep_acquire_cpus_lock(void) { - rwsem_acquire(&cpu_hotplug_lock.rw_sem.dep_map, 0, 0, _THIS_IP_); + rwsem_acquire(&cpu_hotplug_lock.dep_map, 0, 0, _THIS_IP_); } static void lockdep_release_cpus_lock(void) { - rwsem_release(&cpu_hotplug_lock.rw_sem.dep_map, _THIS_IP_); + rwsem_release(&cpu_hotplug_lock.dep_map, _THIS_IP_); } /* @ kernel/cpu.c:851 @ static int take_cpu_down(void *_param) int err, cpu = smp_processor_id(); int ret; +#ifdef CONFIG_PREEMPT_RT + /* + * If any tasks disabled migration before we got here, + * go back and sleep again. + */ + if (cpu_nr_pinned(cpu)) + return -EAGAIN; +#endif + /* Ensure this CPU doesn't handle any more interrupts. */ err = __cpu_disable(); if (err < 0) @ kernel/cpu.c:889 @ static int take_cpu_down(void *_param) return 0; } +#ifdef CONFIG_PREEMPT_RT +struct task_struct *takedown_cpu_task; +#endif + static int takedown_cpu(unsigned int cpu) { struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); @ kernel/cpu.c:907 @ static int takedown_cpu(unsigned int cpu) */ irq_lock_sparse(); +#ifdef CONFIG_PREEMPT_RT + WARN_ON_ONCE(takedown_cpu_task); + takedown_cpu_task = current; + +again: + /* + * If a task pins this CPU after we pass this check, take_cpu_down + * will return -EAGAIN. + */ + for (;;) { + int nr_pinned; + + set_current_state(TASK_UNINTERRUPTIBLE); + nr_pinned = cpu_nr_pinned(cpu); + if (nr_pinned == 0) + break; + schedule(); + } + set_current_state(TASK_RUNNING); +#endif + /* * So now all preempt/rcu users must observe !cpu_active(). */ err = stop_machine_cpuslocked(take_cpu_down, NULL, cpumask_of(cpu)); +#ifdef CONFIG_PREEMPT_RT + if (err == -EAGAIN) + goto again; +#endif if (err) { +#ifdef CONFIG_PREEMPT_RT + takedown_cpu_task = NULL; +#endif /* CPU refused to die */ irq_unlock_sparse(); /* Unpark the hotplug thread so we can rollback there */ @ kernel/cpu.c:958 @ static int takedown_cpu(unsigned int cpu) wait_for_ap_thread(st, false); BUG_ON(st->state != CPUHP_AP_IDLE_DEAD); +#ifdef CONFIG_PREEMPT_RT + takedown_cpu_task = NULL; +#endif /* Interrupts are moved away from the dying cpu, reenable alloc/free */ irq_unlock_sparse(); @ kernel/events/core.c:9210 @ static void bpf_overflow_handler(struct perf_event *event, int ret = 0; ctx.regs = perf_arch_bpf_user_pt_regs(regs); - preempt_disable(); if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) goto out; rcu_read_lock(); @ kernel/events/core.c:9217 @ static void bpf_overflow_handler(struct perf_event *event, rcu_read_unlock(); out: __this_cpu_dec(bpf_prog_active); - preempt_enable(); if (!ret) return; @ kernel/exit.c:164 @ static void __exit_signal(struct task_struct *tsk) * Do this under ->siglock, we can race with another thread * doing sigqueue_free() if we have SIGQUEUE_PREALLOC signals. */ - flush_sigqueue(&tsk->pending); + flush_task_sigqueue(tsk); tsk->sighand = NULL; spin_unlock(&sighand->siglock); @ kernel/exit.c:261 @ void rcuwait_wake_up(struct rcuwait *w) wake_up_process(task); rcu_read_unlock(); } +EXPORT_SYMBOL_GPL(rcuwait_wake_up); /* * Determine if a process group is "orphaned", according to the POSIX @ kernel/fork.c:45 @ #include <linux/mmu_notifier.h> #include <linux/fs.h> #include <linux/mm.h> +#include <linux/kprobes.h> #include <linux/vmacache.h> #include <linux/nsproxy.h> #include <linux/capability.h> @ kernel/fork.c:296 @ static inline void free_thread_stack(struct task_struct *tsk) return; } - vfree_atomic(tsk->stack); + vfree(tsk->stack); return; } #endif @ kernel/fork.c:703 @ void __mmdrop(struct mm_struct *mm) } EXPORT_SYMBOL_GPL(__mmdrop); +#ifdef CONFIG_PREEMPT_RT +/* + * RCU callback for delayed mm drop. Not strictly rcu, but we don't + * want another facility to make this work. + */ +void __mmdrop_delayed(struct rcu_head *rhp) +{ + struct mm_struct *mm = container_of(rhp, struct mm_struct, delayed_drop); + + __mmdrop(mm); +} +#endif + static void mmdrop_async_fn(struct work_struct *work) { struct mm_struct *mm; @ kernel/fork.c:757 @ void __put_task_struct(struct task_struct *tsk) WARN_ON(refcount_read(&tsk->usage)); WARN_ON(tsk == current); + /* + * Remove function-return probe instances associated with this + * task and put them back on the free list. + */ + kprobe_flush_task(tsk); + + /* Task is done with its stack. */ + put_task_stack(tsk); + cgroup_free(tsk); task_numa_free(tsk, true); security_task_free(tsk); @ kernel/fork.c:956 @ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) tsk->splice_pipe = NULL; tsk->task_frag.page = NULL; tsk->wake_q.next = NULL; + tsk->wake_q_sleeper.next = NULL; account_kernel_stack(tsk, 1); @ kernel/fork.c:2000 @ static __latent_entropy struct task_struct *copy_process( spin_lock_init(&p->alloc_lock); init_sigpending(&p->pending); + p->sigqueue_cache = NULL; p->utime = p->stime = p->gtime = 0; #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME @ kernel/fork.c:2043 @ static __latent_entropy struct task_struct *copy_process( #ifdef CONFIG_CPUSETS p->cpuset_mem_spread_rotor = NUMA_NO_NODE; p->cpuset_slab_spread_rotor = NUMA_NO_NODE; - seqcount_init(&p->mems_allowed_seq); + seqcount_spinlock_init(&p->mems_allowed_seq, &p->alloc_lock); #endif #ifdef CONFIG_TRACE_IRQFLAGS p->irq_events = 0; @ kernel/futex.c:965 @ static void exit_pi_state_list(struct task_struct *curr) if (head->next != next) { /* retain curr->pi_lock for the loop invariant */ raw_spin_unlock(&pi_state->pi_mutex.wait_lock); + raw_spin_unlock_irq(&curr->pi_lock); spin_unlock(&hb->lock); + raw_spin_lock_irq(&curr->pi_lock); put_pi_state(pi_state); continue; } @ kernel/futex.c:1577 @ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_ struct task_struct *new_owner; bool postunlock = false; DEFINE_WAKE_Q(wake_q); + DEFINE_WAKE_Q(wake_sleeper_q); int ret = 0; new_owner = rt_mutex_next_owner(&pi_state->pi_mutex); @ kernel/futex.c:1637 @ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_ pi_state->owner = new_owner; raw_spin_unlock(&new_owner->pi_lock); - postunlock = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q); - + postunlock = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q, + &wake_sleeper_q); out_unlock: raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); if (postunlock) - rt_mutex_postunlock(&wake_q); + rt_mutex_postunlock(&wake_q, &wake_sleeper_q); return ret; } @ kernel/futex.c:2267 @ static int futex_requeue(u32 __user *uaddr1, unsigned int flags, requeue_pi_wake_futex(this, &key2, hb2); drop_count++; continue; + } else if (ret == -EAGAIN) { + /* + * Waiter was woken by timeout or + * signal and has set pi_blocked_on to + * PI_WAKEUP_INPROGRESS before we + * tried to enqueue it on the rtmutex. + */ + this->pi_state = NULL; + put_pi_state(pi_state); + continue; } else if (ret) { /* * rt_mutex_start_proxy_lock() detected a @ kernel/futex.c:2985 @ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, goto no_block; } - rt_mutex_init_waiter(&rt_waiter); + rt_mutex_init_waiter(&rt_waiter, false); /* * On PREEMPT_RT_FULL, when hb->lock becomes an rt_mutex, we must not @ kernel/futex.c:3001 @ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, * before __rt_mutex_start_proxy_lock() is done. */ raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock); + /* + * the migrate_disable() here disables migration in the in_atomic() fast + * path which is enabled again in the following spin_unlock(). We have + * one migrate_disable() pending in the slow-path which is reversed + * after the raw_spin_unlock_irq() where we leave the atomic context. + */ + migrate_disable(); + spin_unlock(q.lock_ptr); /* * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter @ kernel/futex.c:3017 @ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, */ ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current); raw_spin_unlock_irq(&q.pi_state->pi_mutex.wait_lock); + migrate_enable(); if (ret) { if (ret == 1) @ kernel/futex.c:3166 @ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags) * rt_waiter. Also see the WARN in wake_futex_pi(). */ raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); + /* + * Magic trickery for now to make the RT migrate disable + * logic happy. The following spin_unlock() happens with + * interrupts disabled so the internal migrate_enable() + * won't undo the migrate_disable() which was issued when + * locking hb->lock. + */ + migrate_disable(); spin_unlock(&hb->lock); /* drops pi_state->pi_mutex.wait_lock */ ret = wake_futex_pi(uaddr, uval, pi_state); + migrate_enable(); put_pi_state(pi_state); @ kernel/futex.c:3350 @ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, struct hrtimer_sleeper timeout, *to; struct futex_pi_state *pi_state = NULL; struct rt_mutex_waiter rt_waiter; - struct futex_hash_bucket *hb; + struct futex_hash_bucket *hb, *hb2; union futex_key key2 = FUTEX_KEY_INIT; struct futex_q q = futex_q_init; int res, ret; @ kernel/futex.c:3371 @ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, * The waiter is allocated on our stack, manipulated by the requeue * code while we sleep on uaddr. */ - rt_mutex_init_waiter(&rt_waiter); + rt_mutex_init_waiter(&rt_waiter, false); ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, FUTEX_WRITE); if (unlikely(ret != 0)) @ kernel/futex.c:3402 @ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, /* Queue the futex_q, drop the hb lock, wait for wakeup. */ futex_wait_queue_me(hb, &q, to); - spin_lock(&hb->lock); - ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to); - spin_unlock(&hb->lock); - if (ret) - goto out_put_keys; + /* + * On RT we must avoid races with requeue and trying to block + * on two mutexes (hb->lock and uaddr2's rtmutex) by + * serializing access to pi_blocked_on with pi_lock. + */ + raw_spin_lock_irq(¤t->pi_lock); + if (current->pi_blocked_on) { + /* + * We have been requeued or are in the process of + * being requeued. + */ + raw_spin_unlock_irq(¤t->pi_lock); + } else { + /* + * Setting pi_blocked_on to PI_WAKEUP_INPROGRESS + * prevents a concurrent requeue from moving us to the + * uaddr2 rtmutex. After that we can safely acquire + * (and possibly block on) hb->lock. + */ + current->pi_blocked_on = PI_WAKEUP_INPROGRESS; + raw_spin_unlock_irq(¤t->pi_lock); + + spin_lock(&hb->lock); + + /* + * Clean up pi_blocked_on. We might leak it otherwise + * when we succeeded with the hb->lock in the fast + * path. + */ + raw_spin_lock_irq(¤t->pi_lock); + current->pi_blocked_on = NULL; + raw_spin_unlock_irq(¤t->pi_lock); + + ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to); + spin_unlock(&hb->lock); + if (ret) + goto out_put_keys; + } /* - * In order for us to be here, we know our q.key == key2, and since - * we took the hb->lock above, we also know that futex_requeue() has - * completed and we no longer have to concern ourselves with a wakeup - * race with the atomic proxy lock acquisition by the requeue code. The - * futex_requeue dropped our key1 reference and incremented our key2 - * reference count. + * In order to be here, we have either been requeued, are in + * the process of being requeued, or requeue successfully + * acquired uaddr2 on our behalf. If pi_blocked_on was + * non-null above, we may be racing with a requeue. Do not + * rely on q->lock_ptr to be hb2->lock until after blocking on + * hb->lock or hb2->lock. The futex_requeue dropped our key1 + * reference and incremented our key2 reference count. */ + hb2 = hash_futex(&key2); /* Check if the requeue code acquired the second futex for us. */ if (!q.rt_waiter) { @ kernel/futex.c:3459 @ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, * did a lock-steal - fix up the PI-state in that case. */ if (q.pi_state && (q.pi_state->owner != current)) { - spin_lock(q.lock_ptr); + spin_lock(&hb2->lock); + BUG_ON(&hb2->lock != q.lock_ptr); ret = fixup_pi_state_owner(uaddr2, &q, current); if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current) { pi_state = q.pi_state; @ kernel/futex.c:3471 @ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, * the requeue_pi() code acquired for us. */ put_pi_state(q.pi_state); - spin_unlock(q.lock_ptr); + spin_unlock(&hb2->lock); } } else { struct rt_mutex *pi_mutex; @ kernel/futex.c:3485 @ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, pi_mutex = &q.pi_state->pi_mutex; ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter); - spin_lock(q.lock_ptr); + spin_lock(&hb2->lock); + BUG_ON(&hb2->lock != q.lock_ptr); if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter)) ret = 0; @ kernel/irq/handle.c:188 @ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc) { irqreturn_t retval; unsigned int flags = 0; + struct pt_regs *regs = get_irq_regs(); + u64 ip = regs ? instruction_pointer(regs) : 0; retval = __handle_irq_event_percpu(desc, &flags); - add_interrupt_randomness(desc->irq_data.irq, flags); +#ifdef CONFIG_PREEMPT_RT + desc->random_ip = ip; +#else + add_interrupt_randomness(desc->irq_data.irq, flags, ip); +#endif if (!noirqdebug) note_interrupt(desc, retval); @ kernel/irq/manage.c:1141 @ static int irq_thread(void *data) if (action_ret == IRQ_WAKE_THREAD) irq_wake_secondary(desc, action); +#ifdef CONFIG_PREEMPT_RT + migrate_disable(); + add_interrupt_randomness(action->irq, 0, + desc->random_ip ^ (unsigned long) action); + migrate_enable(); +#endif wake_threads_waitq(desc); } @ kernel/irq/manage.c:2729 @ EXPORT_SYMBOL_GPL(irq_get_irqchip_state); * This call sets the internal irqchip state of an interrupt, * depending on the value of @which. * - * This function should be called with preemption disabled if the + * This function should be called with migration disabled if the * interrupt controller has per-cpu registers. */ int irq_set_irqchip_state(unsigned int irq, enum irqchip_irq_state which, @ kernel/irq/spurious.c:446 @ MODULE_PARM_DESC(noirqdebug, "Disable irq lockup detection when true"); static int __init irqfixup_setup(char *str) { +#ifdef CONFIG_PREEMPT_RT + pr_warn("irqfixup boot option not supported w/ CONFIG_PREEMPT_RT\n"); + return 1; +#endif irqfixup = 1; printk(KERN_WARNING "Misrouted IRQ fixup support enabled.\n"); printk(KERN_WARNING "This may impact system performance.\n"); @ kernel/irq/spurious.c:462 @ module_param(irqfixup, int, 0644); static int __init irqpoll_setup(char *str) { +#ifdef CONFIG_PREEMPT_RT + pr_warn("irqpoll boot option not supported w/ CONFIG_PREEMPT_RT\n"); + return 1; +#endif irqfixup = 2; printk(KERN_WARNING "Misrouted IRQ fixup and polling support " "enabled\n"); @ kernel/irq_work.c:21 @ #include <linux/cpu.h> #include <linux/notifier.h> #include <linux/smp.h> +#include <linux/interrupt.h> #include <asm/processor.h> @ kernel/irq_work.c:56 @ void __weak arch_irq_work_raise(void) /* Enqueue on current CPU, work must already be claimed and preempt disabled */ static void __irq_work_queue_local(struct irq_work *work) { + struct llist_head *list; + bool lazy_work, realtime = IS_ENABLED(CONFIG_PREEMPT_RT); + + lazy_work = atomic_read(&work->flags) & IRQ_WORK_LAZY; + /* If the work is "lazy", handle it from next tick if any */ - if (atomic_read(&work->flags) & IRQ_WORK_LAZY) { - if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) && - tick_nohz_tick_stopped()) - arch_irq_work_raise(); - } else { - if (llist_add(&work->llnode, this_cpu_ptr(&raised_list))) + if (lazy_work || (realtime && !(atomic_read(&work->flags) & IRQ_WORK_HARD_IRQ))) + list = this_cpu_ptr(&lazy_list); + else + list = this_cpu_ptr(&raised_list); + + if (llist_add(&work->llnode, list)) { + if (!lazy_work || tick_nohz_tick_stopped()) arch_irq_work_raise(); } } @ kernel/irq_work.c:110 @ bool irq_work_queue_on(struct irq_work *work, int cpu) preempt_disable(); if (cpu != smp_processor_id()) { + struct llist_head *list; + /* Arch remote IPI send/receive backend aren't NMI safe */ WARN_ON_ONCE(in_nmi()); - if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) + if (IS_ENABLED(CONFIG_PREEMPT_RT) && !(atomic_read(&work->flags) & IRQ_WORK_HARD_IRQ)) + list = &per_cpu(lazy_list, cpu); + else + list = &per_cpu(raised_list, cpu); + + if (llist_add(&work->llnode, list)) arch_send_call_function_single_ipi(cpu); } else { __irq_work_queue_local(work); @ kernel/irq_work.c:138 @ bool irq_work_needs_cpu(void) raised = this_cpu_ptr(&raised_list); lazy = this_cpu_ptr(&lazy_list); - if (llist_empty(raised) || arch_irq_work_has_interrupt()) - if (llist_empty(lazy)) - return false; + if (llist_empty(raised) && llist_empty(lazy)) + return false; /* All work should have been flushed before going offline */ WARN_ON_ONCE(cpu_is_offline(smp_processor_id())); @ kernel/irq_work.c:152 @ static void irq_work_run_list(struct llist_head *list) struct irq_work *work, *tmp; struct llist_node *llnode; +#ifndef CONFIG_PREEMPT_RT + /* + * nort: On RT IRQ-work may run in SOFTIRQ context. + */ BUG_ON(!irqs_disabled()); - +#endif if (llist_empty(list)) return; @ kernel/irq_work.c:190 @ static void irq_work_run_list(struct llist_head *list) void irq_work_run(void) { irq_work_run_list(this_cpu_ptr(&raised_list)); - irq_work_run_list(this_cpu_ptr(&lazy_list)); + if (IS_ENABLED(CONFIG_PREEMPT_RT)) { + /* + * NOTE: we raise softirq via IPI for safety, + * and execute in irq_work_tick() to move the + * overhead from hard to soft irq context. + */ + if (!llist_empty(this_cpu_ptr(&lazy_list))) + raise_softirq(TIMER_SOFTIRQ); + } else + irq_work_run_list(this_cpu_ptr(&lazy_list)); } EXPORT_SYMBOL_GPL(irq_work_run); @ kernel/irq_work.c:209 @ void irq_work_tick(void) if (!llist_empty(raised) && !arch_irq_work_has_interrupt()) irq_work_run_list(raised); + + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + irq_work_run_list(this_cpu_ptr(&lazy_list)); +} + +#if defined(CONFIG_IRQ_WORK) && defined(CONFIG_PREEMPT_RT) +void irq_work_tick_soft(void) +{ irq_work_run_list(this_cpu_ptr(&lazy_list)); } +#endif /* * Synchronize against the irq_work @entry, ensures the entry is not @ kernel/kexec_core.c:981 @ void crash_kexec(struct pt_regs *regs) old_cpu = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu); if (old_cpu == PANIC_CPU_INVALID) { /* This is the 1st CPU which comes here, so go ahead. */ - printk_safe_flush_on_panic(); __crash_kexec(regs); /* @ kernel/ksysfs.c:141 @ KERNEL_ATTR_RO(vmcoreinfo); #endif /* CONFIG_CRASH_CORE */ +#if defined(CONFIG_PREEMPT_RT) +static ssize_t realtime_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sprintf(buf, "%d\n", 1); +} +KERNEL_ATTR_RO(realtime); +#endif + /* whether file capabilities are enabled */ static ssize_t fscaps_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) @ kernel/ksysfs.c:240 @ static struct attribute * kernel_attrs[] = { #ifndef CONFIG_TINY_RCU &rcu_expedited_attr.attr, &rcu_normal_attr.attr, +#endif +#ifdef CONFIG_PREEMPT_RT + &realtime_attr.attr, #endif NULL }; @ kernel/locking/Makefile:6 @ # and is generally not a function of system call inputs. KCOV_INSTRUMENT := n -obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o +obj-y += semaphore.o rwsem.o percpu-rwsem.o ifdef CONFIG_FUNCTION_TRACER CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE) @ kernel/locking/Makefile:15 @ CFLAGS_REMOVE_mutex-debug.o = $(CC_FLAGS_FTRACE) CFLAGS_REMOVE_rtmutex-debug.o = $(CC_FLAGS_FTRACE) endif -obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o obj-$(CONFIG_LOCKDEP) += lockdep.o ifeq ($(CONFIG_PROC_FS),y) obj-$(CONFIG_LOCKDEP) += lockdep_proc.o endif obj-$(CONFIG_SMP) += spinlock.o -obj-$(CONFIG_LOCK_SPIN_ON_OWNER) += osq_lock.o obj-$(CONFIG_PROVE_LOCKING) += spinlock.o obj-$(CONFIG_QUEUED_SPINLOCKS) += qspinlock.o obj-$(CONFIG_RT_MUTEXES) += rtmutex.o obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock.o obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock_debug.o +ifneq ($(CONFIG_PREEMPT_RT),y) +obj-y += mutex.o +obj-$(CONFIG_LOCK_SPIN_ON_OWNER) += osq_lock.o +obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o +endif +obj-$(CONFIG_PREEMPT_RT) += mutex-rt.o rwsem-rt.o rwlock-rt.o obj-$(CONFIG_QUEUED_RWLOCKS) += qrwlock.o obj-$(CONFIG_LOCK_TORTURE_TEST) += locktorture.o obj-$(CONFIG_WW_MUTEX_SELFTEST) += test-ww_mutex.o @ kernel/locking/lockdep.c:4414 @ static void check_flags(unsigned long flags) } } +#ifndef CONFIG_PREEMPT_RT /* * We dont accurately track softirq state in e.g. * hardirq contexts (such as on 4KSTACKS), so only @ kernel/locking/lockdep.c:4429 @ static void check_flags(unsigned long flags) DEBUG_LOCKS_WARN_ON(!current->softirqs_enabled); } } +#endif if (!debug_locks) print_irqtrace_events(current); @ kernel/locking/mutex-rt.c:4 @ +/* + * kernel/rt.c + * + * Real-Time Preemption Support + * + * started by Ingo Molnar: + * + * Copyright (C) 2004-2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com> + * Copyright (C) 2006, Timesys Corp., Thomas Gleixner <tglx@timesys.com> + * + * historic credit for proving that Linux spinlocks can be implemented via + * RT-aware mutexes goes to many people: The Pmutex project (Dirk Grambow + * and others) who prototyped it on 2.4 and did lots of comparative + * research and analysis; TimeSys, for proving that you can implement a + * fully preemptible kernel via the use of IRQ threading and mutexes; + * Bill Huey for persuasively arguing on lkml that the mutex model is the + * right one; and to MontaVista, who ported pmutexes to 2.6. + * + * This code is a from-scratch implementation and is not based on pmutexes, + * but the idea of converting spinlocks to mutexes is used here too. + * + * lock debugging, locking tree, deadlock detection: + * + * Copyright (C) 2004, LynuxWorks, Inc., Igor Manyilov, Bill Huey + * Released under the General Public License (GPL). + * + * Includes portions of the generic R/W semaphore implementation from: + * + * Copyright (c) 2001 David Howells (dhowells@redhat.com). + * - Derived partially from idea by Andrea Arcangeli <andrea@suse.de> + * - Derived also from comments by Linus + * + * Pending ownership of locks and ownership stealing: + * + * Copyright (C) 2005, Kihon Technologies Inc., Steven Rostedt + * + * (also by Steven Rostedt) + * - Converted single pi_lock to individual task locks. + * + * By Esben Nielsen: + * Doing priority inheritance with help of the scheduler. + * + * Copyright (C) 2006, Timesys Corp., Thomas Gleixner <tglx@timesys.com> + * - major rework based on Esben Nielsens initial patch + * - replaced thread_info references by task_struct refs + * - removed task->pending_owner dependency + * - BKL drop/reacquire for semaphore style locks to avoid deadlocks + * in the scheduler return path as discussed with Steven Rostedt + * + * Copyright (C) 2006, Kihon Technologies Inc. + * Steven Rostedt <rostedt@goodmis.org> + * - debugged and patched Thomas Gleixner's rework. + * - added back the cmpxchg to the rework. + * - turned atomic require back on for SMP. + */ + +#include <linux/spinlock.h> +#include <linux/rtmutex.h> +#include <linux/sched.h> +#include <linux/delay.h> +#include <linux/module.h> +#include <linux/kallsyms.h> +#include <linux/syscalls.h> +#include <linux/interrupt.h> +#include <linux/plist.h> +#include <linux/fs.h> +#include <linux/futex.h> +#include <linux/hrtimer.h> + +#include "rtmutex_common.h" + +/* + * struct mutex functions + */ +void __mutex_do_init(struct mutex *mutex, const char *name, + struct lock_class_key *key) +{ +#ifdef CONFIG_DEBUG_LOCK_ALLOC + /* + * Make sure we are not reinitializing a held lock: + */ + debug_check_no_locks_freed((void *)mutex, sizeof(*mutex)); + lockdep_init_map(&mutex->dep_map, name, key, 0); +#endif + mutex->lock.save_state = 0; +} +EXPORT_SYMBOL(__mutex_do_init); + +void __lockfunc _mutex_lock(struct mutex *lock) +{ + mutex_acquire(&lock->dep_map, 0, 0, _RET_IP_); + __rt_mutex_lock_state(&lock->lock, TASK_UNINTERRUPTIBLE); +} +EXPORT_SYMBOL(_mutex_lock); + +void __lockfunc _mutex_lock_io(struct mutex *lock) +{ + int token; + + token = io_schedule_prepare(); + _mutex_lock(lock); + io_schedule_finish(token); +} +EXPORT_SYMBOL_GPL(_mutex_lock_io); + +int __lockfunc _mutex_lock_interruptible(struct mutex *lock) +{ + int ret; + + mutex_acquire(&lock->dep_map, 0, 0, _RET_IP_); + ret = __rt_mutex_lock_state(&lock->lock, TASK_INTERRUPTIBLE); + if (ret) + mutex_release(&lock->dep_map, _RET_IP_); + return ret; +} +EXPORT_SYMBOL(_mutex_lock_interruptible); + +int __lockfunc _mutex_lock_killable(struct mutex *lock) +{ + int ret; + + mutex_acquire(&lock->dep_map, 0, 0, _RET_IP_); + ret = __rt_mutex_lock_state(&lock->lock, TASK_KILLABLE); + if (ret) + mutex_release(&lock->dep_map, _RET_IP_); + return ret; +} +EXPORT_SYMBOL(_mutex_lock_killable); + +#ifdef CONFIG_DEBUG_LOCK_ALLOC +void __lockfunc _mutex_lock_nested(struct mutex *lock, int subclass) +{ + mutex_acquire_nest(&lock->dep_map, subclass, 0, NULL, _RET_IP_); + __rt_mutex_lock_state(&lock->lock, TASK_UNINTERRUPTIBLE); +} +EXPORT_SYMBOL(_mutex_lock_nested); + +void __lockfunc _mutex_lock_io_nested(struct mutex *lock, int subclass) +{ + int token; + + token = io_schedule_prepare(); + + mutex_acquire_nest(&lock->dep_map, subclass, 0, NULL, _RET_IP_); + __rt_mutex_lock_state(&lock->lock, TASK_UNINTERRUPTIBLE); + + io_schedule_finish(token); +} +EXPORT_SYMBOL_GPL(_mutex_lock_io_nested); + +void __lockfunc _mutex_lock_nest_lock(struct mutex *lock, struct lockdep_map *nest) +{ + mutex_acquire_nest(&lock->dep_map, 0, 0, nest, _RET_IP_); + __rt_mutex_lock_state(&lock->lock, TASK_UNINTERRUPTIBLE); +} +EXPORT_SYMBOL(_mutex_lock_nest_lock); + +int __lockfunc _mutex_lock_interruptible_nested(struct mutex *lock, int subclass) +{ + int ret; + + mutex_acquire_nest(&lock->dep_map, subclass, 0, NULL, _RET_IP_); + ret = __rt_mutex_lock_state(&lock->lock, TASK_INTERRUPTIBLE); + if (ret) + mutex_release(&lock->dep_map, _RET_IP_); + return ret; +} +EXPORT_SYMBOL(_mutex_lock_interruptible_nested); + +int __lockfunc _mutex_lock_killable_nested(struct mutex *lock, int subclass) +{ + int ret; + + mutex_acquire(&lock->dep_map, subclass, 0, _RET_IP_); + ret = __rt_mutex_lock_state(&lock->lock, TASK_KILLABLE); + if (ret) + mutex_release(&lock->dep_map, _RET_IP_); + return ret; +} +EXPORT_SYMBOL(_mutex_lock_killable_nested); +#endif + +int __lockfunc _mutex_trylock(struct mutex *lock) +{ + int ret = __rt_mutex_trylock(&lock->lock); + + if (ret) + mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_); + + return ret; +} +EXPORT_SYMBOL(_mutex_trylock); + +void __lockfunc _mutex_unlock(struct mutex *lock) +{ + mutex_release(&lock->dep_map, _RET_IP_); + __rt_mutex_unlock(&lock->lock); +} +EXPORT_SYMBOL(_mutex_unlock); + +/** + * atomic_dec_and_mutex_lock - return holding mutex if we dec to 0 + * @cnt: the atomic which we are to dec + * @lock: the mutex to return holding if we dec to 0 + * + * return true and hold lock if we dec to 0, return false otherwise + */ +int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock) +{ + /* dec if we can't possibly hit 0 */ + if (atomic_add_unless(cnt, -1, 1)) + return 0; + /* we might hit 0, so take the lock */ + mutex_lock(lock); + if (!atomic_dec_and_test(cnt)) { + /* when we actually did the dec, we didn't hit 0 */ + mutex_unlock(lock); + return 0; + } + /* we hit 0, and we hold the lock */ + return 1; +} +EXPORT_SYMBOL(atomic_dec_and_mutex_lock); @ kernel/locking/percpu-rwsem.c:4 @ // SPDX-License-Identifier: GPL-2.0-only #include <linux/atomic.h> -#include <linux/rwsem.h> #include <linux/percpu.h> +#include <linux/wait.h> #include <linux/lockdep.h> #include <linux/percpu-rwsem.h> #include <linux/rcupdate.h> #include <linux/sched.h> +#include <linux/sched/task.h> #include <linux/errno.h> -#include "rwsem.h" - int __percpu_init_rwsem(struct percpu_rw_semaphore *sem, - const char *name, struct lock_class_key *rwsem_key) + const char *name, struct lock_class_key *key) { sem->read_count = alloc_percpu(int); if (unlikely(!sem->read_count)) return -ENOMEM; - /* ->rw_sem represents the whole percpu_rw_semaphore for lockdep */ rcu_sync_init(&sem->rss); - __init_rwsem(&sem->rw_sem, name, rwsem_key); rcuwait_init(&sem->writer); - sem->readers_block = 0; + init_waitqueue_head(&sem->waiters); + atomic_set(&sem->block, 0); +#ifdef CONFIG_DEBUG_LOCK_ALLOC + debug_check_no_locks_freed((void *)sem, sizeof(*sem)); + lockdep_init_map(&sem->dep_map, name, key, 0); +#endif return 0; } EXPORT_SYMBOL_GPL(__percpu_init_rwsem); @ kernel/locking/percpu-rwsem.c:46 @ void percpu_free_rwsem(struct percpu_rw_semaphore *sem) } EXPORT_SYMBOL_GPL(percpu_free_rwsem); -int __percpu_down_read(struct percpu_rw_semaphore *sem, int try) +static bool __percpu_down_read_trylock(struct percpu_rw_semaphore *sem) { + __this_cpu_inc(*sem->read_count); + /* * Due to having preemption disabled the decrement happens on * the same CPU as the increment, avoiding the * increment-on-one-CPU-and-decrement-on-another problem. * - * If the reader misses the writer's assignment of readers_block, then - * the writer is guaranteed to see the reader's increment. + * If the reader misses the writer's assignment of sem->block, then the + * writer is guaranteed to see the reader's increment. * * Conversely, any readers that increment their sem->read_count after - * the writer looks are guaranteed to see the readers_block value, - * which in turn means that they are guaranteed to immediately - * decrement their sem->read_count, so that it doesn't matter that the - * writer missed them. + * the writer looks are guaranteed to see the sem->block value, which + * in turn means that they are guaranteed to immediately decrement + * their sem->read_count, so that it doesn't matter that the writer + * missed them. */ smp_mb(); /* A matches D */ /* - * If !readers_block the critical section starts here, matched by the + * If !sem->block the critical section starts here, matched by the * release in percpu_up_write(). */ - if (likely(!smp_load_acquire(&sem->readers_block))) - return 1; + if (likely(!atomic_read_acquire(&sem->block))) + return true; - /* - * Per the above comment; we still have preemption disabled and - * will thus decrement on the same CPU as we incremented. - */ - __percpu_up_read(sem); - - if (try) - return 0; - - /* - * We either call schedule() in the wait, or we'll fall through - * and reschedule on the preempt_enable() in percpu_down_read(). - */ - preempt_enable_no_resched(); - - /* - * Avoid lockdep for the down/up_read() we already have them. - */ - __down_read(&sem->rw_sem); - this_cpu_inc(*sem->read_count); - __up_read(&sem->rw_sem); - - preempt_disable(); - return 1; -} -EXPORT_SYMBOL_GPL(__percpu_down_read); - -void __percpu_up_read(struct percpu_rw_semaphore *sem) -{ - smp_mb(); /* B matches C */ - /* - * In other words, if they see our decrement (presumably to aggregate - * zero, as that is the only time it matters) they will also see our - * critical section. - */ __this_cpu_dec(*sem->read_count); - /* Prod writer to recheck readers_active */ + /* Prod writer to re-evaluate readers_active_check() */ rcuwait_wake_up(&sem->writer); + + return false; } -EXPORT_SYMBOL_GPL(__percpu_up_read); + +static inline bool __percpu_down_write_trylock(struct percpu_rw_semaphore *sem) +{ + if (atomic_read(&sem->block)) + return false; + + return atomic_xchg(&sem->block, 1) == 0; +} + +static bool __percpu_rwsem_trylock(struct percpu_rw_semaphore *sem, bool reader) +{ + if (reader) { + bool ret; + + preempt_disable(); + ret = __percpu_down_read_trylock(sem); + preempt_enable(); + + return ret; + } + return __percpu_down_write_trylock(sem); +} + +/* + * The return value of wait_queue_entry::func means: + * + * <0 - error, wakeup is terminated and the error is returned + * 0 - no wakeup, a next waiter is tried + * >0 - woken, if EXCLUSIVE, counted towards @nr_exclusive. + * + * We use EXCLUSIVE for both readers and writers to preserve FIFO order, + * and play games with the return value to allow waking multiple readers. + * + * Specifically, we wake readers until we've woken a single writer, or until a + * trylock fails. + */ +static int percpu_rwsem_wake_function(struct wait_queue_entry *wq_entry, + unsigned int mode, int wake_flags, + void *key) +{ + struct task_struct *p = get_task_struct(wq_entry->private); + bool reader = wq_entry->flags & WQ_FLAG_CUSTOM; + struct percpu_rw_semaphore *sem = key; + + /* concurrent against percpu_down_write(), can get stolen */ + if (!__percpu_rwsem_trylock(sem, reader)) + return 1; + + list_del_init(&wq_entry->entry); + smp_store_release(&wq_entry->private, NULL); + + wake_up_process(p); + put_task_struct(p); + + return !reader; /* wake (readers until) 1 writer */ +} + +static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool reader) +{ + DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function); + bool wait; + + spin_lock_irq(&sem->waiters.lock); + /* + * Serialize against the wakeup in percpu_up_write(), if we fail + * the trylock, the wakeup must see us on the list. + */ + wait = !__percpu_rwsem_trylock(sem, reader); + if (wait) { + wq_entry.flags |= WQ_FLAG_EXCLUSIVE | reader * WQ_FLAG_CUSTOM; + __add_wait_queue_entry_tail(&sem->waiters, &wq_entry); + } + spin_unlock_irq(&sem->waiters.lock); + + while (wait) { + set_current_state(TASK_UNINTERRUPTIBLE); + if (!smp_load_acquire(&wq_entry.private)) + break; + schedule(); + } + __set_current_state(TASK_RUNNING); +} + +bool __percpu_down_read(struct percpu_rw_semaphore *sem, bool try) +{ + if (__percpu_down_read_trylock(sem)) + return true; + + if (try) + return false; + + preempt_enable(); + percpu_rwsem_wait(sem, /* .reader = */ true); + preempt_disable(); + + return true; +} +EXPORT_SYMBOL_GPL(__percpu_down_read); #define per_cpu_sum(var) \ ({ \ @ kernel/locking/percpu-rwsem.c:195 @ EXPORT_SYMBOL_GPL(__percpu_up_read); * zero. If this sum is zero, then it is stable due to the fact that if any * newly arriving readers increment a given counter, they will immediately * decrement that same counter. + * + * Assumes sem->block is set. */ static bool readers_active_check(struct percpu_rw_semaphore *sem) { @ kernel/locking/percpu-rwsem.c:215 @ static bool readers_active_check(struct percpu_rw_semaphore *sem) void percpu_down_write(struct percpu_rw_semaphore *sem) { + might_sleep(); + rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_); + /* Notify readers to take the slow path. */ rcu_sync_enter(&sem->rss); - down_write(&sem->rw_sem); + /* + * Try set sem->block; this provides writer-writer exclusion. + * Having sem->block set makes new readers block. + */ + if (!__percpu_down_write_trylock(sem)) + percpu_rwsem_wait(sem, /* .reader = */ false); + + /* smp_mb() implied by __percpu_down_write_trylock() on success -- D matches A */ /* - * Notify new readers to block; up until now, and thus throughout the - * longish rcu_sync_enter() above, new readers could still come in. - */ - WRITE_ONCE(sem->readers_block, 1); - - smp_mb(); /* D matches A */ - - /* - * If they don't see our writer of readers_block, then we are - * guaranteed to see their sem->read_count increment, and therefore - * will wait for them. + * If they don't see our store of sem->block, then we are guaranteed to + * see their sem->read_count increment, and therefore will wait for + * them. */ - /* Wait for all now active readers to complete. */ + /* Wait for all active readers to complete. */ rcuwait_wait_event(&sem->writer, readers_active_check(sem)); } EXPORT_SYMBOL_GPL(percpu_down_write); void percpu_up_write(struct percpu_rw_semaphore *sem) { + rwsem_release(&sem->dep_map, _RET_IP_); + /* * Signal the writer is done, no fast path yet. * @ kernel/locking/percpu-rwsem.c:255 @ void percpu_up_write(struct percpu_rw_semaphore *sem) * Therefore we force it through the slow path which guarantees an * acquire and thereby guarantees the critical section's consistency. */ - smp_store_release(&sem->readers_block, 0); + atomic_set_release(&sem->block, 0); /* - * Release the write lock, this will allow readers back in the game. + * Prod any pending reader/writer to make progress. */ - up_write(&sem->rw_sem); + __wake_up(&sem->waiters, TASK_NORMAL, 1, sem); /* * Once this completes (at least one RCU-sched grace period hence) the @ kernel/locking/rtmutex.c:11 @ * Copyright (C) 2005-2006 Timesys Corp., Thomas Gleixner <tglx@timesys.com> * Copyright (C) 2005 Kihon Technologies Inc., Steven Rostedt * Copyright (C) 2006 Esben Nielsen + * Adaptive Spinlocks: + * Copyright (C) 2008 Novell, Inc., Gregory Haskins, Sven Dietrich, + * and Peter Morreale, + * Adaptive Spinlocks simplification: + * Copyright (C) 2008 Red Hat, Inc., Steven Rostedt <srostedt@redhat.com> * * See Documentation/locking/rt-mutex-design.rst for details. */ @ kernel/locking/rtmutex.c:27 @ #include <linux/sched/wake_q.h> #include <linux/sched/debug.h> #include <linux/timer.h> +#include <linux/ww_mutex.h> +#include <linux/blkdev.h> #include "rtmutex_common.h" @ kernel/locking/rtmutex.c:146 @ static void fixup_rt_mutex_waiters(struct rt_mutex *lock) WRITE_ONCE(*p, owner & ~RT_MUTEX_HAS_WAITERS); } +static int rt_mutex_real_waiter(struct rt_mutex_waiter *waiter) +{ + return waiter && waiter != PI_WAKEUP_INPROGRESS && + waiter != PI_REQUEUE_INPROGRESS; +} + /* * We can speed up the acquire/release, if there's no debugging state to be * set up. @ kernel/locking/rtmutex.c:245 @ static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock, * Only use with rt_mutex_waiter_{less,equal}() */ #define task_to_waiter(p) \ - &(struct rt_mutex_waiter){ .prio = (p)->prio, .deadline = (p)->dl.deadline } + &(struct rt_mutex_waiter){ .prio = (p)->prio, .deadline = (p)->dl.deadline, .task = (p) } static inline int rt_mutex_waiter_less(struct rt_mutex_waiter *left, @ kernel/locking/rtmutex.c:285 @ rt_mutex_waiter_equal(struct rt_mutex_waiter *left, return 1; } +#define STEAL_NORMAL 0 +#define STEAL_LATERAL 1 + +static inline int +rt_mutex_steal(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, int mode) +{ + struct rt_mutex_waiter *top_waiter = rt_mutex_top_waiter(lock); + + if (waiter == top_waiter || rt_mutex_waiter_less(waiter, top_waiter)) + return 1; + + /* + * Note that RT tasks are excluded from lateral-steals + * to prevent the introduction of an unbounded latency. + */ + if (mode == STEAL_NORMAL || rt_task(waiter->task)) + return 0; + + return rt_mutex_waiter_equal(waiter, top_waiter); +} + static void rt_mutex_enqueue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter) { @ kernel/locking/rtmutex.c:410 @ static bool rt_mutex_cond_detect_deadlock(struct rt_mutex_waiter *waiter, return debug_rt_mutex_detect_deadlock(waiter, chwalk); } +static void rt_mutex_wake_waiter(struct rt_mutex_waiter *waiter) +{ + if (waiter->savestate) + wake_up_lock_sleeper(waiter->task); + else + wake_up_process(waiter->task); +} + /* * Max number of times we'll walk the boosting chain: */ @ kernel/locking/rtmutex.c:425 @ int max_lock_depth = 1024; static inline struct rt_mutex *task_blocked_on_lock(struct task_struct *p) { - return p->pi_blocked_on ? p->pi_blocked_on->lock : NULL; + return rt_mutex_real_waiter(p->pi_blocked_on) ? + p->pi_blocked_on->lock : NULL; } /* @ kernel/locking/rtmutex.c:562 @ static int rt_mutex_adjust_prio_chain(struct task_struct *task, * reached or the state of the chain has changed while we * dropped the locks. */ - if (!waiter) + if (!rt_mutex_real_waiter(waiter)) goto out_unlock_pi; /* @ kernel/locking/rtmutex.c:742 @ static int rt_mutex_adjust_prio_chain(struct task_struct *task, * follow here. This is the end of the chain we are walking. */ if (!rt_mutex_owner(lock)) { + struct rt_mutex_waiter *lock_top_waiter; + /* * If the requeue [7] above changed the top waiter, * then we need to wake the new top waiter up to try * to get the lock. */ - if (prerequeue_top_waiter != rt_mutex_top_waiter(lock)) - wake_up_process(rt_mutex_top_waiter(lock)->task); + lock_top_waiter = rt_mutex_top_waiter(lock); + if (prerequeue_top_waiter != lock_top_waiter) + rt_mutex_wake_waiter(lock_top_waiter); raw_spin_unlock_irq(&lock->wait_lock); return 0; } @ kernel/locking/rtmutex.c:852 @ static int rt_mutex_adjust_prio_chain(struct task_struct *task, * @task: The task which wants to acquire the lock * @waiter: The waiter that is queued to the lock's wait tree if the * callsite called task_blocked_on_lock(), otherwise NULL + * @mode: Lock steal mode (STEAL_NORMAL, STEAL_LATERAL) */ -static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task, - struct rt_mutex_waiter *waiter) +static int __try_to_take_rt_mutex(struct rt_mutex *lock, + struct task_struct *task, + struct rt_mutex_waiter *waiter, int mode) { lockdep_assert_held(&lock->wait_lock); @ kernel/locking/rtmutex.c:892 @ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task, */ if (waiter) { /* - * If waiter is not the highest priority waiter of - * @lock, give up. + * If waiter is not the highest priority waiter of @lock, + * or its peer when lateral steal is allowed, give up. */ - if (waiter != rt_mutex_top_waiter(lock)) + if (!rt_mutex_steal(lock, waiter, mode)) return 0; - /* * We can acquire the lock. Remove the waiter from the * lock waiters tree. @ kernel/locking/rtmutex.c:914 @ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task, */ if (rt_mutex_has_waiters(lock)) { /* - * If @task->prio is greater than or equal to - * the top waiter priority (kernel view), - * @task lost. + * If @task->prio is greater than the top waiter + * priority (kernel view), or equal to it when a + * lateral steal is forbidden, @task lost. */ - if (!rt_mutex_waiter_less(task_to_waiter(task), - rt_mutex_top_waiter(lock))) + if (!rt_mutex_steal(lock, task_to_waiter(task), mode)) return 0; - /* * The current top waiter stays enqueued. We * don't have to change anything in the lock @ kernel/locking/rtmutex.c:966 @ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task, return 1; } +#ifdef CONFIG_PREEMPT_RT +/* + * preemptible spin_lock functions: + */ +static inline void rt_spin_lock_fastlock(struct rt_mutex *lock, + void (*slowfn)(struct rt_mutex *lock)) +{ + might_sleep_no_state_check(); + + if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) + return; + else + slowfn(lock); +} + +static inline void rt_spin_lock_fastunlock(struct rt_mutex *lock, + void (*slowfn)(struct rt_mutex *lock)) +{ + if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) + return; + else + slowfn(lock); +} +#ifdef CONFIG_SMP +/* + * Note that owner is a speculative pointer and dereferencing relies + * on rcu_read_lock() and the check against the lock owner. + */ +static int adaptive_wait(struct rt_mutex *lock, + struct task_struct *owner) +{ + int res = 0; + + rcu_read_lock(); + for (;;) { + if (owner != rt_mutex_owner(lock)) + break; + /* + * Ensure that owner->on_cpu is dereferenced _after_ + * checking the above to be valid. + */ + barrier(); + if (!owner->on_cpu) { + res = 1; + break; + } + cpu_relax(); + } + rcu_read_unlock(); + return res; +} +#else +static int adaptive_wait(struct rt_mutex *lock, + struct task_struct *orig_owner) +{ + return 1; +} +#endif + +static int task_blocks_on_rt_mutex(struct rt_mutex *lock, + struct rt_mutex_waiter *waiter, + struct task_struct *task, + enum rtmutex_chainwalk chwalk); +/* + * Slow path lock function spin_lock style: this variant is very + * careful not to miss any non-lock wakeups. + * + * We store the current state under p->pi_lock in p->saved_state and + * the try_to_wake_up() code handles this accordingly. + */ +void __sched rt_spin_lock_slowlock_locked(struct rt_mutex *lock, + struct rt_mutex_waiter *waiter, + unsigned long flags) +{ + struct task_struct *lock_owner, *self = current; + struct rt_mutex_waiter *top_waiter; + int ret; + + if (__try_to_take_rt_mutex(lock, self, NULL, STEAL_LATERAL)) + return; + + BUG_ON(rt_mutex_owner(lock) == self); + + /* + * We save whatever state the task is in and we'll restore it + * after acquiring the lock taking real wakeups into account + * as well. We are serialized via pi_lock against wakeups. See + * try_to_wake_up(). + */ + raw_spin_lock(&self->pi_lock); + self->saved_state = self->state; + __set_current_state_no_track(TASK_UNINTERRUPTIBLE); + raw_spin_unlock(&self->pi_lock); + + ret = task_blocks_on_rt_mutex(lock, waiter, self, RT_MUTEX_MIN_CHAINWALK); + BUG_ON(ret); + + for (;;) { + /* Try to acquire the lock again. */ + if (__try_to_take_rt_mutex(lock, self, waiter, STEAL_LATERAL)) + break; + + top_waiter = rt_mutex_top_waiter(lock); + lock_owner = rt_mutex_owner(lock); + + raw_spin_unlock_irqrestore(&lock->wait_lock, flags); + + debug_rt_mutex_print_deadlock(waiter); + + if (top_waiter != waiter || adaptive_wait(lock, lock_owner)) + schedule(); + + raw_spin_lock_irqsave(&lock->wait_lock, flags); + + raw_spin_lock(&self->pi_lock); + __set_current_state_no_track(TASK_UNINTERRUPTIBLE); + raw_spin_unlock(&self->pi_lock); + } + + /* + * Restore the task state to current->saved_state. We set it + * to the original state above and the try_to_wake_up() code + * has possibly updated it when a real (non-rtmutex) wakeup + * happened while we were blocked. Clear saved_state so + * try_to_wakeup() does not get confused. + */ + raw_spin_lock(&self->pi_lock); + __set_current_state_no_track(self->saved_state); + self->saved_state = TASK_RUNNING; + raw_spin_unlock(&self->pi_lock); + + /* + * try_to_take_rt_mutex() sets the waiter bit + * unconditionally. We might have to fix that up: + */ + fixup_rt_mutex_waiters(lock); + + BUG_ON(rt_mutex_has_waiters(lock) && waiter == rt_mutex_top_waiter(lock)); + BUG_ON(!RB_EMPTY_NODE(&waiter->tree_entry)); +} + +static void noinline __sched rt_spin_lock_slowlock(struct rt_mutex *lock) +{ + struct rt_mutex_waiter waiter; + unsigned long flags; + + rt_mutex_init_waiter(&waiter, true); + + raw_spin_lock_irqsave(&lock->wait_lock, flags); + rt_spin_lock_slowlock_locked(lock, &waiter, flags); + raw_spin_unlock_irqrestore(&lock->wait_lock, flags); + debug_rt_mutex_free_waiter(&waiter); +} + +static bool __sched __rt_mutex_unlock_common(struct rt_mutex *lock, + struct wake_q_head *wake_q, + struct wake_q_head *wq_sleeper); +/* + * Slow path to release a rt_mutex spin_lock style + */ +void __sched rt_spin_lock_slowunlock(struct rt_mutex *lock) +{ + unsigned long flags; + DEFINE_WAKE_Q(wake_q); + DEFINE_WAKE_Q(wake_sleeper_q); + bool postunlock; + + raw_spin_lock_irqsave(&lock->wait_lock, flags); + postunlock = __rt_mutex_unlock_common(lock, &wake_q, &wake_sleeper_q); + raw_spin_unlock_irqrestore(&lock->wait_lock, flags); + + if (postunlock) + rt_mutex_postunlock(&wake_q, &wake_sleeper_q); +} + +void __lockfunc rt_spin_lock(spinlock_t *lock) +{ + sleeping_lock_inc(); + rcu_read_lock(); + migrate_disable(); + spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); + rt_spin_lock_fastlock(&lock->lock, rt_spin_lock_slowlock); +} +EXPORT_SYMBOL(rt_spin_lock); + +void __lockfunc __rt_spin_lock(struct rt_mutex *lock) +{ + rt_spin_lock_fastlock(lock, rt_spin_lock_slowlock); +} + +#ifdef CONFIG_DEBUG_LOCK_ALLOC +void __lockfunc rt_spin_lock_nested(spinlock_t *lock, int subclass) +{ + sleeping_lock_inc(); + rcu_read_lock(); + migrate_disable(); + spin_acquire(&lock->dep_map, subclass, 0, _RET_IP_); + rt_spin_lock_fastlock(&lock->lock, rt_spin_lock_slowlock); +} +EXPORT_SYMBOL(rt_spin_lock_nested); +#endif + +void __lockfunc rt_spin_unlock(spinlock_t *lock) +{ + /* NOTE: we always pass in '1' for nested, for simplicity */ + spin_release(&lock->dep_map, _RET_IP_); + rt_spin_lock_fastunlock(&lock->lock, rt_spin_lock_slowunlock); + migrate_enable(); + rcu_read_unlock(); + sleeping_lock_dec(); +} +EXPORT_SYMBOL(rt_spin_unlock); + +void __lockfunc __rt_spin_unlock(struct rt_mutex *lock) +{ + rt_spin_lock_fastunlock(lock, rt_spin_lock_slowunlock); +} +EXPORT_SYMBOL(__rt_spin_unlock); + +/* + * Wait for the lock to get unlocked: instead of polling for an unlock + * (like raw spinlocks do), we lock and unlock, to force the kernel to + * schedule if there's contention: + */ +void __lockfunc rt_spin_lock_unlock(spinlock_t *lock) +{ + spin_lock(lock); + spin_unlock(lock); +} +EXPORT_SYMBOL(rt_spin_lock_unlock); + +int __lockfunc rt_spin_trylock(spinlock_t *lock) +{ + int ret; + + sleeping_lock_inc(); + migrate_disable(); + ret = __rt_mutex_trylock(&lock->lock); + if (ret) { + spin_acquire(&lock->dep_map, 0, 1, _RET_IP_); + rcu_read_lock(); + } else { + migrate_enable(); + sleeping_lock_dec(); + } + return ret; +} +EXPORT_SYMBOL(rt_spin_trylock); + +int __lockfunc rt_spin_trylock_bh(spinlock_t *lock) +{ + int ret; + + local_bh_disable(); + ret = __rt_mutex_trylock(&lock->lock); + if (ret) { + sleeping_lock_inc(); + rcu_read_lock(); + migrate_disable(); + spin_acquire(&lock->dep_map, 0, 1, _RET_IP_); + } else + local_bh_enable(); + return ret; +} +EXPORT_SYMBOL(rt_spin_trylock_bh); + +int __lockfunc rt_spin_trylock_irqsave(spinlock_t *lock, unsigned long *flags) +{ + int ret; + + *flags = 0; + ret = __rt_mutex_trylock(&lock->lock); + if (ret) { + sleeping_lock_inc(); + rcu_read_lock(); + migrate_disable(); + spin_acquire(&lock->dep_map, 0, 1, _RET_IP_); + } + return ret; +} +EXPORT_SYMBOL(rt_spin_trylock_irqsave); + +void +__rt_spin_lock_init(spinlock_t *lock, const char *name, struct lock_class_key *key) +{ +#ifdef CONFIG_DEBUG_LOCK_ALLOC + /* + * Make sure we are not reinitializing a held lock: + */ + debug_check_no_locks_freed((void *)lock, sizeof(*lock)); + lockdep_init_map(&lock->dep_map, name, key, 0); +#endif +} +EXPORT_SYMBOL(__rt_spin_lock_init); + +#endif /* PREEMPT_RT */ + +#ifdef CONFIG_PREEMPT_RT + static inline int __sched +__mutex_lock_check_stamp(struct rt_mutex *lock, struct ww_acquire_ctx *ctx) +{ + struct ww_mutex *ww = container_of(lock, struct ww_mutex, base.lock); + struct ww_acquire_ctx *hold_ctx = READ_ONCE(ww->ctx); + + if (!hold_ctx) + return 0; + + if (unlikely(ctx == hold_ctx)) + return -EALREADY; + + if (ctx->stamp - hold_ctx->stamp <= LONG_MAX && + (ctx->stamp != hold_ctx->stamp || ctx > hold_ctx)) { +#ifdef CONFIG_DEBUG_MUTEXES + DEBUG_LOCKS_WARN_ON(ctx->contending_lock); + ctx->contending_lock = ww; +#endif + return -EDEADLK; + } + + return 0; +} +#else + static inline int __sched +__mutex_lock_check_stamp(struct rt_mutex *lock, struct ww_acquire_ctx *ctx) +{ + BUG(); + return 0; +} + +#endif + +static inline int +try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task, + struct rt_mutex_waiter *waiter) +{ + return __try_to_take_rt_mutex(lock, task, waiter, STEAL_NORMAL); +} + /* * Task blocks on lock. * @ kernel/locking/rtmutex.c:1336 @ static int task_blocks_on_rt_mutex(struct rt_mutex *lock, return -EDEADLK; raw_spin_lock(&task->pi_lock); + /* + * In the case of futex requeue PI, this will be a proxy + * lock. The task will wake unaware that it is enqueueed on + * this lock. Avoid blocking on two locks and corrupting + * pi_blocked_on via the PI_WAKEUP_INPROGRESS + * flag. futex_wait_requeue_pi() sets this when it wakes up + * before requeue (due to a signal or timeout). Do not enqueue + * the task if PI_WAKEUP_INPROGRESS is set. + */ + if (task != current && task->pi_blocked_on == PI_WAKEUP_INPROGRESS) { + raw_spin_unlock(&task->pi_lock); + return -EAGAIN; + } + + BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on)); + waiter->task = task; waiter->lock = lock; waiter->prio = task->prio; @ kernel/locking/rtmutex.c:1375 @ static int task_blocks_on_rt_mutex(struct rt_mutex *lock, rt_mutex_enqueue_pi(owner, waiter); rt_mutex_adjust_prio(owner); - if (owner->pi_blocked_on) + if (rt_mutex_real_waiter(owner->pi_blocked_on)) chain_walk = 1; } else if (rt_mutex_cond_detect_deadlock(waiter, chwalk)) { chain_walk = 1; @ kernel/locking/rtmutex.c:1417 @ static int task_blocks_on_rt_mutex(struct rt_mutex *lock, * Called with lock->wait_lock held and interrupts disabled. */ static void mark_wakeup_next_waiter(struct wake_q_head *wake_q, + struct wake_q_head *wake_sleeper_q, struct rt_mutex *lock) { struct rt_mutex_waiter *waiter; @ kernel/locking/rtmutex.c:1457 @ static void mark_wakeup_next_waiter(struct wake_q_head *wake_q, * Pairs with preempt_enable() in rt_mutex_postunlock(); */ preempt_disable(); - wake_q_add(wake_q, waiter->task); + if (waiter->savestate) + wake_q_add_sleeper(wake_sleeper_q, waiter->task); + else + wake_q_add(wake_q, waiter->task); raw_spin_unlock(¤t->pi_lock); } @ kernel/locking/rtmutex.c:1475 @ static void remove_waiter(struct rt_mutex *lock, { bool is_top_waiter = (waiter == rt_mutex_top_waiter(lock)); struct task_struct *owner = rt_mutex_owner(lock); - struct rt_mutex *next_lock; + struct rt_mutex *next_lock = NULL; lockdep_assert_held(&lock->wait_lock); @ kernel/locking/rtmutex.c:1501 @ static void remove_waiter(struct rt_mutex *lock, rt_mutex_adjust_prio(owner); /* Store the lock on which owner is blocked or NULL */ - next_lock = task_blocked_on_lock(owner); + if (rt_mutex_real_waiter(owner->pi_blocked_on)) + next_lock = task_blocked_on_lock(owner); raw_spin_unlock(&owner->pi_lock); @ kernel/locking/rtmutex.c:1538 @ void rt_mutex_adjust_pi(struct task_struct *task) raw_spin_lock_irqsave(&task->pi_lock, flags); waiter = task->pi_blocked_on; - if (!waiter || rt_mutex_waiter_equal(waiter, task_to_waiter(task))) { + if (!rt_mutex_real_waiter(waiter) || + rt_mutex_waiter_equal(waiter, task_to_waiter(task))) { raw_spin_unlock_irqrestore(&task->pi_lock, flags); return; } next_lock = waiter->lock; - raw_spin_unlock_irqrestore(&task->pi_lock, flags); /* gets dropped in rt_mutex_adjust_prio_chain()! */ get_task_struct(task); + raw_spin_unlock_irqrestore(&task->pi_lock, flags); rt_mutex_adjust_prio_chain(task, RT_MUTEX_MIN_CHAINWALK, NULL, next_lock, NULL, task); } -void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter) +void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter, bool savestate) { debug_rt_mutex_init_waiter(waiter); RB_CLEAR_NODE(&waiter->pi_tree_entry); RB_CLEAR_NODE(&waiter->tree_entry); waiter->task = NULL; + waiter->savestate = savestate; } /** @ kernel/locking/rtmutex.c:1575 @ void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter) static int __sched __rt_mutex_slowlock(struct rt_mutex *lock, int state, struct hrtimer_sleeper *timeout, - struct rt_mutex_waiter *waiter) + struct rt_mutex_waiter *waiter, + struct ww_acquire_ctx *ww_ctx) { int ret = 0; @ kernel/locking/rtmutex.c:1585 @ __rt_mutex_slowlock(struct rt_mutex *lock, int state, if (try_to_take_rt_mutex(lock, current, waiter)) break; - /* - * TASK_INTERRUPTIBLE checks for signals and - * timeout. Ignored otherwise. - */ - if (likely(state == TASK_INTERRUPTIBLE)) { - /* Signal pending? */ - if (signal_pending(current)) - ret = -EINTR; - if (timeout && !timeout->task) - ret = -ETIMEDOUT; + if (timeout && !timeout->task) { + ret = -ETIMEDOUT; + break; + } + if (signal_pending_state(state, current)) { + ret = -EINTR; + break; + } + + if (ww_ctx && ww_ctx->acquired > 0) { + ret = __mutex_lock_check_stamp(lock, ww_ctx); if (ret) break; } @ kernel/locking/rtmutex.c:1634 @ static void rt_mutex_handle_deadlock(int res, int detect_deadlock, } } +static __always_inline void ww_mutex_lock_acquired(struct ww_mutex *ww, + struct ww_acquire_ctx *ww_ctx) +{ +#ifdef CONFIG_DEBUG_MUTEXES + /* + * If this WARN_ON triggers, you used ww_mutex_lock to acquire, + * but released with a normal mutex_unlock in this call. + * + * This should never happen, always use ww_mutex_unlock. + */ + DEBUG_LOCKS_WARN_ON(ww->ctx); + + /* + * Not quite done after calling ww_acquire_done() ? + */ + DEBUG_LOCKS_WARN_ON(ww_ctx->done_acquire); + + if (ww_ctx->contending_lock) { + /* + * After -EDEADLK you tried to + * acquire a different ww_mutex? Bad! + */ + DEBUG_LOCKS_WARN_ON(ww_ctx->contending_lock != ww); + + /* + * You called ww_mutex_lock after receiving -EDEADLK, + * but 'forgot' to unlock everything else first? + */ + DEBUG_LOCKS_WARN_ON(ww_ctx->acquired > 0); + ww_ctx->contending_lock = NULL; + } + + /* + * Naughty, using a different class will lead to undefined behavior! + */ + DEBUG_LOCKS_WARN_ON(ww_ctx->ww_class != ww->ww_class); +#endif + ww_ctx->acquired++; +} + +#ifdef CONFIG_PREEMPT_RT +static void ww_mutex_account_lock(struct rt_mutex *lock, + struct ww_acquire_ctx *ww_ctx) +{ + struct ww_mutex *ww = container_of(lock, struct ww_mutex, base.lock); + struct rt_mutex_waiter *waiter, *n; + + /* + * This branch gets optimized out for the common case, + * and is only important for ww_mutex_lock. + */ + ww_mutex_lock_acquired(ww, ww_ctx); + ww->ctx = ww_ctx; + + /* + * Give any possible sleeping processes the chance to wake up, + * so they can recheck if they have to back off. + */ + rbtree_postorder_for_each_entry_safe(waiter, n, &lock->waiters.rb_root, + tree_entry) { + /* XXX debug rt mutex waiter wakeup */ + + BUG_ON(waiter->lock != lock); + rt_mutex_wake_waiter(waiter); + } +} + +#else + +static void ww_mutex_account_lock(struct rt_mutex *lock, + struct ww_acquire_ctx *ww_ctx) +{ + BUG(); +} +#endif + +int __sched rt_mutex_slowlock_locked(struct rt_mutex *lock, int state, + struct hrtimer_sleeper *timeout, + enum rtmutex_chainwalk chwalk, + struct ww_acquire_ctx *ww_ctx, + struct rt_mutex_waiter *waiter) +{ + int ret; + +#ifdef CONFIG_PREEMPT_RT + if (ww_ctx) { + struct ww_mutex *ww; + + ww = container_of(lock, struct ww_mutex, base.lock); + if (unlikely(ww_ctx == READ_ONCE(ww->ctx))) + return -EALREADY; + } +#endif + + /* Try to acquire the lock again: */ + if (try_to_take_rt_mutex(lock, current, NULL)) { + if (ww_ctx) + ww_mutex_account_lock(lock, ww_ctx); + return 0; + } + + set_current_state(state); + + /* Setup the timer, when timeout != NULL */ + if (unlikely(timeout)) + hrtimer_start_expires(&timeout->timer, HRTIMER_MODE_ABS); + + ret = task_blocks_on_rt_mutex(lock, waiter, current, chwalk); + + if (likely(!ret)) { + /* sleep on the mutex */ + ret = __rt_mutex_slowlock(lock, state, timeout, waiter, + ww_ctx); + } else if (ww_ctx) { + /* ww_mutex received EDEADLK, let it become EALREADY */ + ret = __mutex_lock_check_stamp(lock, ww_ctx); + BUG_ON(!ret); + } + + if (unlikely(ret)) { + __set_current_state(TASK_RUNNING); + remove_waiter(lock, waiter); + /* ww_mutex wants to report EDEADLK/EALREADY, let it */ + if (!ww_ctx) + rt_mutex_handle_deadlock(ret, chwalk, waiter); + } else if (ww_ctx) { + ww_mutex_account_lock(lock, ww_ctx); + } + + /* + * try_to_take_rt_mutex() sets the waiter bit + * unconditionally. We might have to fix that up. + */ + fixup_rt_mutex_waiters(lock); + return ret; +} + /* * Slow path lock function: */ static int __sched rt_mutex_slowlock(struct rt_mutex *lock, int state, struct hrtimer_sleeper *timeout, - enum rtmutex_chainwalk chwalk) + enum rtmutex_chainwalk chwalk, + struct ww_acquire_ctx *ww_ctx) { struct rt_mutex_waiter waiter; unsigned long flags; int ret = 0; - rt_mutex_init_waiter(&waiter); + rt_mutex_init_waiter(&waiter, false); /* * Technically we could use raw_spin_[un]lock_irq() here, but this can @ kernel/locking/rtmutex.c:1796 @ rt_mutex_slowlock(struct rt_mutex *lock, int state, */ raw_spin_lock_irqsave(&lock->wait_lock, flags); - /* Try to acquire the lock again: */ - if (try_to_take_rt_mutex(lock, current, NULL)) { - raw_spin_unlock_irqrestore(&lock->wait_lock, flags); - return 0; - } - - set_current_state(state); - - /* Setup the timer, when timeout != NULL */ - if (unlikely(timeout)) - hrtimer_start_expires(&timeout->timer, HRTIMER_MODE_ABS); - - ret = task_blocks_on_rt_mutex(lock, &waiter, current, chwalk); - - if (likely(!ret)) - /* sleep on the mutex */ - ret = __rt_mutex_slowlock(lock, state, timeout, &waiter); - - if (unlikely(ret)) { - __set_current_state(TASK_RUNNING); - remove_waiter(lock, &waiter); - rt_mutex_handle_deadlock(ret, chwalk, &waiter); - } - - /* - * try_to_take_rt_mutex() sets the waiter bit - * unconditionally. We might have to fix that up. - */ - fixup_rt_mutex_waiters(lock); + ret = rt_mutex_slowlock_locked(lock, state, timeout, chwalk, ww_ctx, + &waiter); raw_spin_unlock_irqrestore(&lock->wait_lock, flags); @ kernel/locking/rtmutex.c:1858 @ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock) * Return whether the current task needs to call rt_mutex_postunlock(). */ static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock, - struct wake_q_head *wake_q) + struct wake_q_head *wake_q, + struct wake_q_head *wake_sleeper_q) { unsigned long flags; @ kernel/locking/rtmutex.c:1913 @ static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock, * * Queue the next waiter for wakeup once we release the wait_lock. */ - mark_wakeup_next_waiter(wake_q, lock); + mark_wakeup_next_waiter(wake_q, wake_sleeper_q, lock); raw_spin_unlock_irqrestore(&lock->wait_lock, flags); return true; /* call rt_mutex_postunlock() */ @ kernel/locking/rtmutex.c:1927 @ static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock, */ static inline int rt_mutex_fastlock(struct rt_mutex *lock, int state, + struct ww_acquire_ctx *ww_ctx, int (*slowfn)(struct rt_mutex *lock, int state, struct hrtimer_sleeper *timeout, - enum rtmutex_chainwalk chwalk)) + enum rtmutex_chainwalk chwalk, + struct ww_acquire_ctx *ww_ctx)) { if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) return 0; - return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK); + /* + * If rt_mutex blocks, the function sched_submit_work will not call + * blk_schedule_flush_plug (because tsk_is_pi_blocked would be true). + * We must call blk_schedule_flush_plug here, if we don't call it, + * a deadlock in I/O may happen. + */ + if (unlikely(blk_needs_flush_plug(current))) + blk_schedule_flush_plug(current); + + return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK, ww_ctx); } static inline int rt_mutex_timed_fastlock(struct rt_mutex *lock, int state, struct hrtimer_sleeper *timeout, enum rtmutex_chainwalk chwalk, + struct ww_acquire_ctx *ww_ctx, int (*slowfn)(struct rt_mutex *lock, int state, struct hrtimer_sleeper *timeout, - enum rtmutex_chainwalk chwalk)) + enum rtmutex_chainwalk chwalk, + struct ww_acquire_ctx *ww_ctx)) { if (chwalk == RT_MUTEX_MIN_CHAINWALK && likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) return 0; - return slowfn(lock, state, timeout, chwalk); + if (unlikely(blk_needs_flush_plug(current))) + blk_schedule_flush_plug(current); + + return slowfn(lock, state, timeout, chwalk, ww_ctx); } static inline int @ kernel/locking/rtmutex.c:1981 @ rt_mutex_fasttrylock(struct rt_mutex *lock, /* * Performs the wakeup of the the top-waiter and re-enables preemption. */ -void rt_mutex_postunlock(struct wake_q_head *wake_q) +void rt_mutex_postunlock(struct wake_q_head *wake_q, + struct wake_q_head *wake_sleeper_q) { wake_up_q(wake_q); + wake_up_q_sleeper(wake_sleeper_q); /* Pairs with preempt_disable() in rt_mutex_slowunlock() */ preempt_enable(); @ kernel/locking/rtmutex.c:1994 @ void rt_mutex_postunlock(struct wake_q_head *wake_q) static inline void rt_mutex_fastunlock(struct rt_mutex *lock, bool (*slowfn)(struct rt_mutex *lock, - struct wake_q_head *wqh)) + struct wake_q_head *wqh, + struct wake_q_head *wq_sleeper)) { DEFINE_WAKE_Q(wake_q); + DEFINE_WAKE_Q(wake_sleeper_q); if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) return; - if (slowfn(lock, &wake_q)) - rt_mutex_postunlock(&wake_q); + if (slowfn(lock, &wake_q, &wake_sleeper_q)) + rt_mutex_postunlock(&wake_q, &wake_sleeper_q); +} + +int __sched __rt_mutex_lock_state(struct rt_mutex *lock, int state) +{ + might_sleep(); + return rt_mutex_fastlock(lock, state, NULL, rt_mutex_slowlock); +} + +/** + * rt_mutex_lock_state - lock a rt_mutex with a given state + * + * @lock: The rt_mutex to be locked + * @state: The state to set when blocking on the rt_mutex + */ +static inline int __sched rt_mutex_lock_state(struct rt_mutex *lock, + unsigned int subclass, int state) +{ + int ret; + + mutex_acquire(&lock->dep_map, subclass, 0, _RET_IP_); + ret = __rt_mutex_lock_state(lock, state); + if (ret) + mutex_release(&lock->dep_map, _RET_IP_); + return ret; } static inline void __rt_mutex_lock(struct rt_mutex *lock, unsigned int subclass) { - might_sleep(); - - mutex_acquire(&lock->dep_map, subclass, 0, _RET_IP_); - rt_mutex_fastlock(lock, TASK_UNINTERRUPTIBLE, rt_mutex_slowlock); + rt_mutex_lock_state(lock, subclass, TASK_UNINTERRUPTIBLE); } #ifdef CONFIG_DEBUG_LOCK_ALLOC @ kernel/locking/rtmutex.c:2074 @ EXPORT_SYMBOL_GPL(rt_mutex_lock); */ int __sched rt_mutex_lock_interruptible(struct rt_mutex *lock) { - int ret; - - might_sleep(); - - mutex_acquire(&lock->dep_map, 0, 0, _RET_IP_); - ret = rt_mutex_fastlock(lock, TASK_INTERRUPTIBLE, rt_mutex_slowlock); - if (ret) - mutex_release(&lock->dep_map, _RET_IP_); - - return ret; + return rt_mutex_lock_state(lock, 0, TASK_INTERRUPTIBLE); } EXPORT_SYMBOL_GPL(rt_mutex_lock_interruptible); @ kernel/locking/rtmutex.c:2091 @ int __sched __rt_mutex_futex_trylock(struct rt_mutex *lock) return __rt_mutex_slowtrylock(lock); } +/** + * rt_mutex_lock_killable - lock a rt_mutex killable + * + * @lock: the rt_mutex to be locked + * @detect_deadlock: deadlock detection on/off + * + * Returns: + * 0 on success + * -EINTR when interrupted by a signal + */ +int __sched rt_mutex_lock_killable(struct rt_mutex *lock) +{ + return rt_mutex_lock_state(lock, 0, TASK_KILLABLE); +} +EXPORT_SYMBOL_GPL(rt_mutex_lock_killable); + /** * rt_mutex_timed_lock - lock a rt_mutex interruptible * the timeout structure is provided @ kernel/locking/rtmutex.c:2130 @ rt_mutex_timed_lock(struct rt_mutex *lock, struct hrtimer_sleeper *timeout) mutex_acquire(&lock->dep_map, 0, 0, _RET_IP_); ret = rt_mutex_timed_fastlock(lock, TASK_INTERRUPTIBLE, timeout, RT_MUTEX_MIN_CHAINWALK, + NULL, rt_mutex_slowlock); if (ret) mutex_release(&lock->dep_map, _RET_IP_); @ kernel/locking/rtmutex.c:2139 @ rt_mutex_timed_lock(struct rt_mutex *lock, struct hrtimer_sleeper *timeout) } EXPORT_SYMBOL_GPL(rt_mutex_timed_lock); +int __sched __rt_mutex_trylock(struct rt_mutex *lock) +{ +#ifdef CONFIG_PREEMPT_RT + if (WARN_ON_ONCE(in_irq() || in_nmi())) +#else + if (WARN_ON_ONCE(in_irq() || in_nmi() || in_serving_softirq())) +#endif + return 0; + + return rt_mutex_fasttrylock(lock, rt_mutex_slowtrylock); +} + /** * rt_mutex_trylock - try to lock a rt_mutex * @ kernel/locking/rtmutex.c:2166 @ int __sched rt_mutex_trylock(struct rt_mutex *lock) { int ret; - if (WARN_ON_ONCE(in_irq() || in_nmi() || in_serving_softirq())) - return 0; - - ret = rt_mutex_fasttrylock(lock, rt_mutex_slowtrylock); + ret = __rt_mutex_trylock(lock); if (ret) mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_); @ kernel/locking/rtmutex.c:2174 @ int __sched rt_mutex_trylock(struct rt_mutex *lock) } EXPORT_SYMBOL_GPL(rt_mutex_trylock); +void __sched __rt_mutex_unlock(struct rt_mutex *lock) +{ + rt_mutex_fastunlock(lock, rt_mutex_slowunlock); +} + /** * rt_mutex_unlock - unlock a rt_mutex * @ kernel/locking/rtmutex.c:2187 @ EXPORT_SYMBOL_GPL(rt_mutex_trylock); void __sched rt_mutex_unlock(struct rt_mutex *lock) { mutex_release(&lock->dep_map, _RET_IP_); - rt_mutex_fastunlock(lock, rt_mutex_slowunlock); + __rt_mutex_unlock(lock); } EXPORT_SYMBOL_GPL(rt_mutex_unlock); -/** - * Futex variant, that since futex variants do not use the fast-path, can be - * simple and will not need to retry. - */ -bool __sched __rt_mutex_futex_unlock(struct rt_mutex *lock, - struct wake_q_head *wake_q) +static bool __sched __rt_mutex_unlock_common(struct rt_mutex *lock, + struct wake_q_head *wake_q, + struct wake_q_head *wq_sleeper) { lockdep_assert_held(&lock->wait_lock); @ kernel/locking/rtmutex.c:2210 @ bool __sched __rt_mutex_futex_unlock(struct rt_mutex *lock, * avoid inversion prior to the wakeup. preempt_disable() * therein pairs with rt_mutex_postunlock(). */ - mark_wakeup_next_waiter(wake_q, lock); + mark_wakeup_next_waiter(wake_q, wq_sleeper, lock); return true; /* call postunlock() */ } +/** + * Futex variant, that since futex variants do not use the fast-path, can be + * simple and will not need to retry. + */ +bool __sched __rt_mutex_futex_unlock(struct rt_mutex *lock, + struct wake_q_head *wake_q, + struct wake_q_head *wq_sleeper) +{ + return __rt_mutex_unlock_common(lock, wake_q, wq_sleeper); +} + void __sched rt_mutex_futex_unlock(struct rt_mutex *lock) { DEFINE_WAKE_Q(wake_q); + DEFINE_WAKE_Q(wake_sleeper_q); unsigned long flags; bool postunlock; raw_spin_lock_irqsave(&lock->wait_lock, flags); - postunlock = __rt_mutex_futex_unlock(lock, &wake_q); + postunlock = __rt_mutex_futex_unlock(lock, &wake_q, &wake_sleeper_q); raw_spin_unlock_irqrestore(&lock->wait_lock, flags); if (postunlock) - rt_mutex_postunlock(&wake_q); + rt_mutex_postunlock(&wake_q, &wake_sleeper_q); } /** @ kernel/locking/rtmutex.c:2277 @ void __rt_mutex_init(struct rt_mutex *lock, const char *name, if (name && key) debug_rt_mutex_init(lock, name, key); } -EXPORT_SYMBOL_GPL(__rt_mutex_init); +EXPORT_SYMBOL(__rt_mutex_init); /** * rt_mutex_init_proxy_locked - initialize and lock a rt_mutex on behalf of a @ kernel/locking/rtmutex.c:2297 @ void rt_mutex_init_proxy_locked(struct rt_mutex *lock, struct task_struct *proxy_owner) { __rt_mutex_init(lock, NULL, NULL); +#ifdef CONFIG_DEBUG_SPINLOCK + /* + * get another key class for the wait_lock. LOCK_PI and UNLOCK_PI is + * holding the ->wait_lock of the proxy_lock while unlocking a sleeping + * lock. + */ + raw_spin_lock_init(&lock->wait_lock); +#endif debug_rt_mutex_proxy_lock(lock, proxy_owner); rt_mutex_set_owner(lock, proxy_owner); } @ kernel/locking/rtmutex.c:2328 @ void rt_mutex_proxy_unlock(struct rt_mutex *lock, rt_mutex_set_owner(lock, NULL); } +static void fixup_rt_mutex_blocked(struct rt_mutex *lock) +{ + struct task_struct *tsk = current; + /* + * RT has a problem here when the wait got interrupted by a timeout + * or a signal. task->pi_blocked_on is still set. The task must + * acquire the hash bucket lock when returning from this function. + * + * If the hash bucket lock is contended then the + * BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on)) in + * task_blocks_on_rt_mutex() will trigger. This can be avoided by + * clearing task->pi_blocked_on which removes the task from the + * boosting chain of the rtmutex. That's correct because the task + * is not longer blocked on it. + */ + raw_spin_lock(&tsk->pi_lock); + tsk->pi_blocked_on = NULL; + raw_spin_unlock(&tsk->pi_lock); +} + /** * __rt_mutex_start_proxy_lock() - Start lock acquisition for another task * @lock: the rt_mutex to take @ kernel/locking/rtmutex.c:2378 @ int __rt_mutex_start_proxy_lock(struct rt_mutex *lock, if (try_to_take_rt_mutex(lock, task, NULL)) return 1; +#ifdef CONFIG_PREEMPT_RT + /* + * In PREEMPT_RT there's an added race. + * If the task, that we are about to requeue, times out, + * it can set the PI_WAKEUP_INPROGRESS. This tells the requeue + * to skip this task. But right after the task sets + * its pi_blocked_on to PI_WAKEUP_INPROGRESS it can then + * block on the spin_lock(&hb->lock), which in RT is an rtmutex. + * This will replace the PI_WAKEUP_INPROGRESS with the actual + * lock that it blocks on. We *must not* place this task + * on this proxy lock in that case. + * + * To prevent this race, we first take the task's pi_lock + * and check if it has updated its pi_blocked_on. If it has, + * we assume that it woke up and we return -EAGAIN. + * Otherwise, we set the task's pi_blocked_on to + * PI_REQUEUE_INPROGRESS, so that if the task is waking up + * it will know that we are in the process of requeuing it. + */ + raw_spin_lock(&task->pi_lock); + if (task->pi_blocked_on) { + raw_spin_unlock(&task->pi_lock); + return -EAGAIN; + } + task->pi_blocked_on = PI_REQUEUE_INPROGRESS; + raw_spin_unlock(&task->pi_lock); +#endif + /* We enforce deadlock detection for futexes */ ret = task_blocks_on_rt_mutex(lock, waiter, task, RT_MUTEX_FULL_CHAINWALK); @ kernel/locking/rtmutex.c:2420 @ int __rt_mutex_start_proxy_lock(struct rt_mutex *lock, ret = 0; } + if (ret) + fixup_rt_mutex_blocked(lock); + debug_rt_mutex_print_deadlock(waiter); return ret; @ kernel/locking/rtmutex.c:2508 @ int rt_mutex_wait_proxy_lock(struct rt_mutex *lock, raw_spin_lock_irq(&lock->wait_lock); /* sleep on the mutex */ set_current_state(TASK_INTERRUPTIBLE); - ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter); + ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter, NULL); /* * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might * have to fix that up. */ fixup_rt_mutex_waiters(lock); + if (ret) + fixup_rt_mutex_blocked(lock); + raw_spin_unlock_irq(&lock->wait_lock); return ret; @ kernel/locking/rtmutex.c:2578 @ bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock, return cleanup; } + +static inline int +ww_mutex_deadlock_injection(struct ww_mutex *lock, struct ww_acquire_ctx *ctx) +{ +#ifdef CONFIG_DEBUG_WW_MUTEX_SLOWPATH + unsigned tmp; + + if (ctx->deadlock_inject_countdown-- == 0) { + tmp = ctx->deadlock_inject_interval; + if (tmp > UINT_MAX/4) + tmp = UINT_MAX; + else + tmp = tmp*2 + tmp + tmp/2; + + ctx->deadlock_inject_interval = tmp; + ctx->deadlock_inject_countdown = tmp; + ctx->contending_lock = lock; + + ww_mutex_unlock(lock); + + return -EDEADLK; + } +#endif + + return 0; +} + +#ifdef CONFIG_PREEMPT_RT +int __sched +ww_mutex_lock_interruptible(struct ww_mutex *lock, struct ww_acquire_ctx *ctx) +{ + int ret; + + might_sleep(); + + mutex_acquire_nest(&lock->base.dep_map, 0, 0, + ctx ? &ctx->dep_map : NULL, _RET_IP_); + ret = rt_mutex_slowlock(&lock->base.lock, TASK_INTERRUPTIBLE, NULL, 0, + ctx); + if (ret) + mutex_release(&lock->base.dep_map, _RET_IP_); + else if (!ret && ctx && ctx->acquired > 1) + return ww_mutex_deadlock_injection(lock, ctx); + + return ret; +} +EXPORT_SYMBOL_GPL(ww_mutex_lock_interruptible); + +int __sched +ww_mutex_lock(struct ww_mutex *lock, struct ww_acquire_ctx *ctx) +{ + int ret; + + might_sleep(); + + mutex_acquire_nest(&lock->base.dep_map, 0, 0, + ctx ? &ctx->dep_map : NULL, _RET_IP_); + ret = rt_mutex_slowlock(&lock->base.lock, TASK_UNINTERRUPTIBLE, NULL, 0, + ctx); + if (ret) + mutex_release(&lock->base.dep_map, _RET_IP_); + else if (!ret && ctx && ctx->acquired > 1) + return ww_mutex_deadlock_injection(lock, ctx); + + return ret; +} +EXPORT_SYMBOL_GPL(ww_mutex_lock); + +void __sched ww_mutex_unlock(struct ww_mutex *lock) +{ + /* + * The unlocking fastpath is the 0->1 transition from 'locked' + * into 'unlocked' state: + */ + if (lock->ctx) { +#ifdef CONFIG_DEBUG_MUTEXES + DEBUG_LOCKS_WARN_ON(!lock->ctx->acquired); +#endif + if (lock->ctx->acquired > 0) + lock->ctx->acquired--; + lock->ctx = NULL; + } + + mutex_release(&lock->base.dep_map, _RET_IP_); + __rt_mutex_unlock(&lock->base.lock); +} +EXPORT_SYMBOL(ww_mutex_unlock); + +int __rt_mutex_owner_current(struct rt_mutex *lock) +{ + return rt_mutex_owner(lock) == current; +} +EXPORT_SYMBOL(__rt_mutex_owner_current); +#endif @ kernel/locking/rtmutex_common.h:18 @ #include <linux/rtmutex.h> #include <linux/sched/wake_q.h> +#include <linux/sched/debug.h> /* * This is the control structure for tasks blocked on a rt_mutex, @ kernel/locking/rtmutex_common.h:33 @ struct rt_mutex_waiter { struct rb_node pi_tree_entry; struct task_struct *task; struct rt_mutex *lock; + bool savestate; #ifdef CONFIG_DEBUG_RT_MUTEXES unsigned long ip; struct pid *deadlock_task_pid; @ kernel/locking/rtmutex_common.h:135 @ enum rtmutex_chainwalk { /* * PI-futex support (proxy locking functions, etc.): */ +#define PI_WAKEUP_INPROGRESS ((struct rt_mutex_waiter *) 1) +#define PI_REQUEUE_INPROGRESS ((struct rt_mutex_waiter *) 2) + extern struct task_struct *rt_mutex_next_owner(struct rt_mutex *lock); extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock, struct task_struct *proxy_owner); extern void rt_mutex_proxy_unlock(struct rt_mutex *lock, struct task_struct *proxy_owner); -extern void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter); +extern void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter, bool savetate); extern int __rt_mutex_start_proxy_lock(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, struct task_struct *task); @ kernel/locking/rtmutex_common.h:161 @ extern int __rt_mutex_futex_trylock(struct rt_mutex *l); extern void rt_mutex_futex_unlock(struct rt_mutex *lock); extern bool __rt_mutex_futex_unlock(struct rt_mutex *lock, - struct wake_q_head *wqh); + struct wake_q_head *wqh, + struct wake_q_head *wq_sleeper); -extern void rt_mutex_postunlock(struct wake_q_head *wake_q); +extern void rt_mutex_postunlock(struct wake_q_head *wake_q, + struct wake_q_head *wake_sleeper_q); + +/* RW semaphore special interface */ +struct ww_acquire_ctx; + +extern int __rt_mutex_lock_state(struct rt_mutex *lock, int state); +extern int __rt_mutex_trylock(struct rt_mutex *lock); +extern void __rt_mutex_unlock(struct rt_mutex *lock); +int __sched rt_mutex_slowlock_locked(struct rt_mutex *lock, int state, + struct hrtimer_sleeper *timeout, + enum rtmutex_chainwalk chwalk, + struct ww_acquire_ctx *ww_ctx, + struct rt_mutex_waiter *waiter); +void __sched rt_spin_lock_slowlock_locked(struct rt_mutex *lock, + struct rt_mutex_waiter *waiter, + unsigned long flags); +void __sched rt_spin_lock_slowunlock(struct rt_mutex *lock); #ifdef CONFIG_DEBUG_RT_MUTEXES # include "rtmutex-debug.h" @ kernel/locking/rwlock-rt.c:4 @ +/* + */ +#include <linux/sched/debug.h> +#include <linux/export.h> + +#include "rtmutex_common.h" +#include <linux/rwlock_types_rt.h> + +/* + * RT-specific reader/writer locks + * + * write_lock() + * 1) Lock lock->rtmutex + * 2) Remove the reader BIAS to force readers into the slow path + * 3) Wait until all readers have left the critical region + * 4) Mark it write locked + * + * write_unlock() + * 1) Remove the write locked marker + * 2) Set the reader BIAS so readers can use the fast path again + * 3) Unlock lock->rtmutex to release blocked readers + * + * read_lock() + * 1) Try fast path acquisition (reader BIAS is set) + * 2) Take lock->rtmutex.wait_lock which protects the writelocked flag + * 3) If !writelocked, acquire it for read + * 4) If writelocked, block on lock->rtmutex + * 5) unlock lock->rtmutex, goto 1) + * + * read_unlock() + * 1) Try fast path release (reader count != 1) + * 2) Wake the writer waiting in write_lock()#3 + * + * read_lock()#3 has the consequence, that rw locks on RT are not writer + * fair, but writers, which should be avoided in RT tasks (think tasklist + * lock), are subject to the rtmutex priority/DL inheritance mechanism. + * + * It's possible to make the rw locks writer fair by keeping a list of + * active readers. A blocked writer would force all newly incoming readers + * to block on the rtmutex, but the rtmutex would have to be proxy locked + * for one reader after the other. We can't use multi-reader inheritance + * because there is no way to support that with + * SCHED_DEADLINE. Implementing the one by one reader boosting/handover + * mechanism is a major surgery for a very dubious value. + * + * The risk of writer starvation is there, but the pathological use cases + * which trigger it are not necessarily the typical RT workloads. + */ + +void __rwlock_biased_rt_init(struct rt_rw_lock *lock, const char *name, + struct lock_class_key *key) +{ +#ifdef CONFIG_DEBUG_LOCK_ALLOC + /* + * Make sure we are not reinitializing a held semaphore: + */ + debug_check_no_locks_freed((void *)lock, sizeof(*lock)); + lockdep_init_map(&lock->dep_map, name, key, 0); +#endif + atomic_set(&lock->readers, READER_BIAS); + rt_mutex_init(&lock->rtmutex); + lock->rtmutex.save_state = 1; +} + +int __read_rt_trylock(struct rt_rw_lock *lock) +{ + int r, old; + + /* + * Increment reader count, if lock->readers < 0, i.e. READER_BIAS is + * set. + */ + for (r = atomic_read(&lock->readers); r < 0;) { + old = atomic_cmpxchg(&lock->readers, r, r + 1); + if (likely(old == r)) + return 1; + r = old; + } + return 0; +} + +void __sched __read_rt_lock(struct rt_rw_lock *lock) +{ + struct rt_mutex *m = &lock->rtmutex; + struct rt_mutex_waiter waiter; + unsigned long flags; + + if (__read_rt_trylock(lock)) + return; + + raw_spin_lock_irqsave(&m->wait_lock, flags); + /* + * Allow readers as long as the writer has not completely + * acquired the semaphore for write. + */ + if (atomic_read(&lock->readers) != WRITER_BIAS) { + atomic_inc(&lock->readers); + raw_spin_unlock_irqrestore(&m->wait_lock, flags); + return; + } + + /* + * Call into the slow lock path with the rtmutex->wait_lock + * held, so this can't result in the following race: + * + * Reader1 Reader2 Writer + * read_lock() + * write_lock() + * rtmutex_lock(m) + * swait() + * read_lock() + * unlock(m->wait_lock) + * read_unlock() + * swake() + * lock(m->wait_lock) + * lock->writelocked=true + * unlock(m->wait_lock) + * + * write_unlock() + * lock->writelocked=false + * rtmutex_unlock(m) + * read_lock() + * write_lock() + * rtmutex_lock(m) + * swait() + * rtmutex_lock(m) + * + * That would put Reader1 behind the writer waiting on + * Reader2 to call read_unlock() which might be unbound. + */ + rt_mutex_init_waiter(&waiter, true); + rt_spin_lock_slowlock_locked(m, &waiter, flags); + /* + * The slowlock() above is guaranteed to return with the rtmutex is + * now held, so there can't be a writer active. Increment the reader + * count and immediately drop the rtmutex again. + */ + atomic_inc(&lock->readers); + raw_spin_unlock_irqrestore(&m->wait_lock, flags); + rt_spin_lock_slowunlock(m); + + debug_rt_mutex_free_waiter(&waiter); +} + +void __read_rt_unlock(struct rt_rw_lock *lock) +{ + struct rt_mutex *m = &lock->rtmutex; + struct task_struct *tsk; + + /* + * sem->readers can only hit 0 when a writer is waiting for the + * active readers to leave the critical region. + */ + if (!atomic_dec_and_test(&lock->readers)) + return; + + raw_spin_lock_irq(&m->wait_lock); + /* + * Wake the writer, i.e. the rtmutex owner. It might release the + * rtmutex concurrently in the fast path, but to clean up the rw + * lock it needs to acquire m->wait_lock. The worst case which can + * happen is a spurious wakeup. + */ + tsk = rt_mutex_owner(m); + if (tsk) + wake_up_process(tsk); + + raw_spin_unlock_irq(&m->wait_lock); +} + +static void __write_unlock_common(struct rt_rw_lock *lock, int bias, + unsigned long flags) +{ + struct rt_mutex *m = &lock->rtmutex; + + atomic_add(READER_BIAS - bias, &lock->readers); + raw_spin_unlock_irqrestore(&m->wait_lock, flags); + rt_spin_lock_slowunlock(m); +} + +void __sched __write_rt_lock(struct rt_rw_lock *lock) +{ + struct rt_mutex *m = &lock->rtmutex; + struct task_struct *self = current; + unsigned long flags; + + /* Take the rtmutex as a first step */ + __rt_spin_lock(m); + + /* Force readers into slow path */ + atomic_sub(READER_BIAS, &lock->readers); + + raw_spin_lock_irqsave(&m->wait_lock, flags); + + raw_spin_lock(&self->pi_lock); + self->saved_state = self->state; + __set_current_state_no_track(TASK_UNINTERRUPTIBLE); + raw_spin_unlock(&self->pi_lock); + + for (;;) { + /* Have all readers left the critical region? */ + if (!atomic_read(&lock->readers)) { + atomic_set(&lock->readers, WRITER_BIAS); + raw_spin_lock(&self->pi_lock); + __set_current_state_no_track(self->saved_state); + self->saved_state = TASK_RUNNING; + raw_spin_unlock(&self->pi_lock); + raw_spin_unlock_irqrestore(&m->wait_lock, flags); + return; + } + + raw_spin_unlock_irqrestore(&m->wait_lock, flags); + + if (atomic_read(&lock->readers) != 0) + schedule(); + + raw_spin_lock_irqsave(&m->wait_lock, flags); + + raw_spin_lock(&self->pi_lock); + __set_current_state_no_track(TASK_UNINTERRUPTIBLE); + raw_spin_unlock(&self->pi_lock); + } +} + +int __write_rt_trylock(struct rt_rw_lock *lock) +{ + struct rt_mutex *m = &lock->rtmutex; + unsigned long flags; + + if (!__rt_mutex_trylock(m)) + return 0; + + atomic_sub(READER_BIAS, &lock->readers); + + raw_spin_lock_irqsave(&m->wait_lock, flags); + if (!atomic_read(&lock->readers)) { + atomic_set(&lock->readers, WRITER_BIAS); + raw_spin_unlock_irqrestore(&m->wait_lock, flags); + return 1; + } + __write_unlock_common(lock, 0, flags); + return 0; +} + +void __write_rt_unlock(struct rt_rw_lock *lock) +{ + struct rt_mutex *m = &lock->rtmutex; + unsigned long flags; + + raw_spin_lock_irqsave(&m->wait_lock, flags); + __write_unlock_common(lock, WRITER_BIAS, flags); +} + +/* Map the reader biased implementation */ +static inline int do_read_rt_trylock(rwlock_t *rwlock) +{ + return __read_rt_trylock(rwlock); +} + +static inline int do_write_rt_trylock(rwlock_t *rwlock) +{ + return __write_rt_trylock(rwlock); +} + +static inline void do_read_rt_lock(rwlock_t *rwlock) +{ + __read_rt_lock(rwlock); +} + +static inline void do_write_rt_lock(rwlock_t *rwlock) +{ + __write_rt_lock(rwlock); +} + +static inline void do_read_rt_unlock(rwlock_t *rwlock) +{ + __read_rt_unlock(rwlock); +} + +static inline void do_write_rt_unlock(rwlock_t *rwlock) +{ + __write_rt_unlock(rwlock); +} + +static inline void do_rwlock_rt_init(rwlock_t *rwlock, const char *name, + struct lock_class_key *key) +{ + __rwlock_biased_rt_init(rwlock, name, key); +} + +int __lockfunc rt_read_can_lock(rwlock_t *rwlock) +{ + return atomic_read(&rwlock->readers) < 0; +} + +int __lockfunc rt_write_can_lock(rwlock_t *rwlock) +{ + return atomic_read(&rwlock->readers) == READER_BIAS; +} + +/* + * The common functions which get wrapped into the rwlock API. + */ +int __lockfunc rt_read_trylock(rwlock_t *rwlock) +{ + int ret; + + sleeping_lock_inc(); + migrate_disable(); + ret = do_read_rt_trylock(rwlock); + if (ret) { + rwlock_acquire_read(&rwlock->dep_map, 0, 1, _RET_IP_); + rcu_read_lock(); + } else { + migrate_enable(); + sleeping_lock_dec(); + } + return ret; +} +EXPORT_SYMBOL(rt_read_trylock); + +int __lockfunc rt_write_trylock(rwlock_t *rwlock) +{ + int ret; + + sleeping_lock_inc(); + migrate_disable(); + ret = do_write_rt_trylock(rwlock); + if (ret) { + rwlock_acquire(&rwlock->dep_map, 0, 1, _RET_IP_); + rcu_read_lock(); + } else { + migrate_enable(); + sleeping_lock_dec(); + } + return ret; +} +EXPORT_SYMBOL(rt_write_trylock); + +void __lockfunc rt_read_lock(rwlock_t *rwlock) +{ + sleeping_lock_inc(); + rcu_read_lock(); + migrate_disable(); + rwlock_acquire_read(&rwlock->dep_map, 0, 0, _RET_IP_); + do_read_rt_lock(rwlock); +} +EXPORT_SYMBOL(rt_read_lock); + +void __lockfunc rt_write_lock(rwlock_t *rwlock) +{ + sleeping_lock_inc(); + rcu_read_lock(); + migrate_disable(); + rwlock_acquire(&rwlock->dep_map, 0, 0, _RET_IP_); + do_write_rt_lock(rwlock); +} +EXPORT_SYMBOL(rt_write_lock); + +void __lockfunc rt_read_unlock(rwlock_t *rwlock) +{ + rwlock_release(&rwlock->dep_map, _RET_IP_); + do_read_rt_unlock(rwlock); + migrate_enable(); + rcu_read_unlock(); + sleeping_lock_dec(); +} +EXPORT_SYMBOL(rt_read_unlock); + +void __lockfunc rt_write_unlock(rwlock_t *rwlock) +{ + rwlock_release(&rwlock->dep_map, _RET_IP_); + do_write_rt_unlock(rwlock); + migrate_enable(); + rcu_read_unlock(); + sleeping_lock_dec(); +} +EXPORT_SYMBOL(rt_write_unlock); + +void __rt_rwlock_init(rwlock_t *rwlock, char *name, struct lock_class_key *key) +{ + do_rwlock_rt_init(rwlock, name, key); +} +EXPORT_SYMBOL(__rt_rwlock_init); @ kernel/locking/rwsem-rt.c:4 @ +/* + */ +#include <linux/blkdev.h> +#include <linux/rwsem.h> +#include <linux/sched/debug.h> +#include <linux/sched/signal.h> +#include <linux/export.h> + +#include "rtmutex_common.h" + +/* + * RT-specific reader/writer semaphores + * + * down_write() + * 1) Lock sem->rtmutex + * 2) Remove the reader BIAS to force readers into the slow path + * 3) Wait until all readers have left the critical region + * 4) Mark it write locked + * + * up_write() + * 1) Remove the write locked marker + * 2) Set the reader BIAS so readers can use the fast path again + * 3) Unlock sem->rtmutex to release blocked readers + * + * down_read() + * 1) Try fast path acquisition (reader BIAS is set) + * 2) Take sem->rtmutex.wait_lock which protects the writelocked flag + * 3) If !writelocked, acquire it for read + * 4) If writelocked, block on sem->rtmutex + * 5) unlock sem->rtmutex, goto 1) + * + * up_read() + * 1) Try fast path release (reader count != 1) + * 2) Wake the writer waiting in down_write()#3 + * + * down_read()#3 has the consequence, that rw semaphores on RT are not writer + * fair, but writers, which should be avoided in RT tasks (think mmap_sem), + * are subject to the rtmutex priority/DL inheritance mechanism. + * + * It's possible to make the rw semaphores writer fair by keeping a list of + * active readers. A blocked writer would force all newly incoming readers to + * block on the rtmutex, but the rtmutex would have to be proxy locked for one + * reader after the other. We can't use multi-reader inheritance because there + * is no way to support that with SCHED_DEADLINE. Implementing the one by one + * reader boosting/handover mechanism is a major surgery for a very dubious + * value. + * + * The risk of writer starvation is there, but the pathological use cases + * which trigger it are not necessarily the typical RT workloads. + */ + +void __rwsem_init(struct rw_semaphore *sem, const char *name, + struct lock_class_key *key) +{ +#ifdef CONFIG_DEBUG_LOCK_ALLOC + /* + * Make sure we are not reinitializing a held semaphore: + */ + debug_check_no_locks_freed((void *)sem, sizeof(*sem)); + lockdep_init_map(&sem->dep_map, name, key, 0); +#endif + atomic_set(&sem->readers, READER_BIAS); +} +EXPORT_SYMBOL(__rwsem_init); + +int __down_read_trylock(struct rw_semaphore *sem) +{ + int r, old; + + /* + * Increment reader count, if sem->readers < 0, i.e. READER_BIAS is + * set. + */ + for (r = atomic_read(&sem->readers); r < 0;) { + old = atomic_cmpxchg(&sem->readers, r, r + 1); + if (likely(old == r)) + return 1; + r = old; + } + return 0; +} + +static int __sched __down_read_common(struct rw_semaphore *sem, int state) +{ + struct rt_mutex *m = &sem->rtmutex; + struct rt_mutex_waiter waiter; + int ret; + + if (__down_read_trylock(sem)) + return 0; + /* + * If rt_mutex blocks, the function sched_submit_work will not call + * blk_schedule_flush_plug (because tsk_is_pi_blocked would be true). + * We must call blk_schedule_flush_plug here, if we don't call it, + * a deadlock in I/O may happen. + */ + if (unlikely(blk_needs_flush_plug(current))) + blk_schedule_flush_plug(current); + + might_sleep(); + raw_spin_lock_irq(&m->wait_lock); + /* + * Allow readers as long as the writer has not completely + * acquired the semaphore for write. + */ + if (atomic_read(&sem->readers) != WRITER_BIAS) { + atomic_inc(&sem->readers); + raw_spin_unlock_irq(&m->wait_lock); + return 0; + } + + /* + * Call into the slow lock path with the rtmutex->wait_lock + * held, so this can't result in the following race: + * + * Reader1 Reader2 Writer + * down_read() + * down_write() + * rtmutex_lock(m) + * swait() + * down_read() + * unlock(m->wait_lock) + * up_read() + * swake() + * lock(m->wait_lock) + * sem->writelocked=true + * unlock(m->wait_lock) + * + * up_write() + * sem->writelocked=false + * rtmutex_unlock(m) + * down_read() + * down_write() + * rtmutex_lock(m) + * swait() + * rtmutex_lock(m) + * + * That would put Reader1 behind the writer waiting on + * Reader2 to call up_read() which might be unbound. + */ + rt_mutex_init_waiter(&waiter, false); + ret = rt_mutex_slowlock_locked(m, state, NULL, RT_MUTEX_MIN_CHAINWALK, + NULL, &waiter); + /* + * The slowlock() above is guaranteed to return with the rtmutex (for + * ret = 0) is now held, so there can't be a writer active. Increment + * the reader count and immediately drop the rtmutex again. + * For ret != 0 we don't hold the rtmutex and need unlock the wait_lock. + * We don't own the lock then. + */ + if (!ret) + atomic_inc(&sem->readers); + raw_spin_unlock_irq(&m->wait_lock); + if (!ret) + __rt_mutex_unlock(m); + + debug_rt_mutex_free_waiter(&waiter); + return ret; +} + +void __down_read(struct rw_semaphore *sem) +{ + int ret; + + ret = __down_read_common(sem, TASK_UNINTERRUPTIBLE); + WARN_ON_ONCE(ret); +} + +int __down_read_killable(struct rw_semaphore *sem) +{ + int ret; + + ret = __down_read_common(sem, TASK_KILLABLE); + if (likely(!ret)) + return ret; + WARN_ONCE(ret != -EINTR, "Unexpected state: %d\n", ret); + return -EINTR; +} + +void __up_read(struct rw_semaphore *sem) +{ + struct rt_mutex *m = &sem->rtmutex; + struct task_struct *tsk; + + /* + * sem->readers can only hit 0 when a writer is waiting for the + * active readers to leave the critical region. + */ + if (!atomic_dec_and_test(&sem->readers)) + return; + + might_sleep(); + raw_spin_lock_irq(&m->wait_lock); + /* + * Wake the writer, i.e. the rtmutex owner. It might release the + * rtmutex concurrently in the fast path (due to a signal), but to + * clean up the rwsem it needs to acquire m->wait_lock. The worst + * case which can happen is a spurious wakeup. + */ + tsk = rt_mutex_owner(m); + if (tsk) + wake_up_process(tsk); + + raw_spin_unlock_irq(&m->wait_lock); +} + +static void __up_write_unlock(struct rw_semaphore *sem, int bias, + unsigned long flags) +{ + struct rt_mutex *m = &sem->rtmutex; + + atomic_add(READER_BIAS - bias, &sem->readers); + raw_spin_unlock_irqrestore(&m->wait_lock, flags); + __rt_mutex_unlock(m); +} + +static int __sched __down_write_common(struct rw_semaphore *sem, int state) +{ + struct rt_mutex *m = &sem->rtmutex; + unsigned long flags; + + /* Take the rtmutex as a first step */ + if (__rt_mutex_lock_state(m, state)) + return -EINTR; + + /* Force readers into slow path */ + atomic_sub(READER_BIAS, &sem->readers); + might_sleep(); + + set_current_state(state); + for (;;) { + raw_spin_lock_irqsave(&m->wait_lock, flags); + /* Have all readers left the critical region? */ + if (!atomic_read(&sem->readers)) { + atomic_set(&sem->readers, WRITER_BIAS); + __set_current_state(TASK_RUNNING); + raw_spin_unlock_irqrestore(&m->wait_lock, flags); + return 0; + } + + if (signal_pending_state(state, current)) { + __set_current_state(TASK_RUNNING); + __up_write_unlock(sem, 0, flags); + return -EINTR; + } + raw_spin_unlock_irqrestore(&m->wait_lock, flags); + + if (atomic_read(&sem->readers) != 0) { + schedule(); + set_current_state(state); + } + } +} + +void __sched __down_write(struct rw_semaphore *sem) +{ + __down_write_common(sem, TASK_UNINTERRUPTIBLE); +} + +int __sched __down_write_killable(struct rw_semaphore *sem) +{ + return __down_write_common(sem, TASK_KILLABLE); +} + +int __down_write_trylock(struct rw_semaphore *sem) +{ + struct rt_mutex *m = &sem->rtmutex; + unsigned long flags; + + if (!__rt_mutex_trylock(m)) + return 0; + + atomic_sub(READER_BIAS, &sem->readers); + + raw_spin_lock_irqsave(&m->wait_lock, flags); + if (!atomic_read(&sem->readers)) { + atomic_set(&sem->readers, WRITER_BIAS); + raw_spin_unlock_irqrestore(&m->wait_lock, flags); + return 1; + } + __up_write_unlock(sem, 0, flags); + return 0; +} + +void __up_write(struct rw_semaphore *sem) +{ + struct rt_mutex *m = &sem->rtmutex; + unsigned long flags; + + raw_spin_lock_irqsave(&m->wait_lock, flags); + __up_write_unlock(sem, WRITER_BIAS, flags); +} + +void __downgrade_write(struct rw_semaphore *sem) +{ + struct rt_mutex *m = &sem->rtmutex; + unsigned long flags; + + raw_spin_lock_irqsave(&m->wait_lock, flags); + /* Release it and account current as reader */ + __up_write_unlock(sem, WRITER_BIAS - 1, flags); +} @ kernel/locking/rwsem.c:31 @ #include <linux/rwsem.h> #include <linux/atomic.h> -#include "rwsem.h" +#ifndef CONFIG_PREEMPT_RT #include "lock_events.h" /* @ kernel/locking/rwsem.c:663 @ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, unsigned long flags; bool ret = true; - BUILD_BUG_ON(!(RWSEM_OWNER_UNKNOWN & RWSEM_NONSPINNABLE)); - if (need_resched()) { lockevent_inc(rwsem_opt_fail); return false; @ kernel/locking/rwsem.c:1336 @ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem) return sem; } + /* * lock for reading */ -inline void __down_read(struct rw_semaphore *sem) +static inline void __down_read(struct rw_semaphore *sem) { if (!rwsem_read_trylock(sem)) { rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE); @ kernel/locking/rwsem.c:1428 @ static inline int __down_write_trylock(struct rw_semaphore *sem) /* * unlock after reading */ -inline void __up_read(struct rw_semaphore *sem) +static inline void __up_read(struct rw_semaphore *sem) { long tmp; @ kernel/locking/rwsem.c:1487 @ static inline void __downgrade_write(struct rw_semaphore *sem) if (tmp & RWSEM_FLAG_WAITERS) rwsem_downgrade_wake(sem); } +#endif /* * lock for reading @ kernel/locking/rwsem.c:1623 @ void down_read_non_owner(struct rw_semaphore *sem) { might_sleep(); __down_read(sem); +#ifndef CONFIG_PREEMPT_RT __rwsem_set_reader_owned(sem, NULL); +#endif } EXPORT_SYMBOL(down_read_non_owner); @ kernel/locking/rwsem.c:1654 @ EXPORT_SYMBOL(down_write_killable_nested); void up_read_non_owner(struct rw_semaphore *sem) { +#ifndef CONFIG_PREEMPT_RT DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem); +#endif __up_read(sem); } EXPORT_SYMBOL(up_read_non_owner); @ kernel/locking/rwsem.h:1 @ -/* SPDX-License-Identifier: GPL-2.0 */ - -#ifndef __INTERNAL_RWSEM_H -#define __INTERNAL_RWSEM_H -#include <linux/rwsem.h> - -extern void __down_read(struct rw_semaphore *sem); -extern void __up_read(struct rw_semaphore *sem); - -#endif /* __INTERNAL_RWSEM_H */ @ kernel/locking/spinlock.c:127 @ void __lockfunc __raw_##op##_lock_bh(locktype##_t *lock) \ * __[spin|read|write]_lock_bh() */ BUILD_LOCK_OPS(spin, raw_spinlock); + +#ifndef CONFIG_PREEMPT_RT BUILD_LOCK_OPS(read, rwlock); BUILD_LOCK_OPS(write, rwlock); +#endif #endif @ kernel/locking/spinlock.c:215 @ void __lockfunc _raw_spin_unlock_bh(raw_spinlock_t *lock) EXPORT_SYMBOL(_raw_spin_unlock_bh); #endif +#ifndef CONFIG_PREEMPT_RT + #ifndef CONFIG_INLINE_READ_TRYLOCK int __lockfunc _raw_read_trylock(rwlock_t *lock) { @ kernel/locking/spinlock.c:361 @ void __lockfunc _raw_write_unlock_bh(rwlock_t *lock) EXPORT_SYMBOL(_raw_write_unlock_bh); #endif +#endif /* !PREEMPT_RT */ + #ifdef CONFIG_DEBUG_LOCK_ALLOC void __lockfunc _raw_spin_lock_nested(raw_spinlock_t *lock, int subclass) @ kernel/locking/spinlock_debug.c:34 @ void __raw_spin_lock_init(raw_spinlock_t *lock, const char *name, EXPORT_SYMBOL(__raw_spin_lock_init); +#ifndef CONFIG_PREEMPT_RT void __rwlock_init(rwlock_t *lock, const char *name, struct lock_class_key *key) { @ kernel/locking/spinlock_debug.c:52 @ void __rwlock_init(rwlock_t *lock, const char *name, } EXPORT_SYMBOL(__rwlock_init); +#endif static void spin_dump(raw_spinlock_t *lock, const char *msg) { @ kernel/locking/spinlock_debug.c:144 @ void do_raw_spin_unlock(raw_spinlock_t *lock) arch_spin_unlock(&lock->raw_lock); } +#ifndef CONFIG_PREEMPT_RT static void rwlock_bug(rwlock_t *lock, const char *msg) { if (!debug_locks_off()) @ kernel/locking/spinlock_debug.c:234 @ void do_raw_write_unlock(rwlock_t *lock) debug_write_unlock(lock); arch_write_unlock(&lock->raw_lock); } + +#endif @ kernel/panic.c:240 @ void panic(const char *fmt, ...) * Bypass the panic_cpu check and call __crash_kexec directly. */ if (!_crash_kexec_post_notifiers) { - printk_safe_flush_on_panic(); __crash_kexec(NULL); /* @ kernel/panic.c:263 @ void panic(const char *fmt, ...) */ atomic_notifier_call_chain(&panic_notifier_list, 0, buf); - /* Call flush even twice. It tries harder with a single online CPU */ - printk_safe_flush_on_panic(); kmsg_dump(KMSG_DUMP_PANIC); /* @ kernel/panic.c:524 @ static u64 oops_id; static int init_oops_id(void) { +#ifndef CONFIG_PREEMPT_RT if (!oops_id) get_random_bytes(&oops_id, sizeof(oops_id)); else +#endif oops_id++; return 0; @ kernel/printk/Makefile:1 @ # SPDX-License-Identifier: GPL-2.0-only obj-y = printk.o -obj-$(CONFIG_PRINTK) += printk_safe.o obj-$(CONFIG_A11Y_BRAILLE_CONSOLE) += braille.o @ kernel/printk/internal.h:1 @ -/* SPDX-License-Identifier: GPL-2.0-or-later */ -/* - * internal.h - printk internal definitions - */ -#include <linux/percpu.h> - -#ifdef CONFIG_PRINTK - -#define PRINTK_SAFE_CONTEXT_MASK 0x3fffffff -#define PRINTK_NMI_DIRECT_CONTEXT_MASK 0x40000000 -#define PRINTK_NMI_CONTEXT_MASK 0x80000000 - -extern raw_spinlock_t logbuf_lock; - -__printf(5, 0) -int vprintk_store(int facility, int level, - const char *dict, size_t dictlen, - const char *fmt, va_list args); - -__printf(1, 0) int vprintk_default(const char *fmt, va_list args); -__printf(1, 0) int vprintk_deferred(const char *fmt, va_list args); -__printf(1, 0) int vprintk_func(const char *fmt, va_list args); -void __printk_safe_enter(void); -void __printk_safe_exit(void); - -void printk_safe_init(void); -bool printk_percpu_data_ready(void); - -#define printk_safe_enter_irqsave(flags) \ - do { \ - local_irq_save(flags); \ - __printk_safe_enter(); \ - } while (0) - -#define printk_safe_exit_irqrestore(flags) \ - do { \ - __printk_safe_exit(); \ - local_irq_restore(flags); \ - } while (0) - -#define printk_safe_enter_irq() \ - do { \ - local_irq_disable(); \ - __printk_safe_enter(); \ - } while (0) - -#define printk_safe_exit_irq() \ - do { \ - __printk_safe_exit(); \ - local_irq_enable(); \ - } while (0) - -void defer_console_output(void); - -#else - -__printf(1, 0) int vprintk_func(const char *fmt, va_list args) { return 0; } - -/* - * In !PRINTK builds we still export logbuf_lock spin_lock, console_sem - * semaphore and some of console functions (console_unlock()/etc.), so - * printk-safe must preserve the existing local IRQ guarantees. - */ -#define printk_safe_enter_irqsave(flags) local_irq_save(flags) -#define printk_safe_exit_irqrestore(flags) local_irq_restore(flags) - -#define printk_safe_enter_irq() local_irq_disable() -#define printk_safe_exit_irq() local_irq_enable() - -static inline void printk_safe_init(void) { } -static inline bool printk_percpu_data_ready(void) { return false; } -#endif /* CONFIG_PRINTK */ @ kernel/printk/printk.c:48 @ #include <linux/irq_work.h> #include <linux/ctype.h> #include <linux/uio.h> +#include <linux/kthread.h> +#include <linux/clocksource.h> +#include <linux/printk_ringbuffer.h> #include <linux/sched/clock.h> #include <linux/sched/debug.h> #include <linux/sched/task_stack.h> @ kernel/printk/printk.c:64 @ #include "console_cmdline.h" #include "braille.h" -#include "internal.h" -int console_printk[4] = { +int console_printk[5] = { CONSOLE_LOGLEVEL_DEFAULT, /* console_loglevel */ MESSAGE_LOGLEVEL_DEFAULT, /* default_message_loglevel */ CONSOLE_LOGLEVEL_MIN, /* minimum_console_loglevel */ CONSOLE_LOGLEVEL_DEFAULT, /* default_console_loglevel */ + CONSOLE_LOGLEVEL_EMERGENCY, /* emergency_console_loglevel */ }; EXPORT_SYMBOL_GPL(console_printk); @ kernel/printk/printk.c:231 @ static int nr_ext_console_drivers; static int __down_trylock_console_sem(unsigned long ip) { - int lock_failed; - unsigned long flags; - - /* - * Here and in __up_console_sem() we need to be in safe mode, - * because spindump/WARN/etc from under console ->lock will - * deadlock in printk()->down_trylock_console_sem() otherwise. - */ - printk_safe_enter_irqsave(flags); - lock_failed = down_trylock(&console_sem); - printk_safe_exit_irqrestore(flags); - - if (lock_failed) + if (down_trylock(&console_sem)) return 1; mutex_acquire(&console_lock_dep_map, 0, 1, ip); return 0; @ kernel/printk/printk.c:240 @ static int __down_trylock_console_sem(unsigned long ip) static void __up_console_sem(unsigned long ip) { - unsigned long flags; - mutex_release(&console_lock_dep_map, ip); - printk_safe_enter_irqsave(flags); up(&console_sem); - printk_safe_exit_irqrestore(flags); } #define up_console_sem() __up_console_sem(_RET_IP_) @ kernel/printk/printk.c:256 @ static void __up_console_sem(unsigned long ip) */ static int console_locked, console_suspended; -/* - * If exclusive_console is non-NULL then only this console is to be printed to. - */ -static struct console *exclusive_console; - /* * Array of consoles built from command line options (console=) */ @ kernel/printk/printk.c:352 @ enum log_flags { struct printk_log { u64 ts_nsec; /* timestamp in nanoseconds */ + u16 cpu; /* cpu that generated record */ u16 len; /* length of entire record */ u16 text_len; /* length of text buffer */ u16 dict_len; /* length of dictionary buffer */ @ kernel/printk/printk.c:368 @ __packed __aligned(4) #endif ; -/* - * The logbuf_lock protects kmsg buffer, indices, counters. This can be taken - * within the scheduler's rq lock. It must be released before calling - * console_unlock() or anything else that might wake up a process. - */ -DEFINE_RAW_SPINLOCK(logbuf_lock); - -/* - * Helper macros to lock/unlock logbuf_lock and switch between - * printk-safe/unsafe modes. - */ -#define logbuf_lock_irq() \ - do { \ - printk_safe_enter_irq(); \ - raw_spin_lock(&logbuf_lock); \ - } while (0) - -#define logbuf_unlock_irq() \ - do { \ - raw_spin_unlock(&logbuf_lock); \ - printk_safe_exit_irq(); \ - } while (0) - -#define logbuf_lock_irqsave(flags) \ - do { \ - printk_safe_enter_irqsave(flags); \ - raw_spin_lock(&logbuf_lock); \ - } while (0) - -#define logbuf_unlock_irqrestore(flags) \ - do { \ - raw_spin_unlock(&logbuf_lock); \ - printk_safe_exit_irqrestore(flags); \ - } while (0) +DECLARE_STATIC_PRINTKRB_CPULOCK(printk_cpulock); #ifdef CONFIG_PRINTK -DECLARE_WAIT_QUEUE_HEAD(log_wait); -/* the next printk record to read by syslog(READ) or /proc/kmsg */ +/* record buffer */ +DECLARE_STATIC_PRINTKRB(printk_rb, CONFIG_LOG_BUF_SHIFT, &printk_cpulock); + +static DEFINE_MUTEX(syslog_lock); +DECLARE_STATIC_PRINTKRB_ITER(syslog_iter, &printk_rb); + +/* the last printk record to read by syslog(READ) or /proc/kmsg */ static u64 syslog_seq; -static u32 syslog_idx; static size_t syslog_partial; static bool syslog_time; -/* index and sequence number of the first record stored in the buffer */ -static u64 log_first_seq; -static u32 log_first_idx; - -/* index and sequence number of the next record to store in the buffer */ -static u64 log_next_seq; -static u32 log_next_idx; - -/* the next printk record to write to the console */ -static u64 console_seq; -static u32 console_idx; -static u64 exclusive_console_stop_seq; - /* the next printk record to read after the last 'clear' command */ static u64 clear_seq; -static u32 clear_idx; #ifdef CONFIG_PRINTK_CALLER #define PREFIX_MAX 48 @ kernel/printk/printk.c:395 @ static u32 clear_idx; #define LOG_LEVEL(v) ((v) & 0x07) #define LOG_FACILITY(v) ((v) >> 3 & 0xff) -/* record buffer */ -#define LOG_ALIGN __alignof__(struct printk_log) -#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT) -#define LOG_BUF_LEN_MAX (u32)(1 << 31) -static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN); -static char *log_buf = __log_buf; -static u32 log_buf_len = __LOG_BUF_LEN; - -/* - * We cannot access per-CPU data (e.g. per-CPU flush irq_work) before - * per_cpu_areas are initialised. This variable is set to true when - * it's safe to access per-CPU data. - */ -static bool __printk_percpu_data_ready __read_mostly; - -bool printk_percpu_data_ready(void) -{ - return __printk_percpu_data_ready; -} - /* Return log buffer address */ char *log_buf_addr_get(void) { - return log_buf; + return printk_rb.buffer; } /* Return log buffer size */ u32 log_buf_len_get(void) { - return log_buf_len; + return (1 << printk_rb.size_bits); } /* human readable text of the record */ @ kernel/printk/printk.c:419 @ static char *log_dict(const struct printk_log *msg) return (char *)msg + sizeof(struct printk_log) + msg->text_len; } -/* get record by index; idx must point to valid msg */ -static struct printk_log *log_from_idx(u32 idx) -{ - struct printk_log *msg = (struct printk_log *)(log_buf + idx); - - /* - * A length == 0 record is the end of buffer marker. Wrap around and - * read the message at the start of the buffer. - */ - if (!msg->len) - return (struct printk_log *)log_buf; - return msg; -} - -/* get next record; idx must point to valid msg */ -static u32 log_next(u32 idx) -{ - struct printk_log *msg = (struct printk_log *)(log_buf + idx); - - /* length == 0 indicates the end of the buffer; wrap */ - /* - * A length == 0 record is the end of buffer marker. Wrap around and - * read the message at the start of the buffer as *this* one, and - * return the one after that. - */ - if (!msg->len) { - msg = (struct printk_log *)log_buf; - return msg->len; - } - return idx + msg->len; -} - -/* - * Check whether there is enough free space for the given message. - * - * The same values of first_idx and next_idx mean that the buffer - * is either empty or full. - * - * If the buffer is empty, we must respect the position of the indexes. - * They cannot be reset to the beginning of the buffer. - */ -static int logbuf_has_space(u32 msg_size, bool empty) -{ - u32 free; - - if (log_next_idx > log_first_idx || empty) - free = max(log_buf_len - log_next_idx, log_first_idx); - else - free = log_first_idx - log_next_idx; - - /* - * We need space also for an empty header that signalizes wrapping - * of the buffer. - */ - return free >= msg_size + sizeof(struct printk_log); -} - -static int log_make_free_space(u32 msg_size) -{ - while (log_first_seq < log_next_seq && - !logbuf_has_space(msg_size, false)) { - /* drop old messages until we have enough contiguous space */ - log_first_idx = log_next(log_first_idx); - log_first_seq++; - } - - if (clear_seq < log_first_seq) { - clear_seq = log_first_seq; - clear_idx = log_first_idx; - } - - /* sequence numbers are equal, so the log buffer is empty */ - if (logbuf_has_space(msg_size, log_first_seq == log_next_seq)) - return 0; - - return -ENOMEM; -} - -/* compute the message size including the padding bytes */ -static u32 msg_used_size(u16 text_len, u16 dict_len, u32 *pad_len) -{ - u32 size; - - size = sizeof(struct printk_log) + text_len + dict_len; - *pad_len = (-size) & (LOG_ALIGN - 1); - size += *pad_len; - - return size; -} - -/* - * Define how much of the log buffer we could take at maximum. The value - * must be greater than two. Note that only half of the buffer is available - * when the index points to the middle. - */ -#define MAX_LOG_TAKE_PART 4 -static const char trunc_msg[] = "<truncated>"; - -static u32 truncate_msg(u16 *text_len, u16 *trunc_msg_len, - u16 *dict_len, u32 *pad_len) -{ - /* - * The message should not take the whole buffer. Otherwise, it might - * get removed too soon. - */ - u32 max_text_len = log_buf_len / MAX_LOG_TAKE_PART; - if (*text_len > max_text_len) - *text_len = max_text_len; - /* enable the warning message */ - *trunc_msg_len = strlen(trunc_msg); - /* disable the "dict" completely */ - *dict_len = 0; - /* compute the size again, count also the warning message */ - return msg_used_size(*text_len + *trunc_msg_len, 0, pad_len); -} +static void printk_emergency(char *buffer, int level, u64 ts_nsec, u16 cpu, + char *text, u16 text_len); /* insert record into the buffer, discard old ones, update heads */ static int log_store(u32 caller_id, int facility, int level, - enum log_flags flags, u64 ts_nsec, + enum log_flags flags, u64 ts_nsec, u16 cpu, const char *dict, u16 dict_len, const char *text, u16 text_len) { struct printk_log *msg; - u32 size, pad_len; - u16 trunc_msg_len = 0; + struct prb_handle h; + char *rbuf; + u32 size; - /* number of '\0' padding bytes to next message */ - size = msg_used_size(text_len, dict_len, &pad_len); + size = sizeof(*msg) + text_len + dict_len; - if (log_make_free_space(size)) { - /* truncate the message if it is too long for empty buffer */ - size = truncate_msg(&text_len, &trunc_msg_len, - &dict_len, &pad_len); - /* survive when the log buffer is too small for trunc_msg */ - if (log_make_free_space(size)) - return 0; - } - - if (log_next_idx + size + sizeof(struct printk_log) > log_buf_len) { + rbuf = prb_reserve(&h, &printk_rb, size); + if (!rbuf) { /* - * This message + an additional empty header does not fit - * at the end of the buffer. Add an empty header with len == 0 - * to signify a wrap around. + * An emergency message would have been printed, but + * it cannot be stored in the log. */ - memset(log_buf + log_next_idx, 0, sizeof(struct printk_log)); - log_next_idx = 0; + prb_inc_lost(&printk_rb); + return 0; } /* fill message */ - msg = (struct printk_log *)(log_buf + log_next_idx); + msg = (struct printk_log *)rbuf; memcpy(log_text(msg), text, text_len); msg->text_len = text_len; - if (trunc_msg_len) { - memcpy(log_text(msg) + text_len, trunc_msg, trunc_msg_len); - msg->text_len += trunc_msg_len; - } memcpy(log_dict(msg), dict, dict_len); msg->dict_len = dict_len; msg->facility = facility; msg->level = level & 7; msg->flags = flags & 0x1f; - if (ts_nsec > 0) - msg->ts_nsec = ts_nsec; - else - msg->ts_nsec = local_clock(); + msg->ts_nsec = ts_nsec; #ifdef CONFIG_PRINTK_CALLER msg->caller_id = caller_id; #endif - memset(log_dict(msg) + dict_len, 0, pad_len); + msg->cpu = cpu; msg->len = size; /* insert message */ - log_next_idx += msg->len; - log_next_seq++; + prb_commit(&h); return msg->text_len; } @ kernel/printk/printk.c:532 @ static ssize_t msg_print_ext_header(char *buf, size_t size, do_div(ts_usec, 1000); - return scnprintf(buf, size, "%u,%llu,%llu,%c%s;", + return scnprintf(buf, size, "%u,%llu,%llu,%c%s,%hu;", (msg->facility << 3) | msg->level, seq, ts_usec, - msg->flags & LOG_CONT ? 'c' : '-', caller); + msg->flags & LOG_CONT ? 'c' : '-', caller, msg->cpu); } static ssize_t msg_print_ext_body(char *buf, size_t size, @ kernel/printk/printk.c:585 @ static ssize_t msg_print_ext_body(char *buf, size_t size, return p - buf; } +#define PRINTK_SPRINT_MAX (LOG_LINE_MAX + PREFIX_MAX) +#define PRINTK_RECORD_MAX (sizeof(struct printk_log) + \ + CONSOLE_EXT_LOG_MAX + PRINTK_SPRINT_MAX) + /* /dev/kmsg - userspace message inject/listen interface */ struct devkmsg_user { u64 seq; - u32 idx; + struct prb_iterator iter; struct ratelimit_state rs; struct mutex lock; char buf[CONSOLE_EXT_LOG_MAX]; + char msgbuf[PRINTK_RECORD_MAX]; }; static __printf(3, 4) __cold @ kernel/printk/printk.c:679 @ static ssize_t devkmsg_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { struct devkmsg_user *user = file->private_data; + struct prb_iterator backup_iter; struct printk_log *msg; - size_t len; ssize_t ret; + size_t len; + u64 seq; if (!user) return -EBADF; @ kernel/printk/printk.c:692 @ static ssize_t devkmsg_read(struct file *file, char __user *buf, if (ret) return ret; - logbuf_lock_irq(); - while (user->seq == log_next_seq) { - if (file->f_flags & O_NONBLOCK) { - ret = -EAGAIN; - logbuf_unlock_irq(); - goto out; - } + /* make a backup copy in case there is a problem */ + prb_iter_copy(&backup_iter, &user->iter); - logbuf_unlock_irq(); - ret = wait_event_interruptible(log_wait, - user->seq != log_next_seq); - if (ret) - goto out; - logbuf_lock_irq(); + if (file->f_flags & O_NONBLOCK) { + ret = prb_iter_next(&user->iter, &user->msgbuf[0], + sizeof(user->msgbuf), &seq); + } else { + ret = prb_iter_wait_next(&user->iter, &user->msgbuf[0], + sizeof(user->msgbuf), &seq); } - - if (user->seq < log_first_seq) { - /* our last seen message is gone, return error and reset */ - user->idx = log_first_idx; - user->seq = log_first_seq; + if (ret == 0) { + /* end of list */ + ret = -EAGAIN; + goto out; + } else if (ret == -EINVAL) { + /* iterator invalid, return error and reset */ ret = -EPIPE; - logbuf_unlock_irq(); + prb_iter_init(&user->iter, &printk_rb, &user->seq); + goto out; + } else if (ret < 0) { + /* interrupted by signal */ goto out; } - msg = log_from_idx(user->idx); + user->seq++; + if (user->seq < seq) { + ret = -EPIPE; + goto restore_out; + } + + msg = (struct printk_log *)&user->msgbuf[0]; len = msg_print_ext_header(user->buf, sizeof(user->buf), msg, user->seq); len += msg_print_ext_body(user->buf + len, sizeof(user->buf) - len, log_dict(msg), msg->dict_len, log_text(msg), msg->text_len); - user->idx = log_next(user->idx); - user->seq++; - logbuf_unlock_irq(); - if (len > count) { ret = -EINVAL; - goto out; + goto restore_out; } if (copy_to_user(buf, user->buf, len)) { ret = -EFAULT; - goto out; + goto restore_out; } + ret = len; + goto out; +restore_out: + /* + * There was an error, but this message should not be + * lost because of it. Restore the backup and setup + * seq so that it will work with the next read. + */ + prb_iter_copy(&user->iter, &backup_iter); + user->seq = seq - 1; out: mutex_unlock(&user->lock); return ret; @ kernel/printk/printk.c:757 @ static ssize_t devkmsg_read(struct file *file, char __user *buf, static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence) { struct devkmsg_user *user = file->private_data; - loff_t ret = 0; + loff_t ret; + u64 seq; if (!user) return -EBADF; if (offset) return -ESPIPE; - logbuf_lock_irq(); + ret = mutex_lock_interruptible(&user->lock); + if (ret) + return ret; + switch (whence) { case SEEK_SET: /* the first record */ - user->idx = log_first_idx; - user->seq = log_first_seq; + prb_iter_init(&user->iter, &printk_rb, &user->seq); break; case SEEK_DATA: /* @ kernel/printk/printk.c:780 @ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence) * like issued by 'dmesg -c'. Reading /dev/kmsg itself * changes no global state, and does not clear anything. */ - user->idx = clear_idx; - user->seq = clear_seq; + for (;;) { + prb_iter_init(&user->iter, &printk_rb, &seq); + ret = prb_iter_seek(&user->iter, clear_seq); + if (ret > 0) { + /* seeked to clear seq */ + user->seq = clear_seq; + break; + } else if (ret == 0) { + /* + * The end of the list was hit without + * ever seeing the clear seq. Just + * seek to the beginning of the list. + */ + prb_iter_init(&user->iter, &printk_rb, + &user->seq); + break; + } + /* iterator invalid, start over */ + + /* reset clear_seq if it is no longer available */ + if (seq > clear_seq) + clear_seq = 0; + } + ret = 0; break; case SEEK_END: /* after the last record */ - user->idx = log_next_idx; - user->seq = log_next_seq; + for (;;) { + ret = prb_iter_next(&user->iter, NULL, 0, &user->seq); + if (ret == 0) + break; + else if (ret > 0) + continue; + /* iterator invalid, start over */ + prb_iter_init(&user->iter, &printk_rb, &user->seq); + } + ret = 0; break; default: ret = -EINVAL; } - logbuf_unlock_irq(); + + mutex_unlock(&user->lock); return ret; } +struct wait_queue_head *printk_wait_queue(void) +{ + /* FIXME: using prb internals! */ + return printk_rb.wq; +} + static __poll_t devkmsg_poll(struct file *file, poll_table *wait) { struct devkmsg_user *user = file->private_data; + struct prb_iterator iter; __poll_t ret = 0; + int rbret; + u64 seq; if (!user) return EPOLLERR|EPOLLNVAL; - poll_wait(file, &log_wait, wait); + poll_wait(file, printk_wait_queue(), wait); - logbuf_lock_irq(); - if (user->seq < log_next_seq) { - /* return error when data has vanished underneath us */ - if (user->seq < log_first_seq) - ret = EPOLLIN|EPOLLRDNORM|EPOLLERR|EPOLLPRI; - else - ret = EPOLLIN|EPOLLRDNORM; - } - logbuf_unlock_irq(); + mutex_lock(&user->lock); + + /* use copy so no actual iteration takes place */ + prb_iter_copy(&iter, &user->iter); + + rbret = prb_iter_next(&iter, &user->msgbuf[0], + sizeof(user->msgbuf), &seq); + if (rbret == 0) + goto out; + + ret = EPOLLIN|EPOLLRDNORM; + + if (rbret < 0 || (seq - user->seq) != 1) + ret |= EPOLLERR|EPOLLPRI; +out: + mutex_unlock(&user->lock); return ret; } @ kernel/printk/printk.c:890 @ static int devkmsg_open(struct inode *inode, struct file *file) mutex_init(&user->lock); - logbuf_lock_irq(); - user->idx = log_first_idx; - user->seq = log_first_seq; - logbuf_unlock_irq(); + prb_iter_init(&user->iter, &printk_rb, &user->seq); file->private_data = user; return 0; @ kernel/printk/printk.c:930 @ const struct file_operations kmsg_fops = { */ void log_buf_vmcoreinfo_setup(void) { - VMCOREINFO_SYMBOL(log_buf); - VMCOREINFO_SYMBOL(log_buf_len); - VMCOREINFO_SYMBOL(log_first_idx); - VMCOREINFO_SYMBOL(clear_idx); - VMCOREINFO_SYMBOL(log_next_idx); /* * Export struct printk_log size and field offsets. User space tools can * parse it and detect any changes to structure down the line. @ kernel/printk/printk.c:945 @ void log_buf_vmcoreinfo_setup(void) } #endif +/* FIXME: no support for buffer resizing */ +#if 0 /* requested log_buf_len from kernel cmdline */ static unsigned long __initdata new_log_buf_len; @ kernel/printk/printk.c:1012 @ static void __init log_buf_add_cpu(void) #else /* !CONFIG_SMP */ static inline void log_buf_add_cpu(void) {} #endif /* CONFIG_SMP */ - -static void __init set_percpu_data_ready(void) -{ - printk_safe_init(); - /* Make sure we set this flag only after printk_safe() init is done */ - barrier(); - __printk_percpu_data_ready = true; -} +#endif /* 0 */ void __init setup_log_buf(int early) { +/* FIXME: no support for buffer resizing */ +#if 0 unsigned long flags; char *new_log_buf; unsigned int free; - /* - * Some archs call setup_log_buf() multiple times - first is very - * early, e.g. from setup_arch(), and second - when percpu_areas - * are initialised. - */ - if (!early) - set_percpu_data_ready(); - if (log_buf != __log_buf) return; @ kernel/printk/printk.c:1049 @ void __init setup_log_buf(int early) pr_info("log_buf_len: %u bytes\n", log_buf_len); pr_info("early log buf free: %u(%u%%)\n", free, (free * 100) / __LOG_BUF_LEN); +#endif } static bool __read_mostly ignore_loglevel; @ kernel/printk/printk.c:1130 @ static inline void boot_delay_msec(int level) static bool printk_time = IS_ENABLED(CONFIG_PRINTK_TIME); module_param_named(time, printk_time, bool, S_IRUGO | S_IWUSR); +static size_t print_cpu(u16 cpu, char *buf) +{ + return sprintf(buf, "%03hu: ", cpu); +} + static size_t print_syslog(unsigned int level, char *buf) { return sprintf(buf, "<%u>", level); @ kernel/printk/printk.c:1178 @ static size_t print_prefix(const struct printk_log *msg, bool syslog, buf[len++] = ' '; buf[len] = '\0'; } + len += print_cpu(msg->cpu, buf + len); return len; } @ kernel/printk/printk.c:1224 @ static size_t msg_print_text(const struct printk_log *msg, bool syslog, return len; } -static int syslog_print(char __user *buf, int size) +static int syslog_print(char __user *buf, int size, char *text, + char *msgbuf, int *locked) { - char *text; + struct prb_iterator iter; struct printk_log *msg; int len = 0; - - text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL); - if (!text) - return -ENOMEM; + u64 seq; + int ret; while (size > 0) { size_t n; size_t skip; - logbuf_lock_irq(); - if (syslog_seq < log_first_seq) { - /* messages are gone, move to first one */ - syslog_seq = log_first_seq; - syslog_idx = log_first_idx; - syslog_partial = 0; - } - if (syslog_seq == log_next_seq) { - logbuf_unlock_irq(); + for (;;) { + prb_iter_copy(&iter, &syslog_iter); + ret = prb_iter_next(&iter, msgbuf, + PRINTK_RECORD_MAX, &seq); + if (ret < 0) { + /* messages are gone, move to first one */ + prb_iter_init(&syslog_iter, &printk_rb, + &syslog_seq); + syslog_partial = 0; + continue; + } break; } + if (ret == 0) + break; + + /* + * If messages have been missed, the partial tracker + * is no longer valid and must be reset. + */ + if (syslog_seq > 0 && seq - 1 != syslog_seq) { + syslog_seq = seq - 1; + syslog_partial = 0; + } /* * To keep reading/counting partial line consistent, @ kernel/printk/printk.c:1269 @ static int syslog_print(char __user *buf, int size) if (!syslog_partial) syslog_time = printk_time; + msg = (struct printk_log *)msgbuf; + skip = syslog_partial; - msg = log_from_idx(syslog_idx); n = msg_print_text(msg, true, syslog_time, text, - LOG_LINE_MAX + PREFIX_MAX); + PRINTK_SPRINT_MAX); if (n - syslog_partial <= size) { /* message fits into buffer, move forward */ - syslog_idx = log_next(syslog_idx); - syslog_seq++; + prb_iter_next(&syslog_iter, NULL, 0, &syslog_seq); n -= syslog_partial; syslog_partial = 0; - } else if (!len){ + } else if (!len) { /* partial read(), remember position */ n = size; syslog_partial += n; } else n = 0; - logbuf_unlock_irq(); if (!n) break; + mutex_unlock(&syslog_lock); if (copy_to_user(buf, text + skip, n)) { if (!len) len = -EFAULT; + *locked = 0; break; } + ret = mutex_lock_interruptible(&syslog_lock); len += n; size -= n; buf += n; - } - kfree(text); - return len; -} - -static int syslog_print_all(char __user *buf, int size, bool clear) -{ - char *text; - int len = 0; - u64 next_seq; - u64 seq; - u32 idx; - bool time; - - text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL); - if (!text) - return -ENOMEM; - - time = printk_time; - logbuf_lock_irq(); - /* - * Find first record that fits, including all following records, - * into the user-provided buffer for this dump. - */ - seq = clear_seq; - idx = clear_idx; - while (seq < log_next_seq) { - struct printk_log *msg = log_from_idx(idx); - - len += msg_print_text(msg, true, time, NULL, 0); - idx = log_next(idx); - seq++; - } - - /* move first record forward until length fits into the buffer */ - seq = clear_seq; - idx = clear_idx; - while (len > size && seq < log_next_seq) { - struct printk_log *msg = log_from_idx(idx); - - len -= msg_print_text(msg, true, time, NULL, 0); - idx = log_next(idx); - seq++; - } - - /* last message fitting into this dump */ - next_seq = log_next_seq; - - len = 0; - while (len >= 0 && seq < next_seq) { - struct printk_log *msg = log_from_idx(idx); - int textlen = msg_print_text(msg, true, time, text, - LOG_LINE_MAX + PREFIX_MAX); - - idx = log_next(idx); - seq++; - - logbuf_unlock_irq(); - if (copy_to_user(buf + len, text, textlen)) - len = -EFAULT; - else - len += textlen; - logbuf_lock_irq(); - - if (seq < log_first_seq) { - /* messages are gone, move to next one */ - seq = log_first_seq; - idx = log_first_idx; + if (ret) { + if (!len) + len = ret; + *locked = 0; + break; } } - if (clear) { - clear_seq = log_next_seq; - clear_idx = log_next_idx; - } - logbuf_unlock_irq(); + return len; +} + +static int count_remaining(struct prb_iterator *iter, u64 until_seq, + char *msgbuf, int size, bool records, bool time) +{ + struct prb_iterator local_iter; + struct printk_log *msg; + int len = 0; + u64 seq; + int ret; + + prb_iter_copy(&local_iter, iter); + for (;;) { + ret = prb_iter_next(&local_iter, msgbuf, size, &seq); + if (ret == 0) { + break; + } else if (ret < 0) { + /* the iter is invalid, restart from head */ + prb_iter_init(&local_iter, &printk_rb, NULL); + len = 0; + continue; + } + + if (until_seq && seq >= until_seq) + break; + + if (records) { + len++; + } else { + msg = (struct printk_log *)msgbuf; + len += msg_print_text(msg, true, time, NULL, 0); + } + } - kfree(text); return len; } static void syslog_clear(void) { - logbuf_lock_irq(); - clear_seq = log_next_seq; - clear_idx = log_next_idx; - logbuf_unlock_irq(); + struct prb_iterator iter; + int ret; + + prb_iter_init(&iter, &printk_rb, &clear_seq); + for (;;) { + ret = prb_iter_next(&iter, NULL, 0, &clear_seq); + if (ret == 0) + break; + else if (ret < 0) + prb_iter_init(&iter, &printk_rb, &clear_seq); + } +} + +static int syslog_print_all(char __user *buf, int size, bool clear) +{ + struct prb_iterator iter; + struct printk_log *msg; + char *msgbuf = NULL; + char *text = NULL; + int textlen; + u64 seq = 0; + int len = 0; + bool time; + int ret; + + text = kmalloc(PRINTK_SPRINT_MAX, GFP_KERNEL); + if (!text) + return -ENOMEM; + msgbuf = kmalloc(PRINTK_RECORD_MAX, GFP_KERNEL); + if (!msgbuf) { + kfree(text); + return -ENOMEM; + } + + time = printk_time; + + /* + * Setup iter to last event before clear. Clear may + * be lost, but keep going with a best effort. + */ + prb_iter_init(&iter, &printk_rb, NULL); + prb_iter_seek(&iter, clear_seq); + + /* count the total bytes after clear */ + len = count_remaining(&iter, 0, msgbuf, PRINTK_RECORD_MAX, + false, time); + + /* move iter forward until length fits into the buffer */ + while (len > size) { + ret = prb_iter_next(&iter, msgbuf, + PRINTK_RECORD_MAX, &seq); + if (ret == 0) { + break; + } else if (ret < 0) { + /* + * The iter is now invalid so clear will + * also be invalid. Restart from the head. + */ + prb_iter_init(&iter, &printk_rb, NULL); + len = count_remaining(&iter, 0, msgbuf, + PRINTK_RECORD_MAX, false, time); + continue; + } + + msg = (struct printk_log *)msgbuf; + len -= msg_print_text(msg, true, time, NULL, 0); + + if (clear) + clear_seq = seq; + } + + /* copy messages to buffer */ + len = 0; + while (len >= 0 && len < size) { + if (clear) + clear_seq = seq; + + ret = prb_iter_next(&iter, msgbuf, + PRINTK_RECORD_MAX, &seq); + if (ret == 0) { + break; + } else if (ret < 0) { + /* + * The iter is now invalid. Make a best + * effort to grab the rest of the log + * from the new head. + */ + prb_iter_init(&iter, &printk_rb, NULL); + continue; + } + + msg = (struct printk_log *)msgbuf; + textlen = msg_print_text(msg, true, time, text, + PRINTK_SPRINT_MAX); + if (textlen < 0) { + len = textlen; + break; + } + + if (len + textlen > size) + break; + + if (copy_to_user(buf + len, text, textlen)) + len = -EFAULT; + else + len += textlen; + } + + if (clear && !seq) + syslog_clear(); + + if (text) + kfree(text); + if (msgbuf) + kfree(msgbuf); + return len; } int do_syslog(int type, char __user *buf, int len, int source) { bool clear = false; static int saved_console_loglevel = LOGLEVEL_DEFAULT; + struct prb_iterator iter; + char *msgbuf = NULL; + char *text = NULL; + int locked; int error; + int ret; error = check_syslog_permissions(type, source); if (error) @ kernel/printk/printk.c:1495 @ int do_syslog(int type, char __user *buf, int len, int source) return 0; if (!access_ok(buf, len)) return -EFAULT; - error = wait_event_interruptible(log_wait, - syslog_seq != log_next_seq); + + text = kmalloc(PRINTK_SPRINT_MAX, GFP_KERNEL); + msgbuf = kmalloc(PRINTK_RECORD_MAX, GFP_KERNEL); + if (!text || !msgbuf) { + error = -ENOMEM; + goto out; + } + + error = mutex_lock_interruptible(&syslog_lock); if (error) - return error; - error = syslog_print(buf, len); + goto out; + + /* + * Wait until a first message is available. Use a copy + * because no iteration should occur for syslog now. + */ + for (;;) { + prb_iter_copy(&iter, &syslog_iter); + + mutex_unlock(&syslog_lock); + ret = prb_iter_wait_next(&iter, NULL, 0, NULL); + if (ret == -ERESTARTSYS) { + error = ret; + goto out; + } + error = mutex_lock_interruptible(&syslog_lock); + if (error) + goto out; + + if (ret == -EINVAL) { + prb_iter_init(&syslog_iter, &printk_rb, + &syslog_seq); + syslog_partial = 0; + continue; + } + break; + } + + /* print as much as will fit in the user buffer */ + locked = 1; + error = syslog_print(buf, len, text, msgbuf, &locked); + if (locked) + mutex_unlock(&syslog_lock); break; /* Read/clear last kernel messages */ case SYSLOG_ACTION_READ_CLEAR: @ kernel/printk/printk.c:1582 @ int do_syslog(int type, char __user *buf, int len, int source) break; /* Number of chars in the log buffer */ case SYSLOG_ACTION_SIZE_UNREAD: - logbuf_lock_irq(); - if (syslog_seq < log_first_seq) { - /* messages are gone, move to first one */ - syslog_seq = log_first_seq; - syslog_idx = log_first_idx; - syslog_partial = 0; - } + msgbuf = kmalloc(PRINTK_RECORD_MAX, GFP_KERNEL); + if (!msgbuf) + return -ENOMEM; + + error = mutex_lock_interruptible(&syslog_lock); + if (error) + goto out; + if (source == SYSLOG_FROM_PROC) { /* * Short-cut for poll(/"proc/kmsg") which simply checks * for pending data, not the size; return the count of * records, not the length. */ - error = log_next_seq - syslog_seq; + error = count_remaining(&syslog_iter, 0, msgbuf, + PRINTK_RECORD_MAX, true, + printk_time); } else { - u64 seq = syslog_seq; - u32 idx = syslog_idx; - bool time = syslog_partial ? syslog_time : printk_time; - - while (seq < log_next_seq) { - struct printk_log *msg = log_from_idx(idx); - - error += msg_print_text(msg, true, time, NULL, - 0); - time = printk_time; - idx = log_next(idx); - seq++; - } + error = count_remaining(&syslog_iter, 0, msgbuf, + PRINTK_RECORD_MAX, false, + printk_time); error -= syslog_partial; } - logbuf_unlock_irq(); + + mutex_unlock(&syslog_lock); break; /* Size of the log buffer */ case SYSLOG_ACTION_SIZE_BUFFER: - error = log_buf_len; + error = prb_buffer_size(&printk_rb); break; default: error = -EINVAL; break; } - +out: + if (msgbuf) + kfree(msgbuf); + if (text) + kfree(text); return error; } @ kernel/printk/printk.c:1629 @ SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, int, len) return do_syslog(type, buf, len, SYSLOG_FROM_READER); } -/* - * Special console_lock variants that help to reduce the risk of soft-lockups. - * They allow to pass console_lock to another printk() call using a busy wait. - */ +int printk_delay_msec __read_mostly; -#ifdef CONFIG_LOCKDEP -static struct lockdep_map console_owner_dep_map = { - .name = "console_owner" -}; -#endif - -static DEFINE_RAW_SPINLOCK(console_owner_lock); -static struct task_struct *console_owner; -static bool console_waiter; - -/** - * console_lock_spinning_enable - mark beginning of code where another - * thread might safely busy wait - * - * This basically converts console_lock into a spinlock. This marks - * the section where the console_lock owner can not sleep, because - * there may be a waiter spinning (like a spinlock). Also it must be - * ready to hand over the lock at the end of the section. - */ -static void console_lock_spinning_enable(void) +static inline void printk_delay(int level) { - raw_spin_lock(&console_owner_lock); - console_owner = current; - raw_spin_unlock(&console_owner_lock); + boot_delay_msec(level); + if (unlikely(printk_delay_msec)) { + int m = printk_delay_msec; - /* The waiter may spin on us after setting console_owner */ - spin_acquire(&console_owner_dep_map, 0, 0, _THIS_IP_); + while (m--) { + mdelay(1); + touch_nmi_watchdog(); + } + } } -/** - * console_lock_spinning_disable_and_check - mark end of code where another - * thread was able to busy wait and check if there is a waiter - * - * This is called at the end of the section where spinning is allowed. - * It has two functions. First, it is a signal that it is no longer - * safe to start busy waiting for the lock. Second, it checks if - * there is a busy waiter and passes the lock rights to her. - * - * Important: Callers lose the lock if there was a busy waiter. - * They must not touch items synchronized by console_lock - * in this case. - * - * Return: 1 if the lock rights were passed, 0 otherwise. - */ -static int console_lock_spinning_disable_and_check(void) +static void print_console_dropped(struct console *con, u64 count) { - int waiter; + char text[64]; + int len; - raw_spin_lock(&console_owner_lock); - waiter = READ_ONCE(console_waiter); - console_owner = NULL; - raw_spin_unlock(&console_owner_lock); - - if (!waiter) { - spin_release(&console_owner_dep_map, _THIS_IP_); - return 0; - } - - /* The waiter is now free to continue */ - WRITE_ONCE(console_waiter, false); - - spin_release(&console_owner_dep_map, _THIS_IP_); - - /* - * Hand off console_lock to waiter. The waiter will perform - * the up(). After this, the waiter is the console_lock owner. - */ - mutex_release(&console_lock_dep_map, _THIS_IP_); - return 1; + len = sprintf(text, "** %llu printk message%s dropped **\n", + count, count > 1 ? "s" : ""); + con->write(con, text, len); } -/** - * console_trylock_spinning - try to get console_lock by busy waiting - * - * This allows to busy wait for the console_lock when the current - * owner is running in specially marked sections. It means that - * the current owner is running and cannot reschedule until it - * is ready to lose the lock. - * - * Return: 1 if we got the lock, 0 othrewise - */ -static int console_trylock_spinning(void) +static void format_text(struct printk_log *msg, u64 seq, + char *ext_text, size_t *ext_len, + char *text, size_t *len, bool time) { - struct task_struct *owner = NULL; - bool waiter; - bool spin = false; - unsigned long flags; - - if (console_trylock()) - return 1; - - printk_safe_enter_irqsave(flags); - - raw_spin_lock(&console_owner_lock); - owner = READ_ONCE(console_owner); - waiter = READ_ONCE(console_waiter); - if (!waiter && owner && owner != current) { - WRITE_ONCE(console_waiter, true); - spin = true; - } - raw_spin_unlock(&console_owner_lock); - - /* - * If there is an active printk() writing to the - * consoles, instead of having it write our data too, - * see if we can offload that load from the active - * printer, and do some printing ourselves. - * Go into a spin only if there isn't already a waiter - * spinning, and there is an active printer, and - * that active printer isn't us (recursive printk?). - */ - if (!spin) { - printk_safe_exit_irqrestore(flags); - return 0; + if (suppress_message_printing(msg->level)) { + /* + * Skip record that has level above the console + * loglevel and update each console's local seq. + */ + *len = 0; + *ext_len = 0; + return; } - /* We spin waiting for the owner to release us */ - spin_acquire(&console_owner_dep_map, 0, 0, _THIS_IP_); - /* Owner will clear console_waiter on hand off */ - while (READ_ONCE(console_waiter)) - cpu_relax(); - spin_release(&console_owner_dep_map, _THIS_IP_); + *len = msg_print_text(msg, console_msg_format & MSG_FORMAT_SYSLOG, + time, text, PRINTK_SPRINT_MAX); + if (nr_ext_console_drivers) { + *ext_len = msg_print_ext_header(ext_text, CONSOLE_EXT_LOG_MAX, + msg, seq); + *ext_len += msg_print_ext_body(ext_text + *ext_len, + CONSOLE_EXT_LOG_MAX - *ext_len, + log_dict(msg), msg->dict_len, + log_text(msg), msg->text_len); + } else { + *ext_len = 0; + } +} - printk_safe_exit_irqrestore(flags); - /* - * The owner passed the console lock to us. - * Since we did not spin on console lock, annotate - * this as a trylock. Otherwise lockdep will - * complain. - */ - mutex_acquire(&console_lock_dep_map, 0, 1, _THIS_IP_); +static void printk_write_history(struct console *con, u64 master_seq) +{ + struct prb_iterator iter; + bool time = printk_time; + static char *ext_text; + static char *text; + static char *buf; + u64 seq; - return 1; + ext_text = kmalloc(CONSOLE_EXT_LOG_MAX, GFP_KERNEL); + text = kmalloc(PRINTK_SPRINT_MAX, GFP_KERNEL); + buf = kmalloc(PRINTK_RECORD_MAX, GFP_KERNEL); + if (!ext_text || !text || !buf) + return; + + if (!(con->flags & CON_ENABLED)) + goto out; + + if (!con->write) + goto out; + + if (!cpu_online(raw_smp_processor_id()) && + !(con->flags & CON_ANYTIME)) + goto out; + + prb_iter_init(&iter, &printk_rb, NULL); + + for (;;) { + struct printk_log *msg; + size_t ext_len; + size_t len; + int ret; + + ret = prb_iter_next(&iter, buf, PRINTK_RECORD_MAX, &seq); + if (ret == 0) { + break; + } else if (ret < 0) { + prb_iter_init(&iter, &printk_rb, NULL); + continue; + } + + if (seq > master_seq) + break; + + con->printk_seq++; + if (con->printk_seq < seq) { + print_console_dropped(con, seq - con->printk_seq); + con->printk_seq = seq; + } + + msg = (struct printk_log *)buf; + format_text(msg, master_seq, ext_text, &ext_len, text, + &len, time); + + if (len == 0 && ext_len == 0) + continue; + + if (con->flags & CON_EXTENDED) + con->write(con, ext_text, ext_len); + else + con->write(con, text, len); + + printk_delay(msg->level); + } +out: + con->wrote_history = 1; + kfree(ext_text); + kfree(text); + kfree(buf); } /* @ kernel/printk/printk.c:1758 @ static int console_trylock_spinning(void) * log_buf[start] to log_buf[end - 1]. * The console_lock must be held. */ -static void call_console_drivers(const char *ext_text, size_t ext_len, - const char *text, size_t len) +static void call_console_drivers(u64 seq, const char *ext_text, size_t ext_len, + const char *text, size_t len, int level, + int facility) { struct console *con; @ kernel/printk/printk.c:1770 @ static void call_console_drivers(const char *ext_text, size_t ext_len, return; for_each_console(con) { - if (exclusive_console && con != exclusive_console) - continue; if (!(con->flags & CON_ENABLED)) continue; + if (!con->wrote_history) { + if (con->flags & CON_PRINTBUFFER) { + printk_write_history(con, seq); + continue; + } + con->wrote_history = 1; + con->printk_seq = seq - 1; + } + if (con->flags & CON_BOOT && facility == 0) { + /* skip boot messages, already printed */ + if (con->printk_seq < seq) + con->printk_seq = seq; + continue; + } if (!con->write) continue; - if (!cpu_online(smp_processor_id()) && + if (!cpu_online(raw_smp_processor_id()) && !(con->flags & CON_ANYTIME)) continue; + if (con->printk_seq >= seq) + continue; + + con->printk_seq++; + if (con->printk_seq < seq) { + print_console_dropped(con, seq - con->printk_seq); + con->printk_seq = seq; + } + + /* for supressed messages, only seq is updated */ + if (len == 0 && ext_len == 0) + continue; + if (con->flags & CON_EXTENDED) con->write(con, ext_text, ext_len); else @ kernel/printk/printk.c:1811 @ static void call_console_drivers(const char *ext_text, size_t ext_len, } } -int printk_delay_msec __read_mostly; - -static inline void printk_delay(void) -{ - if (unlikely(printk_delay_msec)) { - int m = printk_delay_msec; - - while (m--) { - mdelay(1); - touch_nmi_watchdog(); - } - } -} - static inline u32 printk_caller_id(void) { return in_task() ? task_pid_nr(current) : @ kernel/printk/printk.c:1827 @ static struct cont { char buf[LOG_LINE_MAX]; size_t len; /* length == 0 means unused buffer */ u32 caller_id; /* printk_caller_id() of first print */ + int cpu_owner; /* cpu of first print */ u64 ts_nsec; /* time of first print */ u8 level; /* log level of first message */ u8 facility; /* log facility of first message */ enum log_flags flags; /* prefix, newline flags */ -} cont; +} cont[2]; -static void cont_flush(void) +static void cont_flush(int ctx) { - if (cont.len == 0) + struct cont *c = &cont[ctx]; + + if (c->len == 0) return; - log_store(cont.caller_id, cont.facility, cont.level, cont.flags, - cont.ts_nsec, NULL, 0, cont.buf, cont.len); - cont.len = 0; + log_store(c->caller_id, c->facility, c->level, c->flags, + c->ts_nsec, c->cpu_owner, NULL, 0, c->buf, c->len); + c->len = 0; } -static bool cont_add(u32 caller_id, int facility, int level, +static void cont_add(int ctx, int cpu, u32 caller_id, int facility, int level, enum log_flags flags, const char *text, size_t len) { + struct cont *c = &cont[ctx]; + + if (cpu != c->cpu_owner || !(flags & LOG_CONT)) + cont_flush(ctx); + /* If the line gets too long, split it up in separate records. */ - if (cont.len + len > sizeof(cont.buf)) { - cont_flush(); - return false; + while (c->len + len > sizeof(c->buf)) + cont_flush(ctx); + + if (!c->len) { + c->facility = facility; + c->level = level; + c->caller_id = caller_id; + c->ts_nsec = local_clock(); + c->flags = flags; + c->cpu_owner = cpu; } - if (!cont.len) { - cont.facility = facility; - cont.level = level; - cont.caller_id = caller_id; - cont.ts_nsec = local_clock(); - cont.flags = flags; - } - - memcpy(cont.buf + cont.len, text, len); - cont.len += len; + memcpy(c->buf + c->len, text, len); + c->len += len; // The original flags come from the first line, // but later continuations can add a newline. if (flags & LOG_NEWLINE) { - cont.flags |= LOG_NEWLINE; - cont_flush(); + c->flags |= LOG_NEWLINE; + cont_flush(ctx); } - - return true; } -static size_t log_output(int facility, int level, enum log_flags lflags, const char *dict, size_t dictlen, char *text, size_t text_len) +/* ring buffer used as memory allocator for temporary sprint buffers */ +DECLARE_STATIC_PRINTKRB(sprint_rb, + ilog2(PRINTK_RECORD_MAX + sizeof(struct prb_entry) + + sizeof(long)) + 2, &printk_cpulock); + +asmlinkage int vprintk_emit(int facility, int level, + const char *dict, size_t dictlen, + const char *fmt, va_list args) { const u32 caller_id = printk_caller_id(); - - /* - * If an earlier line was buffered, and we're a continuation - * write from the same context, try to add it to the buffer. - */ - if (cont.len) { - if (cont.caller_id == caller_id && (lflags & LOG_CONT)) { - if (cont_add(caller_id, facility, level, lflags, text, text_len)) - return text_len; - } - /* Otherwise, make sure it's flushed */ - cont_flush(); - } - - /* Skip empty continuation lines that couldn't be added - they just flush */ - if (!text_len && (lflags & LOG_CONT)) - return 0; - - /* If it doesn't end in a newline, try to buffer the current line */ - if (!(lflags & LOG_NEWLINE)) { - if (cont_add(caller_id, facility, level, lflags, text, text_len)) - return text_len; - } - - /* Store it in the record log */ - return log_store(caller_id, facility, level, lflags, 0, - dict, dictlen, text, text_len); -} - -/* Must be called under logbuf_lock. */ -int vprintk_store(int facility, int level, - const char *dict, size_t dictlen, - const char *fmt, va_list args) -{ - static char textbuf[LOG_LINE_MAX]; - char *text = textbuf; - size_t text_len; + int ctx = !!in_nmi(); enum log_flags lflags = 0; + int printed_len = 0; + struct prb_handle h; + size_t text_len; + u64 ts_nsec; + char *text; + char *rbuf; + int cpu; + + ts_nsec = local_clock(); + + rbuf = prb_reserve(&h, &sprint_rb, PRINTK_SPRINT_MAX); + if (!rbuf) { + prb_inc_lost(&printk_rb); + return printed_len; + } + + cpu = raw_smp_processor_id(); /* - * The printf needs to come first; we need the syslog - * prefix which might be passed-in as a parameter. + * If this turns out to be an emergency message, there + * may need to be a prefix added. Leave room for it. */ - text_len = vscnprintf(text, sizeof(textbuf), fmt, args); + text = rbuf + PREFIX_MAX; + text_len = vscnprintf(text, PRINTK_SPRINT_MAX - PREFIX_MAX, fmt, args); - /* mark and strip a trailing newline */ + /* strip and flag a trailing newline */ if (text_len && text[text_len-1] == '\n') { text_len--; lflags |= LOG_NEWLINE; @ kernel/printk/printk.c:1946 @ int vprintk_store(int facility, int level, if (dict) lflags |= LOG_NEWLINE; - return log_output(facility, level, lflags, - dict, dictlen, text, text_len); -} - -asmlinkage int vprintk_emit(int facility, int level, - const char *dict, size_t dictlen, - const char *fmt, va_list args) -{ - int printed_len; - bool in_sched = false, pending_output; - unsigned long flags; - u64 curr_log_seq; - - /* Suppress unimportant messages after panic happens */ - if (unlikely(suppress_printk)) - return 0; - - if (level == LOGLEVEL_SCHED) { - level = LOGLEVEL_DEFAULT; - in_sched = true; + /* + * NOTE: + * - rbuf points to beginning of allocated buffer + * - text points to beginning of text + * - there is room before text for prefix + */ + if (facility == 0) { + /* only the kernel can create emergency messages */ + printk_emergency(rbuf, level & 7, ts_nsec, cpu, text, text_len); } - boot_delay_msec(level); - printk_delay(); - - /* This stops the holder of console_sem just where we want him */ - logbuf_lock_irqsave(flags); - curr_log_seq = log_next_seq; - printed_len = vprintk_store(facility, level, dict, dictlen, fmt, args); - pending_output = (curr_log_seq != log_next_seq); - logbuf_unlock_irqrestore(flags); - - /* If called from the scheduler, we can not call up(). */ - if (!in_sched && pending_output) { - /* - * Disable preemption to avoid being preempted while holding - * console_sem which would prevent anyone from printing to - * console - */ - preempt_disable(); - /* - * Try to acquire and then immediately release the console - * semaphore. The release will print out buffers and wake up - * /dev/kmsg and syslog() users. - */ - if (console_trylock_spinning()) - console_unlock(); - preempt_enable(); + if ((lflags & LOG_CONT) || !(lflags & LOG_NEWLINE)) { + cont_add(ctx, cpu, caller_id, facility, level, lflags, text, text_len); + printed_len = text_len; + } else { + if (cpu == cont[ctx].cpu_owner) + cont_flush(ctx); + printed_len = log_store(caller_id, facility, level, lflags, ts_nsec, cpu, + dict, dictlen, text, text_len); } - if (pending_output) - wake_up_klogd(); + prb_commit(&h); return printed_len; } EXPORT_SYMBOL(vprintk_emit); +static __printf(1, 0) int vprintk_func(const char *fmt, va_list args) +{ + return vprintk_emit(0, LOGLEVEL_DEFAULT, NULL, 0, fmt, args); +} + asmlinkage int vprintk(const char *fmt, va_list args) { return vprintk_func(fmt, args); @ kernel/printk/printk.c:2033 @ asmlinkage __visible int printk(const char *fmt, ...) return r; } EXPORT_SYMBOL(printk); - -#else /* CONFIG_PRINTK */ - -#define LOG_LINE_MAX 0 -#define PREFIX_MAX 0 -#define printk_time false - -static u64 syslog_seq; -static u32 syslog_idx; -static u64 console_seq; -static u32 console_idx; -static u64 exclusive_console_stop_seq; -static u64 log_first_seq; -static u32 log_first_idx; -static u64 log_next_seq; -static char *log_text(const struct printk_log *msg) { return NULL; } -static char *log_dict(const struct printk_log *msg) { return NULL; } -static struct printk_log *log_from_idx(u32 idx) { return NULL; } -static u32 log_next(u32 idx) { return 0; } -static ssize_t msg_print_ext_header(char *buf, size_t size, - struct printk_log *msg, - u64 seq) { return 0; } -static ssize_t msg_print_ext_body(char *buf, size_t size, - char *dict, size_t dict_len, - char *text, size_t text_len) { return 0; } -static void console_lock_spinning_enable(void) { } -static int console_lock_spinning_disable_and_check(void) { return 0; } -static void call_console_drivers(const char *ext_text, size_t ext_len, - const char *text, size_t len) {} -static size_t msg_print_text(const struct printk_log *msg, bool syslog, - bool time, char *buf, size_t size) { return 0; } -static bool suppress_message_printing(int level) { return false; } - #endif /* CONFIG_PRINTK */ #ifdef CONFIG_EARLY_PRINTK @ kernel/printk/printk.c:2263 @ int is_console_locked(void) } EXPORT_SYMBOL(is_console_locked); -/* - * Check if we have any console that is capable of printing while cpu is - * booting or shutting down. Requires console_sem. - */ -static int have_callable_console(void) -{ - struct console *con; - - for_each_console(con) - if ((con->flags & CON_ENABLED) && - (con->flags & CON_ANYTIME)) - return 1; - - return 0; -} - -/* - * Can we actually use the console at this time on this cpu? - * - * Console drivers may assume that per-cpu resources have been allocated. So - * unless they're explicitly marked as being able to cope (CON_ANYTIME) don't - * call them until this CPU is officially up. - */ -static inline int can_use_console(void) -{ - return cpu_online(raw_smp_processor_id()) || have_callable_console(); -} - /** * console_unlock - unlock the console system * * Releases the console_lock which the caller holds on the console system * and the console driver list. * - * While the console_lock was held, console output may have been buffered - * by printk(). If this is the case, console_unlock(); emits - * the output prior to releasing the lock. - * - * If there is output waiting, we wake /dev/kmsg and syslog() users. - * * console_unlock(); may be called from any context. */ void console_unlock(void) { - static char ext_text[CONSOLE_EXT_LOG_MAX]; - static char text[LOG_LINE_MAX + PREFIX_MAX]; - unsigned long flags; - bool do_cond_resched, retry; - if (console_suspended) { up_console_sem(); return; } - /* - * Console drivers are called with interrupts disabled, so - * @console_may_schedule should be cleared before; however, we may - * end up dumping a lot of lines, for example, if called from - * console registration path, and should invoke cond_resched() - * between lines if allowable. Not doing so can cause a very long - * scheduling stall on a slow console leading to RCU stall and - * softlockup warnings which exacerbate the issue with more - * messages practically incapacitating the system. - * - * console_trylock() is not able to detect the preemptive - * context reliably. Therefore the value must be stored before - * and cleared after the the "again" goto label. - */ - do_cond_resched = console_may_schedule; -again: - console_may_schedule = 0; - - /* - * We released the console_sem lock, so we need to recheck if - * cpu is online and (if not) is there at least one CON_ANYTIME - * console. - */ - if (!can_use_console()) { - console_locked = 0; - up_console_sem(); - return; - } - - for (;;) { - struct printk_log *msg; - size_t ext_len = 0; - size_t len; - - printk_safe_enter_irqsave(flags); - raw_spin_lock(&logbuf_lock); - if (console_seq < log_first_seq) { - len = sprintf(text, - "** %llu printk messages dropped **\n", - log_first_seq - console_seq); - - /* messages are gone, move to first one */ - console_seq = log_first_seq; - console_idx = log_first_idx; - } else { - len = 0; - } -skip: - if (console_seq == log_next_seq) - break; - - msg = log_from_idx(console_idx); - if (suppress_message_printing(msg->level)) { - /* - * Skip record we have buffered and already printed - * directly to the console when we received it, and - * record that has level above the console loglevel. - */ - console_idx = log_next(console_idx); - console_seq++; - goto skip; - } - - /* Output to all consoles once old messages replayed. */ - if (unlikely(exclusive_console && - console_seq >= exclusive_console_stop_seq)) { - exclusive_console = NULL; - } - - len += msg_print_text(msg, - console_msg_format & MSG_FORMAT_SYSLOG, - printk_time, text + len, sizeof(text) - len); - if (nr_ext_console_drivers) { - ext_len = msg_print_ext_header(ext_text, - sizeof(ext_text), - msg, console_seq); - ext_len += msg_print_ext_body(ext_text + ext_len, - sizeof(ext_text) - ext_len, - log_dict(msg), msg->dict_len, - log_text(msg), msg->text_len); - } - console_idx = log_next(console_idx); - console_seq++; - raw_spin_unlock(&logbuf_lock); - - /* - * While actively printing out messages, if another printk() - * were to occur on another CPU, it may wait for this one to - * finish. This task can not be preempted if there is a - * waiter waiting to take over. - */ - console_lock_spinning_enable(); - - stop_critical_timings(); /* don't trace print latency */ - call_console_drivers(ext_text, ext_len, text, len); - start_critical_timings(); - - if (console_lock_spinning_disable_and_check()) { - printk_safe_exit_irqrestore(flags); - return; - } - - printk_safe_exit_irqrestore(flags); - - if (do_cond_resched) - cond_resched(); - } - console_locked = 0; - - raw_spin_unlock(&logbuf_lock); - up_console_sem(); - - /* - * Someone could have filled up the buffer again, so re-check if there's - * something to flush. In case we cannot trylock the console_sem again, - * there's a new owner and the console_unlock() from them will do the - * flush, no worries. - */ - raw_spin_lock(&logbuf_lock); - retry = console_seq != log_next_seq; - raw_spin_unlock(&logbuf_lock); - printk_safe_exit_irqrestore(flags); - - if (retry && console_trylock()) - goto again; } EXPORT_SYMBOL(console_unlock); @ kernel/printk/printk.c:2330 @ void console_unblank(void) void console_flush_on_panic(enum con_flush_mode mode) { /* - * If someone else is holding the console lock, trylock will fail - * and may_schedule may be set. Ignore and proceed to unlock so - * that messages are flushed out. As this can be called from any - * context and we don't want to get preempted while flushing, - * ensure may_schedule is cleared. + * FIXME: This is currently a NOP. Emergency messages will have been + * printed, but what about if write_atomic is not available on the + * console? What if the printk kthread is still alive? */ - console_trylock(); - console_may_schedule = 0; - - if (mode == CONSOLE_REPLAY_ALL) { - unsigned long flags; - - logbuf_lock_irqsave(flags); - console_seq = log_first_seq; - console_idx = log_first_idx; - logbuf_unlock_irqrestore(flags); - } - console_unlock(); } /* @ kernel/printk/printk.c:2411 @ early_param("keep_bootcon", keep_bootcon_setup); void register_console(struct console *newcon) { int i; - unsigned long flags; struct console *bcon = NULL; struct console_cmdline *c; static bool has_preferred; @ kernel/printk/printk.c:2526 @ void register_console(struct console *newcon) if (newcon->flags & CON_EXTENDED) nr_ext_console_drivers++; - if (newcon->flags & CON_PRINTBUFFER) { - /* - * console_unlock(); will print out the buffered messages - * for us. - */ - logbuf_lock_irqsave(flags); - /* - * We're about to replay the log buffer. Only do this to the - * just-registered console to avoid excessive message spam to - * the already-registered consoles. - * - * Set exclusive_console with disabled interrupts to reduce - * race window with eventual console_flush_on_panic() that - * ignores console_lock. - */ - exclusive_console = newcon; - exclusive_console_stop_seq = console_seq; - console_seq = syslog_seq; - console_idx = syslog_idx; - logbuf_unlock_irqrestore(flags); - } console_unlock(); console_sysfs_notify(); @ kernel/printk/printk.c:2535 @ void register_console(struct console *newcon) * boot consoles, real consoles, etc - this is to ensure that end * users know there might be something in the kernel's log buffer that * went to the bootconsole (that they do not see on the real console) + * + * This message is also important because it will trigger the + * printk kthread to begin dumping the log buffer to the newly + * registered console. */ pr_info("%sconsole [%s%d] enabled\n", (newcon->flags & CON_BOOT) ? "boot" : "" , @ kernel/printk/printk.c:2682 @ static int __init printk_late_init(void) late_initcall(printk_late_init); #if defined CONFIG_PRINTK -/* - * Delayed printk version, for scheduler-internal messages: - */ -#define PRINTK_PENDING_WAKEUP 0x01 -#define PRINTK_PENDING_OUTPUT 0x02 - -static DEFINE_PER_CPU(int, printk_pending); - -static void wake_up_klogd_work_func(struct irq_work *irq_work) +static int printk_kthread_func(void *data) { - int pending = __this_cpu_xchg(printk_pending, 0); + struct prb_iterator iter; + struct printk_log *msg; + size_t ext_len; + char *ext_text; + u64 master_seq; + size_t len; + char *text; + char *buf; + int ret; - if (pending & PRINTK_PENDING_OUTPUT) { - /* If trylock fails, someone else is doing the printing */ - if (console_trylock()) - console_unlock(); + ext_text = kmalloc(CONSOLE_EXT_LOG_MAX, GFP_KERNEL); + text = kmalloc(PRINTK_SPRINT_MAX, GFP_KERNEL); + buf = kmalloc(PRINTK_RECORD_MAX, GFP_KERNEL); + if (!ext_text || !text || !buf) + return -1; + + prb_iter_init(&iter, &printk_rb, NULL); + + /* the printk kthread never exits */ + for (;;) { + ret = prb_iter_wait_next(&iter, buf, + PRINTK_RECORD_MAX, &master_seq); + if (ret == -ERESTARTSYS) { + continue; + } else if (ret < 0) { + /* iterator invalid, start over */ + prb_iter_init(&iter, &printk_rb, NULL); + continue; + } + + msg = (struct printk_log *)buf; + format_text(msg, master_seq, ext_text, &ext_len, text, + &len, printk_time); + + console_lock(); + console_may_schedule = 0; + call_console_drivers(master_seq, ext_text, ext_len, text, len, + msg->level, msg->facility); + if (len > 0 || ext_len > 0) + printk_delay(msg->level); + console_unlock(); } - if (pending & PRINTK_PENDING_WAKEUP) - wake_up_interruptible(&log_wait); + kfree(ext_text); + kfree(text); + kfree(buf); + + return 0; } -static DEFINE_PER_CPU(struct irq_work, wake_up_klogd_work) = { - .func = wake_up_klogd_work_func, - .flags = ATOMIC_INIT(IRQ_WORK_LAZY), -}; - -void wake_up_klogd(void) +static int __init init_printk_kthread(void) { - if (!printk_percpu_data_ready()) - return; + struct task_struct *thread; - preempt_disable(); - if (waitqueue_active(&log_wait)) { - this_cpu_or(printk_pending, PRINTK_PENDING_WAKEUP); - irq_work_queue(this_cpu_ptr(&wake_up_klogd_work)); + thread = kthread_run(printk_kthread_func, NULL, "printk"); + if (IS_ERR(thread)) { + pr_err("printk: unable to create printing thread\n"); + return PTR_ERR(thread); } - preempt_enable(); + + return 0; } +late_initcall(init_printk_kthread); -void defer_console_output(void) +static int vprintk_deferred(const char *fmt, va_list args) { - if (!printk_percpu_data_ready()) - return; - - preempt_disable(); - __this_cpu_or(printk_pending, PRINTK_PENDING_OUTPUT); - irq_work_queue(this_cpu_ptr(&wake_up_klogd_work)); - preempt_enable(); -} - -int vprintk_deferred(const char *fmt, va_list args) -{ - int r; - - r = vprintk_emit(0, LOGLEVEL_SCHED, NULL, 0, fmt, args); - defer_console_output(); - - return r; + return vprintk_emit(0, LOGLEVEL_DEFAULT, NULL, 0, fmt, args); } int printk_deferred(const char *fmt, ...) @ kernel/printk/printk.c:2872 @ module_param_named(always_kmsg_dump, always_kmsg_dump, bool, S_IRUGO | S_IWUSR); */ void kmsg_dump(enum kmsg_dump_reason reason) { + struct kmsg_dumper dumper_local; struct kmsg_dumper *dumper; - unsigned long flags; if ((reason > KMSG_DUMP_OOPS) && !always_kmsg_dump) return; @ kernel/printk/printk.c:2883 @ void kmsg_dump(enum kmsg_dump_reason reason) if (dumper->max_reason && reason > dumper->max_reason) continue; - /* initialize iterator with data about the stored records */ - dumper->active = true; + /* + * use a local copy to avoid modifying the + * iterator used by any other cpus/contexts + */ + memcpy(&dumper_local, dumper, sizeof(dumper_local)); - logbuf_lock_irqsave(flags); - dumper->cur_seq = clear_seq; - dumper->cur_idx = clear_idx; - dumper->next_seq = log_next_seq; - dumper->next_idx = log_next_idx; - logbuf_unlock_irqrestore(flags); + /* initialize iterator with data about the stored records */ + dumper_local.active = true; + kmsg_dump_rewind(&dumper_local); /* invoke dumper which will iterate over records */ - dumper->dump(dumper, reason); - - /* reset iterator */ - dumper->active = false; + dumper_local.dump(&dumper_local, reason); } rcu_read_unlock(); } @ kernel/printk/printk.c:2921 @ void kmsg_dump(enum kmsg_dump_reason reason) bool kmsg_dump_get_line_nolock(struct kmsg_dumper *dumper, bool syslog, char *line, size_t size, size_t *len) { + struct prb_iterator iter; struct printk_log *msg; - size_t l = 0; - bool ret = false; + struct prb_handle h; + bool cont = false; + char *msgbuf; + char *rbuf; + size_t l; + u64 seq; + int ret; if (!dumper->active) - goto out; + return cont; - if (dumper->cur_seq < log_first_seq) { - /* messages are gone, move to first available one */ - dumper->cur_seq = log_first_seq; - dumper->cur_idx = log_first_idx; + rbuf = prb_reserve(&h, &sprint_rb, PRINTK_RECORD_MAX); + if (!rbuf) + return cont; + msgbuf = rbuf; +retry: + for (;;) { + prb_iter_init(&iter, &printk_rb, &seq); + + if (dumper->line_seq == seq) { + /* already where we want to be */ + break; + } else if (dumper->line_seq < seq) { + /* messages are gone, move to first available one */ + dumper->line_seq = seq; + break; + } + + ret = prb_iter_seek(&iter, dumper->line_seq); + if (ret > 0) { + /* seeked to line_seq */ + break; + } else if (ret == 0) { + /* + * The end of the list was hit without ever seeing + * line_seq. Reset it to the beginning of the list. + */ + prb_iter_init(&iter, &printk_rb, &dumper->line_seq); + break; + } + /* iterator invalid, start over */ } - /* last entry */ - if (dumper->cur_seq >= log_next_seq) + ret = prb_iter_next(&iter, msgbuf, PRINTK_RECORD_MAX, + &dumper->line_seq); + if (ret == 0) goto out; + else if (ret < 0) + goto retry; - msg = log_from_idx(dumper->cur_idx); + msg = (struct printk_log *)msgbuf; l = msg_print_text(msg, syslog, printk_time, line, size); - dumper->cur_idx = log_next(dumper->cur_idx); - dumper->cur_seq++; - ret = true; -out: if (len) *len = l; - return ret; + cont = true; +out: + prb_commit(&h); + return cont; } /** @ kernel/printk/printk.c:3004 @ bool kmsg_dump_get_line_nolock(struct kmsg_dumper *dumper, bool syslog, bool kmsg_dump_get_line(struct kmsg_dumper *dumper, bool syslog, char *line, size_t size, size_t *len) { - unsigned long flags; bool ret; - logbuf_lock_irqsave(flags); ret = kmsg_dump_get_line_nolock(dumper, syslog, line, size, len); - logbuf_unlock_irqrestore(flags); return ret; } @ kernel/printk/printk.c:3034 @ EXPORT_SYMBOL_GPL(kmsg_dump_get_line); bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog, char *buf, size_t size, size_t *len) { - unsigned long flags; - u64 seq; - u32 idx; - u64 next_seq; - u32 next_idx; - size_t l = 0; - bool ret = false; + struct prb_iterator iter; bool time = printk_time; + struct printk_log *msg; + u64 new_end_seq = 0; + struct prb_handle h; + bool cont = false; + char *msgbuf; + u64 end_seq; + int textlen; + u64 seq = 0; + char *rbuf; + int l = 0; + int ret; if (!dumper->active) + return cont; + + rbuf = prb_reserve(&h, &sprint_rb, PRINTK_RECORD_MAX); + if (!rbuf) + return cont; + msgbuf = rbuf; + + prb_iter_init(&iter, &printk_rb, NULL); + + /* + * seek to the start record, which is set/modified + * by kmsg_dump_get_line_nolock() + */ + ret = prb_iter_seek(&iter, dumper->line_seq); + if (ret <= 0) + prb_iter_init(&iter, &printk_rb, &seq); + + /* work with a local end seq to have a constant value */ + end_seq = dumper->buffer_end_seq; + if (!end_seq) { + /* initialize end seq to "infinity" */ + end_seq = -1; + dumper->buffer_end_seq = end_seq; + } +retry: + if (seq >= end_seq) goto out; - logbuf_lock_irqsave(flags); - if (dumper->cur_seq < log_first_seq) { - /* messages are gone, move to first available one */ - dumper->cur_seq = log_first_seq; - dumper->cur_idx = log_first_idx; + /* count the total bytes after seq */ + textlen = count_remaining(&iter, end_seq, msgbuf, + PRINTK_RECORD_MAX, 0, time); + + /* move iter forward until length fits into the buffer */ + while (textlen > size) { + ret = prb_iter_next(&iter, msgbuf, PRINTK_RECORD_MAX, &seq); + if (ret == 0) { + break; + } else if (ret < 0 || seq >= end_seq) { + prb_iter_init(&iter, &printk_rb, &seq); + goto retry; + } + + msg = (struct printk_log *)msgbuf; + textlen -= msg_print_text(msg, true, time, NULL, 0); } - /* last entry */ - if (dumper->cur_seq >= dumper->next_seq) { - logbuf_unlock_irqrestore(flags); - goto out; + /* save end seq for the next interation */ + new_end_seq = seq + 1; + + /* copy messages to buffer */ + while (l < size) { + ret = prb_iter_next(&iter, msgbuf, PRINTK_RECORD_MAX, &seq); + if (ret == 0) { + break; + } else if (ret < 0) { + /* + * iterator (and thus also the start position) + * invalid, start over from beginning of list + */ + prb_iter_init(&iter, &printk_rb, NULL); + continue; + } + + if (seq >= end_seq) + break; + + msg = (struct printk_log *)msgbuf; + textlen = msg_print_text(msg, syslog, time, buf + l, size - l); + if (textlen > 0) + l += textlen; + cont = true; } - /* calculate length of entire buffer */ - seq = dumper->cur_seq; - idx = dumper->cur_idx; - while (seq < dumper->next_seq) { - struct printk_log *msg = log_from_idx(idx); - - l += msg_print_text(msg, true, time, NULL, 0); - idx = log_next(idx); - seq++; - } - - /* move first record forward until length fits into the buffer */ - seq = dumper->cur_seq; - idx = dumper->cur_idx; - while (l >= size && seq < dumper->next_seq) { - struct printk_log *msg = log_from_idx(idx); - - l -= msg_print_text(msg, true, time, NULL, 0); - idx = log_next(idx); - seq++; - } - - /* last message in next interation */ - next_seq = seq; - next_idx = idx; - - l = 0; - while (seq < dumper->next_seq) { - struct printk_log *msg = log_from_idx(idx); - - l += msg_print_text(msg, syslog, time, buf + l, size - l); - idx = log_next(idx); - seq++; - } - - dumper->next_seq = next_seq; - dumper->next_idx = next_idx; - ret = true; - logbuf_unlock_irqrestore(flags); -out: - if (len) + if (cont && len) *len = l; - return ret; +out: + prb_commit(&h); + if (new_end_seq) + dumper->buffer_end_seq = new_end_seq; + return cont; } EXPORT_SYMBOL_GPL(kmsg_dump_get_buffer); @ kernel/printk/printk.c:3144 @ EXPORT_SYMBOL_GPL(kmsg_dump_get_buffer); */ void kmsg_dump_rewind_nolock(struct kmsg_dumper *dumper) { - dumper->cur_seq = clear_seq; - dumper->cur_idx = clear_idx; - dumper->next_seq = log_next_seq; - dumper->next_idx = log_next_idx; + dumper->line_seq = 0; + dumper->buffer_end_seq = 0; } /** @ kernel/printk/printk.c:3158 @ void kmsg_dump_rewind_nolock(struct kmsg_dumper *dumper) */ void kmsg_dump_rewind(struct kmsg_dumper *dumper) { - unsigned long flags; - - logbuf_lock_irqsave(flags); kmsg_dump_rewind_nolock(dumper); - logbuf_unlock_irqrestore(flags); } EXPORT_SYMBOL_GPL(kmsg_dump_rewind); +static bool console_can_emergency(int level) +{ + struct console *con; + + for_each_console(con) { + if (!(con->flags & CON_ENABLED)) + continue; + if (con->write_atomic && oops_in_progress) + return true; + if (con->write && (con->flags & CON_BOOT)) + return true; + } + return false; +} + +static void call_emergency_console_drivers(int level, const char *text, + size_t text_len) +{ + struct console *con; + + for_each_console(con) { + if (!(con->flags & CON_ENABLED)) + continue; + if (con->write_atomic && oops_in_progress) { + con->write_atomic(con, text, text_len); + continue; + } + if (con->write && (con->flags & CON_BOOT)) { + con->write(con, text, text_len); + continue; + } + } +} + +static void printk_emergency(char *buffer, int level, u64 ts_nsec, u16 cpu, + char *text, u16 text_len) +{ + struct printk_log msg; + size_t prefix_len; + + if (!console_can_emergency(level)) + return; + + msg.level = level; + msg.ts_nsec = ts_nsec; + msg.cpu = cpu; + msg.facility = 0; + + /* "text" must have PREFIX_MAX preceding bytes available */ + + prefix_len = print_prefix(&msg, + console_msg_format & MSG_FORMAT_SYSLOG, + printk_time, buffer); + /* move the prefix forward to the beginning of the message text */ + text -= prefix_len; + memmove(text, buffer, prefix_len); + text_len += prefix_len; + + text[text_len++] = '\n'; + + call_emergency_console_drivers(level, text, text_len); + + touch_softlockup_watchdog_sync(); + clocksource_touch_watchdog(); + rcu_cpu_stall_reset(); + touch_nmi_watchdog(); + + printk_delay(level); +} #endif + +void console_atomic_lock(unsigned int *flags) +{ + prb_lock(&printk_cpulock, flags); +} +EXPORT_SYMBOL(console_atomic_lock); + +void console_atomic_unlock(unsigned int flags) +{ + prb_unlock(&printk_cpulock, flags); +} +EXPORT_SYMBOL(console_atomic_unlock); @ kernel/printk/printk_safe.c:1 @ -// SPDX-License-Identifier: GPL-2.0-or-later -/* - * printk_safe.c - Safe printk for printk-deadlock-prone contexts - */ - -#include <linux/preempt.h> -#include <linux/spinlock.h> -#include <linux/debug_locks.h> -#include <linux/smp.h> -#include <linux/cpumask.h> -#include <linux/irq_work.h> -#include <linux/printk.h> - -#include "internal.h" - -/* - * printk() could not take logbuf_lock in NMI context. Instead, - * it uses an alternative implementation that temporary stores - * the strings into a per-CPU buffer. The content of the buffer - * is later flushed into the main ring buffer via IRQ work. - * - * The alternative implementation is chosen transparently - * by examinig current printk() context mask stored in @printk_context - * per-CPU variable. - * - * The implementation allows to flush the strings also from another CPU. - * There are situations when we want to make sure that all buffers - * were handled or when IRQs are blocked. - */ - -#define SAFE_LOG_BUF_LEN ((1 << CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT) - \ - sizeof(atomic_t) - \ - sizeof(atomic_t) - \ - sizeof(struct irq_work)) - -struct printk_safe_seq_buf { - atomic_t len; /* length of written data */ - atomic_t message_lost; - struct irq_work work; /* IRQ work that flushes the buffer */ - unsigned char buffer[SAFE_LOG_BUF_LEN]; -}; - -static DEFINE_PER_CPU(struct printk_safe_seq_buf, safe_print_seq); -static DEFINE_PER_CPU(int, printk_context); - -#ifdef CONFIG_PRINTK_NMI -static DEFINE_PER_CPU(struct printk_safe_seq_buf, nmi_print_seq); -#endif - -/* Get flushed in a more safe context. */ -static void queue_flush_work(struct printk_safe_seq_buf *s) -{ - if (printk_percpu_data_ready()) - irq_work_queue(&s->work); -} - -/* - * Add a message to per-CPU context-dependent buffer. NMI and printk-safe - * have dedicated buffers, because otherwise printk-safe preempted by - * NMI-printk would have overwritten the NMI messages. - * - * The messages are flushed from irq work (or from panic()), possibly, - * from other CPU, concurrently with printk_safe_log_store(). Should this - * happen, printk_safe_log_store() will notice the buffer->len mismatch - * and repeat the write. - */ -static __printf(2, 0) int printk_safe_log_store(struct printk_safe_seq_buf *s, - const char *fmt, va_list args) -{ - int add; - size_t len; - va_list ap; - -again: - len = atomic_read(&s->len); - - /* The trailing '\0' is not counted into len. */ - if (len >= sizeof(s->buffer) - 1) { - atomic_inc(&s->message_lost); - queue_flush_work(s); - return 0; - } - - /* - * Make sure that all old data have been read before the buffer - * was reset. This is not needed when we just append data. - */ - if (!len) - smp_rmb(); - - va_copy(ap, args); - add = vscnprintf(s->buffer + len, sizeof(s->buffer) - len, fmt, ap); - va_end(ap); - if (!add) - return 0; - - /* - * Do it once again if the buffer has been flushed in the meantime. - * Note that atomic_cmpxchg() is an implicit memory barrier that - * makes sure that the data were written before updating s->len. - */ - if (atomic_cmpxchg(&s->len, len, len + add) != len) - goto again; - - queue_flush_work(s); - return add; -} - -static inline void printk_safe_flush_line(const char *text, int len) -{ - /* - * Avoid any console drivers calls from here, because we may be - * in NMI or printk_safe context (when in panic). The messages - * must go only into the ring buffer at this stage. Consoles will - * get explicitly called later when a crashdump is not generated. - */ - printk_deferred("%.*s", len, text); -} - -/* printk part of the temporary buffer line by line */ -static int printk_safe_flush_buffer(const char *start, size_t len) -{ - const char *c, *end; - bool header; - - c = start; - end = start + len; - header = true; - - /* Print line by line. */ - while (c < end) { - if (*c == '\n') { - printk_safe_flush_line(start, c - start + 1); - start = ++c; - header = true; - continue; - } - - /* Handle continuous lines or missing new line. */ - if ((c + 1 < end) && printk_get_level(c)) { - if (header) { - c = printk_skip_level(c); - continue; - } - - printk_safe_flush_line(start, c - start); - start = c++; - header = true; - continue; - } - - header = false; - c++; - } - - /* Check if there was a partial line. Ignore pure header. */ - if (start < end && !header) { - static const char newline[] = KERN_CONT "\n"; - - printk_safe_flush_line(start, end - start); - printk_safe_flush_line(newline, strlen(newline)); - } - - return len; -} - -static void report_message_lost(struct printk_safe_seq_buf *s) -{ - int lost = atomic_xchg(&s->message_lost, 0); - - if (lost) - printk_deferred("Lost %d message(s)!\n", lost); -} - -/* - * Flush data from the associated per-CPU buffer. The function - * can be called either via IRQ work or independently. - */ -static void __printk_safe_flush(struct irq_work *work) -{ - static raw_spinlock_t read_lock = - __RAW_SPIN_LOCK_INITIALIZER(read_lock); - struct printk_safe_seq_buf *s = - container_of(work, struct printk_safe_seq_buf, work); - unsigned long flags; - size_t len; - int i; - - /* - * The lock has two functions. First, one reader has to flush all - * available message to make the lockless synchronization with - * writers easier. Second, we do not want to mix messages from - * different CPUs. This is especially important when printing - * a backtrace. - */ - raw_spin_lock_irqsave(&read_lock, flags); - - i = 0; -more: - len = atomic_read(&s->len); - - /* - * This is just a paranoid check that nobody has manipulated - * the buffer an unexpected way. If we printed something then - * @len must only increase. Also it should never overflow the - * buffer size. - */ - if ((i && i >= len) || len > sizeof(s->buffer)) { - const char *msg = "printk_safe_flush: internal error\n"; - - printk_safe_flush_line(msg, strlen(msg)); - len = 0; - } - - if (!len) - goto out; /* Someone else has already flushed the buffer. */ - - /* Make sure that data has been written up to the @len */ - smp_rmb(); - i += printk_safe_flush_buffer(s->buffer + i, len - i); - - /* - * Check that nothing has got added in the meantime and truncate - * the buffer. Note that atomic_cmpxchg() is an implicit memory - * barrier that makes sure that the data were copied before - * updating s->len. - */ - if (atomic_cmpxchg(&s->len, len, 0) != len) - goto more; - -out: - report_message_lost(s); - raw_spin_unlock_irqrestore(&read_lock, flags); -} - -/** - * printk_safe_flush - flush all per-cpu nmi buffers. - * - * The buffers are flushed automatically via IRQ work. This function - * is useful only when someone wants to be sure that all buffers have - * been flushed at some point. - */ -void printk_safe_flush(void) -{ - int cpu; - - for_each_possible_cpu(cpu) { -#ifdef CONFIG_PRINTK_NMI - __printk_safe_flush(&per_cpu(nmi_print_seq, cpu).work); -#endif - __printk_safe_flush(&per_cpu(safe_print_seq, cpu).work); - } -} - -/** - * printk_safe_flush_on_panic - flush all per-cpu nmi buffers when the system - * goes down. - * - * Similar to printk_safe_flush() but it can be called even in NMI context when - * the system goes down. It does the best effort to get NMI messages into - * the main ring buffer. - * - * Note that it could try harder when there is only one CPU online. - */ -void printk_safe_flush_on_panic(void) -{ - /* - * Make sure that we could access the main ring buffer. - * Do not risk a double release when more CPUs are up. - */ - if (raw_spin_is_locked(&logbuf_lock)) { - if (num_online_cpus() > 1) - return; - - debug_locks_off(); - raw_spin_lock_init(&logbuf_lock); - } - - printk_safe_flush(); -} - -#ifdef CONFIG_PRINTK_NMI -/* - * Safe printk() for NMI context. It uses a per-CPU buffer to - * store the message. NMIs are not nested, so there is always only - * one writer running. But the buffer might get flushed from another - * CPU, so we need to be careful. - */ -static __printf(1, 0) int vprintk_nmi(const char *fmt, va_list args) -{ - struct printk_safe_seq_buf *s = this_cpu_ptr(&nmi_print_seq); - - return printk_safe_log_store(s, fmt, args); -} - -void notrace printk_nmi_enter(void) -{ - this_cpu_or(printk_context, PRINTK_NMI_CONTEXT_MASK); -} - -void notrace printk_nmi_exit(void) -{ - this_cpu_and(printk_context, ~PRINTK_NMI_CONTEXT_MASK); -} - -/* - * Marks a code that might produce many messages in NMI context - * and the risk of losing them is more critical than eventual - * reordering. - * - * It has effect only when called in NMI context. Then printk() - * will try to store the messages into the main logbuf directly - * and use the per-CPU buffers only as a fallback when the lock - * is not available. - */ -void printk_nmi_direct_enter(void) -{ - if (this_cpu_read(printk_context) & PRINTK_NMI_CONTEXT_MASK) - this_cpu_or(printk_context, PRINTK_NMI_DIRECT_CONTEXT_MASK); -} - -void printk_nmi_direct_exit(void) -{ - this_cpu_and(printk_context, ~PRINTK_NMI_DIRECT_CONTEXT_MASK); -} - -#else - -static __printf(1, 0) int vprintk_nmi(const char *fmt, va_list args) -{ - return 0; -} - -#endif /* CONFIG_PRINTK_NMI */ - -/* - * Lock-less printk(), to avoid deadlocks should the printk() recurse - * into itself. It uses a per-CPU buffer to store the message, just like - * NMI. - */ -static __printf(1, 0) int vprintk_safe(const char *fmt, va_list args) -{ - struct printk_safe_seq_buf *s = this_cpu_ptr(&safe_print_seq); - - return printk_safe_log_store(s, fmt, args); -} - -/* Can be preempted by NMI. */ -void __printk_safe_enter(void) -{ - this_cpu_inc(printk_context); -} - -/* Can be preempted by NMI. */ -void __printk_safe_exit(void) -{ - this_cpu_dec(printk_context); -} - -__printf(1, 0) int vprintk_func(const char *fmt, va_list args) -{ - /* - * Try to use the main logbuf even in NMI. But avoid calling console - * drivers that might have their own locks. - */ - if ((this_cpu_read(printk_context) & PRINTK_NMI_DIRECT_CONTEXT_MASK) && - raw_spin_trylock(&logbuf_lock)) { - int len; - - len = vprintk_store(0, LOGLEVEL_DEFAULT, NULL, 0, fmt, args); - raw_spin_unlock(&logbuf_lock); - defer_console_output(); - return len; - } - - /* Use extra buffer in NMI when logbuf_lock is taken or in safe mode. */ - if (this_cpu_read(printk_context) & PRINTK_NMI_CONTEXT_MASK) - return vprintk_nmi(fmt, args); - - /* Use extra buffer to prevent a recursion deadlock in safe mode. */ - if (this_cpu_read(printk_context) & PRINTK_SAFE_CONTEXT_MASK) - return vprintk_safe(fmt, args); - - /* No obstacles. */ - return vprintk_default(fmt, args); -} - -void __init printk_safe_init(void) -{ - int cpu; - - for_each_possible_cpu(cpu) { - struct printk_safe_seq_buf *s; - - s = &per_cpu(safe_print_seq, cpu); - init_irq_work(&s->work, __printk_safe_flush); - -#ifdef CONFIG_PRINTK_NMI - s = &per_cpu(nmi_print_seq, cpu); - init_irq_work(&s->work, __printk_safe_flush); -#endif - } - - /* Flush pending messages that did not have scheduled IRQ works. */ - printk_safe_flush(); -} @ kernel/ptrace.c:183 @ static bool ptrace_freeze_traced(struct task_struct *task) spin_lock_irq(&task->sighand->siglock); if (task_is_traced(task) && !__fatal_signal_pending(task)) { - task->state = __TASK_TRACED; + unsigned long flags; + + raw_spin_lock_irqsave(&task->pi_lock, flags); + if (task->state & __TASK_TRACED) + task->state = __TASK_TRACED; + else + task->saved_state = __TASK_TRACED; + raw_spin_unlock_irqrestore(&task->pi_lock, flags); ret = true; } spin_unlock_irq(&task->sighand->siglock); @ kernel/rcu/Kconfig:165 @ config RCU_FAST_NO_HZ config RCU_BOOST bool "Enable RCU priority boosting" - depends on RT_MUTEXES && PREEMPT_RCU && RCU_EXPERT - default n + depends on (RT_MUTEXES && PREEMPT_RCU && RCU_EXPERT) || PREEMPT_RT + default y if PREEMPT_RT help This option boosts the priority of preempted RCU readers that block the current preemptible RCU grace period for too long. @ kernel/rcu/rcutorture.c:64 @ MODULE_AUTHOR("Paul E. McKenney <paulmck@linux.ibm.com> and Josh Triplett <josh@ #define RCUTORTURE_RDR_RBH 0x08 /* ... rcu_read_lock_bh(). */ #define RCUTORTURE_RDR_SCHED 0x10 /* ... rcu_read_lock_sched(). */ #define RCUTORTURE_RDR_RCU 0x20 /* ... entering another RCU reader. */ -#define RCUTORTURE_RDR_NBITS 6 /* Number of bits defined above. */ +#define RCUTORTURE_RDR_ATOM_BH 0x40 /* ... disabling bh while atomic */ +#define RCUTORTURE_RDR_ATOM_RBH 0x80 /* ... RBH while atomic */ +#define RCUTORTURE_RDR_NBITS 8 /* Number of bits defined above. */ #define RCUTORTURE_MAX_EXTEND \ (RCUTORTURE_RDR_BH | RCUTORTURE_RDR_IRQ | RCUTORTURE_RDR_PREEMPT | \ - RCUTORTURE_RDR_RBH | RCUTORTURE_RDR_SCHED) + RCUTORTURE_RDR_RBH | RCUTORTURE_RDR_SCHED | \ + RCUTORTURE_RDR_ATOM_BH | RCUTORTURE_RDR_ATOM_RBH) #define RCUTORTURE_RDR_MAX_LOOPS 0x7 /* Maximum reader extensions. */ /* Must be power of two minus one. */ #define RCUTORTURE_RDR_MAX_SEGS (RCUTORTURE_RDR_MAX_LOOPS + 3) @ kernel/rcu/rcutorture.c:1159 @ static void rcutorture_one_extend(int *readstate, int newstate, WARN_ON_ONCE((idxold >> RCUTORTURE_RDR_SHIFT) > 1); rtrsp->rt_readstate = newstate; - /* First, put new protection in place to avoid critical-section gap. */ + /* + * First, put new protection in place to avoid critical-section gap. + * Disable preemption around the ATOM disables to ensure that + * in_atomic() is true. + */ if (statesnew & RCUTORTURE_RDR_BH) local_bh_disable(); + if (statesnew & RCUTORTURE_RDR_RBH) + rcu_read_lock_bh(); if (statesnew & RCUTORTURE_RDR_IRQ) local_irq_disable(); if (statesnew & RCUTORTURE_RDR_PREEMPT) preempt_disable(); - if (statesnew & RCUTORTURE_RDR_RBH) - rcu_read_lock_bh(); if (statesnew & RCUTORTURE_RDR_SCHED) rcu_read_lock_sched(); + preempt_disable(); + if (statesnew & RCUTORTURE_RDR_ATOM_BH) + local_bh_disable(); + if (statesnew & RCUTORTURE_RDR_ATOM_RBH) + rcu_read_lock_bh(); + preempt_enable(); if (statesnew & RCUTORTURE_RDR_RCU) idxnew = cur_ops->readlock() << RCUTORTURE_RDR_SHIFT; - /* Next, remove old protection, irq first due to bh conflict. */ + /* + * Next, remove old protection, in decreasing order of strength + * to avoid unlock paths that aren't safe in the stronger + * context. Disable preemption around the ATOM enables in + * case the context was only atomic due to IRQ disabling. + */ + preempt_disable(); if (statesold & RCUTORTURE_RDR_IRQ) local_irq_enable(); - if (statesold & RCUTORTURE_RDR_BH) + if (statesold & RCUTORTURE_RDR_ATOM_BH) local_bh_enable(); + if (statesold & RCUTORTURE_RDR_ATOM_RBH) + rcu_read_unlock_bh(); + preempt_enable(); if (statesold & RCUTORTURE_RDR_PREEMPT) preempt_enable(); - if (statesold & RCUTORTURE_RDR_RBH) - rcu_read_unlock_bh(); if (statesold & RCUTORTURE_RDR_SCHED) rcu_read_unlock_sched(); + if (statesold & RCUTORTURE_RDR_BH) + local_bh_enable(); + if (statesold & RCUTORTURE_RDR_RBH) + rcu_read_unlock_bh(); if (statesold & RCUTORTURE_RDR_RCU) cur_ops->readunlock(idxold >> RCUTORTURE_RDR_SHIFT); @ kernel/rcu/rcutorture.c:1240 @ rcutorture_extend_mask(int oldmask, struct torture_random_state *trsp) int mask = rcutorture_extend_mask_max(); unsigned long randmask1 = torture_random(trsp) >> 8; unsigned long randmask2 = randmask1 >> 3; + unsigned long preempts = RCUTORTURE_RDR_PREEMPT | RCUTORTURE_RDR_SCHED; + unsigned long preempts_irq = preempts | RCUTORTURE_RDR_IRQ; + unsigned long nonatomic_bhs = RCUTORTURE_RDR_BH | RCUTORTURE_RDR_RBH; + unsigned long atomic_bhs = RCUTORTURE_RDR_ATOM_BH | + RCUTORTURE_RDR_ATOM_RBH; + unsigned long tmp; WARN_ON_ONCE(mask >> RCUTORTURE_RDR_SHIFT); /* Mostly only one bit (need preemption!), sometimes lots of bits. */ @ kernel/rcu/rcutorture.c:1253 @ rcutorture_extend_mask(int oldmask, struct torture_random_state *trsp) mask = mask & randmask2; else mask = mask & (1 << (randmask2 % RCUTORTURE_RDR_NBITS)); - /* Can't enable bh w/irq disabled. */ - if ((mask & RCUTORTURE_RDR_IRQ) && - ((!(mask & RCUTORTURE_RDR_BH) && (oldmask & RCUTORTURE_RDR_BH)) || - (!(mask & RCUTORTURE_RDR_RBH) && (oldmask & RCUTORTURE_RDR_RBH)))) - mask |= RCUTORTURE_RDR_BH | RCUTORTURE_RDR_RBH; + + /* + * Can't enable bh w/irq disabled. + */ + tmp = atomic_bhs | nonatomic_bhs; + if (mask & RCUTORTURE_RDR_IRQ) + mask |= oldmask & tmp; + + /* + * Ideally these sequences would be detected in debug builds + * (regardless of RT), but until then don't stop testing + * them on non-RT. + */ + if (IS_ENABLED(CONFIG_PREEMPT_RT)) { + /* + * Can't release the outermost rcu lock in an irq disabled + * section without preemption also being disabled, if irqs + * had ever been enabled during this RCU critical section + * (could leak a special flag and delay reporting the qs). + */ + if ((oldmask & RCUTORTURE_RDR_RCU) && + (mask & RCUTORTURE_RDR_IRQ) && + !(mask & preempts)) + mask |= RCUTORTURE_RDR_RCU; + + /* Can't modify atomic bh in non-atomic context */ + if ((oldmask & atomic_bhs) && (mask & atomic_bhs) && + !(mask & preempts_irq)) { + mask |= oldmask & preempts_irq; + if (mask & RCUTORTURE_RDR_IRQ) + mask |= oldmask & tmp; + } + if ((mask & atomic_bhs) && !(mask & preempts_irq)) + mask |= RCUTORTURE_RDR_PREEMPT; + + /* Can't modify non-atomic bh in atomic context */ + tmp = nonatomic_bhs; + if (oldmask & preempts_irq) + mask &= ~tmp; + if ((oldmask | mask) & preempts_irq) + mask |= oldmask & tmp; + } + return mask ?: RCUTORTURE_RDR_RCU; } @ kernel/rcu/srcutree.c:28 @ #include <linux/delay.h> #include <linux/module.h> #include <linux/srcu.h> +#include <linux/locallock.h> #include "rcu.h" #include "rcu_segcblist.h" @ kernel/rcu/srcutree.c:739 @ static void srcu_flip(struct srcu_struct *ssp) smp_mb(); /* D */ /* Pairs with C. */ } +static DEFINE_LOCAL_IRQ_LOCK(sp_llock); /* * If SRCU is likely idle, return true, otherwise return false. * @ kernel/rcu/srcutree.c:770 @ static bool srcu_might_be_idle(struct srcu_struct *ssp) unsigned long tlast; /* If the local srcu_data structure has callbacks, not idle. */ - local_irq_save(flags); + local_lock_irqsave(sp_llock, flags); sdp = this_cpu_ptr(ssp->sda); if (rcu_segcblist_pend_cbs(&sdp->srcu_cblist)) { - local_irq_restore(flags); + local_unlock_irqrestore(sp_llock, flags); return false; /* Callbacks already present, so not idle. */ } - local_irq_restore(flags); + local_unlock_irqrestore(sp_llock, flags); /* * No local callbacks, so probabalistically probe global state. @ kernel/rcu/srcutree.c:856 @ static void __call_srcu(struct srcu_struct *ssp, struct rcu_head *rhp, } rhp->func = func; idx = srcu_read_lock(ssp); - local_irq_save(flags); + local_lock_irqsave(sp_llock, flags); sdp = this_cpu_ptr(ssp->sda); spin_lock_rcu_node(sdp); rcu_segcblist_enqueue(&sdp->srcu_cblist, rhp); @ kernel/rcu/srcutree.c:872 @ static void __call_srcu(struct srcu_struct *ssp, struct rcu_head *rhp, sdp->srcu_gp_seq_needed_exp = s; needexp = true; } - spin_unlock_irqrestore_rcu_node(sdp, flags); + spin_unlock_rcu_node(sdp); + local_unlock_irqrestore(sp_llock, flags); if (needgp) srcu_funnel_gp_start(ssp, sdp, s, do_norm); else if (needexp) @ kernel/rcu/tree.c:103 @ static struct rcu_state rcu_state = { static bool dump_tree; module_param(dump_tree, bool, 0444); /* By default, use RCU_SOFTIRQ instead of rcuc kthreads. */ -static bool use_softirq = 1; +static bool use_softirq = !IS_ENABLED(CONFIG_PREEMPT_RT); +#ifndef CONFIG_PREEMPT_RT module_param(use_softirq, bool, 0444); +#endif /* Control rcu_node-tree auto-balancing at boot time. */ static bool rcu_fanout_exact; module_param(rcu_fanout_exact, bool, 0444); @ kernel/rcu/tree.c:1118 @ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) !rdp->rcu_iw_pending && rdp->rcu_iw_gp_seq != rnp->gp_seq && (rnp->ffmask & rdp->grpmask)) { init_irq_work(&rdp->rcu_iw, rcu_iw_handler); + atomic_or(IRQ_WORK_HARD_IRQ, &rdp->rcu_iw.flags); rdp->rcu_iw_pending = true; rdp->rcu_iw_gp_seq = rnp->gp_seq; irq_work_queue_on(&rdp->rcu_iw, rdp->cpu); @ kernel/rcu/tree.c:2725 @ struct kfree_rcu_cpu_work { struct kfree_rcu_cpu { struct rcu_head *head; struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES]; - spinlock_t lock; + raw_spinlock_t lock; struct delayed_work monitor_work; bool monitor_todo; bool initialized; @ kernel/rcu/tree.c:2747 @ static void kfree_rcu_work(struct work_struct *work) krwp = container_of(to_rcu_work(work), struct kfree_rcu_cpu_work, rcu_work); krcp = krwp->krcp; - spin_lock_irqsave(&krcp->lock, flags); + raw_spin_lock_irqsave(&krcp->lock, flags); head = krwp->head_free; krwp->head_free = NULL; - spin_unlock_irqrestore(&krcp->lock, flags); + raw_spin_unlock_irqrestore(&krcp->lock, flags); // List "head" is now private, so traverse locklessly. for (; head; head = next) { @ kernel/rcu/tree.c:2809 @ static inline void kfree_rcu_drain_unlock(struct kfree_rcu_cpu *krcp, krcp->monitor_todo = false; if (queue_kfree_rcu_work(krcp)) { // Success! Our job is done here. - spin_unlock_irqrestore(&krcp->lock, flags); + raw_spin_unlock_irqrestore(&krcp->lock, flags); return; } // Previous RCU batch still in progress, try again later. krcp->monitor_todo = true; schedule_delayed_work(&krcp->monitor_work, KFREE_DRAIN_JIFFIES); - spin_unlock_irqrestore(&krcp->lock, flags); + raw_spin_unlock_irqrestore(&krcp->lock, flags); } /* @ kernel/rcu/tree.c:2829 @ static void kfree_rcu_monitor(struct work_struct *work) struct kfree_rcu_cpu *krcp = container_of(work, struct kfree_rcu_cpu, monitor_work.work); - spin_lock_irqsave(&krcp->lock, flags); + raw_spin_lock_irqsave(&krcp->lock, flags); if (krcp->monitor_todo) kfree_rcu_drain_unlock(krcp, flags); else - spin_unlock_irqrestore(&krcp->lock, flags); + raw_spin_unlock_irqrestore(&krcp->lock, flags); } /* @ kernel/rcu/tree.c:2858 @ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) local_irq_save(flags); // For safely calling this_cpu_ptr(). krcp = this_cpu_ptr(&krc); if (krcp->initialized) - spin_lock(&krcp->lock); + raw_spin_lock(&krcp->lock); // Queue the object but don't yet schedule the batch. if (debug_rcu_head_queue(head)) { @ kernel/rcu/tree.c:2880 @ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) unlock_return: if (krcp->initialized) - spin_unlock(&krcp->lock); + raw_spin_unlock(&krcp->lock); local_irq_restore(flags); } EXPORT_SYMBOL_GPL(kfree_call_rcu); @ kernel/rcu/tree.c:2893 @ void __init kfree_rcu_scheduler_running(void) for_each_online_cpu(cpu) { struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu); - spin_lock_irqsave(&krcp->lock, flags); + raw_spin_lock_irqsave(&krcp->lock, flags); if (!krcp->head || krcp->monitor_todo) { - spin_unlock_irqrestore(&krcp->lock, flags); + raw_spin_unlock_irqrestore(&krcp->lock, flags); continue; } krcp->monitor_todo = true; schedule_delayed_work_on(cpu, &krcp->monitor_work, KFREE_DRAIN_JIFFIES); - spin_unlock_irqrestore(&krcp->lock, flags); + raw_spin_unlock_irqrestore(&krcp->lock, flags); } } @ kernel/rcu/tree.c:3786 @ static void __init kfree_rcu_batch_init(void) for_each_possible_cpu(cpu) { struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu); - spin_lock_init(&krcp->lock); + raw_spin_lock_init(&krcp->lock); for (i = 0; i < KFREE_N_BATCHES; i++) krcp->krw_arr[i].krcp = krcp; INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor); @ kernel/rcu/tree_plugin.h:290 @ void rcu_note_context_switch(bool preempt) struct task_struct *t = current; struct rcu_data *rdp = this_cpu_ptr(&rcu_data); struct rcu_node *rnp; + int sleeping_l = 0; trace_rcu_utilization(TPS("Start context switch")); lockdep_assert_irqs_disabled(); - WARN_ON_ONCE(!preempt && rcu_preempt_depth() > 0); +#if defined(CONFIG_PREEMPT_RT) + sleeping_l = t->sleeping_lock; +#endif + WARN_ON_ONCE(!preempt && rcu_preempt_depth() > 0 && !sleeping_l); if (rcu_preempt_depth() > 0 && !t->rcu_read_unlock_special.b.blocked) { @ kernel/rcu/update.c:57 @ #ifndef CONFIG_TINY_RCU module_param(rcu_expedited, int, 0); module_param(rcu_normal, int, 0); -static int rcu_normal_after_boot; +static int rcu_normal_after_boot = IS_ENABLED(CONFIG_PREEMPT_RT); +#ifndef CONFIG_PREEMPT_RT module_param(rcu_normal_after_boot, int, 0); +#endif #endif /* #ifndef CONFIG_TINY_RCU */ #ifdef CONFIG_DEBUG_LOCK_ALLOC @ kernel/sched/completion.c:32 @ void complete(struct completion *x) { unsigned long flags; - spin_lock_irqsave(&x->wait.lock, flags); + raw_spin_lock_irqsave(&x->wait.lock, flags); if (x->done != UINT_MAX) x->done++; - __wake_up_locked(&x->wait, TASK_NORMAL, 1); - spin_unlock_irqrestore(&x->wait.lock, flags); + swake_up_locked(&x->wait); + raw_spin_unlock_irqrestore(&x->wait.lock, flags); } EXPORT_SYMBOL(complete); @ kernel/sched/completion.c:61 @ void complete_all(struct completion *x) { unsigned long flags; - spin_lock_irqsave(&x->wait.lock, flags); + raw_spin_lock_irqsave(&x->wait.lock, flags); x->done = UINT_MAX; - __wake_up_locked(&x->wait, TASK_NORMAL, 0); - spin_unlock_irqrestore(&x->wait.lock, flags); + swake_up_all_locked(&x->wait); + raw_spin_unlock_irqrestore(&x->wait.lock, flags); } EXPORT_SYMBOL(complete_all); @ kernel/sched/completion.c:73 @ do_wait_for_common(struct completion *x, long (*action)(long), long timeout, int state) { if (!x->done) { - DECLARE_WAITQUEUE(wait, current); + DECLARE_SWAITQUEUE(wait); - __add_wait_queue_entry_tail_exclusive(&x->wait, &wait); do { if (signal_pending_state(state, current)) { timeout = -ERESTARTSYS; break; } + __prepare_to_swait(&x->wait, &wait); __set_current_state(state); - spin_unlock_irq(&x->wait.lock); + raw_spin_unlock_irq(&x->wait.lock); timeout = action(timeout); - spin_lock_irq(&x->wait.lock); + raw_spin_lock_irq(&x->wait.lock); } while (!x->done && timeout); - __remove_wait_queue(&x->wait, &wait); + __finish_swait(&x->wait, &wait); if (!x->done) return timeout; } @ kernel/sched/completion.c:103 @ __wait_for_common(struct completion *x, complete_acquire(x); - spin_lock_irq(&x->wait.lock); + raw_spin_lock_irq(&x->wait.lock); timeout = do_wait_for_common(x, action, timeout, state); - spin_unlock_irq(&x->wait.lock); + raw_spin_unlock_irq(&x->wait.lock); complete_release(x); @ kernel/sched/completion.c:294 @ bool try_wait_for_completion(struct completion *x) if (!READ_ONCE(x->done)) return false; - spin_lock_irqsave(&x->wait.lock, flags); + raw_spin_lock_irqsave(&x->wait.lock, flags); if (!x->done) ret = false; else if (x->done != UINT_MAX) x->done--; - spin_unlock_irqrestore(&x->wait.lock, flags); + raw_spin_unlock_irqrestore(&x->wait.lock, flags); return ret; } EXPORT_SYMBOL(try_wait_for_completion); @ kernel/sched/completion.c:325 @ bool completion_done(struct completion *x) * otherwise we can end up freeing the completion before complete() * is done referencing it. */ - spin_lock_irqsave(&x->wait.lock, flags); - spin_unlock_irqrestore(&x->wait.lock, flags); + raw_spin_lock_irqsave(&x->wait.lock, flags); + raw_spin_unlock_irqrestore(&x->wait.lock, flags); return true; } EXPORT_SYMBOL(completion_done); @ kernel/sched/core.c:60 @ const_debug unsigned int sysctl_sched_features = * Number of tasks to iterate in a single balance run. * Limited because this is done with IRQs disabled. */ +#ifdef CONFIG_PREEMPT_RT +const_debug unsigned int sysctl_sched_nr_migrate = 8; +#else const_debug unsigned int sysctl_sched_nr_migrate = 32; +#endif /* * period over which we measure -rt task CPU usage in us. @ kernel/sched/core.c:418 @ static bool set_nr_if_polling(struct task_struct *p) #endif #endif -static bool __wake_q_add(struct wake_q_head *head, struct task_struct *task) +static bool __wake_q_add(struct wake_q_head *head, struct task_struct *task, + bool sleeper) { - struct wake_q_node *node = &task->wake_q; + struct wake_q_node *node; + + if (sleeper) + node = &task->wake_q_sleeper; + else + node = &task->wake_q; /* * Atomically grab the task, if ->wake_q is !nil already it means @ kernel/sched/core.c:462 @ static bool __wake_q_add(struct wake_q_head *head, struct task_struct *task) */ void wake_q_add(struct wake_q_head *head, struct task_struct *task) { - if (__wake_q_add(head, task)) + if (__wake_q_add(head, task, false)) + get_task_struct(task); +} + +void wake_q_add_sleeper(struct wake_q_head *head, struct task_struct *task) +{ + if (__wake_q_add(head, task, true)) get_task_struct(task); } @ kernel/sched/core.c:491 @ void wake_q_add(struct wake_q_head *head, struct task_struct *task) */ void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task) { - if (!__wake_q_add(head, task)) + if (!__wake_q_add(head, task, false)) put_task_struct(task); } -void wake_up_q(struct wake_q_head *head) +void __wake_up_q(struct wake_q_head *head, bool sleeper) { struct wake_q_node *node = head->first; while (node != WAKE_Q_TAIL) { struct task_struct *task; - task = container_of(node, struct task_struct, wake_q); + if (sleeper) + task = container_of(node, struct task_struct, wake_q_sleeper); + else + task = container_of(node, struct task_struct, wake_q); + BUG_ON(!task); /* Task can safely be re-inserted now: */ node = node->next; - task->wake_q.next = NULL; + if (sleeper) + task->wake_q_sleeper.next = NULL; + else + task->wake_q.next = NULL; /* * wake_up_process() executes a full barrier, which pairs with * the queueing in wake_q_add() so as not to miss wakeups. */ - wake_up_process(task); + if (sleeper) + wake_up_lock_sleeper(task); + else + wake_up_process(task); + put_task_struct(task); } } @ kernel/sched/core.c:559 @ void resched_curr(struct rq *rq) trace_sched_wake_idle_without_ipi(cpu); } +#ifdef CONFIG_PREEMPT_LAZY + +static int tsk_is_polling(struct task_struct *p) +{ +#ifdef TIF_POLLING_NRFLAG + return test_tsk_thread_flag(p, TIF_POLLING_NRFLAG); +#else + return 0; +#endif +} + +void resched_curr_lazy(struct rq *rq) +{ + struct task_struct *curr = rq->curr; + int cpu; + + if (!sched_feat(PREEMPT_LAZY)) { + resched_curr(rq); + return; + } + + lockdep_assert_held(&rq->lock); + + if (test_tsk_need_resched(curr)) + return; + + if (test_tsk_need_resched_lazy(curr)) + return; + + set_tsk_need_resched_lazy(curr); + + cpu = cpu_of(rq); + if (cpu == smp_processor_id()) + return; + + /* NEED_RESCHED_LAZY must be visible before we test polling */ + smp_mb(); + if (!tsk_is_polling(curr)) + smp_send_reschedule(cpu); +} +#endif + void resched_cpu(int cpu) { struct rq *rq = cpu_rq(cpu); @ kernel/sched/core.c:1523 @ static inline bool is_cpu_allowed(struct task_struct *p, int cpu) if (!cpumask_test_cpu(cpu, p->cpus_ptr)) return false; - if (is_per_cpu_kthread(p)) + if (is_per_cpu_kthread(p) || __migrate_disabled(p)) return cpu_online(cpu); return cpu_active(cpu); @ kernel/sched/core.c:1572 @ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf, struct migration_arg { struct task_struct *task; int dest_cpu; + bool done; }; /* @ kernel/sched/core.c:1608 @ static int migration_cpu_stop(void *data) struct task_struct *p = arg->task; struct rq *rq = this_rq(); struct rq_flags rf; + int dest_cpu = arg->dest_cpu; + + /* We don't look at arg after this point. */ + smp_mb(); + arg->done = true; /* * The original target CPU might have gone down and we might @ kernel/sched/core.c:1635 @ static int migration_cpu_stop(void *data) */ if (task_rq(p) == rq) { if (task_on_rq_queued(p)) - rq = __migrate_task(rq, &rf, p, arg->dest_cpu); + rq = __migrate_task(rq, &rf, p, dest_cpu); else - p->wake_cpu = arg->dest_cpu; + p->wake_cpu = dest_cpu; } rq_unlock(rq, &rf); raw_spin_unlock(&p->pi_lock); @ kernel/sched/core.c:1653 @ static int migration_cpu_stop(void *data) void set_cpus_allowed_common(struct task_struct *p, const struct cpumask *new_mask) { cpumask_copy(&p->cpus_mask, new_mask); - p->nr_cpus_allowed = cpumask_weight(new_mask); + if (p->cpus_ptr == &p->cpus_mask) + p->nr_cpus_allowed = cpumask_weight(new_mask); } +#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT) +int __migrate_disabled(struct task_struct *p) +{ + return p->migrate_disable; +} +EXPORT_SYMBOL_GPL(__migrate_disabled); +#endif + void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask) { struct rq *rq = task_rq(p); @ kernel/sched/core.c:1731 @ static int __set_cpus_allowed_ptr(struct task_struct *p, goto out; } - if (cpumask_equal(p->cpus_ptr, new_mask)) + if (cpumask_equal(&p->cpus_mask, new_mask)) goto out; dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask); @ kernel/sched/core.c:1753 @ static int __set_cpus_allowed_ptr(struct task_struct *p, } /* Can the task run on the task's current CPU? If so, we're done */ - if (cpumask_test_cpu(task_cpu(p), new_mask)) + if (cpumask_test_cpu(task_cpu(p), new_mask) || + p->cpus_ptr != &p->cpus_mask) goto out; if (task_running(rq, p) || p->state == TASK_WAKING) { @ kernel/sched/core.c:1951 @ int migrate_swap(struct task_struct *cur, struct task_struct *p, } #endif /* CONFIG_NUMA_BALANCING */ +static bool check_task_state(struct task_struct *p, long match_state) +{ + bool match = false; + + raw_spin_lock_irq(&p->pi_lock); + if (p->state == match_state || p->saved_state == match_state) + match = true; + raw_spin_unlock_irq(&p->pi_lock); + + return match; +} + /* * wait_task_inactive - wait for a thread to unschedule. * @ kernel/sched/core.c:2007 @ unsigned long wait_task_inactive(struct task_struct *p, long match_state) * is actually now running somewhere else! */ while (task_running(rq, p)) { - if (match_state && unlikely(p->state != match_state)) + if (match_state && !check_task_state(p, match_state)) return 0; cpu_relax(); } @ kernel/sched/core.c:2022 @ unsigned long wait_task_inactive(struct task_struct *p, long match_state) running = task_running(rq, p); queued = task_on_rq_queued(p); ncsw = 0; - if (!match_state || p->state == match_state) + if (!match_state || p->state == match_state || + p->saved_state == match_state) ncsw = p->nvcsw | LONG_MIN; /* sets MSB */ task_rq_unlock(rq, p, &rf); @ kernel/sched/core.c:2611 @ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) int cpu, success = 0; preempt_disable(); + +#ifndef CONFIG_PREEMPT_RT if (p == current) { /* * We're waking current, this means 'p->on_rq' and 'task_cpu(p) @ kernel/sched/core.c:2635 @ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) trace_sched_wakeup(p); goto out; } - +#endif /* * If we are going to wake up a thread waiting for CONDITION we * need to ensure that CONDITION=1 done by the caller can not be @ kernel/sched/core.c:2644 @ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) */ raw_spin_lock_irqsave(&p->pi_lock, flags); smp_mb__after_spinlock(); - if (!(p->state & state)) - goto unlock; + if (!(p->state & state)) { + /* + * The task might be running due to a spinlock sleeper + * wakeup. Check the saved state and set it to running + * if the wakeup condition is true. + */ + if (!(wake_flags & WF_LOCK_SLEEPER)) { + if (p->saved_state & state) { + p->saved_state = TASK_RUNNING; + success = 1; + } + } + raw_spin_unlock_irqrestore(&p->pi_lock, flags); + goto out_nostat; + } + /* + * If this is a regular wakeup, then we can unconditionally + * clear the saved state of a "lock sleeper". + */ + if (!(wake_flags & WF_LOCK_SLEEPER)) + p->saved_state = TASK_RUNNING; trace_sched_waking(p); @ kernel/sched/core.c:2756 @ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) ttwu_queue(p, cpu, wake_flags); unlock: raw_spin_unlock_irqrestore(&p->pi_lock, flags); +#ifndef CONFIG_PREEMPT_RT out: +#endif if (success) ttwu_stat(p, cpu, wake_flags); +out_nostat: preempt_enable(); return success; @ kernel/sched/core.c:2784 @ int wake_up_process(struct task_struct *p) } EXPORT_SYMBOL(wake_up_process); +/** + * wake_up_lock_sleeper - Wake up a specific process blocked on a "sleeping lock" + * @p: The process to be woken up. + * + * Same as wake_up_process() above, but wake_flags=WF_LOCK_SLEEPER to indicate + * the nature of the wakeup. + */ +int wake_up_lock_sleeper(struct task_struct *p) +{ + return try_to_wake_up(p, TASK_UNINTERRUPTIBLE, WF_LOCK_SLEEPER); +} + int wake_up_state(struct task_struct *p, unsigned int state) { return try_to_wake_up(p, state, 0); @ kernel/sched/core.c:3038 @ int sched_fork(unsigned long clone_flags, struct task_struct *p) p->on_cpu = 0; #endif init_task_preempt_count(p); +#ifdef CONFIG_HAVE_PREEMPT_LAZY + task_thread_info(p)->preempt_lazy_count = 0; +#endif #ifdef CONFIG_SMP plist_node_init(&p->pushable_tasks, MAX_PRIO); RB_CLEAR_NODE(&p->pushable_dl_tasks); @ kernel/sched/core.c:3368 @ static struct rq *finish_task_switch(struct task_struct *prev) * provided by mmdrop(), * - a sync_core for SYNC_CORE. */ + /* + * We use mmdrop_delayed() here so we don't have to do the + * full __mmdrop() when we are the last user. + */ if (mm) { membarrier_mm_sync_core_before_usermode(mm); - mmdrop(mm); + mmdrop_delayed(mm); } if (unlikely(prev_state == TASK_DEAD)) { if (prev->sched_class->task_dead) prev->sched_class->task_dead(prev); - /* - * Remove function-return probe instances associated with this - * task and put them back on the free list. - */ - kprobe_flush_task(prev); - - /* Task is done with its stack. */ - put_task_stack(prev); - put_task_struct_rcu_user(prev); } @ kernel/sched/core.c:4087 @ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) BUG(); } +static void migrate_disabled_sched(struct task_struct *p); + /* * __schedule() is the main scheduler function. * @ kernel/sched/core.c:4159 @ static void __sched notrace __schedule(bool preempt) rq_lock(rq, &rf); smp_mb__after_spinlock(); + if (__migrate_disabled(prev)) + migrate_disabled_sched(prev); + /* Promote REQ to ACT */ rq->clock_update_flags <<= 1; update_rq_clock(rq); @ kernel/sched/core.c:4183 @ static void __sched notrace __schedule(bool preempt) next = pick_next_task(rq, prev, &rf); clear_tsk_need_resched(prev); + clear_tsk_need_resched_lazy(prev); clear_preempt_need_resched(); if (likely(prev != next)) { @ kernel/sched/core.c:4378 @ static void __sched notrace preempt_schedule_common(void) } while (need_resched()); } +#ifdef CONFIG_PREEMPT_LAZY +/* + * If TIF_NEED_RESCHED is then we allow to be scheduled away since this is + * set by a RT task. Oterwise we try to avoid beeing scheduled out as long as + * preempt_lazy_count counter >0. + */ +static __always_inline int preemptible_lazy(void) +{ + if (test_thread_flag(TIF_NEED_RESCHED)) + return 1; + if (current_thread_info()->preempt_lazy_count) + return 0; + return 1; +} + +#else + +static inline int preemptible_lazy(void) +{ + return 1; +} + +#endif + #ifdef CONFIG_PREEMPTION /* * This is the entry point to schedule() from in-kernel preemption @ kernel/sched/core.c:4415 @ asmlinkage __visible void __sched notrace preempt_schedule(void) */ if (likely(!preemptible())) return; - + if (!preemptible_lazy()) + return; preempt_schedule_common(); } NOKPROBE_SYMBOL(preempt_schedule); @ kernel/sched/core.c:4443 @ asmlinkage __visible void __sched notrace preempt_schedule_notrace(void) if (likely(!preemptible())) return; + if (!preemptible_lazy()) + return; + do { /* * Because the function tracer can trace preempt_count_sub() @ kernel/sched/core.c:6235 @ void init_idle(struct task_struct *idle, int cpu) /* Set the preempt count _outside_ the spinlocks! */ init_idle_preempt_count(idle, cpu); - +#ifdef CONFIG_HAVE_PREEMPT_LAZY + task_thread_info(idle)->preempt_lazy_count = 0; +#endif /* * The idle tasks have their own, simple scheduling class: */ @ kernel/sched/core.c:6342 @ void sched_setnuma(struct task_struct *p, int nid) #endif /* CONFIG_NUMA_BALANCING */ #ifdef CONFIG_HOTPLUG_CPU +static DEFINE_PER_CPU(struct mm_struct *, idle_last_mm); + /* * Ensure that the idle task is using init_mm right before its CPU goes * offline. @ kernel/sched/core.c:6359 @ void idle_task_exit(void) current->active_mm = &init_mm; finish_arch_post_lock_switch(); } - mmdrop(mm); + /* + * Defer the cleanup to an alive cpu. On RT we can neither + * call mmdrop() nor mmdrop_delayed() from here. + */ + per_cpu(idle_last_mm, smp_processor_id()) = mm; } /* @ kernel/sched/core.c:6441 @ static void migrate_tasks(struct rq *dead_rq, struct rq_flags *rf) break; next = __pick_migrate_task(rq); + WARN_ON_ONCE(__migrate_disabled(next)); /* * Rules for changing task_struct::cpus_mask are holding @ kernel/sched/core.c:6670 @ int sched_cpu_dying(unsigned int cpu) update_max_interval(); nohz_balance_exit_idle(rq); hrtick_clear(rq); + if (per_cpu(idle_last_mm, cpu)) { + mmdrop_delayed(per_cpu(idle_last_mm, cpu)); + per_cpu(idle_last_mm, cpu) = NULL; + } return 0; } #endif @ kernel/sched/core.c:6905 @ void __init sched_init(void) #ifdef CONFIG_DEBUG_ATOMIC_SLEEP static inline int preempt_count_equals(int preempt_offset) { - int nested = preempt_count() + rcu_preempt_depth(); + int nested = preempt_count() + sched_rcu_preempt_depth(); return (nested == preempt_offset); } @ kernel/sched/core.c:8144 @ const u32 sched_prio_to_wmult[40] = { }; #undef CREATE_TRACE_POINTS + +#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT) + +static inline void +update_nr_migratory(struct task_struct *p, long delta) +{ + if (unlikely((p->sched_class == &rt_sched_class || + p->sched_class == &dl_sched_class) && + p->nr_cpus_allowed > 1)) { + if (p->sched_class == &rt_sched_class) + task_rq(p)->rt.rt_nr_migratory += delta; + else + task_rq(p)->dl.dl_nr_migratory += delta; + } +} + +static inline void +migrate_disable_update_cpus_allowed(struct task_struct *p) +{ + p->cpus_ptr = cpumask_of(smp_processor_id()); + update_nr_migratory(p, -1); + p->nr_cpus_allowed = 1; +} + +static inline void +migrate_enable_update_cpus_allowed(struct task_struct *p) +{ + struct rq *rq; + struct rq_flags rf; + + rq = task_rq_lock(p, &rf); + p->cpus_ptr = &p->cpus_mask; + p->nr_cpus_allowed = cpumask_weight(&p->cpus_mask); + update_nr_migratory(p, 1); + task_rq_unlock(rq, p, &rf); +} + +void migrate_disable(void) +{ + preempt_disable(); + + if (++current->migrate_disable == 1) { + this_rq()->nr_pinned++; + preempt_lazy_disable(); +#ifdef CONFIG_SCHED_DEBUG + WARN_ON_ONCE(current->pinned_on_cpu >= 0); + current->pinned_on_cpu = smp_processor_id(); +#endif + } + + preempt_enable(); +} +EXPORT_SYMBOL(migrate_disable); + +static void migrate_disabled_sched(struct task_struct *p) +{ + if (p->migrate_disable_scheduled) + return; + + migrate_disable_update_cpus_allowed(p); + p->migrate_disable_scheduled = 1; +} + +static DEFINE_PER_CPU(struct cpu_stop_work, migrate_work); +static DEFINE_PER_CPU(struct migration_arg, migrate_arg); + +void migrate_enable(void) +{ + struct task_struct *p = current; + struct rq *rq = this_rq(); + int cpu = task_cpu(p); + + WARN_ON_ONCE(p->migrate_disable <= 0); + if (p->migrate_disable > 1) { + p->migrate_disable--; + return; + } + + preempt_disable(); + +#ifdef CONFIG_SCHED_DEBUG + WARN_ON_ONCE(current->pinned_on_cpu != cpu); + current->pinned_on_cpu = -1; +#endif + + WARN_ON_ONCE(rq->nr_pinned < 1); + + p->migrate_disable = 0; + rq->nr_pinned--; +#ifdef CONFIG_HOTPLUG_CPU + if (rq->nr_pinned == 0 && unlikely(!cpu_active(cpu)) && + takedown_cpu_task) + wake_up_process(takedown_cpu_task); +#endif + + if (!p->migrate_disable_scheduled) + goto out; + + p->migrate_disable_scheduled = 0; + + migrate_enable_update_cpus_allowed(p); + + WARN_ON(smp_processor_id() != cpu); + if (!is_cpu_allowed(p, cpu)) { + struct migration_arg __percpu *arg; + struct cpu_stop_work __percpu *work; + struct rq_flags rf; + + work = this_cpu_ptr(&migrate_work); + arg = this_cpu_ptr(&migrate_arg); + WARN_ON_ONCE(!arg->done && !work->disabled && work->arg); + + arg->task = p; + arg->done = false; + + rq = task_rq_lock(p, &rf); + update_rq_clock(rq); + arg->dest_cpu = select_fallback_rq(cpu, p); + task_rq_unlock(rq, p, &rf); + + stop_one_cpu_nowait(task_cpu(p), migration_cpu_stop, + arg, work); + } + +out: + preempt_lazy_enable(); + preempt_enable(); +} +EXPORT_SYMBOL(migrate_enable); + +int cpu_nr_pinned(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + + return rq->nr_pinned; +} + +#elif !defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT) +static void migrate_disabled_sched(struct task_struct *p) +{ +} + +void migrate_disable(void) +{ +#ifdef CONFIG_SCHED_DEBUG + current->migrate_disable++; +#endif + barrier(); +} +EXPORT_SYMBOL(migrate_disable); + +void migrate_enable(void) +{ +#ifdef CONFIG_SCHED_DEBUG + struct task_struct *p = current; + + WARN_ON_ONCE(p->migrate_disable <= 0); + p->migrate_disable--; +#endif + barrier(); +} +EXPORT_SYMBOL(migrate_enable); + +#else +static void migrate_disabled_sched(struct task_struct *p) +{ +} + +#endif @ kernel/sched/debug.c:968 @ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns, P(dl.runtime); P(dl.deadline); } +#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT) + P(migrate_disable); +#endif + P(nr_cpus_allowed); #undef PN_SCHEDSTAT #undef PN #undef __PN @ kernel/sched/fair.c:4163 @ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) ideal_runtime = sched_slice(cfs_rq, curr); delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; if (delta_exec > ideal_runtime) { - resched_curr(rq_of(cfs_rq)); + resched_curr_lazy(rq_of(cfs_rq)); /* * The current task ran long enough, ensure it doesn't get * re-elected due to buddy favours. @ kernel/sched/fair.c:4187 @ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) return; if (delta > ideal_runtime) - resched_curr(rq_of(cfs_rq)); + resched_curr_lazy(rq_of(cfs_rq)); } static void @ kernel/sched/fair.c:4330 @ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) * validating it and just reschedule. */ if (queued) { - resched_curr(rq_of(cfs_rq)); + resched_curr_lazy(rq_of(cfs_rq)); return; } /* @ kernel/sched/fair.c:4455 @ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) * hierarchy can be throttled */ if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr)) - resched_curr(rq_of(cfs_rq)); + resched_curr_lazy(rq_of(cfs_rq)); } static __always_inline @ kernel/sched/fair.c:5179 @ static void hrtick_start_fair(struct rq *rq, struct task_struct *p) if (delta < 0) { if (rq->curr == p) - resched_curr(rq); + resched_curr_lazy(rq); return; } hrtick_start(rq, delta); @ kernel/sched/fair.c:6689 @ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ return; preempt: - resched_curr(rq); + resched_curr_lazy(rq); /* * Only set the backward buddy when the current task is still * on the rq. This can happen when a wakeup gets interleaved @ kernel/sched/fair.c:10388 @ static void task_fork_fair(struct task_struct *p) * 'current' within the tree based on its new key value. */ swap(curr->vruntime, se->vruntime); - resched_curr(rq); + resched_curr_lazy(rq); } se->vruntime -= cfs_rq->min_vruntime; @ kernel/sched/fair.c:10415 @ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio) */ if (rq->curr == p) { if (p->prio > oldprio) - resched_curr(rq); + resched_curr_lazy(rq); } else check_preempt_curr(rq, p, 0); } @ kernel/sched/features.h:48 @ SCHED_FEAT(DOUBLE_TICK, false) */ SCHED_FEAT(NONTASK_CAPACITY, true) +#ifdef CONFIG_PREEMPT_RT +SCHED_FEAT(TTWU_QUEUE, false) +# ifdef CONFIG_PREEMPT_LAZY +SCHED_FEAT(PREEMPT_LAZY, true) +# endif +#else + /* * Queue remote wakeups on the target CPU and process them * using the scheduler IPI. Reduces rq->lock contention/bounces. */ SCHED_FEAT(TTWU_QUEUE, true) +#endif /* * When doing wakeups, attempt to limit superfluous scans of the LLC domain. @ kernel/sched/sched.h:1008 @ struct rq { /* Must be inspected within a rcu lock section */ struct cpuidle_state *idle_state; #endif + +#if defined(CONFIG_PREEMPT_RT) && defined(CONFIG_SMP) + int nr_pinned; +#endif }; #ifdef CONFIG_FAIR_GROUP_SCHED @ kernel/sched/sched.h:1657 @ static inline int task_on_rq_migrating(struct task_struct *p) #define WF_SYNC 0x01 /* Waker goes to sleep after wakeup */ #define WF_FORK 0x02 /* Child wakeup after fork */ #define WF_MIGRATED 0x4 /* Internal use, task got migrated */ +#define WF_LOCK_SLEEPER 0x08 /* wakeup spinlock "sleeper" */ /* * To aid in avoiding the subversion of "niceness" due to uneven distribution @ kernel/sched/sched.h:1876 @ extern void reweight_task(struct task_struct *p, int prio); extern void resched_curr(struct rq *rq); extern void resched_cpu(int cpu); +#ifdef CONFIG_PREEMPT_LAZY +extern void resched_curr_lazy(struct rq *rq); +#else +static inline void resched_curr_lazy(struct rq *rq) +{ + resched_curr(rq); +} +#endif + extern struct rt_bandwidth def_rt_bandwidth; extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime); @ kernel/sched/swait.c:35 @ void swake_up_locked(struct swait_queue_head *q) } EXPORT_SYMBOL(swake_up_locked); +void swake_up_all_locked(struct swait_queue_head *q) +{ + struct swait_queue *curr; + + while (!list_empty(&q->task_list)) { + + curr = list_first_entry(&q->task_list, typeof(*curr), + task_list); + wake_up_process(curr->task); + list_del_init(&curr->task_list); + } +} +EXPORT_SYMBOL(swake_up_all_locked); + void swake_up_one(struct swait_queue_head *q) { unsigned long flags; @ kernel/sched/swait.c:68 @ void swake_up_all(struct swait_queue_head *q) struct swait_queue *curr; LIST_HEAD(tmp); + WARN_ON(irqs_disabled()); raw_spin_lock_irq(&q->lock); list_splice_init(&q->task_list, &tmp); while (!list_empty(&tmp)) { @ kernel/sched/swait.c:87 @ void swake_up_all(struct swait_queue_head *q) } EXPORT_SYMBOL(swake_up_all); -static void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait) +void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait) { wait->task = current; if (list_empty(&wait->task_list)) @ kernel/sched/topology.c:505 @ static int init_rootdomain(struct root_domain *rd) rd->rto_cpu = -1; raw_spin_lock_init(&rd->rto_lock); init_irq_work(&rd->rto_push_work, rto_push_irq_work_func); + atomic_or(IRQ_WORK_HARD_IRQ, &rd->rto_push_work.flags); #endif init_dl_bw(&rd->dl_bw); @ kernel/seccomp.c:271 @ static u32 seccomp_run_filters(const struct seccomp_data *sd, * All filters in the list are evaluated and the lowest BPF return * value always takes priority (ignoring the DATA). */ - preempt_disable(); for (; f; f = f->prev) { - u32 cur_ret = BPF_PROG_RUN(f->prog, sd); + u32 cur_ret = bpf_prog_run_pin_on_cpu(f->prog, sd); if (ACTION_ONLY(cur_ret) < ACTION_ONLY(ret)) { ret = cur_ret; *match = f; } } - preempt_enable(); return ret; } #endif /* CONFIG_SECCOMP_FILTER */ @ kernel/signal.c:23 @ #include <linux/sched/task.h> #include <linux/sched/task_stack.h> #include <linux/sched/cputime.h> +#include <linux/sched/rt.h> #include <linux/file.h> #include <linux/fs.h> #include <linux/proc_fs.h> @ kernel/signal.c:407 @ void task_join_group_stop(struct task_struct *task) } } +static inline struct sigqueue *get_task_cache(struct task_struct *t) +{ + struct sigqueue *q = t->sigqueue_cache; + + if (cmpxchg(&t->sigqueue_cache, q, NULL) != q) + return NULL; + return q; +} + +static inline int put_task_cache(struct task_struct *t, struct sigqueue *q) +{ + if (cmpxchg(&t->sigqueue_cache, NULL, q) == NULL) + return 0; + return 1; +} + /* * allocate a new signal queue record * - this may be called without locks if and only if t == current, otherwise an * appropriate lock must be held to stop the target task from exiting */ static struct sigqueue * -__sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int override_rlimit) +__sigqueue_do_alloc(int sig, struct task_struct *t, gfp_t flags, + int override_rlimit, int fromslab) { struct sigqueue *q = NULL; struct user_struct *user; @ kernel/signal.c:452 @ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int override_rlimi rcu_read_unlock(); if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) { - q = kmem_cache_alloc(sigqueue_cachep, flags); + if (!fromslab) + q = get_task_cache(t); + if (!q) + q = kmem_cache_alloc(sigqueue_cachep, flags); } else { print_dropped_signal(sig); } @ kernel/signal.c:472 @ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int override_rlimi return q; } +static struct sigqueue * +__sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, + int override_rlimit) +{ + return __sigqueue_do_alloc(sig, t, flags, override_rlimit, 0); +} + static void __sigqueue_free(struct sigqueue *q) { if (q->flags & SIGQUEUE_PREALLOC) @ kernel/signal.c:488 @ static void __sigqueue_free(struct sigqueue *q) kmem_cache_free(sigqueue_cachep, q); } +static void sigqueue_free_current(struct sigqueue *q) +{ + struct user_struct *up; + + if (q->flags & SIGQUEUE_PREALLOC) + return; + + up = q->user; + if (rt_prio(current->normal_prio) && !put_task_cache(current, q)) { + if (atomic_dec_and_test(&up->sigpending)) + free_uid(up); + } else + __sigqueue_free(q); +} + void flush_sigqueue(struct sigpending *queue) { struct sigqueue *q; @ kernel/signal.c:515 @ void flush_sigqueue(struct sigpending *queue) } } +/* + * Called from __exit_signal. Flush tsk->pending and + * tsk->sigqueue_cache + */ +void flush_task_sigqueue(struct task_struct *tsk) +{ + struct sigqueue *q; + + flush_sigqueue(&tsk->pending); + + q = get_task_cache(tsk); + if (q) + kmem_cache_free(sigqueue_cachep, q); +} + /* * Flush all pending signals for this kthread. */ @ kernel/signal.c:654 @ static void collect_signal(int sig, struct sigpending *list, kernel_siginfo_t *i (info->si_code == SI_TIMER) && (info->si_sys_private); - __sigqueue_free(first); + sigqueue_free_current(first); } else { /* * Ok, it wasn't in the queue. This must be @ kernel/signal.c:691 @ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, kernel_siginfo_t *in bool resched_timer = false; int signr; + WARN_ON_ONCE(tsk != current); + /* We only dequeue private signals from ourselves, we don't let * signalfd steal them */ @ kernel/signal.c:1376 @ force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t) struct k_sigaction *action; int sig = info->si_signo; + /* + * On some archs, PREEMPT_RT has to delay sending a signal from a trap + * since it can not enable preemption, and the signal code's spin_locks + * turn into mutexes. Instead, it must set TIF_NOTIFY_RESUME which will + * send the signal on exit of the trap. + */ +#ifdef ARCH_RT_DELAYS_SIGNAL_SEND + if (in_atomic()) { + struct task_struct *t = current; + + if (WARN_ON_ONCE(t->forced_info.si_signo)) + return 0; + + if (is_si_special(info)) { + WARN_ON_ONCE(info != SEND_SIG_PRIV); + t->forced_info.si_signo = info->si_signo; + t->forced_info.si_errno = 0; + t->forced_info.si_code = SI_KERNEL; + t->forced_info.si_pid = 0; + t->forced_info.si_uid = 0; + } else { + t->forced_info = *info; + } + + set_tsk_thread_flag(t, TIF_NOTIFY_RESUME); + return 0; + } +#endif spin_lock_irqsave(&t->sighand->siglock, flags); action = &t->sighand->action[sig-1]; ignored = action->sa.sa_handler == SIG_IGN; @ kernel/signal.c:1901 @ EXPORT_SYMBOL(kill_pid); */ struct sigqueue *sigqueue_alloc(void) { - struct sigqueue *q = __sigqueue_alloc(-1, current, GFP_KERNEL, 0); + /* Preallocated sigqueue objects always from the slabcache ! */ + struct sigqueue *q = __sigqueue_do_alloc(-1, current, GFP_KERNEL, 0, 1); if (q) q->flags |= SIGQUEUE_PREALLOC; @ kernel/signal.c:2298 @ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t if (gstop_done && ptrace_reparented(current)) do_notify_parent_cldstop(current, false, why); - /* - * Don't want to allow preemption here, because - * sys_ptrace() needs this task to be inactive. - * - * XXX: implement read_unlock_no_resched(). - */ - preempt_disable(); read_unlock(&tasklist_lock); cgroup_enter_frozen(); - preempt_enable_no_resched(); freezable_schedule(); cgroup_leave_frozen(true); } else { @ kernel/softirq.c:28 @ #include <linux/smpboot.h> #include <linux/tick.h> #include <linux/irq.h> +#ifdef CONFIG_PREEMPT_RT +#include <linux/locallock.h> +#endif #define CREATE_TRACE_POINTS #include <trace/events/irq.h> @ kernel/softirq.c:108 @ static bool ksoftirqd_running(unsigned long pending) * softirq and whether we just have bh disabled. */ +#ifdef CONFIG_PREEMPT_RT +static DEFINE_LOCAL_IRQ_LOCK(bh_lock); +static DEFINE_PER_CPU(long, softirq_counter); + +void __local_bh_disable_ip(unsigned long ip, unsigned int cnt) +{ + unsigned long __maybe_unused flags; + long soft_cnt; + + WARN_ON_ONCE(in_irq()); + if (!in_atomic()) { + local_lock(bh_lock); + rcu_read_lock(); + } + soft_cnt = this_cpu_inc_return(softirq_counter); + WARN_ON_ONCE(soft_cnt == 0); + current->softirq_count += SOFTIRQ_DISABLE_OFFSET; + +#ifdef CONFIG_TRACE_IRQFLAGS + local_irq_save(flags); + if (soft_cnt == 1) + trace_softirqs_off(ip); + local_irq_restore(flags); +#endif +} +EXPORT_SYMBOL(__local_bh_disable_ip); + +static void local_bh_disable_rt(void) +{ + local_bh_disable(); +} + +void _local_bh_enable(void) +{ + unsigned long __maybe_unused flags; + long soft_cnt; + + soft_cnt = this_cpu_dec_return(softirq_counter); + WARN_ON_ONCE(soft_cnt < 0); + +#ifdef CONFIG_TRACE_IRQFLAGS + local_irq_save(flags); + if (soft_cnt == 0) + trace_softirqs_on(_RET_IP_); + local_irq_restore(flags); +#endif + + current->softirq_count -= SOFTIRQ_DISABLE_OFFSET; + if (!in_atomic()) { + rcu_read_unlock(); + local_unlock(bh_lock); + } +} + +void _local_bh_enable_rt(void) +{ + _local_bh_enable(); +} + +void __local_bh_enable_ip(unsigned long ip, unsigned int cnt) +{ + u32 pending; + long count; + + WARN_ON_ONCE(in_irq()); + lockdep_assert_irqs_enabled(); + + local_irq_disable(); + count = this_cpu_read(softirq_counter); + + if (unlikely(count == 1)) { + pending = local_softirq_pending(); + if (pending && !ksoftirqd_running(pending)) { + if (!in_atomic()) + __do_softirq(); + else + wakeup_softirqd(); + } + trace_softirqs_on(ip); + } + count = this_cpu_dec_return(softirq_counter); + WARN_ON_ONCE(count < 0); + local_irq_enable(); + + if (!in_atomic()) { + rcu_read_unlock(); + local_unlock(bh_lock); + } + + current->softirq_count -= SOFTIRQ_DISABLE_OFFSET; + preempt_check_resched(); +} +EXPORT_SYMBOL(__local_bh_enable_ip); + +#else +static void local_bh_disable_rt(void) { } +static void _local_bh_enable_rt(void) { } + /* * This one is for softirq.c-internal use, * where hardirqs are disabled legitimately: @ kernel/softirq.c:300 @ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt) preempt_check_resched(); } EXPORT_SYMBOL(__local_bh_enable_ip); +#endif /* * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times, @ kernel/softirq.c:371 @ asmlinkage __visible void __softirq_entry __do_softirq(void) pending = local_softirq_pending(); account_irq_enter_time(current); +#ifdef CONFIG_PREEMPT_RT + current->softirq_count |= SOFTIRQ_OFFSET; +#else __local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET); +#endif in_hardirq = lockdep_softirq_start(); restart: @ kernel/softirq.c:409 @ asmlinkage __visible void __softirq_entry __do_softirq(void) h++; pending >>= softirq_bit; } - +#ifndef CONFIG_PREEMPT_RT if (__this_cpu_read(ksoftirqd) == current) rcu_softirq_qs(); +#endif local_irq_disable(); pending = local_softirq_pending(); @ kernel/softirq.c:426 @ asmlinkage __visible void __softirq_entry __do_softirq(void) lockdep_softirq_end(in_hardirq); account_irq_exit_time(current); +#ifdef CONFIG_PREEMPT_RT + current->softirq_count &= ~SOFTIRQ_OFFSET; +#else __local_bh_enable(SOFTIRQ_OFFSET); +#endif WARN_ON_ONCE(in_interrupt()); current_restore_flags(old_flags, PF_MEMALLOC); } +#ifndef CONFIG_PREEMPT_RT asmlinkage __visible void do_softirq(void) { __u32 pending; @ kernel/softirq.c:453 @ asmlinkage __visible void do_softirq(void) local_irq_restore(flags); } +#endif /* * Enter an interrupt context. @ kernel/softirq.c:474 @ void irq_enter(void) __irq_enter(); } +#ifdef CONFIG_PREEMPT_RT + +static inline void invoke_softirq(void) +{ + if (this_cpu_read(softirq_counter) == 0) + wakeup_softirqd(); +} + +#else + static inline void invoke_softirq(void) { if (ksoftirqd_running(local_softirq_pending())) @ kernel/softirq.c:509 @ static inline void invoke_softirq(void) wakeup_softirqd(); } } +#endif static inline void tick_irq_exit(void) { @ kernel/softirq.c:547 @ void irq_exit(void) /* * This function must run with irqs disabled! */ +#ifdef CONFIG_PREEMPT_RT +void raise_softirq_irqoff(unsigned int nr) +{ + __raise_softirq_irqoff(nr); + + /* + * If we're in an hard interrupt we let irq return code deal + * with the wakeup of ksoftirqd. + */ + if (in_irq()) + return; + /* + * If were are not in BH-disabled section then we have to wake + * ksoftirqd. + */ + if (this_cpu_read(softirq_counter) == 0) + wakeup_softirqd(); +} + +#else + inline void raise_softirq_irqoff(unsigned int nr) { __raise_softirq_irqoff(nr); @ kernel/softirq.c:585 @ inline void raise_softirq_irqoff(unsigned int nr) wakeup_softirqd(); } +#endif + void raise_softirq(unsigned int nr) { unsigned long flags; @ kernel/softirq.c:714 @ void tasklet_kill(struct tasklet_struct *t) while (test_and_set_bit(TASKLET_STATE_SCHED, &t->state)) { do { - yield(); + local_bh_disable(); + local_bh_enable(); } while (test_bit(TASKLET_STATE_SCHED, &t->state)); } tasklet_unlock_wait(t); @ kernel/softirq.c:745 @ static int ksoftirqd_should_run(unsigned int cpu) static void run_ksoftirqd(unsigned int cpu) { + local_bh_disable_rt(); local_irq_disable(); if (local_softirq_pending()) { /* @ kernel/softirq.c:754 @ static void run_ksoftirqd(unsigned int cpu) */ __do_softirq(); local_irq_enable(); + _local_bh_enable_rt(); cond_resched(); return; } local_irq_enable(); + _local_bh_enable_rt(); } #ifdef CONFIG_HOTPLUG_CPU @ kernel/softirq.c:833 @ static struct smp_hotplug_thread softirq_threads = { static __init int spawn_ksoftirqd(void) { +#ifdef CONFIG_PREEMPT_RT + int cpu; + + for_each_possible_cpu(cpu) + lockdep_set_novalidate_class(per_cpu_ptr(&bh_lock.lock, cpu)); +#endif + cpuhp_setup_state_nocalls(CPUHP_SOFTIRQ_DEAD, "softirq:dead", NULL, takeover_tasklets); BUG_ON(smpboot_register_percpu_thread(&softirq_threads)); @ kernel/softirq.c:848 @ static __init int spawn_ksoftirqd(void) } early_initcall(spawn_ksoftirqd); +#ifdef CONFIG_PREEMPT_RT + +/* + * On preempt-rt a softirq running context might be blocked on a + * lock. There might be no other runnable task on this CPU because the + * lock owner runs on some other CPU. So we have to go into idle with + * the pending bit set. Therefor we need to check this otherwise we + * warn about false positives which confuses users and defeats the + * whole purpose of this test. + * + * This code is called with interrupts disabled. + */ +void softirq_check_pending_idle(void) +{ + struct task_struct *tsk = __this_cpu_read(ksoftirqd); + static int rate_limit; + bool okay = false; + u32 warnpending; + + if (rate_limit >= 10) + return; + + warnpending = local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK; + if (!warnpending) + return; + + if (!tsk) + return; + /* + * If ksoftirqd is blocked on a lock then we may go idle with pending + * softirq. + */ + raw_spin_lock(&tsk->pi_lock); + if (tsk->pi_blocked_on || tsk->state == TASK_RUNNING || + (tsk->state == TASK_UNINTERRUPTIBLE && tsk->sleeping_lock)) { + okay = true; + } + raw_spin_unlock(&tsk->pi_lock); + if (okay) + return; + /* + * The softirq lock is held in non-atomic context and the owner is + * blocking on a lock. It will schedule softirqs once the counter goes + * back to zero. + */ + if (this_cpu_read(softirq_counter) > 0) + return; + + printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n", + warnpending); + rate_limit++; +} + +#else + +void softirq_check_pending_idle(void) +{ + static int ratelimit; + + if (ratelimit < 10 && + (local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) { + pr_warn("NOHZ: local_softirq_pending %02x\n", + (unsigned int) local_softirq_pending()); + ratelimit++; + } +} + +#endif + /* * [ These __weak aliases are kept in a separate compilation unit, so that * GCC does not inline them incorrectly. ] @ kernel/stop_machine.c:89 @ static bool cpu_stop_queue_work(unsigned int cpu, struct cpu_stop_work *work) enabled = stopper->enabled; if (enabled) __cpu_stop_queue_work(stopper, work, &wakeq); - else if (work->done) - cpu_stop_signal_done(work->done); + else { + work->disabled = true; + if (work->done) + cpu_stop_signal_done(work->done); + } raw_spin_unlock_irqrestore(&stopper->lock, flags); wake_up_q(&wakeq); @ kernel/sysctl.c:215 @ static int proc_do_cad_pid(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); static int proc_taint(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); +#ifdef CONFIG_COMPACTION +static int proc_dointvec_minmax_warn_RT_change(struct ctl_table *table, + int write, void __user *buffer, + size_t *lenp, loff_t *ppos); +#endif #endif #ifdef CONFIG_PRINTK @ kernel/sysctl.c:1492 @ static struct ctl_table vm_table[] = { .data = &sysctl_compact_unevictable_allowed, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = proc_dointvec_minmax_warn_RT_change, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, @ kernel/sysctl.c:2580 @ int proc_dointvec(struct ctl_table *table, int write, return do_proc_dointvec(table, write, buffer, lenp, ppos, NULL, NULL); } +#ifdef CONFIG_COMPACTION +static int proc_dointvec_minmax_warn_RT_change(struct ctl_table *table, + int write, void __user *buffer, + size_t *lenp, loff_t *ppos) +{ + int ret, old; + + if (!IS_ENABLED(CONFIG_PREEMPT_RT) || !write) + return proc_dointvec_minmax(table, write, buffer, lenp, ppos); + + old = *(int *)table->data; + ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); + if (ret) + return ret; + if (old != *(int *)table->data) + pr_warn_once("sysctl attribute %s changed by %s[%d]\n", + table->procname, current->comm, + task_pid_nr(current)); + return ret; +} +#endif + /** * proc_douintvec - read a vector of unsigned integers * @table: the sysctl table @ kernel/time/hrtimer.c:138 @ static const int hrtimer_clock_to_base_table[MAX_CLOCKS] = { * timer->base->cpu_base */ static struct hrtimer_cpu_base migration_cpu_base = { - .clock_base = { { .cpu_base = &migration_cpu_base, }, }, + .clock_base = { { + .cpu_base = &migration_cpu_base, + .seq = SEQCNT_RAW_SPINLOCK_ZERO(migration_cpu_base.seq, + &migration_cpu_base.lock), + }, }, }; #define migration_base migration_cpu_base.clock_base[0] @ kernel/time/hrtimer.c:1826 @ static void __hrtimer_init_sleeper(struct hrtimer_sleeper *sl, * expiry. */ if (IS_ENABLED(CONFIG_PREEMPT_RT)) { - if (task_is_realtime(current) && !(mode & HRTIMER_MODE_SOFT)) + if ((task_is_realtime(current) && !(mode & HRTIMER_MODE_SOFT)) || system_state != SYSTEM_RUNNING) mode |= HRTIMER_MODE_HARD; } @ kernel/time/hrtimer.c:1991 @ SYSCALL_DEFINE2(nanosleep_time32, struct old_timespec32 __user *, rqtp, } #endif +#ifdef CONFIG_PREEMPT_RT +/* + * Sleep for 1 ms in hope whoever holds what we want will let it go. + */ +void cpu_chill(void) +{ + unsigned int freeze_flag = current->flags & PF_NOFREEZE; + struct task_struct *self = current; + ktime_t chill_time; + + raw_spin_lock_irq(&self->pi_lock); + self->saved_state = self->state; + __set_current_state_no_track(TASK_UNINTERRUPTIBLE); + raw_spin_unlock_irq(&self->pi_lock); + + chill_time = ktime_set(0, NSEC_PER_MSEC); + + current->flags |= PF_NOFREEZE; + sleeping_lock_inc(); + schedule_hrtimeout(&chill_time, HRTIMER_MODE_REL_HARD); + sleeping_lock_dec(); + if (!freeze_flag) + current->flags &= ~PF_NOFREEZE; + + raw_spin_lock_irq(&self->pi_lock); + __set_current_state_no_track(self->saved_state); + self->saved_state = TASK_RUNNING; + raw_spin_unlock_irq(&self->pi_lock); +} +EXPORT_SYMBOL(cpu_chill); +#endif + /* * Functions related to boot-time initialization: */ @ kernel/time/hrtimer.c:2032 @ int hrtimers_prepare_cpu(unsigned int cpu) int i; for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) { - cpu_base->clock_base[i].cpu_base = cpu_base; - timerqueue_init_head(&cpu_base->clock_base[i].active); + struct hrtimer_clock_base *clock_b = &cpu_base->clock_base[i]; + + clock_b->cpu_base = cpu_base; + seqcount_raw_spinlock_init(&clock_b->seq, &cpu_base->lock); + timerqueue_init_head(&clock_b->active); } cpu_base->cpu = cpu; @ kernel/time/jiffies.c:61 @ static struct clocksource clocksource_jiffies = { .max_cycles = 10, }; -__cacheline_aligned_in_smp DEFINE_SEQLOCK(jiffies_lock); +__cacheline_aligned_in_smp DEFINE_RAW_SPINLOCK(jiffies_lock); +__cacheline_aligned_in_smp seqcount_t jiffies_seq; #if (BITS_PER_LONG < 64) u64 get_jiffies_64(void) @ kernel/time/jiffies.c:71 @ u64 get_jiffies_64(void) u64 ret; do { - seq = read_seqbegin(&jiffies_lock); + seq = read_seqcount_begin(&jiffies_seq); ret = jiffies_64; - } while (read_seqretry(&jiffies_lock, seq)); + } while (read_seqcount_retry(&jiffies_seq, seq)); return ret; } EXPORT_SYMBOL(get_jiffies_64); @ kernel/time/posix-cpu-timers.c:6 @ * Implement CPU time clocks for the POSIX clock interface. */ +#include <uapi/linux/sched/types.h> #include <linux/sched/signal.h> #include <linux/sched/cputime.h> +#include <linux/sched/rt.h> #include <linux/posix-timers.h> #include <linux/errno.h> #include <linux/math64.h> @ kernel/time/posix-cpu-timers.c:20 @ #include <linux/workqueue.h> #include <linux/compat.h> #include <linux/sched/deadline.h> +#include <linux/smpboot.h> #include "posix-timers.h" @ kernel/time/posix-cpu-timers.c:33 @ void posix_cputimers_group_init(struct posix_cputimers *pct, u64 cpu_limit) pct->bases[CPUCLOCK_PROF].nextevt = cpu_limit * NSEC_PER_SEC; pct->timers_active = true; } +#ifdef CONFIG_PREEMPT_RT + pct->posix_timer_list = NULL; +#endif } /* @ kernel/time/posix-cpu-timers.c:449 @ static int posix_cpu_timer_del(struct k_itimer *timer) return ret; } +static DEFINE_PER_CPU(spinlock_t, cpu_timer_expiry_lock) = __SPIN_LOCK_UNLOCKED(cpu_timer_expiry_lock); + +static void posix_cpu_wait_running(struct k_itimer *timer) +{ + int cpu = timer->it.cpu.firing_cpu; + + if (cpu >= 0) { + spinlock_t *expiry_lock = per_cpu_ptr(&cpu_timer_expiry_lock, cpu); + + spin_lock_irq(expiry_lock); + spin_unlock_irq(expiry_lock); + } +} + static void cleanup_timerqueue(struct timerqueue_head *head) { struct timerqueue_node *node; @ kernel/time/posix-cpu-timers.c:800 @ static u64 collect_timerqueue(struct timerqueue_head *head, return expires; ctmr->firing = 1; + ctmr->firing_cpu = smp_processor_id(); cpu_timer_dequeue(ctmr); list_add_tail(&ctmr->elist, firing); } @ kernel/time/posix-cpu-timers.c:828 @ static inline void check_dl_overrun(struct task_struct *tsk) } } -static bool check_rlimit(u64 time, u64 limit, int signo, bool rt, bool hard) +static bool check_rlimit(struct task_struct *tsk, u64 time, u64 limit, + int signo, bool rt, bool hard) { if (time < limit) return false; @ kernel/time/posix-cpu-timers.c:837 @ static bool check_rlimit(u64 time, u64 limit, int signo, bool rt, bool hard) if (print_fatal_signals) { pr_info("%s Watchdog Timeout (%s): %s[%d]\n", rt ? "RT" : "CPU", hard ? "hard" : "soft", - current->comm, task_pid_nr(current)); + tsk->comm, task_pid_nr(tsk)); } - __group_send_sig_info(signo, SEND_SIG_PRIV, current); + __group_send_sig_info(signo, SEND_SIG_PRIV, tsk); return true; } @ kernel/time/posix-cpu-timers.c:875 @ static void check_thread_timers(struct task_struct *tsk, /* At the hard limit, send SIGKILL. No further action. */ if (hard != RLIM_INFINITY && - check_rlimit(rttime, hard, SIGKILL, true, true)) + check_rlimit(tsk, rttime, hard, SIGKILL, true, true)) return; /* At the soft limit, send a SIGXCPU every second */ - if (check_rlimit(rttime, soft, SIGXCPU, true, false)) { + if (check_rlimit(tsk, rttime, soft, SIGXCPU, true, false)) { soft += USEC_PER_SEC; tsk->signal->rlim[RLIMIT_RTTIME].rlim_cur = soft; } @ kernel/time/posix-cpu-timers.c:974 @ static void check_process_timers(struct task_struct *tsk, /* At the hard limit, send SIGKILL. No further action. */ if (hard != RLIM_INFINITY && - check_rlimit(ptime, hardns, SIGKILL, false, true)) + check_rlimit(tsk, ptime, hardns, SIGKILL, false, true)) return; /* At the soft limit, send a SIGXCPU every second */ - if (check_rlimit(ptime, softns, SIGXCPU, false, false)) { + if (check_rlimit(tsk, ptime, softns, SIGXCPU, false, false)) { sig->rlim[RLIMIT_CPU].rlim_cur = soft + 1; softns += NSEC_PER_SEC; } @ kernel/time/posix-cpu-timers.c:1135 @ static inline bool fastpath_timer_check(struct task_struct *tsk) * already updated our counts. We need to check if any timers fire now. * Interrupts are disabled. */ -void run_posix_cpu_timers(void) +static void __run_posix_cpu_timers(struct task_struct *tsk) { - struct task_struct *tsk = current; struct k_itimer *timer, *next; unsigned long flags; + spinlock_t *expiry_lock; LIST_HEAD(firing); - lockdep_assert_irqs_disabled(); - /* * The fast path checks that there are no expired thread or thread * group timers. If that's so, just return. @ kernel/time/posix-cpu-timers.c:1149 @ void run_posix_cpu_timers(void) if (!fastpath_timer_check(tsk)) return; - if (!lock_task_sighand(tsk, &flags)) + expiry_lock = this_cpu_ptr(&cpu_timer_expiry_lock); + spin_lock(expiry_lock); + + if (!lock_task_sighand(tsk, &flags)) { + spin_unlock(expiry_lock); return; + } /* * Here we take off tsk->signal->cpu_timers[N] and * tsk->cpu_timers[N] all the timers that are firing, and @ kernel/time/posix-cpu-timers.c:1188 @ void run_posix_cpu_timers(void) list_del_init(&timer->it.cpu.elist); cpu_firing = timer->it.cpu.firing; timer->it.cpu.firing = 0; + timer->it.cpu.firing_cpu = -1; /* * The firing flag is -1 if we collided with a reset * of the timer, which already reported this @ kernel/time/posix-cpu-timers.c:1198 @ void run_posix_cpu_timers(void) cpu_timer_fire(timer); spin_unlock(&timer->it_lock); } + spin_unlock(expiry_lock); } +#ifdef CONFIG_PREEMPT_RT +#include <linux/kthread.h> +#include <linux/cpu.h> +DEFINE_PER_CPU(struct task_struct *, posix_timer_task); +DEFINE_PER_CPU(struct task_struct *, posix_timer_tasklist); +DEFINE_PER_CPU(bool, posix_timer_th_active); + +static void posix_cpu_kthread_fn(unsigned int cpu) +{ + struct task_struct *tsk = NULL; + struct task_struct *next = NULL; + + BUG_ON(per_cpu(posix_timer_task, cpu) != current); + + /* grab task list */ + raw_local_irq_disable(); + tsk = per_cpu(posix_timer_tasklist, cpu); + per_cpu(posix_timer_tasklist, cpu) = NULL; + raw_local_irq_enable(); + + /* its possible the list is empty, just return */ + if (!tsk) + return; + + /* Process task list */ + while (1) { + /* save next */ + next = tsk->posix_cputimers.posix_timer_list; + + /* run the task timers, clear its ptr and + * unreference it + */ + __run_posix_cpu_timers(tsk); + tsk->posix_cputimers.posix_timer_list = NULL; + put_task_struct(tsk); + + /* check if this is the last on the list */ + if (next == tsk) + break; + tsk = next; + } +} + +static inline int __fastpath_timer_check(struct task_struct *tsk) +{ + /* tsk == current, ensure it is safe to use ->signal/sighand */ + if (unlikely(tsk->exit_state)) + return 0; + + if (!expiry_cache_is_inactive(&tsk->posix_cputimers)) + return 1; + + if (!expiry_cache_is_inactive(&tsk->signal->posix_cputimers)) + return 1; + + return 0; +} + +void run_posix_cpu_timers(void) +{ + unsigned int cpu = smp_processor_id(); + struct task_struct *tsk = current; + struct task_struct *tasklist; + + BUG_ON(!irqs_disabled()); + + if (per_cpu(posix_timer_th_active, cpu) != true) + return; + + /* get per-cpu references */ + tasklist = per_cpu(posix_timer_tasklist, cpu); + + /* check to see if we're already queued */ + if (!tsk->posix_cputimers.posix_timer_list && __fastpath_timer_check(tsk)) { + get_task_struct(tsk); + if (tasklist) { + tsk->posix_cputimers.posix_timer_list = tasklist; + } else { + /* + * The list is terminated by a self-pointing + * task_struct + */ + tsk->posix_cputimers.posix_timer_list = tsk; + } + per_cpu(posix_timer_tasklist, cpu) = tsk; + + wake_up_process(per_cpu(posix_timer_task, cpu)); + } +} + +static int posix_cpu_kthread_should_run(unsigned int cpu) +{ + return __this_cpu_read(posix_timer_tasklist) != NULL; +} + +static void posix_cpu_kthread_park(unsigned int cpu) +{ + this_cpu_write(posix_timer_th_active, false); +} + +static void posix_cpu_kthread_unpark(unsigned int cpu) +{ + this_cpu_write(posix_timer_th_active, true); +} + +static void posix_cpu_kthread_setup(unsigned int cpu) +{ + struct sched_param sp; + + sp.sched_priority = MAX_RT_PRIO - 1; + sched_setscheduler_nocheck(current, SCHED_FIFO, &sp); + posix_cpu_kthread_unpark(cpu); +} + +static struct smp_hotplug_thread posix_cpu_thread = { + .store = &posix_timer_task, + .thread_should_run = posix_cpu_kthread_should_run, + .thread_fn = posix_cpu_kthread_fn, + .thread_comm = "posixcputmr/%u", + .setup = posix_cpu_kthread_setup, + .park = posix_cpu_kthread_park, + .unpark = posix_cpu_kthread_unpark, +}; + +static int __init posix_cpu_thread_init(void) +{ + /* Start one for boot CPU. */ + unsigned long cpu; + int ret; + + /* init the per-cpu posix_timer_tasklets */ + for_each_possible_cpu(cpu) + per_cpu(posix_timer_tasklist, cpu) = NULL; + + ret = smpboot_register_percpu_thread(&posix_cpu_thread); + WARN_ON(ret); + + return 0; +} +early_initcall(posix_cpu_thread_init); + +#else /* CONFIG_PREEMPT_RT */ +void run_posix_cpu_timers(void) +{ + lockdep_assert_irqs_disabled(); + __run_posix_cpu_timers(current); +} +#endif /* CONFIG_PREEMPT_RT */ + /* * Set one of the process-wide special case CPU timers or RLIMIT_CPU. * The tsk->sighand->siglock must be held by the caller. @ kernel/time/posix-cpu-timers.c:1461 @ static int do_cpu_nanosleep(const clockid_t which_clock, int flags, spin_unlock_irq(&timer.it_lock); while (error == TIMER_RETRY) { + + posix_cpu_wait_running(&timer); /* * We need to handle case when timer was or is in the * middle of firing. In other cases we already freed @ kernel/time/posix-cpu-timers.c:1581 @ const struct k_clock clock_posix_cpu = { .timer_del = posix_cpu_timer_del, .timer_get = posix_cpu_timer_get, .timer_rearm = posix_cpu_timer_rearm, + .timer_wait_running = posix_cpu_wait_running, }; const struct k_clock clock_process = { @ kernel/time/tick-common.c:87 @ int tick_is_oneshot_available(void) static void tick_periodic(int cpu) { if (tick_do_timer_cpu == cpu) { - write_seqlock(&jiffies_lock); + raw_spin_lock(&jiffies_lock); + write_seqcount_begin(&jiffies_seq); /* Keep track of the next tick event */ tick_next_period = ktime_add(tick_next_period, tick_period); do_timer(1); - write_sequnlock(&jiffies_lock); + write_seqcount_end(&jiffies_seq); + raw_spin_unlock(&jiffies_lock); update_wall_time(); } @ kernel/time/tick-common.c:167 @ void tick_setup_periodic(struct clock_event_device *dev, int broadcast) ktime_t next; do { - seq = read_seqbegin(&jiffies_lock); + seq = read_seqcount_begin(&jiffies_seq); next = tick_next_period; - } while (read_seqretry(&jiffies_lock, seq)); + } while (read_seqcount_retry(&jiffies_seq, seq)); clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT); @ kernel/time/tick-sched.c:68 @ static void tick_do_update_jiffies64(ktime_t now) return; /* Reevaluate with jiffies_lock held */ - write_seqlock(&jiffies_lock); + raw_spin_lock(&jiffies_lock); + write_seqcount_begin(&jiffies_seq); delta = ktime_sub(now, last_jiffies_update); if (delta >= tick_period) { @ kernel/time/tick-sched.c:95 @ static void tick_do_update_jiffies64(ktime_t now) /* Keep the tick_next_period variable up to date */ tick_next_period = ktime_add(last_jiffies_update, tick_period); } else { - write_sequnlock(&jiffies_lock); + write_seqcount_end(&jiffies_seq); + raw_spin_unlock(&jiffies_lock); return; } - write_sequnlock(&jiffies_lock); + write_seqcount_end(&jiffies_seq); + raw_spin_unlock(&jiffies_lock); update_wall_time(); } @ kernel/time/tick-sched.c:111 @ static ktime_t tick_init_jiffy_update(void) { ktime_t period; - write_seqlock(&jiffies_lock); + raw_spin_lock(&jiffies_lock); + write_seqcount_begin(&jiffies_seq); /* Did we start the jiffies update yet ? */ if (last_jiffies_update == 0) last_jiffies_update = tick_next_period; period = last_jiffies_update; - write_sequnlock(&jiffies_lock); + write_seqcount_end(&jiffies_seq); + raw_spin_unlock(&jiffies_lock); return period; } @ kernel/time/tick-sched.c:248 @ static void nohz_full_kick_func(struct irq_work *work) static DEFINE_PER_CPU(struct irq_work, nohz_full_kick_work) = { .func = nohz_full_kick_func, + .flags = ATOMIC_INIT(IRQ_WORK_HARD_IRQ), }; /* @ kernel/time/tick-sched.c:685 @ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu) /* Read jiffies and the time when jiffies were updated last */ do { - seq = read_seqbegin(&jiffies_lock); + seq = read_seqcount_begin(&jiffies_seq); basemono = last_jiffies_update; basejiff = jiffies; - } while (read_seqretry(&jiffies_lock, seq)); + } while (read_seqcount_retry(&jiffies_seq, seq)); ts->last_jiffies = basejiff; ts->timer_expires_base = basemono; @ kernel/time/tick-sched.c:918 @ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts) return false; if (unlikely(local_softirq_pending())) { - static int ratelimit; - - if (ratelimit < 10 && - (local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) { - pr_warn("NOHZ: local_softirq_pending %02x\n", - (unsigned int) local_softirq_pending()); - ratelimit++; - } + softirq_check_pending_idle(); return false; } @ kernel/time/timekeeping.c:42 @ enum timekeeping_adv_mode { TK_ADV_FREQ }; +static DEFINE_RAW_SPINLOCK(timekeeper_lock); + /* * The most important data for readout fits into a single 64 byte * cache line. */ static struct { - seqcount_t seq; + seqcount_raw_spinlock_t seq; struct timekeeper timekeeper; } tk_core ____cacheline_aligned = { - .seq = SEQCNT_ZERO(tk_core.seq), + .seq = SEQCNT_RAW_SPINLOCK_ZERO(tk_core.seq, &timekeeper_lock), }; -static DEFINE_RAW_SPINLOCK(timekeeper_lock); static struct timekeeper shadow_timekeeper; /** @ kernel/time/timekeeping.c:67 @ static struct timekeeper shadow_timekeeper; * See @update_fast_timekeeper() below. */ struct tk_fast { - seqcount_t seq; + seqcount_raw_spinlock_t seq; struct tk_read_base base[2]; }; @ kernel/time/timekeeping.c:84 @ static struct clocksource dummy_clock = { }; static struct tk_fast tk_fast_mono ____cacheline_aligned = { + .seq = SEQCNT_RAW_SPINLOCK_ZERO(tk_fast_mono.seq, &timekeeper_lock), .base[0] = { .clock = &dummy_clock, }, .base[1] = { .clock = &dummy_clock, }, }; static struct tk_fast tk_fast_raw ____cacheline_aligned = { + .seq = SEQCNT_RAW_SPINLOCK_ZERO(tk_fast_raw.seq, &timekeeper_lock), .base[0] = { .clock = &dummy_clock, }, .base[1] = { .clock = &dummy_clock, }, }; @ kernel/time/timekeeping.c:163 @ static inline void tk_update_sleep_time(struct timekeeper *tk, ktime_t delta) * tk_clock_read - atomic clocksource read() helper * * This helper is necessary to use in the read paths because, while the - * seqlock ensures we don't return a bad value while structures are updated, + * seqcount ensures we don't return a bad value while structures are updated, * it doesn't protect from potential crashes. There is the possibility that * the tkr's clocksource may change between the read reference, and the * clock reference passed to the read function. This can cause crashes if @ kernel/time/timekeeping.c:228 @ static inline u64 timekeeping_get_delta(const struct tk_read_base *tkr) unsigned int seq; /* - * Since we're called holding a seqlock, the data may shift + * Since we're called holding a seqcount, the data may shift * under us while we're doing the calculation. This can cause * false positives, since we'd note a problem but throw the - * results away. So nest another seqlock here to atomically + * results away. So nest another seqcount here to atomically * grab the points we are checking with. */ do { @ kernel/time/timekeeping.c:492 @ EXPORT_SYMBOL_GPL(ktime_get_raw_fast_ns); * * To keep it NMI safe since we're accessing from tracing, we're not using a * separate timekeeper with updates to monotonic clock and boot offset - * protected with seqlocks. This has the following minor side effects: + * protected with seqcounts. This has the following minor side effects: * * (1) Its possible that a timestamp be taken after the boot offset is updated * but before the timekeeper is updated. If this happens, the new boot offset @ kernel/time/timekeeping.c:2403 @ EXPORT_SYMBOL(hardpps); */ void xtime_update(unsigned long ticks) { - write_seqlock(&jiffies_lock); + raw_spin_lock(&jiffies_lock); + write_seqcount_begin(&jiffies_seq); do_timer(ticks); - write_sequnlock(&jiffies_lock); + write_seqcount_end(&jiffies_seq); + raw_spin_unlock(&jiffies_lock); update_wall_time(); } @ kernel/time/timekeeping.h:28 @ static inline void sched_clock_resume(void) { } extern void do_timer(unsigned long ticks); extern void update_wall_time(void); -extern seqlock_t jiffies_lock; +extern raw_spinlock_t jiffies_lock; +extern seqcount_t jiffies_seq; #define CS_NAME_LEN 32 @ kernel/time/timer.c:1786 @ static __latent_entropy void run_timer_softirq(struct softirq_action *h) { struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]); + irq_work_tick_soft(); + __run_timers(base); if (IS_ENABLED(CONFIG_NO_HZ_COMMON)) __run_timers(this_cpu_ptr(&timer_bases[BASE_DEF])); @ kernel/trace/bpf_trace.c:86 @ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx) if (in_nmi()) /* not supported yet */ return 1; - preempt_disable(); + cant_sleep(); if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) { /* @ kernel/trace/bpf_trace.c:118 @ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx) out: __this_cpu_dec(bpf_prog_active); - preempt_enable(); return ret; } -EXPORT_SYMBOL_GPL(trace_call_bpf); #ifdef CONFIG_BPF_KPROBE_OVERRIDE BPF_CALL_2(bpf_override_return, struct pt_regs *, regs, unsigned long, rc) @ kernel/trace/bpf_trace.c:1508 @ void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp) static __always_inline void __bpf_trace_run(struct bpf_prog *prog, u64 *args) { + cant_sleep(); rcu_read_lock(); - preempt_disable(); (void) BPF_PROG_RUN(prog, args); - preempt_enable(); rcu_read_unlock(); } @ kernel/trace/trace.c:2440 @ tracing_generic_entry_update(struct trace_entry *entry, unsigned short type, struct task_struct *tsk = current; entry->preempt_count = pc & 0xff; + entry->preempt_lazy_count = preempt_lazy_count(); entry->pid = (tsk) ? tsk->pid : 0; entry->type = type; entry->flags = @ kernel/trace/trace.c:2452 @ tracing_generic_entry_update(struct trace_entry *entry, unsigned short type, ((pc & NMI_MASK ) ? TRACE_FLAG_NMI : 0) | ((pc & HARDIRQ_MASK) ? TRACE_FLAG_HARDIRQ : 0) | ((pc & SOFTIRQ_OFFSET) ? TRACE_FLAG_SOFTIRQ : 0) | - (tif_need_resched() ? TRACE_FLAG_NEED_RESCHED : 0) | + (tif_need_resched_now() ? TRACE_FLAG_NEED_RESCHED : 0) | + (need_resched_lazy() ? TRACE_FLAG_NEED_RESCHED_LAZY : 0) | (test_preempt_need_resched() ? TRACE_FLAG_PREEMPT_RESCHED : 0); + + entry->migrate_disable = (tsk) ? __migrate_disabled(tsk) & 0xFF : 0; } EXPORT_SYMBOL_GPL(tracing_generic_entry_update); @ kernel/trace/trace.c:3700 @ static void print_lat_help_header(struct seq_file *m) seq_puts(m, "# _------=> CPU# \n" "# / _-----=> irqs-off \n" "# | / _----=> need-resched \n" - "# || / _---=> hardirq/softirq \n" - "# ||| / _--=> preempt-depth \n" - "# |||| / delay \n" - "# cmd pid ||||| time | caller \n" - "# \\ / ||||| \\ | / \n"); + "# || / _----=> need-resched_lazy\n" + "# ||| / _---=> hardirq/softirq \n" + "# |||| / _--=> preempt-depth \n" + "# ||||| / _-=> migrate-disable \n" + "# |||||| / delay \n" + "# cmd pid ||||||| time | caller \n" + "# \\ / |||||| \\ | / \n"); } static void print_event_info(struct array_buffer *buf, struct seq_file *m) @ kernel/trace/trace.c:3742 @ static void print_func_help_header_irq(struct array_buffer *buf, struct seq_file seq_printf(m, "# %.*s _-----=> irqs-off\n", prec, space); seq_printf(m, "# %.*s / _----=> need-resched\n", prec, space); - seq_printf(m, "# %.*s| / _---=> hardirq/softirq\n", prec, space); - seq_printf(m, "# %.*s|| / _--=> preempt-depth\n", prec, space); - seq_printf(m, "# %.*s||| / delay\n", prec, space); - seq_printf(m, "# TASK-PID %.*sCPU# |||| TIMESTAMP FUNCTION\n", prec, " TGID "); - seq_printf(m, "# | | %.*s | |||| | |\n", prec, " | "); + seq_printf(m, "# %.*s| / _----=> need-resched\n", prec, space); + seq_printf(m, "# %.*s|| / _---=> hardirq/softirq\n", prec, space); + seq_printf(m, "# %.*s||| / _--=> preempt-depth\n", prec, space); + seq_printf(m, "# %.*s||||/ delay\n", prec, space); + seq_printf(m, "# TASK-PID %.*sCPU# ||||| TIMESTAMP FUNCTION\n", prec, " TGID "); + seq_printf(m, "# | | %.*s | ||||| | |\n", prec, " | "); } void @ kernel/trace/trace.c:9165 @ void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) tracing_off(); local_irq_save(flags); - printk_nmi_direct_enter(); /* Simulate the iterator */ trace_init_global_iter(&iter); @ kernel/trace/trace.c:9241 @ void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) atomic_dec(&per_cpu_ptr(iter.array_buffer->data, cpu)->disabled); } atomic_dec(&dump_running); - printk_nmi_direct_exit(); local_irq_restore(flags); } EXPORT_SYMBOL_GPL(ftrace_dump); @ kernel/trace/trace.h:143 @ struct kretprobe_trace_entry_head { * NEED_RESCHED - reschedule is requested * HARDIRQ - inside an interrupt handler * SOFTIRQ - inside a softirq handler + * NEED_RESCHED_LAZY - lazy reschedule is requested */ enum trace_flag_type { TRACE_FLAG_IRQS_OFF = 0x01, @ kernel/trace/trace.h:153 @ enum trace_flag_type { TRACE_FLAG_SOFTIRQ = 0x10, TRACE_FLAG_PREEMPT_RESCHED = 0x20, TRACE_FLAG_NMI = 0x40, + TRACE_FLAG_NEED_RESCHED_LAZY = 0x80, }; #define TRACE_BUF_SIZE 1024 @ kernel/trace/trace_events.c:185 @ static int trace_define_common_fields(void) __common_field(unsigned char, flags); __common_field(unsigned char, preempt_count); __common_field(int, pid); + __common_field(unsigned char, migrate_disable); + __common_field(unsigned char, preempt_lazy_count); return ret; } @ kernel/trace/trace_output.c:444 @ int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry) { char hardsoft_irq; char need_resched; + char need_resched_lazy; char irqs_off; int hardirq; int softirq; @ kernel/trace/trace_output.c:475 @ int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry) break; } + need_resched_lazy = + (entry->flags & TRACE_FLAG_NEED_RESCHED_LAZY) ? 'L' : '.'; + hardsoft_irq = (nmi && hardirq) ? 'Z' : nmi ? 'z' : @ kernel/trace/trace_output.c:486 @ int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry) softirq ? 's' : '.' ; - trace_seq_printf(s, "%c%c%c", - irqs_off, need_resched, hardsoft_irq); + trace_seq_printf(s, "%c%c%c%c", + irqs_off, need_resched, need_resched_lazy, + hardsoft_irq); if (entry->preempt_count) trace_seq_printf(s, "%x", entry->preempt_count); else trace_seq_putc(s, '.'); + if (entry->preempt_lazy_count) + trace_seq_printf(s, "%x", entry->preempt_lazy_count); + else + trace_seq_putc(s, '.'); + + if (entry->migrate_disable) + trace_seq_printf(s, "%x", entry->migrate_disable); + else + trace_seq_putc(s, '.'); + return !trace_seq_has_overflowed(s); } @ kernel/trace/trace_uprobe.c:1336 @ static void __uprobe_perf_func(struct trace_uprobe *tu, int size, esize; int rctx; - if (bpf_prog_array_valid(call) && !trace_call_bpf(call, regs)) - return; + if (bpf_prog_array_valid(call)) { + u32 ret; + + preempt_disable(); + ret = trace_call_bpf(call, regs); + preempt_enable(); + if (!ret) + return; + } esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu)); @ kernel/workqueue.c:148 @ enum { /* struct worker is defined in workqueue_internal.h */ struct worker_pool { - spinlock_t lock; /* the pool lock */ + raw_spinlock_t lock; /* the pool lock */ int cpu; /* I: the associated cpu */ int node; /* I: the associated node ID */ int id; /* I: pool ID */ @ kernel/workqueue.c:303 @ static struct workqueue_attrs *wq_update_unbound_numa_attrs_buf; static DEFINE_MUTEX(wq_pool_mutex); /* protects pools and workqueues list */ static DEFINE_MUTEX(wq_pool_attach_mutex); /* protects worker attach/detach */ -static DEFINE_SPINLOCK(wq_mayday_lock); /* protects wq->maydays list */ -static DECLARE_WAIT_QUEUE_HEAD(wq_manager_wait); /* wait for manager to go away */ +static DEFINE_RAW_SPINLOCK(wq_mayday_lock); /* protects wq->maydays list */ +/* wait for manager to go away */ +static struct rcuwait manager_wait = __RCUWAIT_INITIALIZER(manager_wait); static LIST_HEAD(workqueues); /* PR: list of all workqueues */ static bool workqueue_freezing; /* PL: have wqs started freezing? */ @ kernel/workqueue.c:830 @ static struct worker *first_idle_worker(struct worker_pool *pool) * Wake up the first idle worker of @pool. * * CONTEXT: - * spin_lock_irq(pool->lock). + * raw_spin_lock_irq(pool->lock). */ static void wake_up_worker(struct worker_pool *pool) { @ kernel/workqueue.c:883 @ void wq_worker_sleeping(struct task_struct *task) return; worker->sleeping = 1; - spin_lock_irq(&pool->lock); + raw_spin_lock_irq(&pool->lock); /* * The counterpart of the following dec_and_test, implied mb, @ kernel/workqueue.c:902 @ void wq_worker_sleeping(struct task_struct *task) if (next) wake_up_process(next->task); } - spin_unlock_irq(&pool->lock); + raw_spin_unlock_irq(&pool->lock); } /** @ kernel/workqueue.c:913 @ void wq_worker_sleeping(struct task_struct *task) * the scheduler to get a worker's last known identity. * * CONTEXT: - * spin_lock_irq(rq->lock) + * raw_spin_lock_irq(rq->lock) * * This function is called during schedule() when a kworker is going * to sleep. It's used by psi to identify aggregation workers during @ kernel/workqueue.c:944 @ work_func_t wq_worker_last_func(struct task_struct *task) * Set @flags in @worker->flags and adjust nr_running accordingly. * * CONTEXT: - * spin_lock_irq(pool->lock) + * raw_spin_lock_irq(pool->lock) */ static inline void worker_set_flags(struct worker *worker, unsigned int flags) { @ kernel/workqueue.c:969 @ static inline void worker_set_flags(struct worker *worker, unsigned int flags) * Clear @flags in @worker->flags and adjust nr_running accordingly. * * CONTEXT: - * spin_lock_irq(pool->lock) + * raw_spin_lock_irq(pool->lock) */ static inline void worker_clr_flags(struct worker *worker, unsigned int flags) { @ kernel/workqueue.c:1017 @ static inline void worker_clr_flags(struct worker *worker, unsigned int flags) * actually occurs, it should be easy to locate the culprit work function. * * CONTEXT: - * spin_lock_irq(pool->lock). + * raw_spin_lock_irq(pool->lock). * * Return: * Pointer to worker which is executing @work if found, %NULL @ kernel/workqueue.c:1052 @ static struct worker *find_worker_executing_work(struct worker_pool *pool, * nested inside outer list_for_each_entry_safe(). * * CONTEXT: - * spin_lock_irq(pool->lock). + * raw_spin_lock_irq(pool->lock). */ static void move_linked_works(struct work_struct *work, struct list_head *head, struct work_struct **nextp) @ kernel/workqueue.c:1130 @ static void put_pwq_unlocked(struct pool_workqueue *pwq) * As both pwqs and pools are RCU protected, the * following lock operations are safe. */ - spin_lock_irq(&pwq->pool->lock); + raw_spin_lock_irq(&pwq->pool->lock); put_pwq(pwq); - spin_unlock_irq(&pwq->pool->lock); + raw_spin_unlock_irq(&pwq->pool->lock); } } @ kernel/workqueue.c:1165 @ static void pwq_activate_first_delayed(struct pool_workqueue *pwq) * decrement nr_in_flight of its pwq and handle workqueue flushing. * * CONTEXT: - * spin_lock_irq(pool->lock). + * raw_spin_lock_irq(pool->lock). */ static void pwq_dec_nr_in_flight(struct pool_workqueue *pwq, int color) { @ kernel/workqueue.c:1264 @ static int try_to_grab_pending(struct work_struct *work, bool is_dwork, if (!pool) goto fail; - spin_lock(&pool->lock); + raw_spin_lock(&pool->lock); /* * work->data is guaranteed to point to pwq only while the work * item is queued on pwq->wq, and both updating work->data to point @ kernel/workqueue.c:1293 @ static int try_to_grab_pending(struct work_struct *work, bool is_dwork, /* work->data points to pwq iff queued, point to pool */ set_work_pool_and_keep_pending(work, pool->id); - spin_unlock(&pool->lock); + raw_spin_unlock(&pool->lock); rcu_read_unlock(); return 1; } - spin_unlock(&pool->lock); + raw_spin_unlock(&pool->lock); fail: rcu_read_unlock(); local_irq_restore(*flags); @ kernel/workqueue.c:1318 @ static int try_to_grab_pending(struct work_struct *work, bool is_dwork, * work_struct flags. * * CONTEXT: - * spin_lock_irq(pool->lock). + * raw_spin_lock_irq(pool->lock). */ static void insert_work(struct pool_workqueue *pwq, struct work_struct *work, struct list_head *head, unsigned int extra_flags) @ kernel/workqueue.c:1435 @ static void __queue_work(int cpu, struct workqueue_struct *wq, if (last_pool && last_pool != pwq->pool) { struct worker *worker; - spin_lock(&last_pool->lock); + raw_spin_lock(&last_pool->lock); worker = find_worker_executing_work(last_pool, work); @ kernel/workqueue.c:1443 @ static void __queue_work(int cpu, struct workqueue_struct *wq, pwq = worker->current_pwq; } else { /* meh... not running there, queue here */ - spin_unlock(&last_pool->lock); - spin_lock(&pwq->pool->lock); + raw_spin_unlock(&last_pool->lock); + raw_spin_lock(&pwq->pool->lock); } } else { - spin_lock(&pwq->pool->lock); + raw_spin_lock(&pwq->pool->lock); } /* @ kernel/workqueue.c:1460 @ static void __queue_work(int cpu, struct workqueue_struct *wq, */ if (unlikely(!pwq->refcnt)) { if (wq->flags & WQ_UNBOUND) { - spin_unlock(&pwq->pool->lock); + raw_spin_unlock(&pwq->pool->lock); cpu_relax(); goto retry; } @ kernel/workqueue.c:1492 @ static void __queue_work(int cpu, struct workqueue_struct *wq, insert_work(pwq, work, worklist, work_flags); out: - spin_unlock(&pwq->pool->lock); + raw_spin_unlock(&pwq->pool->lock); rcu_read_unlock(); } @ kernel/workqueue.c:1761 @ EXPORT_SYMBOL(queue_rcu_work); * necessary. * * LOCKING: - * spin_lock_irq(pool->lock). + * raw_spin_lock_irq(pool->lock). */ static void worker_enter_idle(struct worker *worker) { @ kernel/workqueue.c:1801 @ static void worker_enter_idle(struct worker *worker) * @worker is leaving idle state. Update stats. * * LOCKING: - * spin_lock_irq(pool->lock). + * raw_spin_lock_irq(pool->lock). */ static void worker_leave_idle(struct worker *worker) { @ kernel/workqueue.c:1939 @ static struct worker *create_worker(struct worker_pool *pool) worker_attach_to_pool(worker, pool); /* start the newly created worker */ - spin_lock_irq(&pool->lock); + raw_spin_lock_irq(&pool->lock); worker->pool->nr_workers++; worker_enter_idle(worker); wake_up_process(worker->task); - spin_unlock_irq(&pool->lock); + raw_spin_unlock_irq(&pool->lock); return worker; @ kernel/workqueue.c:1962 @ static struct worker *create_worker(struct worker_pool *pool) * be idle. * * CONTEXT: - * spin_lock_irq(pool->lock). + * raw_spin_lock_irq(pool->lock). */ static void destroy_worker(struct worker *worker) { @ kernel/workqueue.c:1988 @ static void idle_worker_timeout(struct timer_list *t) { struct worker_pool *pool = from_timer(pool, t, idle_timer); - spin_lock_irq(&pool->lock); + raw_spin_lock_irq(&pool->lock); while (too_many_workers(pool)) { struct worker *worker; @ kernel/workqueue.c:2006 @ static void idle_worker_timeout(struct timer_list *t) destroy_worker(worker); } - spin_unlock_irq(&pool->lock); + raw_spin_unlock_irq(&pool->lock); } static void send_mayday(struct work_struct *work) @ kernel/workqueue.c:2037 @ static void pool_mayday_timeout(struct timer_list *t) struct worker_pool *pool = from_timer(pool, t, mayday_timer); struct work_struct *work; - spin_lock_irq(&pool->lock); - spin_lock(&wq_mayday_lock); /* for wq->maydays */ + raw_spin_lock_irq(&pool->lock); + raw_spin_lock(&wq_mayday_lock); /* for wq->maydays */ if (need_to_create_worker(pool)) { /* @ kernel/workqueue.c:2051 @ static void pool_mayday_timeout(struct timer_list *t) send_mayday(work); } - spin_unlock(&wq_mayday_lock); - spin_unlock_irq(&pool->lock); + raw_spin_unlock(&wq_mayday_lock); + raw_spin_unlock_irq(&pool->lock); mod_timer(&pool->mayday_timer, jiffies + MAYDAY_INTERVAL); } @ kernel/workqueue.c:2071 @ static void pool_mayday_timeout(struct timer_list *t) * may_start_working() %true. * * LOCKING: - * spin_lock_irq(pool->lock) which may be released and regrabbed + * raw_spin_lock_irq(pool->lock) which may be released and regrabbed * multiple times. Does GFP_KERNEL allocations. Called only from * manager. */ @ kernel/workqueue.c:2080 @ __releases(&pool->lock) __acquires(&pool->lock) { restart: - spin_unlock_irq(&pool->lock); + raw_spin_unlock_irq(&pool->lock); /* if we don't make progress in MAYDAY_INITIAL_TIMEOUT, call for help */ mod_timer(&pool->mayday_timer, jiffies + MAYDAY_INITIAL_TIMEOUT); @ kernel/workqueue.c:2096 @ __acquires(&pool->lock) } del_timer_sync(&pool->mayday_timer); - spin_lock_irq(&pool->lock); + raw_spin_lock_irq(&pool->lock); /* * This is necessary even after a new worker was just successfully * created as @pool->lock was dropped and the new worker might have @ kernel/workqueue.c:2119 @ __acquires(&pool->lock) * and may_start_working() is true. * * CONTEXT: - * spin_lock_irq(pool->lock) which may be released and regrabbed + * raw_spin_lock_irq(pool->lock) which may be released and regrabbed * multiple times. Does GFP_KERNEL allocations. * * Return: @ kernel/workqueue.c:2142 @ static bool manage_workers(struct worker *worker) pool->manager = NULL; pool->flags &= ~POOL_MANAGER_ACTIVE; - wake_up(&wq_manager_wait); + rcuwait_wake_up(&manager_wait); return true; } @ kernel/workqueue.c:2158 @ static bool manage_workers(struct worker *worker) * call this function to process a work. * * CONTEXT: - * spin_lock_irq(pool->lock) which is released and regrabbed. + * raw_spin_lock_irq(pool->lock) which is released and regrabbed. */ static void process_one_work(struct worker *worker, struct work_struct *work) __releases(&pool->lock) @ kernel/workqueue.c:2240 @ __acquires(&pool->lock) */ set_work_pool_and_clear_pending(work, pool->id); - spin_unlock_irq(&pool->lock); + raw_spin_unlock_irq(&pool->lock); lock_map_acquire(&pwq->wq->lockdep_map); lock_map_acquire(&lockdep_map); @ kernel/workqueue.c:2295 @ __acquires(&pool->lock) */ cond_resched(); - spin_lock_irq(&pool->lock); + raw_spin_lock_irq(&pool->lock); /* clear cpu intensive status */ if (unlikely(cpu_intensive)) @ kernel/workqueue.c:2321 @ __acquires(&pool->lock) * fetches a work from the top and executes it. * * CONTEXT: - * spin_lock_irq(pool->lock) which may be released and regrabbed + * raw_spin_lock_irq(pool->lock) which may be released and regrabbed * multiple times. */ static void process_scheduled_works(struct worker *worker) @ kernel/workqueue.c:2363 @ static int worker_thread(void *__worker) /* tell the scheduler that this is a workqueue worker */ set_pf_worker(true); woke_up: - spin_lock_irq(&pool->lock); + raw_spin_lock_irq(&pool->lock); /* am I supposed to die? */ if (unlikely(worker->flags & WORKER_DIE)) { - spin_unlock_irq(&pool->lock); + raw_spin_unlock_irq(&pool->lock); WARN_ON_ONCE(!list_empty(&worker->entry)); set_pf_worker(false); @ kernel/workqueue.c:2433 @ static int worker_thread(void *__worker) */ worker_enter_idle(worker); __set_current_state(TASK_IDLE); - spin_unlock_irq(&pool->lock); + raw_spin_unlock_irq(&pool->lock); schedule(); goto woke_up; } @ kernel/workqueue.c:2487 @ static int rescuer_thread(void *__rescuer) should_stop = kthread_should_stop(); /* see whether any pwq is asking for help */ - spin_lock_irq(&wq_mayday_lock); + raw_spin_lock_irq(&wq_mayday_lock); while (!list_empty(&wq->maydays)) { struct pool_workqueue *pwq = list_first_entry(&wq->maydays, @ kernel/workqueue.c:2499 @ static int rescuer_thread(void *__rescuer) __set_current_state(TASK_RUNNING); list_del_init(&pwq->mayday_node); - spin_unlock_irq(&wq_mayday_lock); + raw_spin_unlock_irq(&wq_mayday_lock); worker_attach_to_pool(rescuer, pool); - spin_lock_irq(&pool->lock); + raw_spin_lock_irq(&pool->lock); /* * Slurp in all works issued via this workqueue and @ kernel/workqueue.c:2532 @ static int rescuer_thread(void *__rescuer) * incur MAYDAY_INTERVAL delay inbetween. */ if (need_to_create_worker(pool)) { - spin_lock(&wq_mayday_lock); + raw_spin_lock(&wq_mayday_lock); /* * Queue iff we aren't racing destruction * and somebody else hasn't queued it already. @ kernel/workqueue.c:2541 @ static int rescuer_thread(void *__rescuer) get_pwq(pwq); list_add_tail(&pwq->mayday_node, &wq->maydays); } - spin_unlock(&wq_mayday_lock); + raw_spin_unlock(&wq_mayday_lock); } } @ kernel/workqueue.c:2559 @ static int rescuer_thread(void *__rescuer) if (need_more_worker(pool)) wake_up_worker(pool); - spin_unlock_irq(&pool->lock); + raw_spin_unlock_irq(&pool->lock); worker_detach_from_pool(rescuer); - spin_lock_irq(&wq_mayday_lock); + raw_spin_lock_irq(&wq_mayday_lock); } - spin_unlock_irq(&wq_mayday_lock); + raw_spin_unlock_irq(&wq_mayday_lock); if (should_stop) { __set_current_state(TASK_RUNNING); @ kernel/workqueue.c:2646 @ static void wq_barrier_func(struct work_struct *work) * underneath us, so we can't reliably determine pwq from @target. * * CONTEXT: - * spin_lock_irq(pool->lock). + * raw_spin_lock_irq(pool->lock). */ static void insert_wq_barrier(struct pool_workqueue *pwq, struct wq_barrier *barr, @ kernel/workqueue.c:2733 @ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq, for_each_pwq(pwq, wq) { struct worker_pool *pool = pwq->pool; - spin_lock_irq(&pool->lock); + raw_spin_lock_irq(&pool->lock); if (flush_color >= 0) { WARN_ON_ONCE(pwq->flush_color != -1); @ kernel/workqueue.c:2750 @ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq, pwq->work_color = work_color; } - spin_unlock_irq(&pool->lock); + raw_spin_unlock_irq(&pool->lock); } if (flush_color >= 0 && atomic_dec_and_test(&wq->nr_pwqs_to_flush)) @ kernel/workqueue.c:2950 @ void drain_workqueue(struct workqueue_struct *wq) for_each_pwq(pwq, wq) { bool drained; - spin_lock_irq(&pwq->pool->lock); + raw_spin_lock_irq(&pwq->pool->lock); drained = !pwq->nr_active && list_empty(&pwq->delayed_works); - spin_unlock_irq(&pwq->pool->lock); + raw_spin_unlock_irq(&pwq->pool->lock); if (drained) continue; @ kernel/workqueue.c:2988 @ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr, return false; } - spin_lock_irq(&pool->lock); + raw_spin_lock_irq(&pool->lock); /* see the comment in try_to_grab_pending() with the same code */ pwq = get_work_pwq(work); if (pwq) { @ kernel/workqueue.c:3004 @ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr, check_flush_dependency(pwq->wq, work); insert_wq_barrier(pwq, barr, work, worker); - spin_unlock_irq(&pool->lock); + raw_spin_unlock_irq(&pool->lock); /* * Force a lock recursion deadlock when using flush_work() inside a @ kernel/workqueue.c:3023 @ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr, rcu_read_unlock(); return true; already_gone: - spin_unlock_irq(&pool->lock); + raw_spin_unlock_irq(&pool->lock); rcu_read_unlock(); return false; } @ kernel/workqueue.c:3416 @ static bool wqattrs_equal(const struct workqueue_attrs *a, */ static int init_worker_pool(struct worker_pool *pool) { - spin_lock_init(&pool->lock); + raw_spin_lock_init(&pool->lock); pool->id = -1; pool->cpu = -1; pool->node = NUMA_NO_NODE; @ kernel/workqueue.c:3506 @ static void rcu_free_pool(struct rcu_head *rcu) kfree(pool); } +/* This returns with the lock held on success (pool manager is inactive). */ +static bool wq_manager_inactive(struct worker_pool *pool) +{ + raw_spin_lock_irq(&pool->lock); + + if (pool->flags & POOL_MANAGER_ACTIVE) { + raw_spin_unlock_irq(&pool->lock); + return false; + } + return true; +} + /** * put_unbound_pool - put a worker_pool * @pool: worker_pool to put @ kernel/workqueue.c:3553 @ static void put_unbound_pool(struct worker_pool *pool) * Become the manager and destroy all workers. This prevents * @pool's workers from blocking on attach_mutex. We're the last * manager and @pool gets freed with the flag set. + * Because of how wq_manager_inactive() works, we will hold the + * spinlock after a successful wait. */ - spin_lock_irq(&pool->lock); - wait_event_lock_irq(wq_manager_wait, - !(pool->flags & POOL_MANAGER_ACTIVE), pool->lock); + rcuwait_wait_event(&manager_wait, wq_manager_inactive(pool)); pool->flags |= POOL_MANAGER_ACTIVE; while ((worker = first_idle_worker(pool))) destroy_worker(worker); WARN_ON(pool->nr_workers || pool->nr_idle); - spin_unlock_irq(&pool->lock); + raw_spin_unlock_irq(&pool->lock); mutex_lock(&wq_pool_attach_mutex); if (!list_empty(&pool->workers)) @ kernel/workqueue.c:3718 @ static void pwq_adjust_max_active(struct pool_workqueue *pwq) return; /* this function can be called during early boot w/ irq disabled */ - spin_lock_irqsave(&pwq->pool->lock, flags); + raw_spin_lock_irqsave(&pwq->pool->lock, flags); /* * During [un]freezing, the caller is responsible for ensuring that @ kernel/workqueue.c:3741 @ static void pwq_adjust_max_active(struct pool_workqueue *pwq) pwq->max_active = 0; } - spin_unlock_irqrestore(&pwq->pool->lock, flags); + raw_spin_unlock_irqrestore(&pwq->pool->lock, flags); } /* initialize newly alloced @pwq which is associated with @wq and @pool */ @ kernel/workqueue.c:4143 @ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu, use_dfl_pwq: mutex_lock(&wq->mutex); - spin_lock_irq(&wq->dfl_pwq->pool->lock); + raw_spin_lock_irq(&wq->dfl_pwq->pool->lock); get_pwq(wq->dfl_pwq); - spin_unlock_irq(&wq->dfl_pwq->pool->lock); + raw_spin_unlock_irq(&wq->dfl_pwq->pool->lock); old_pwq = numa_pwq_tbl_install(wq, node, wq->dfl_pwq); out_unlock: mutex_unlock(&wq->mutex); @ kernel/workqueue.c:4374 @ void destroy_workqueue(struct workqueue_struct *wq) struct worker *rescuer = wq->rescuer; /* this prevents new queueing */ - spin_lock_irq(&wq_mayday_lock); + raw_spin_lock_irq(&wq_mayday_lock); wq->rescuer = NULL; - spin_unlock_irq(&wq_mayday_lock); + raw_spin_unlock_irq(&wq_mayday_lock); /* rescuer will empty maydays list before exiting */ kthread_stop(rescuer->task); @ kernel/workqueue.c:4390 @ void destroy_workqueue(struct workqueue_struct *wq) mutex_lock(&wq_pool_mutex); mutex_lock(&wq->mutex); for_each_pwq(pwq, wq) { - spin_lock_irq(&pwq->pool->lock); + raw_spin_lock_irq(&pwq->pool->lock); if (WARN_ON(pwq_busy(pwq))) { pr_warn("%s: %s has the following busy pwq\n", __func__, wq->name); show_pwq(pwq); - spin_unlock_irq(&pwq->pool->lock); + raw_spin_unlock_irq(&pwq->pool->lock); mutex_unlock(&wq->mutex); mutex_unlock(&wq_pool_mutex); show_workqueue_state(); return; } - spin_unlock_irq(&pwq->pool->lock); + raw_spin_unlock_irq(&pwq->pool->lock); } mutex_unlock(&wq->mutex); mutex_unlock(&wq_pool_mutex); @ kernel/workqueue.c:4572 @ unsigned int work_busy(struct work_struct *work) rcu_read_lock(); pool = get_work_pool(work); if (pool) { - spin_lock_irqsave(&pool->lock, flags); + raw_spin_lock_irqsave(&pool->lock, flags); if (find_worker_executing_work(pool, work)) ret |= WORK_BUSY_RUNNING; - spin_unlock_irqrestore(&pool->lock, flags); + raw_spin_unlock_irqrestore(&pool->lock, flags); } rcu_read_unlock(); @ kernel/workqueue.c:4782 @ void show_workqueue_state(void) pr_info("workqueue %s: flags=0x%x\n", wq->name, wq->flags); for_each_pwq(pwq, wq) { - spin_lock_irqsave(&pwq->pool->lock, flags); + raw_spin_lock_irqsave(&pwq->pool->lock, flags); if (pwq->nr_active || !list_empty(&pwq->delayed_works)) show_pwq(pwq); - spin_unlock_irqrestore(&pwq->pool->lock, flags); + raw_spin_unlock_irqrestore(&pwq->pool->lock, flags); /* * We could be printing a lot from atomic context, e.g. * sysrq-t -> show_workqueue_state(). Avoid triggering @ kernel/workqueue.c:4799 @ void show_workqueue_state(void) struct worker *worker; bool first = true; - spin_lock_irqsave(&pool->lock, flags); + raw_spin_lock_irqsave(&pool->lock, flags); if (pool->nr_workers == pool->nr_idle) goto next_pool; @ kernel/workqueue.c:4818 @ void show_workqueue_state(void) } pr_cont("\n"); next_pool: - spin_unlock_irqrestore(&pool->lock, flags); + raw_spin_unlock_irqrestore(&pool->lock, flags); /* * We could be printing a lot from atomic context, e.g. * sysrq-t -> show_workqueue_state(). Avoid triggering @ kernel/workqueue.c:4848 @ void wq_worker_comm(char *buf, size_t size, struct task_struct *task) struct worker_pool *pool = worker->pool; if (pool) { - spin_lock_irq(&pool->lock); + raw_spin_lock_irq(&pool->lock); /* * ->desc tracks information (wq name or * set_worker_desc()) for the latest execution. If @ kernel/workqueue.c:4862 @ void wq_worker_comm(char *buf, size_t size, struct task_struct *task) scnprintf(buf + off, size - off, "-%s", worker->desc); } - spin_unlock_irq(&pool->lock); + raw_spin_unlock_irq(&pool->lock); } } @ kernel/workqueue.c:4893 @ static void unbind_workers(int cpu) for_each_cpu_worker_pool(pool, cpu) { mutex_lock(&wq_pool_attach_mutex); - spin_lock_irq(&pool->lock); + raw_spin_lock_irq(&pool->lock); /* * We've blocked all attach/detach operations. Make all workers @ kernel/workqueue.c:4907 @ static void unbind_workers(int cpu) pool->flags |= POOL_DISASSOCIATED; - spin_unlock_irq(&pool->lock); + raw_spin_unlock_irq(&pool->lock); mutex_unlock(&wq_pool_attach_mutex); /* @ kernel/workqueue.c:4933 @ static void unbind_workers(int cpu) * worker blocking could lead to lengthy stalls. Kick off * unbound chain execution of currently pending work items. */ - spin_lock_irq(&pool->lock); + raw_spin_lock_irq(&pool->lock); wake_up_worker(pool); - spin_unlock_irq(&pool->lock); + raw_spin_unlock_irq(&pool->lock); } } @ kernel/workqueue.c:4962 @ static void rebind_workers(struct worker_pool *pool) WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask) < 0); - spin_lock_irq(&pool->lock); + raw_spin_lock_irq(&pool->lock); pool->flags &= ~POOL_DISASSOCIATED; @ kernel/workqueue.c:5001 @ static void rebind_workers(struct worker_pool *pool) WRITE_ONCE(worker->flags, worker_flags); } - spin_unlock_irq(&pool->lock); + raw_spin_unlock_irq(&pool->lock); } /** @ lib/Kconfig.debug:64 @ config CONSOLE_LOGLEVEL_QUIET will be used as the loglevel. IOW passing "quiet" will be the equivalent of passing "loglevel=<CONSOLE_LOGLEVEL_QUIET>" +config CONSOLE_LOGLEVEL_EMERGENCY + int "Emergency console loglevel (1-15)" + range 1 15 + default "5" + help + The loglevel to determine if a console message is an emergency + message. + + If supported by the console driver, emergency messages will be + flushed to the console immediately. This can cause significant system + latencies so the value should be set such that only significant + messages are classified as emergency messages. + + Setting a default here is equivalent to passing in + emergency_loglevel=<x> in the kernel bootargs. emergency_loglevel=<x> + continues to override whatever value is specified here as well. + config MESSAGE_LOGLEVEL_DEFAULT int "Default message log level (1-7)" range 1 7 @ lib/Kconfig.debug:1225 @ config DEBUG_ATOMIC_SLEEP config DEBUG_LOCKING_API_SELFTESTS bool "Locking API boot-time self-tests" - depends on DEBUG_KERNEL + depends on DEBUG_KERNEL && !PREEMPT_RT help Say Y here if you want the kernel to run a short self-test during bootup. The self-test checks whether common types of locking bugs @ lib/Makefile:30 @ endif lib-y := ctype.o string.o vsprintf.o cmdline.o \ rbtree.o radix-tree.o timerqueue.o xarray.o \ - idr.o extable.o sha1.o irq_regs.o argv_split.o \ + idr.o extable.o sha1.o irq_regs.o argv_split.o printk_ringbuffer.o \ flex_proportions.o ratelimit.o show_mem.o \ is_single_threaded.o plist.o decompress.o kobject_uevent.o \ earlycpio.o seq_buf.o siphash.o dec_and_lock.o \ @ lib/bust_spinlocks.c:29 @ void bust_spinlocks(int yes) unblank_screen(); #endif console_unblank(); - if (--oops_in_progress == 0) - wake_up_klogd(); + --oops_in_progress; } } @ lib/debugobjects.c:540 @ __debug_object_init(void *addr, struct debug_obj_descr *descr, int onstack) struct debug_obj *obj; unsigned long flags; - fill_pool(); +#ifdef CONFIG_PREEMPT_RT + if (preempt_count() == 0 && !irqs_disabled()) +#endif + fill_pool(); db = get_bucket((unsigned long) addr); @ lib/irq_poll.c:40 @ void irq_poll_sched(struct irq_poll *iop) list_add_tail(&iop->list, this_cpu_ptr(&blk_cpu_iopoll)); raise_softirq_irqoff(IRQ_POLL_SOFTIRQ); local_irq_restore(flags); + preempt_check_resched_rt(); } EXPORT_SYMBOL(irq_poll_sched); @ lib/irq_poll.c:76 @ void irq_poll_complete(struct irq_poll *iop) local_irq_save(flags); __irq_poll_complete(iop); local_irq_restore(flags); + preempt_check_resched_rt(); } EXPORT_SYMBOL(irq_poll_complete); @ lib/irq_poll.c:101 @ static void __latent_entropy irq_poll_softirq(struct softirq_action *h) } local_irq_enable(); + preempt_check_resched_rt(); /* Even though interrupts have been re-enabled, this * access is safe because interrupts can only add new @ lib/irq_poll.c:139 @ static void __latent_entropy irq_poll_softirq(struct softirq_action *h) __raise_softirq_irqoff(IRQ_POLL_SOFTIRQ); local_irq_enable(); + preempt_check_resched_rt(); } /** @ lib/irq_poll.c:203 @ static int irq_poll_cpu_dead(unsigned int cpu) this_cpu_ptr(&blk_cpu_iopoll)); __raise_softirq_irqoff(IRQ_POLL_SOFTIRQ); local_irq_enable(); + preempt_check_resched_rt(); return 0; } @ lib/locking-selftest.c:745 @ GENERATE_TESTCASE(init_held_rtmutex); #include "locking-selftest-spin-hardirq.h" GENERATE_PERMUTATIONS_2_EVENTS(irqsafe1_hard_spin) +#ifndef CONFIG_PREEMPT_RT + #include "locking-selftest-rlock-hardirq.h" GENERATE_PERMUTATIONS_2_EVENTS(irqsafe1_hard_rlock) @ lib/locking-selftest.c:762 @ GENERATE_PERMUTATIONS_2_EVENTS(irqsafe1_soft_rlock) #include "locking-selftest-wlock-softirq.h" GENERATE_PERMUTATIONS_2_EVENTS(irqsafe1_soft_wlock) +#endif + #undef E1 #undef E2 +#ifndef CONFIG_PREEMPT_RT /* * Enabling hardirqs with a softirq-safe lock held: */ @ lib/locking-selftest.c:800 @ GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2A_rlock) #undef E1 #undef E2 +#endif + /* * Enabling irqs with an irq-safe lock held: */ @ lib/locking-selftest.c:825 @ GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2A_rlock) #include "locking-selftest-spin-hardirq.h" GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2B_hard_spin) +#ifndef CONFIG_PREEMPT_RT + #include "locking-selftest-rlock-hardirq.h" GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2B_hard_rlock) @ lib/locking-selftest.c:842 @ GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2B_soft_rlock) #include "locking-selftest-wlock-softirq.h" GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2B_soft_wlock) +#endif + #undef E1 #undef E2 @ lib/locking-selftest.c:875 @ GENERATE_PERMUTATIONS_2_EVENTS(irqsafe2B_soft_wlock) #include "locking-selftest-spin-hardirq.h" GENERATE_PERMUTATIONS_3_EVENTS(irqsafe3_hard_spin) +#ifndef CONFIG_PREEMPT_RT + #include "locking-selftest-rlock-hardirq.h" GENERATE_PERMUTATIONS_3_EVENTS(irqsafe3_hard_rlock) @ lib/locking-selftest.c:892 @ GENERATE_PERMUTATIONS_3_EVENTS(irqsafe3_soft_rlock) #include "locking-selftest-wlock-softirq.h" GENERATE_PERMUTATIONS_3_EVENTS(irqsafe3_soft_wlock) +#endif + #undef E1 #undef E2 #undef E3 @ lib/locking-selftest.c:927 @ GENERATE_PERMUTATIONS_3_EVENTS(irqsafe3_soft_wlock) #include "locking-selftest-spin-hardirq.h" GENERATE_PERMUTATIONS_3_EVENTS(irqsafe4_hard_spin) +#ifndef CONFIG_PREEMPT_RT + #include "locking-selftest-rlock-hardirq.h" GENERATE_PERMUTATIONS_3_EVENTS(irqsafe4_hard_rlock) @ lib/locking-selftest.c:944 @ GENERATE_PERMUTATIONS_3_EVENTS(irqsafe4_soft_rlock) #include "locking-selftest-wlock-softirq.h" GENERATE_PERMUTATIONS_3_EVENTS(irqsafe4_soft_wlock) +#endif + #undef E1 #undef E2 #undef E3 +#ifndef CONFIG_PREEMPT_RT + /* * read-lock / write-lock irq inversion. * @ lib/locking-selftest.c:1014 @ GENERATE_PERMUTATIONS_3_EVENTS(irq_inversion_soft_wlock) #undef E2 #undef E3 +#endif + +#ifndef CONFIG_PREEMPT_RT + /* * read-lock / write-lock recursion that is actually safe. */ @ lib/locking-selftest.c:1056 @ GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion_soft) #undef E2 #undef E3 +#endif + /* * read-lock / write-lock recursion that is unsafe. */ @ lib/locking-selftest.c:2088 @ void locking_selftest(void) printk(" --------------------------------------------------------------------------\n"); +#ifndef CONFIG_PREEMPT_RT /* * irq-context testcases: */ @ lib/locking-selftest.c:2101 @ void locking_selftest(void) DO_TESTCASE_6x2("irq read-recursion", irq_read_recursion); // DO_TESTCASE_6x2B("irq read-recursion #2", irq_read_recursion2); +#else + /* On -rt, we only do hardirq context test for raw spinlock */ + DO_TESTCASE_1B("hard-irqs-on + irq-safe-A", irqsafe1_hard_spin, 12); + DO_TESTCASE_1B("hard-irqs-on + irq-safe-A", irqsafe1_hard_spin, 21); + + DO_TESTCASE_1B("hard-safe-A + irqs-on", irqsafe2B_hard_spin, 12); + DO_TESTCASE_1B("hard-safe-A + irqs-on", irqsafe2B_hard_spin, 21); + + DO_TESTCASE_1B("hard-safe-A + unsafe-B #1", irqsafe3_hard_spin, 123); + DO_TESTCASE_1B("hard-safe-A + unsafe-B #1", irqsafe3_hard_spin, 132); + DO_TESTCASE_1B("hard-safe-A + unsafe-B #1", irqsafe3_hard_spin, 213); + DO_TESTCASE_1B("hard-safe-A + unsafe-B #1", irqsafe3_hard_spin, 231); + DO_TESTCASE_1B("hard-safe-A + unsafe-B #1", irqsafe3_hard_spin, 312); + DO_TESTCASE_1B("hard-safe-A + unsafe-B #1", irqsafe3_hard_spin, 321); + + DO_TESTCASE_1B("hard-safe-A + unsafe-B #2", irqsafe4_hard_spin, 123); + DO_TESTCASE_1B("hard-safe-A + unsafe-B #2", irqsafe4_hard_spin, 132); + DO_TESTCASE_1B("hard-safe-A + unsafe-B #2", irqsafe4_hard_spin, 213); + DO_TESTCASE_1B("hard-safe-A + unsafe-B #2", irqsafe4_hard_spin, 231); + DO_TESTCASE_1B("hard-safe-A + unsafe-B #2", irqsafe4_hard_spin, 312); + DO_TESTCASE_1B("hard-safe-A + unsafe-B #2", irqsafe4_hard_spin, 321); +#endif ww_tests(); @ lib/nmi_backtrace.c:78 @ void nmi_trigger_cpumask_backtrace(const cpumask_t *mask, touch_softlockup_watchdog(); } - /* - * Force flush any remote buffers that might be stuck in IRQ context - * and therefore could not run their irq_work. - */ - printk_safe_flush(); - clear_bit_unlock(0, &backtrace_flag); put_cpu(); } @ lib/printk_ringbuffer.c:4 @ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/sched.h> +#include <linux/smp.h> +#include <linux/string.h> +#include <linux/errno.h> +#include <linux/printk_ringbuffer.h> + +#define PRB_SIZE(rb) (1 << rb->size_bits) +#define PRB_SIZE_BITMASK(rb) (PRB_SIZE(rb) - 1) +#define PRB_INDEX(rb, lpos) (lpos & PRB_SIZE_BITMASK(rb)) +#define PRB_WRAPS(rb, lpos) (lpos >> rb->size_bits) +#define PRB_WRAP_LPOS(rb, lpos, xtra) \ + ((PRB_WRAPS(rb, lpos) + xtra) << rb->size_bits) +#define PRB_DATA_SIZE(e) (e->size - sizeof(struct prb_entry)) +#define PRB_DATA_ALIGN sizeof(long) + +static bool __prb_trylock(struct prb_cpulock *cpu_lock, + unsigned int *cpu_store) +{ + unsigned long *flags; + unsigned int cpu; + + cpu = get_cpu(); + + *cpu_store = atomic_read(&cpu_lock->owner); + /* memory barrier to ensure the current lock owner is visible */ + smp_rmb(); + if (*cpu_store == -1) { + flags = per_cpu_ptr(cpu_lock->irqflags, cpu); + local_irq_save(*flags); + if (atomic_try_cmpxchg_acquire(&cpu_lock->owner, + cpu_store, cpu)) { + return true; + } + local_irq_restore(*flags); + } else if (*cpu_store == cpu) { + return true; + } + + put_cpu(); + return false; +} + +/* + * prb_lock: Perform a processor-reentrant spin lock. + * @cpu_lock: A pointer to the lock object. + * @cpu_store: A "flags" pointer to store lock status information. + * + * If no processor has the lock, the calling processor takes the lock and + * becomes the owner. If the calling processor is already the owner of the + * lock, this function succeeds immediately. If lock is locked by another + * processor, this function spins until the calling processor becomes the + * owner. + * + * It is safe to call this function from any context and state. + */ +void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store) +{ + for (;;) { + if (__prb_trylock(cpu_lock, cpu_store)) + break; + cpu_relax(); + } +} + +/* + * prb_unlock: Perform a processor-reentrant spin unlock. + * @cpu_lock: A pointer to the lock object. + * @cpu_store: A "flags" object storing lock status information. + * + * Release the lock. The calling processor must be the owner of the lock. + * + * It is safe to call this function from any context and state. + */ +void prb_unlock(struct prb_cpulock *cpu_lock, unsigned int cpu_store) +{ + unsigned long *flags; + unsigned int cpu; + + cpu = atomic_read(&cpu_lock->owner); + atomic_set_release(&cpu_lock->owner, cpu_store); + + if (cpu_store == -1) { + flags = per_cpu_ptr(cpu_lock->irqflags, cpu); + local_irq_restore(*flags); + } + + put_cpu(); +} + +static struct prb_entry *to_entry(struct printk_ringbuffer *rb, + unsigned long lpos) +{ + char *buffer = rb->buffer; + buffer += PRB_INDEX(rb, lpos); + return (struct prb_entry *)buffer; +} + +static int calc_next(struct printk_ringbuffer *rb, unsigned long tail, + unsigned long lpos, int size, unsigned long *calced_next) +{ + unsigned long next_lpos; + int ret = 0; +again: + next_lpos = lpos + size; + if (next_lpos - tail > PRB_SIZE(rb)) + return -1; + + if (PRB_WRAPS(rb, lpos) != PRB_WRAPS(rb, next_lpos)) { + lpos = PRB_WRAP_LPOS(rb, next_lpos, 0); + ret |= 1; + goto again; + } + + *calced_next = next_lpos; + return ret; +} + +static bool push_tail(struct printk_ringbuffer *rb, unsigned long tail) +{ + unsigned long new_tail; + struct prb_entry *e; + unsigned long head; + + if (tail != atomic_long_read(&rb->tail)) + return true; + + e = to_entry(rb, tail); + if (e->size != -1) + new_tail = tail + e->size; + else + new_tail = PRB_WRAP_LPOS(rb, tail, 1); + + /* make sure the new tail does not overtake the head */ + head = atomic_long_read(&rb->head); + if (head - new_tail > PRB_SIZE(rb)) + return false; + + atomic_long_cmpxchg(&rb->tail, tail, new_tail); + return true; +} + +/* + * prb_commit: Commit a reserved entry to the ring buffer. + * @h: An entry handle referencing the data entry to commit. + * + * Commit data that has been reserved using prb_reserve(). Once the data + * block has been committed, it can be invalidated at any time. If a writer + * is interested in using the data after committing, the writer should make + * its own copy first or use the prb_iter_ reader functions to access the + * data in the ring buffer. + * + * It is safe to call this function from any context and state. + */ +void prb_commit(struct prb_handle *h) +{ + struct printk_ringbuffer *rb = h->rb; + bool changed = false; + struct prb_entry *e; + unsigned long head; + unsigned long res; + + for (;;) { + if (atomic_read(&rb->ctx) != 1) { + /* the interrupted context will fixup head */ + atomic_dec(&rb->ctx); + break; + } + /* assign sequence numbers before moving head */ + head = atomic_long_read(&rb->head); + res = atomic_long_read(&rb->reserve); + while (head != res) { + e = to_entry(rb, head); + if (e->size == -1) { + head = PRB_WRAP_LPOS(rb, head, 1); + continue; + } + while (atomic_long_read(&rb->lost)) { + atomic_long_dec(&rb->lost); + rb->seq++; + } + e->seq = ++rb->seq; + head += e->size; + changed = true; + } + atomic_long_set_release(&rb->head, res); + + atomic_dec(&rb->ctx); + + if (atomic_long_read(&rb->reserve) == res) + break; + atomic_inc(&rb->ctx); + } + + prb_unlock(rb->cpulock, h->cpu); + + if (changed) { + atomic_long_inc(&rb->wq_counter); + if (wq_has_sleeper(rb->wq)) { +#ifdef CONFIG_IRQ_WORK + irq_work_queue(rb->wq_work); +#else + if (!in_nmi()) + wake_up_interruptible_all(rb->wq); +#endif + } + } +} + +/* + * prb_reserve: Reserve an entry within a ring buffer. + * @h: An entry handle to be setup and reference an entry. + * @rb: A ring buffer to reserve data within. + * @size: The number of bytes to reserve. + * + * Reserve an entry of at least @size bytes to be used by the caller. If + * successful, the data region of the entry belongs to the caller and cannot + * be invalidated by any other task/context. For this reason, the caller + * should call prb_commit() as quickly as possible in order to avoid preventing + * other tasks/contexts from reserving data in the case that the ring buffer + * has wrapped. + * + * It is safe to call this function from any context and state. + * + * Returns a pointer to the reserved entry (and @h is setup to reference that + * entry) or NULL if it was not possible to reserve data. + */ +char *prb_reserve(struct prb_handle *h, struct printk_ringbuffer *rb, + unsigned int size) +{ + unsigned long tail, res1, res2; + int ret; + + if (size == 0) + return NULL; + size += sizeof(struct prb_entry); + size += PRB_DATA_ALIGN - 1; + size &= ~(PRB_DATA_ALIGN - 1); + if (size >= PRB_SIZE(rb)) + return NULL; + + h->rb = rb; + prb_lock(rb->cpulock, &h->cpu); + + atomic_inc(&rb->ctx); + + do { + for (;;) { + tail = atomic_long_read(&rb->tail); + res1 = atomic_long_read(&rb->reserve); + ret = calc_next(rb, tail, res1, size, &res2); + if (ret >= 0) + break; + if (!push_tail(rb, tail)) { + prb_commit(h); + return NULL; + } + } + } while (!atomic_long_try_cmpxchg_acquire(&rb->reserve, &res1, res2)); + + h->entry = to_entry(rb, res1); + + if (ret) { + /* handle wrap */ + h->entry->size = -1; + h->entry = to_entry(rb, PRB_WRAP_LPOS(rb, res2, 0)); + } + + h->entry->size = size; + + return &h->entry->data[0]; +} + +/* + * prb_iter_copy: Copy an iterator. + * @dest: The iterator to copy to. + * @src: The iterator to copy from. + * + * Make a deep copy of an iterator. This is particularly useful for making + * backup copies of an iterator in case a form of rewinding it needed. + * + * It is safe to call this function from any context and state. But + * note that this function is not atomic. Callers should not make copies + * to/from iterators that can be accessed by other tasks/contexts. + */ +void prb_iter_copy(struct prb_iterator *dest, struct prb_iterator *src) +{ + memcpy(dest, src, sizeof(*dest)); +} + +/* + * prb_iter_init: Initialize an iterator for a ring buffer. + * @iter: The iterator to initialize. + * @rb: A ring buffer to that @iter should iterate. + * @seq: The sequence number of the position preceding the first record. + * May be NULL. + * + * Initialize an iterator to be used with a specified ring buffer. If @seq + * is non-NULL, it will be set such that prb_iter_next() will provide a + * sequence value of "@seq + 1" if no records were missed. + * + * It is safe to call this function from any context and state. + */ +void prb_iter_init(struct prb_iterator *iter, struct printk_ringbuffer *rb, + u64 *seq) +{ + memset(iter, 0, sizeof(*iter)); + iter->rb = rb; + iter->lpos = PRB_INIT; + + if (!seq) + return; + + for (;;) { + struct prb_iterator tmp_iter; + int ret; + + prb_iter_copy(&tmp_iter, iter); + + ret = prb_iter_next(&tmp_iter, NULL, 0, seq); + if (ret < 0) + continue; + + if (ret == 0) + *seq = 0; + else + (*seq)--; + break; + } +} + +static bool is_valid(struct printk_ringbuffer *rb, unsigned long lpos) +{ + unsigned long head, tail; + + tail = atomic_long_read(&rb->tail); + head = atomic_long_read(&rb->head); + head -= tail; + lpos -= tail; + + if (lpos >= head) + return false; + return true; +} + +/* + * prb_iter_data: Retrieve the record data at the current position. + * @iter: Iterator tracking the current position. + * @buf: A buffer to store the data of the record. May be NULL. + * @size: The size of @buf. (Ignored if @buf is NULL.) + * @seq: The sequence number of the record. May be NULL. + * + * If @iter is at a record, provide the data and/or sequence number of that + * record (if specified by the caller). + * + * It is safe to call this function from any context and state. + * + * Returns >=0 if the current record contains valid data (returns 0 if @buf + * is NULL or returns the size of the data block if @buf is non-NULL) or + * -EINVAL if @iter is now invalid. + */ +int prb_iter_data(struct prb_iterator *iter, char *buf, int size, u64 *seq) +{ + struct printk_ringbuffer *rb = iter->rb; + unsigned long lpos = iter->lpos; + unsigned int datsize = 0; + struct prb_entry *e; + + if (buf || seq) { + e = to_entry(rb, lpos); + if (!is_valid(rb, lpos)) + return -EINVAL; + /* memory barrier to ensure valid lpos */ + smp_rmb(); + if (buf) { + datsize = PRB_DATA_SIZE(e); + /* memory barrier to ensure load of datsize */ + smp_rmb(); + if (!is_valid(rb, lpos)) + return -EINVAL; + if (PRB_INDEX(rb, lpos) + datsize > + PRB_SIZE(rb) - PRB_DATA_ALIGN) { + return -EINVAL; + } + if (size > datsize) + size = datsize; + memcpy(buf, &e->data[0], size); + } + if (seq) + *seq = e->seq; + /* memory barrier to ensure loads of entry data */ + smp_rmb(); + } + + if (!is_valid(rb, lpos)) + return -EINVAL; + + return datsize; +} + +/* + * prb_iter_next: Advance to the next record. + * @iter: Iterator tracking the current position. + * @buf: A buffer to store the data of the next record. May be NULL. + * @size: The size of @buf. (Ignored if @buf is NULL.) + * @seq: The sequence number of the next record. May be NULL. + * + * If a next record is available, @iter is advanced and (if specified) + * the data and/or sequence number of that record are provided. + * + * It is safe to call this function from any context and state. + * + * Returns 1 if @iter was advanced, 0 if @iter is at the end of the list, or + * -EINVAL if @iter is now invalid. + */ +int prb_iter_next(struct prb_iterator *iter, char *buf, int size, u64 *seq) +{ + struct printk_ringbuffer *rb = iter->rb; + unsigned long next_lpos; + struct prb_entry *e; + unsigned int esize; + + if (iter->lpos == PRB_INIT) { + next_lpos = atomic_long_read(&rb->tail); + } else { + if (!is_valid(rb, iter->lpos)) + return -EINVAL; + /* memory barrier to ensure valid lpos */ + smp_rmb(); + e = to_entry(rb, iter->lpos); + esize = e->size; + /* memory barrier to ensure load of size */ + smp_rmb(); + if (!is_valid(rb, iter->lpos)) + return -EINVAL; + next_lpos = iter->lpos + esize; + } + if (next_lpos == atomic_long_read(&rb->head)) + return 0; + if (!is_valid(rb, next_lpos)) + return -EINVAL; + /* memory barrier to ensure valid lpos */ + smp_rmb(); + + iter->lpos = next_lpos; + e = to_entry(rb, iter->lpos); + esize = e->size; + /* memory barrier to ensure load of size */ + smp_rmb(); + if (!is_valid(rb, iter->lpos)) + return -EINVAL; + if (esize == -1) + iter->lpos = PRB_WRAP_LPOS(rb, iter->lpos, 1); + + if (prb_iter_data(iter, buf, size, seq) < 0) + return -EINVAL; + + return 1; +} + +/* + * prb_iter_wait_next: Advance to the next record, blocking if none available. + * @iter: Iterator tracking the current position. + * @buf: A buffer to store the data of the next record. May be NULL. + * @size: The size of @buf. (Ignored if @buf is NULL.) + * @seq: The sequence number of the next record. May be NULL. + * + * If a next record is already available, this function works like + * prb_iter_next(). Otherwise block interruptible until a next record is + * available. + * + * When a next record is available, @iter is advanced and (if specified) + * the data and/or sequence number of that record are provided. + * + * This function might sleep. + * + * Returns 1 if @iter was advanced, -EINVAL if @iter is now invalid, or + * -ERESTARTSYS if interrupted by a signal. + */ +int prb_iter_wait_next(struct prb_iterator *iter, char *buf, int size, u64 *seq) +{ + unsigned long last_seen; + int ret; + + for (;;) { + last_seen = atomic_long_read(&iter->rb->wq_counter); + + ret = prb_iter_next(iter, buf, size, seq); + if (ret != 0) + break; + + ret = wait_event_interruptible(*iter->rb->wq, + last_seen != atomic_long_read(&iter->rb->wq_counter)); + if (ret < 0) + break; + } + + return ret; +} + +/* + * prb_iter_seek: Seek forward to a specific record. + * @iter: Iterator to advance. + * @seq: Record number to advance to. + * + * Advance @iter such that a following call to prb_iter_data() will provide + * the contents of the specified record. If a record is specified that does + * not yet exist, advance @iter to the end of the record list. + * + * Note that iterators cannot be rewound. So if a record is requested that + * exists but is previous to @iter in position, @iter is considered invalid. + * + * It is safe to call this function from any context and state. + * + * Returns 1 on succces, 0 if specified record does not yet exist (@iter is + * now at the end of the list), or -EINVAL if @iter is now invalid. + */ +int prb_iter_seek(struct prb_iterator *iter, u64 seq) +{ + u64 cur_seq; + int ret; + + /* first check if the iterator is already at the wanted seq */ + if (seq == 0) { + if (iter->lpos == PRB_INIT) + return 1; + else + return -EINVAL; + } + if (iter->lpos != PRB_INIT) { + if (prb_iter_data(iter, NULL, 0, &cur_seq) >= 0) { + if (cur_seq == seq) + return 1; + if (cur_seq > seq) + return -EINVAL; + } + } + + /* iterate to find the wanted seq */ + for (;;) { + ret = prb_iter_next(iter, NULL, 0, &cur_seq); + if (ret <= 0) + break; + + if (cur_seq == seq) + break; + + if (cur_seq > seq) { + ret = -EINVAL; + break; + } + } + + return ret; +} + +/* + * prb_buffer_size: Get the size of the ring buffer. + * @rb: The ring buffer to get the size of. + * + * Return the number of bytes used for the ring buffer entry storage area. + * Note that this area stores both entry header and entry data. Therefore + * this represents an upper bound to the amount of data that can be stored + * in the ring buffer. + * + * It is safe to call this function from any context and state. + * + * Returns the size in bytes of the entry storage area. + */ +int prb_buffer_size(struct printk_ringbuffer *rb) +{ + return PRB_SIZE(rb); +} + +/* + * prb_inc_lost: Increment the seq counter to signal a lost record. + * @rb: The ring buffer to increment the seq of. + * + * Increment the seq counter so that a seq number is intentially missing + * for the readers. This allows readers to identify that a record is + * missing. A writer will typically use this function if prb_reserve() + * fails. + * + * It is safe to call this function from any context and state. + */ +void prb_inc_lost(struct printk_ringbuffer *rb) +{ + atomic_long_inc(&rb->lost); +} @ lib/radix-tree.c:29 @ #include <linux/slab.h> #include <linux/string.h> #include <linux/xarray.h> - +#include <linux/locallock.h> /* * Radix tree node cache. @ lib/radix-tree.c:75 @ struct radix_tree_preload { struct radix_tree_node *nodes; }; static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, }; +static DEFINE_LOCAL_IRQ_LOCK(radix_tree_preloads_lock); static inline struct radix_tree_node *entry_to_node(void *ptr) { @ lib/radix-tree.c:273 @ radix_tree_node_alloc(gfp_t gfp_mask, struct radix_tree_node *parent, * succeed in getting a node here (and never reach * kmem_cache_alloc) */ - rtp = this_cpu_ptr(&radix_tree_preloads); + rtp = &get_locked_var(radix_tree_preloads_lock, radix_tree_preloads); if (rtp->nr) { ret = rtp->nodes; rtp->nodes = ret->parent; rtp->nr--; } + put_locked_var(radix_tree_preloads_lock, radix_tree_preloads); /* * Update the allocation stack trace as this is more useful * for debugging. @ lib/radix-tree.c:345 @ static __must_check int __radix_tree_preload(gfp_t gfp_mask, unsigned nr) */ gfp_mask &= ~__GFP_ACCOUNT; - preempt_disable(); + local_lock(radix_tree_preloads_lock); rtp = this_cpu_ptr(&radix_tree_preloads); while (rtp->nr < nr) { - preempt_enable(); + local_unlock(radix_tree_preloads_lock); node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask); if (node == NULL) goto out; - preempt_disable(); + local_lock(radix_tree_preloads_lock); rtp = this_cpu_ptr(&radix_tree_preloads); if (rtp->nr < nr) { node->parent = rtp->nodes; @ lib/radix-tree.c:394 @ int radix_tree_maybe_preload(gfp_t gfp_mask) if (gfpflags_allow_blocking(gfp_mask)) return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE); /* Preloading doesn't help anything with this gfp mask, skip it */ - preempt_disable(); + local_lock(radix_tree_preloads_lock); return 0; } EXPORT_SYMBOL(radix_tree_maybe_preload); +void radix_tree_preload_end(void) +{ + local_unlock(radix_tree_preloads_lock); +} +EXPORT_SYMBOL(radix_tree_preload_end); + static unsigned radix_tree_load_root(const struct radix_tree_root *root, struct radix_tree_node **nodep, unsigned long *maxindex) { @ lib/radix-tree.c:1489 @ EXPORT_SYMBOL(radix_tree_tagged); void idr_preload(gfp_t gfp_mask) { if (__radix_tree_preload(gfp_mask, IDR_PRELOAD_SIZE)) - preempt_disable(); + local_lock(radix_tree_preloads_lock); } EXPORT_SYMBOL(idr_preload); +void idr_preload_end(void) +{ + local_unlock(radix_tree_preloads_lock); +} +EXPORT_SYMBOL(idr_preload_end); + void __rcu **idr_get_free(struct radix_tree_root *root, struct radix_tree_iter *iter, gfp_t gfp, unsigned long max) @ lib/scatterlist.c:814 @ void sg_miter_stop(struct sg_mapping_iter *miter) flush_kernel_dcache_page(miter->page); if (miter->__flags & SG_MITER_ATOMIC) { - WARN_ON_ONCE(preemptible()); + WARN_ON_ONCE(!pagefault_disabled()); kunmap_atomic(miter->addr); } else kunmap(miter->page); @ lib/smp_processor_id.c:26 @ unsigned int check_preemption_disabled(const char *what1, const char *what2) * Kernel threads bound to a single CPU can safely use * smp_processor_id(): */ +#if defined(CONFIG_PREEMPT_RT) && (defined(CONFIG_SMP) || defined(CONFIG_SCHED_DEBUG)) + if (current->migrate_disable) + goto out; +#endif + if (current->nr_cpus_allowed == 1) goto out; @ lib/test_bpf.c:6663 @ static int __run_one(const struct bpf_prog *fp, const void *data, u64 start, finish; int ret = 0, i; - preempt_disable(); + migrate_disable(); start = ktime_get_ns(); for (i = 0; i < runs; i++) ret = BPF_PROG_RUN(fp, data); finish = ktime_get_ns(); - preempt_enable(); + migrate_enable(); *duration = finish - start; do_div(*duration, runs); @ localversion-rt:1 @ +-rt11 @ mm/Kconfig:373 @ config NOMMU_INITIAL_TRIM_EXCESS config TRANSPARENT_HUGEPAGE bool "Transparent Hugepage Support" - depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE + depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT select COMPACTION select XARRAY_MULTI help @ mm/compaction.c:1593 @ typedef enum { * Allow userspace to control policy on scanning the unevictable LRU for * compactable pages. */ +#ifdef CONFIG_PREEMPT_RT +int sysctl_compact_unevictable_allowed __read_mostly = 0; +#else int sysctl_compact_unevictable_allowed __read_mostly = 1; +#endif static inline void update_fast_start_pfn(struct compact_control *cc, unsigned long pfn) @ mm/compaction.c:2247 @ compact_zone(struct compact_control *cc, struct capture_control *capc) block_start_pfn(cc->migrate_pfn, cc->order); if (last_migrated_pfn < current_block_start) { - cpu = get_cpu(); + cpu = get_cpu_light(); + local_lock_irq(swapvec_lock); lru_add_drain_cpu(cpu); + local_unlock_irq(swapvec_lock); drain_local_pages(cc->zone); - put_cpu(); + put_cpu_light(); /* No more flushing until we migrate again */ last_migrated_pfn = 0; } @ mm/highmem.c:34 @ #include <asm/tlbflush.h> #include <linux/vmalloc.h> +#ifndef CONFIG_PREEMPT_RT #if defined(CONFIG_HIGHMEM) || defined(CONFIG_X86_32) DEFINE_PER_CPU(int, __kmap_atomic_idx); +EXPORT_PER_CPU_SYMBOL(__kmap_atomic_idx); +#endif #endif /* @ mm/highmem.c:114 @ static inline wait_queue_head_t *get_pkmap_wait_queue_head(unsigned int color) atomic_long_t _totalhigh_pages __read_mostly; EXPORT_SYMBOL(_totalhigh_pages); -EXPORT_PER_CPU_SYMBOL(__kmap_atomic_idx); - unsigned int nr_free_highpages (void) { struct zone *zone; @ mm/memcontrol.c:66 @ #include <net/sock.h> #include <net/ip.h> #include "slab.h" +#include <linux/locallock.h> #include <linux/uaccess.h> @ mm/memcontrol.c:96 @ int do_swap_account __read_mostly; static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq); #endif +static DEFINE_LOCAL_IRQ_LOCK(event_lock); + /* Whether legacy memory+swap accounting is active */ static bool do_memsw_account(void) { @ mm/memcontrol.c:2167 @ static void drain_all_stock(struct mem_cgroup *root_memcg) * as well as workers from this path always operate on the local * per-cpu data. CPU up doesn't touch memcg_stock at all. */ - curcpu = get_cpu(); + curcpu = get_cpu_light(); for_each_online_cpu(cpu) { struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu); struct mem_cgroup *memcg; @ mm/memcontrol.c:2188 @ static void drain_all_stock(struct mem_cgroup *root_memcg) schedule_work_on(cpu, &stock->work); } } - put_cpu(); + put_cpu_light(); mutex_unlock(&percpu_charge_mutex); } @ mm/memcontrol.c:5434 @ static int mem_cgroup_move_account(struct page *page, ret = 0; - local_irq_disable(); + local_lock_irq(event_lock); mem_cgroup_charge_statistics(to, page, compound, nr_pages); memcg_check_events(to, page); mem_cgroup_charge_statistics(from, page, compound, -nr_pages); memcg_check_events(from, page); - local_irq_enable(); + local_unlock_irq(event_lock); out_unlock: unlock_page(page); out: @ mm/memcontrol.c:6503 @ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg, commit_charge(page, memcg, lrucare); - local_irq_disable(); + local_lock_irq(event_lock); mem_cgroup_charge_statistics(memcg, page, compound, nr_pages); memcg_check_events(memcg, page); - local_irq_enable(); + local_unlock_irq(event_lock); if (do_memsw_account() && PageSwapCache(page)) { swp_entry_t entry = { .val = page_private(page) }; @ mm/memcontrol.c:6575 @ static void uncharge_batch(const struct uncharge_gather *ug) memcg_oom_recover(ug->memcg); } - local_irq_save(flags); + local_lock_irqsave(event_lock, flags); __mod_memcg_state(ug->memcg, MEMCG_RSS, -ug->nr_anon); __mod_memcg_state(ug->memcg, MEMCG_CACHE, -ug->nr_file); __mod_memcg_state(ug->memcg, MEMCG_RSS_HUGE, -ug->nr_huge); @ mm/memcontrol.c:6583 @ static void uncharge_batch(const struct uncharge_gather *ug) __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout); __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, nr_pages); memcg_check_events(ug->memcg, ug->dummy_page); - local_irq_restore(flags); + local_unlock_irqrestore(event_lock, flags); if (!mem_cgroup_is_root(ug->memcg)) css_put_many(&ug->memcg->css, nr_pages); @ mm/memcontrol.c:6744 @ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage) commit_charge(newpage, memcg, false); - local_irq_save(flags); + local_lock_irqsave(event_lock, flags); mem_cgroup_charge_statistics(memcg, newpage, PageTransHuge(newpage), nr_pages); memcg_check_events(memcg, newpage); - local_irq_restore(flags); + local_unlock_irqrestore(event_lock, flags); } DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key); @ mm/memcontrol.c:6930 @ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) struct mem_cgroup *memcg, *swap_memcg; unsigned int nr_entries; unsigned short oldid; + unsigned long flags; VM_BUG_ON_PAGE(PageLRU(page), page); VM_BUG_ON_PAGE(page_count(page), page); @ mm/memcontrol.c:6976 @ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) * important here to have the interrupts disabled because it is the * only synchronisation we have for updating the per-CPU variables. */ + local_lock_irqsave(event_lock, flags); +#ifndef CONFIG_PREEMPT_RT VM_BUG_ON(!irqs_disabled()); +#endif mem_cgroup_charge_statistics(memcg, page, PageTransHuge(page), -nr_entries); memcg_check_events(memcg, page); + local_unlock_irqrestore(event_lock, flags); if (!mem_cgroup_is_root(memcg)) css_put_many(&memcg->css, nr_entries); @ mm/page_alloc.c:64 @ #include <linux/hugetlb.h> #include <linux/sched/rt.h> #include <linux/sched/mm.h> +#include <linux/locallock.h> #include <linux/page_owner.h> #include <linux/kthread.h> #include <linux/memcontrol.h> @ mm/page_alloc.c:361 @ EXPORT_SYMBOL(nr_node_ids); EXPORT_SYMBOL(nr_online_nodes); #endif +static DEFINE_LOCAL_IRQ_LOCK(pa_lock); + int page_group_by_mobility_disabled __read_mostly; #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT @ mm/page_alloc.c:1242 @ static inline void prefetch_buddy(struct page *page) } /* - * Frees a number of pages from the PCP lists + * Frees a number of pages which have been collected from the pcp lists. * Assumes all pages on list are in same zone, and of same order. * count is the number of pages to free. * @ mm/page_alloc.c:1252 @ static inline void prefetch_buddy(struct page *page) * And clear the zone's pages_scanned counter, to hold off the "all pages are * pinned" detection logic. */ -static void free_pcppages_bulk(struct zone *zone, int count, - struct per_cpu_pages *pcp) +static void free_pcppages_bulk(struct zone *zone, struct list_head *head, + bool zone_retry) +{ + bool isolated_pageblocks; + struct page *page, *tmp; + unsigned long flags; + + spin_lock_irqsave(&zone->lock, flags); + isolated_pageblocks = has_isolate_pageblock(zone); + + /* + * Use safe version since after __free_one_page(), + * page->lru.next will not point to original list. + */ + list_for_each_entry_safe(page, tmp, head, lru) { + int mt = get_pcppage_migratetype(page); + + if (page_zone(page) != zone) { + /* + * free_unref_page_list() sorts pages by zone. If we end + * up with pages from a different NUMA nodes belonging + * to the same ZONE index then we need to redo with the + * correct ZONE pointer. Skip the page for now, redo it + * on the next iteration. + */ + WARN_ON_ONCE(zone_retry == false); + if (zone_retry) + continue; + } + + /* MIGRATE_ISOLATE page should not go to pcplists */ + VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); + /* Pageblock could have been isolated meanwhile */ + if (unlikely(isolated_pageblocks)) + mt = get_pageblock_migratetype(page); + + list_del(&page->lru); + __free_one_page(page, page_to_pfn(page), zone, 0, mt); + trace_mm_page_pcpu_drain(page, 0, mt); + } + spin_unlock_irqrestore(&zone->lock, flags); +} + +static void isolate_pcp_pages(int count, struct per_cpu_pages *pcp, + struct list_head *dst) + { int migratetype = 0; int batch_free = 0; int prefetch_nr = 0; - bool isolated_pageblocks; - struct page *page, *tmp; - LIST_HEAD(head); + struct page *page; while (count) { struct list_head *list; @ mm/page_alloc.c:1334 @ static void free_pcppages_bulk(struct zone *zone, int count, if (bulkfree_pcp_prepare(page)) continue; - list_add_tail(&page->lru, &head); + list_add_tail(&page->lru, dst); /* * We are going to put the page back to the global @ mm/page_alloc.c:1349 @ static void free_pcppages_bulk(struct zone *zone, int count, prefetch_buddy(page); } while (--count && --batch_free && !list_empty(list)); } - - spin_lock(&zone->lock); - isolated_pageblocks = has_isolate_pageblock(zone); - - /* - * Use safe version since after __free_one_page(), - * page->lru.next will not point to original list. - */ - list_for_each_entry_safe(page, tmp, &head, lru) { - int mt = get_pcppage_migratetype(page); - /* MIGRATE_ISOLATE page should not go to pcplists */ - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); - /* Pageblock could have been isolated meanwhile */ - if (unlikely(isolated_pageblocks)) - mt = get_pageblock_migratetype(page); - - __free_one_page(page, page_to_pfn(page), zone, 0, mt); - trace_mm_page_pcpu_drain(page, 0, mt); - } - spin_unlock(&zone->lock); } static void free_one_page(struct zone *zone, @ mm/page_alloc.c:1449 @ static void __free_pages_ok(struct page *page, unsigned int order) return; migratetype = get_pfnblock_migratetype(page, pfn); - local_irq_save(flags); + local_lock_irqsave(pa_lock, flags); __count_vm_events(PGFREE, 1 << order); free_one_page(page_zone(page), page, pfn, order, migratetype); - local_irq_restore(flags); + local_unlock_irqrestore(pa_lock, flags); } void __free_pages_core(struct page *page, unsigned int order) @ mm/page_alloc.c:2825 @ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) { unsigned long flags; int to_drain, batch; + LIST_HEAD(dst); - local_irq_save(flags); + local_lock_irqsave(pa_lock, flags); batch = READ_ONCE(pcp->batch); to_drain = min(pcp->count, batch); if (to_drain > 0) - free_pcppages_bulk(zone, to_drain, pcp); - local_irq_restore(flags); + isolate_pcp_pages(to_drain, pcp, &dst); + + local_unlock_irqrestore(pa_lock, flags); + + if (to_drain > 0) + free_pcppages_bulk(zone, &dst, false); } #endif @ mm/page_alloc.c:2852 @ static void drain_pages_zone(unsigned int cpu, struct zone *zone) unsigned long flags; struct per_cpu_pageset *pset; struct per_cpu_pages *pcp; + LIST_HEAD(dst); + int count; - local_irq_save(flags); + local_lock_irqsave(pa_lock, flags); pset = per_cpu_ptr(zone->pageset, cpu); pcp = &pset->pcp; - if (pcp->count) - free_pcppages_bulk(zone, pcp->count, pcp); - local_irq_restore(flags); + count = pcp->count; + if (count) + isolate_pcp_pages(count, pcp, &dst); + + local_unlock_irqrestore(pa_lock, flags); + + if (count) + free_pcppages_bulk(zone, &dst, false); } /* @ mm/page_alloc.c:2914 @ static void drain_local_pages_wq(struct work_struct *work) * cpu which is allright but we also have to make sure to not move to * a different one. */ - preempt_disable(); + migrate_disable(); drain_local_pages(drain->zone); - preempt_enable(); + migrate_enable(); } /* @ mm/page_alloc.c:3065 @ static bool free_unref_page_prepare(struct page *page, unsigned long pfn) return true; } -static void free_unref_page_commit(struct page *page, unsigned long pfn) +static void free_unref_page_commit(struct page *page, unsigned long pfn, + struct list_head *dst) { struct zone *zone = page_zone(page); struct per_cpu_pages *pcp; @ mm/page_alloc.c:3095 @ static void free_unref_page_commit(struct page *page, unsigned long pfn) pcp->count++; if (pcp->count >= pcp->high) { unsigned long batch = READ_ONCE(pcp->batch); - free_pcppages_bulk(zone, batch, pcp); + + isolate_pcp_pages(batch, pcp, dst); } } @ mm/page_alloc.c:3107 @ void free_unref_page(struct page *page) { unsigned long flags; unsigned long pfn = page_to_pfn(page); + struct zone *zone = page_zone(page); + LIST_HEAD(dst); if (!free_unref_page_prepare(page, pfn)) return; - local_irq_save(flags); - free_unref_page_commit(page, pfn); - local_irq_restore(flags); + local_lock_irqsave(pa_lock, flags); + free_unref_page_commit(page, pfn, &dst); + local_unlock_irqrestore(pa_lock, flags); + if (!list_empty(&dst)) + free_pcppages_bulk(zone, &dst, false); } /* @ mm/page_alloc.c:3128 @ void free_unref_page_list(struct list_head *list) struct page *page, *next; unsigned long flags, pfn; int batch_count = 0; + struct list_head dsts[__MAX_NR_ZONES]; + int i; + + for (i = 0; i < __MAX_NR_ZONES; i++) + INIT_LIST_HEAD(&dsts[i]); /* Prepare pages for freeing */ list_for_each_entry_safe(page, next, list, lru) { @ mm/page_alloc.c:3142 @ void free_unref_page_list(struct list_head *list) set_page_private(page, pfn); } - local_irq_save(flags); + local_lock_irqsave(pa_lock, flags); list_for_each_entry_safe(page, next, list, lru) { unsigned long pfn = page_private(page); + enum zone_type type; set_page_private(page, 0); trace_mm_page_free_batched(page); - free_unref_page_commit(page, pfn); + type = page_zonenum(page); + free_unref_page_commit(page, pfn, &dsts[type]); /* * Guard against excessive IRQ disabled times when we get * a large list of pages to free. */ if (++batch_count == SWAP_CLUSTER_MAX) { - local_irq_restore(flags); + local_unlock_irqrestore(pa_lock, flags); batch_count = 0; - local_irq_save(flags); + local_lock_irqsave(pa_lock, flags); } } - local_irq_restore(flags); + local_unlock_irqrestore(pa_lock, flags); + + for (i = 0; i < __MAX_NR_ZONES; ) { + struct page *page; + struct zone *zone; + + if (list_empty(&dsts[i])) { + i++; + continue; + } + + page = list_first_entry(&dsts[i], struct page, lru); + zone = page_zone(page); + + free_pcppages_bulk(zone, &dsts[i], true); + } } /* @ mm/page_alloc.c:3312 @ static struct page *rmqueue_pcplist(struct zone *preferred_zone, struct page *page; unsigned long flags; - local_irq_save(flags); + local_lock_irqsave(pa_lock, flags); pcp = &this_cpu_ptr(zone->pageset)->pcp; list = &pcp->lists[migratetype]; page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list); @ mm/page_alloc.c:3320 @ static struct page *rmqueue_pcplist(struct zone *preferred_zone, __count_zid_vm_events(PGALLOC, page_zonenum(page), 1); zone_statistics(preferred_zone, zone); } - local_irq_restore(flags); + local_unlock_irqrestore(pa_lock, flags); return page; } @ mm/page_alloc.c:3347 @ struct page *rmqueue(struct zone *preferred_zone, * allocate greater than order-1 page units with __GFP_NOFAIL. */ WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); - spin_lock_irqsave(&zone->lock, flags); + local_spin_lock_irqsave(pa_lock, &zone->lock, flags); do { page = NULL; @ mm/page_alloc.c:3367 @ struct page *rmqueue(struct zone *preferred_zone, __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); zone_statistics(preferred_zone, zone); - local_irq_restore(flags); + local_unlock_irqrestore(pa_lock, flags); out: /* Separate test+clear to avoid unnecessary atomics */ @ mm/page_alloc.c:3380 @ struct page *rmqueue(struct zone *preferred_zone, return page; failed: - local_irq_restore(flags); + local_unlock_irqrestore(pa_lock, flags); return NULL; } @ mm/page_alloc.c:8728 @ void zone_pcp_reset(struct zone *zone) struct per_cpu_pageset *pset; /* avoid races with drain_pages() */ - local_irq_save(flags); + local_lock_irqsave(pa_lock, flags); if (zone->pageset != &boot_pageset) { for_each_online_cpu(cpu) { pset = per_cpu_ptr(zone->pageset, cpu); @ mm/page_alloc.c:8737 @ void zone_pcp_reset(struct zone *zone) free_percpu(zone->pageset); zone->pageset = &boot_pageset; } - local_irq_restore(flags); + local_unlock_irqrestore(pa_lock, flags); } #ifdef CONFIG_MEMORY_HOTREMOVE @ mm/slab.c:236 @ static void kmem_cache_node_init(struct kmem_cache_node *parent) parent->shared = NULL; parent->alien = NULL; parent->colour_next = 0; - spin_lock_init(&parent->list_lock); + raw_spin_lock_init(&parent->list_lock); parent->free_objects = 0; parent->free_touched = 0; } @ mm/slab.c:561 @ static noinline void cache_free_pfmemalloc(struct kmem_cache *cachep, page_node = page_to_nid(page); n = get_node(cachep, page_node); - spin_lock(&n->list_lock); + raw_spin_lock(&n->list_lock); free_block(cachep, &objp, 1, page_node, &list); - spin_unlock(&n->list_lock); + raw_spin_unlock(&n->list_lock); slabs_destroy(cachep, &list); } @ mm/slab.c:691 @ static void __drain_alien_cache(struct kmem_cache *cachep, struct kmem_cache_node *n = get_node(cachep, node); if (ac->avail) { - spin_lock(&n->list_lock); + raw_spin_lock(&n->list_lock); /* * Stuff objects into the remote nodes shared array first. * That way we could avoid the overhead of putting the objects @ mm/slab.c:702 @ static void __drain_alien_cache(struct kmem_cache *cachep, free_block(cachep, ac->entry, ac->avail, node, list); ac->avail = 0; - spin_unlock(&n->list_lock); + raw_spin_unlock(&n->list_lock); } } @ mm/slab.c:775 @ static int __cache_free_alien(struct kmem_cache *cachep, void *objp, slabs_destroy(cachep, &list); } else { n = get_node(cachep, page_node); - spin_lock(&n->list_lock); + raw_spin_lock(&n->list_lock); free_block(cachep, &objp, 1, page_node, &list); - spin_unlock(&n->list_lock); + raw_spin_unlock(&n->list_lock); slabs_destroy(cachep, &list); } return 1; @ mm/slab.c:818 @ static int init_cache_node(struct kmem_cache *cachep, int node, gfp_t gfp) */ n = get_node(cachep, node); if (n) { - spin_lock_irq(&n->list_lock); + raw_spin_lock_irq(&n->list_lock); n->free_limit = (1 + nr_cpus_node(node)) * cachep->batchcount + cachep->num; - spin_unlock_irq(&n->list_lock); + raw_spin_unlock_irq(&n->list_lock); return 0; } @ mm/slab.c:900 @ static int setup_kmem_cache_node(struct kmem_cache *cachep, goto fail; n = get_node(cachep, node); - spin_lock_irq(&n->list_lock); + raw_spin_lock_irq(&n->list_lock); if (n->shared && force_change) { free_block(cachep, n->shared->entry, n->shared->avail, node, &list); @ mm/slab.c:918 @ static int setup_kmem_cache_node(struct kmem_cache *cachep, new_alien = NULL; } - spin_unlock_irq(&n->list_lock); + raw_spin_unlock_irq(&n->list_lock); slabs_destroy(cachep, &list); /* @ mm/slab.c:957 @ static void cpuup_canceled(long cpu) if (!n) continue; - spin_lock_irq(&n->list_lock); + raw_spin_lock_irq(&n->list_lock); /* Free limit for this kmem_cache_node */ n->free_limit -= cachep->batchcount; @ mm/slab.c:968 @ static void cpuup_canceled(long cpu) nc->avail = 0; if (!cpumask_empty(mask)) { - spin_unlock_irq(&n->list_lock); + raw_spin_unlock_irq(&n->list_lock); goto free_slab; } @ mm/slab.c:982 @ static void cpuup_canceled(long cpu) alien = n->alien; n->alien = NULL; - spin_unlock_irq(&n->list_lock); + raw_spin_unlock_irq(&n->list_lock); kfree(shared); if (alien) { @ mm/slab.c:1166 @ static void __init init_list(struct kmem_cache *cachep, struct kmem_cache_node * /* * Do not assume that spinlocks can be initialized via memcpy: */ - spin_lock_init(&ptr->list_lock); + raw_spin_lock_init(&ptr->list_lock); MAKE_ALL_LISTS(cachep, ptr, nodeid); cachep->node[nodeid] = ptr; @ mm/slab.c:1338 @ slab_out_of_memory(struct kmem_cache *cachep, gfp_t gfpflags, int nodeid) for_each_kmem_cache_node(cachep, node, n) { unsigned long total_slabs, free_slabs, free_objs; - spin_lock_irqsave(&n->list_lock, flags); + raw_spin_lock_irqsave(&n->list_lock, flags); total_slabs = n->total_slabs; free_slabs = n->free_slabs; free_objs = n->free_objects; - spin_unlock_irqrestore(&n->list_lock, flags); + raw_spin_unlock_irqrestore(&n->list_lock, flags); pr_warn(" node %d: slabs: %ld/%ld, objs: %ld/%ld\n", node, total_slabs - free_slabs, total_slabs, @ mm/slab.c:2100 @ static void check_spinlock_acquired(struct kmem_cache *cachep) { #ifdef CONFIG_SMP check_irq_off(); - assert_spin_locked(&get_node(cachep, numa_mem_id())->list_lock); + assert_raw_spin_locked(&get_node(cachep, numa_mem_id())->list_lock); #endif } @ mm/slab.c:2108 @ static void check_spinlock_acquired_node(struct kmem_cache *cachep, int node) { #ifdef CONFIG_SMP check_irq_off(); - assert_spin_locked(&get_node(cachep, node)->list_lock); + assert_raw_spin_locked(&get_node(cachep, node)->list_lock); #endif } @ mm/slab.c:2148 @ static void do_drain(void *arg) check_irq_off(); ac = cpu_cache_get(cachep); n = get_node(cachep, node); - spin_lock(&n->list_lock); + raw_spin_lock(&n->list_lock); free_block(cachep, ac->entry, ac->avail, node, &list); - spin_unlock(&n->list_lock); + raw_spin_unlock(&n->list_lock); slabs_destroy(cachep, &list); ac->avail = 0; } @ mm/slab.c:2168 @ static void drain_cpu_caches(struct kmem_cache *cachep) drain_alien_cache(cachep, n->alien); for_each_kmem_cache_node(cachep, node, n) { - spin_lock_irq(&n->list_lock); + raw_spin_lock_irq(&n->list_lock); drain_array_locked(cachep, n->shared, node, true, &list); - spin_unlock_irq(&n->list_lock); + raw_spin_unlock_irq(&n->list_lock); slabs_destroy(cachep, &list); } @ mm/slab.c:2192 @ static int drain_freelist(struct kmem_cache *cache, nr_freed = 0; while (nr_freed < tofree && !list_empty(&n->slabs_free)) { - spin_lock_irq(&n->list_lock); + raw_spin_lock_irq(&n->list_lock); p = n->slabs_free.prev; if (p == &n->slabs_free) { - spin_unlock_irq(&n->list_lock); + raw_spin_unlock_irq(&n->list_lock); goto out; } @ mm/slab.c:2208 @ static int drain_freelist(struct kmem_cache *cache, * to the cache. */ n->free_objects -= cache->num; - spin_unlock_irq(&n->list_lock); + raw_spin_unlock_irq(&n->list_lock); slab_destroy(cache, page); nr_freed++; } @ mm/slab.c:2661 @ static void cache_grow_end(struct kmem_cache *cachep, struct page *page) INIT_LIST_HEAD(&page->slab_list); n = get_node(cachep, page_to_nid(page)); - spin_lock(&n->list_lock); + raw_spin_lock(&n->list_lock); n->total_slabs++; if (!page->active) { list_add_tail(&page->slab_list, &n->slabs_free); @ mm/slab.c:2671 @ static void cache_grow_end(struct kmem_cache *cachep, struct page *page) STATS_INC_GROWN(cachep); n->free_objects += cachep->num - page->active; - spin_unlock(&n->list_lock); + raw_spin_unlock(&n->list_lock); fixup_objfreelist_debug(cachep, &list); } @ mm/slab.c:2837 @ static struct page *get_first_slab(struct kmem_cache_node *n, bool pfmemalloc) { struct page *page; - assert_spin_locked(&n->list_lock); + assert_raw_spin_locked(&n->list_lock); page = list_first_entry_or_null(&n->slabs_partial, struct page, slab_list); if (!page) { @ mm/slab.c:2864 @ static noinline void *cache_alloc_pfmemalloc(struct kmem_cache *cachep, if (!gfp_pfmemalloc_allowed(flags)) return NULL; - spin_lock(&n->list_lock); + raw_spin_lock(&n->list_lock); page = get_first_slab(n, true); if (!page) { - spin_unlock(&n->list_lock); + raw_spin_unlock(&n->list_lock); return NULL; } @ mm/slab.c:2876 @ static noinline void *cache_alloc_pfmemalloc(struct kmem_cache *cachep, fixup_slab_list(cachep, n, page, &list); - spin_unlock(&n->list_lock); + raw_spin_unlock(&n->list_lock); fixup_objfreelist_debug(cachep, &list); return obj; @ mm/slab.c:2935 @ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) if (!n->free_objects && (!shared || !shared->avail)) goto direct_grow; - spin_lock(&n->list_lock); + raw_spin_lock(&n->list_lock); shared = READ_ONCE(n->shared); /* See if we can refill from the shared array */ @ mm/slab.c:2959 @ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) must_grow: n->free_objects -= ac->avail; alloc_done: - spin_unlock(&n->list_lock); + raw_spin_unlock(&n->list_lock); fixup_objfreelist_debug(cachep, &list); direct_grow: @ mm/slab.c:3184 @ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, BUG_ON(!n); check_irq_off(); - spin_lock(&n->list_lock); + raw_spin_lock(&n->list_lock); page = get_first_slab(n, false); if (!page) goto must_grow; @ mm/slab.c:3202 @ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, fixup_slab_list(cachep, n, page, &list); - spin_unlock(&n->list_lock); + raw_spin_unlock(&n->list_lock); fixup_objfreelist_debug(cachep, &list); return obj; must_grow: - spin_unlock(&n->list_lock); + raw_spin_unlock(&n->list_lock); page = cache_grow_begin(cachep, gfp_exact_node(flags), nodeid); if (page) { /* This slab isn't counted yet so don't update free_objects */ @ mm/slab.c:3383 @ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac) check_irq_off(); n = get_node(cachep, node); - spin_lock(&n->list_lock); + raw_spin_lock(&n->list_lock); if (n->shared) { struct array_cache *shared_array = n->shared; int max = shared_array->limit - shared_array->avail; @ mm/slab.c:3412 @ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac) STATS_SET_FREEABLE(cachep, i); } #endif - spin_unlock(&n->list_lock); + raw_spin_unlock(&n->list_lock); slabs_destroy(cachep, &list); ac->avail -= batchcount; memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail); @ mm/slab.c:3834 @ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit, node = cpu_to_mem(cpu); n = get_node(cachep, node); - spin_lock_irq(&n->list_lock); + raw_spin_lock_irq(&n->list_lock); free_block(cachep, ac->entry, ac->avail, node, &list); - spin_unlock_irq(&n->list_lock); + raw_spin_unlock_irq(&n->list_lock); slabs_destroy(cachep, &list); } free_percpu(prev); @ mm/slab.c:3961 @ static void drain_array(struct kmem_cache *cachep, struct kmem_cache_node *n, return; } - spin_lock_irq(&n->list_lock); + raw_spin_lock_irq(&n->list_lock); drain_array_locked(cachep, ac, node, false, &list); - spin_unlock_irq(&n->list_lock); + raw_spin_unlock_irq(&n->list_lock); slabs_destroy(cachep, &list); } @ mm/slab.c:4047 @ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo) for_each_kmem_cache_node(cachep, node, n) { check_irq_on(); - spin_lock_irq(&n->list_lock); + raw_spin_lock_irq(&n->list_lock); total_slabs += n->total_slabs; free_slabs += n->free_slabs; @ mm/slab.c:4056 @ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo) if (n->shared) shared_avail += n->shared->avail; - spin_unlock_irq(&n->list_lock); + raw_spin_unlock_irq(&n->list_lock); } num_objs = total_slabs * cachep->num; active_slabs = total_slabs - free_slabs; @ mm/slab.h:599 @ static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags, * The slab lists for all objects. */ struct kmem_cache_node { - spinlock_t list_lock; + raw_spinlock_t list_lock; #ifdef CONFIG_SLAB struct list_head slabs_partial; /* partial list first, better asm code */ @ mm/slub.c:1199 @ static noinline int free_debug_processing( unsigned long uninitialized_var(flags); int ret = 0; - spin_lock_irqsave(&n->list_lock, flags); + raw_spin_lock_irqsave(&n->list_lock, flags); slab_lock(page); if (s->flags & SLAB_CONSISTENCY_CHECKS) { @ mm/slub.c:1234 @ static noinline int free_debug_processing( bulk_cnt, cnt); slab_unlock(page); - spin_unlock_irqrestore(&n->list_lock, flags); + raw_spin_unlock_irqrestore(&n->list_lock, flags); if (!ret) slab_fix(s, "Object at 0x%p not freed", object); return ret; @ mm/slub.c:1404 @ static inline void dec_slabs_node(struct kmem_cache *s, int node, #endif /* CONFIG_SLUB_DEBUG */ +struct slub_free_list { + raw_spinlock_t lock; + struct list_head list; +}; +static DEFINE_PER_CPU(struct slub_free_list, slub_free_list); + /* * Hooks for other subsystems that check memory allocations. In a typical * production configuration these hooks all should produce no code at all. @ mm/slub.c:1650 @ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) void *start, *p, *next; int idx; bool shuffle; + bool enableirqs = false; flags &= gfp_allowed_mask; if (gfpflags_allow_blocking(flags)) + enableirqs = true; + +#ifdef CONFIG_PREEMPT_RT + if (system_state > SYSTEM_BOOTING && system_state < SYSTEM_SUSPEND) + enableirqs = true; +#endif + if (enableirqs) local_irq_enable(); flags |= s->allocflags; @ mm/slub.c:1720 @ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) page->frozen = 1; out: - if (gfpflags_allow_blocking(flags)) + if (enableirqs) local_irq_disable(); if (!page) return NULL; @ mm/slub.c:1768 @ static void __free_slab(struct kmem_cache *s, struct page *page) __free_pages(page, order); } +static void free_delayed(struct list_head *h) +{ + while (!list_empty(h)) { + struct page *page = list_first_entry(h, struct page, lru); + + list_del(&page->lru); + __free_slab(page->slab_cache, page); + } +} + static void rcu_free_slab(struct rcu_head *h) { struct page *page = container_of(h, struct page, rcu_head); @ mm/slub.c:1789 @ static void free_slab(struct kmem_cache *s, struct page *page) { if (unlikely(s->flags & SLAB_TYPESAFE_BY_RCU)) { call_rcu(&page->rcu_head, rcu_free_slab); + } else if (irqs_disabled()) { + struct slub_free_list *f = this_cpu_ptr(&slub_free_list); + + raw_spin_lock(&f->lock); + list_add(&page->lru, &f->list); + raw_spin_unlock(&f->lock); } else __free_slab(s, page); } @ mm/slub.c:1902 @ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n, if (!n || !n->nr_partial) return NULL; - spin_lock(&n->list_lock); + raw_spin_lock(&n->list_lock); list_for_each_entry_safe(page, page2, &n->partial, slab_list) { void *t; @ mm/slub.c:1927 @ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n, break; } - spin_unlock(&n->list_lock); + raw_spin_unlock(&n->list_lock); return object; } @ mm/slub.c:2173 @ static void deactivate_slab(struct kmem_cache *s, struct page *page, * that acquire_slab() will see a slab page that * is frozen */ - spin_lock(&n->list_lock); + raw_spin_lock(&n->list_lock); } } else { m = M_FULL; @ mm/slub.c:2184 @ static void deactivate_slab(struct kmem_cache *s, struct page *page, * slabs from diagnostic functions will not see * any frozen slabs. */ - spin_lock(&n->list_lock); + raw_spin_lock(&n->list_lock); } } @ mm/slub.c:2208 @ static void deactivate_slab(struct kmem_cache *s, struct page *page, goto redo; if (lock) - spin_unlock(&n->list_lock); + raw_spin_unlock(&n->list_lock); if (m == M_PARTIAL) stat(s, tail); @ mm/slub.c:2247 @ static void unfreeze_partials(struct kmem_cache *s, n2 = get_node(s, page_to_nid(page)); if (n != n2) { if (n) - spin_unlock(&n->list_lock); + raw_spin_unlock(&n->list_lock); n = n2; - spin_lock(&n->list_lock); + raw_spin_lock(&n->list_lock); } do { @ mm/slub.c:2279 @ static void unfreeze_partials(struct kmem_cache *s, } if (n) - spin_unlock(&n->list_lock); + raw_spin_unlock(&n->list_lock); while (discard_page) { page = discard_page; @ mm/slub.c:2316 @ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) pobjects = oldpage->pobjects; pages = oldpage->pages; if (drain && pobjects > s->cpu_partial) { + struct slub_free_list *f; unsigned long flags; + LIST_HEAD(tofree); /* * partial array is full. Move the existing * set to the per node partial list. */ local_irq_save(flags); unfreeze_partials(s, this_cpu_ptr(s->cpu_slab)); + f = this_cpu_ptr(&slub_free_list); + raw_spin_lock(&f->lock); + list_splice_init(&f->list, &tofree); + raw_spin_unlock(&f->lock); local_irq_restore(flags); + free_delayed(&tofree); oldpage = NULL; pobjects = 0; pages = 0; @ mm/slub.c:2398 @ static bool has_cpu_slab(int cpu, void *info) static void flush_all(struct kmem_cache *s) { + LIST_HEAD(tofree); + int cpu; + on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1); + for_each_online_cpu(cpu) { + struct slub_free_list *f; + + f = &per_cpu(slub_free_list, cpu); + raw_spin_lock_irq(&f->lock); + list_splice_init(&f->list, &tofree); + raw_spin_unlock_irq(&f->lock); + free_delayed(&tofree); + } } /* @ mm/slub.c:2465 @ static unsigned long count_partial(struct kmem_cache_node *n, unsigned long x = 0; struct page *page; - spin_lock_irqsave(&n->list_lock, flags); + raw_spin_lock_irqsave(&n->list_lock, flags); list_for_each_entry(page, &n->partial, slab_list) x += get_count(page); - spin_unlock_irqrestore(&n->list_lock, flags); + raw_spin_unlock_irqrestore(&n->list_lock, flags); return x; } #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */ @ mm/slub.c:2607 @ static inline void *get_freelist(struct kmem_cache *s, struct page *page) * already disabled (which is the case for bulk allocation). */ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, - unsigned long addr, struct kmem_cache_cpu *c) + unsigned long addr, struct kmem_cache_cpu *c, + struct list_head *to_free) { + struct slub_free_list *f; void *freelist; struct page *page; @ mm/slub.c:2676 @ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, VM_BUG_ON(!c->page->frozen); c->freelist = get_freepointer(s, freelist); c->tid = next_tid(c->tid); + +out: + f = this_cpu_ptr(&slub_free_list); + raw_spin_lock(&f->lock); + list_splice_init(&f->list, to_free); + raw_spin_unlock(&f->lock); + return freelist; new_slab: @ mm/slub.c:2698 @ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, if (unlikely(!freelist)) { slab_out_of_memory(s, gfpflags, node); - return NULL; + goto out; } page = c->page; @ mm/slub.c:2711 @ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, goto new_slab; /* Slab failed checks. Next slab needed */ deactivate_slab(s, page, get_freepointer(s, freelist), c); - return freelist; + goto out; } /* @ mm/slub.c:2723 @ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, { void *p; unsigned long flags; + LIST_HEAD(tofree); local_irq_save(flags); #ifdef CONFIG_PREEMPTION @ mm/slub.c:2735 @ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, c = this_cpu_ptr(s->cpu_slab); #endif - p = ___slab_alloc(s, gfpflags, node, addr, c); + p = ___slab_alloc(s, gfpflags, node, addr, c, &tofree); local_irq_restore(flags); + free_delayed(&tofree); return p; } @ mm/slub.c:2770 @ static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct page *page; unsigned long tid; + if (IS_ENABLED(CONFIG_PREEMPT_RT) && IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP)) + WARN_ON_ONCE(!preemptible() && + (system_state > SYSTEM_BOOTING && system_state < SYSTEM_SUSPEND)); + s = slab_pre_alloc_hook(s, gfpflags); if (!s) return NULL; @ mm/slub.c:2940 @ static void __slab_free(struct kmem_cache *s, struct page *page, do { if (unlikely(n)) { - spin_unlock_irqrestore(&n->list_lock, flags); + raw_spin_unlock_irqrestore(&n->list_lock, flags); n = NULL; } prior = page->freelist; @ mm/slub.c:2972 @ static void __slab_free(struct kmem_cache *s, struct page *page, * Otherwise the list_lock will synchronize with * other processors updating the list of slabs. */ - spin_lock_irqsave(&n->list_lock, flags); + raw_spin_lock_irqsave(&n->list_lock, flags); } } @ mm/slub.c:3013 @ static void __slab_free(struct kmem_cache *s, struct page *page, add_partial(n, page, DEACTIVATE_TO_TAIL); stat(s, FREE_ADD_PARTIAL); } - spin_unlock_irqrestore(&n->list_lock, flags); + raw_spin_unlock_irqrestore(&n->list_lock, flags); return; slab_empty: @ mm/slub.c:3028 @ static void __slab_free(struct kmem_cache *s, struct page *page, remove_full(s, n, page); } - spin_unlock_irqrestore(&n->list_lock, flags); + raw_spin_unlock_irqrestore(&n->list_lock, flags); stat(s, FREE_SLAB); discard_slab(s, page); } @ mm/slub.c:3233 @ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, void **p) { struct kmem_cache_cpu *c; + LIST_HEAD(to_free); int i; + if (IS_ENABLED(CONFIG_PREEMPT_RT) && IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP)) + WARN_ON_ONCE(!preemptible() && + (system_state > SYSTEM_BOOTING && system_state < SYSTEM_SUSPEND)); + /* memcg and kmem_cache debug support */ s = slab_pre_alloc_hook(s, flags); if (unlikely(!s)) @ mm/slub.c:3270 @ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, * of re-populating per CPU c->freelist */ p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, - _RET_IP_, c); + _RET_IP_, c, &to_free); if (unlikely(!p[i])) goto error; @ mm/slub.c:3285 @ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, } c->tid = next_tid(c->tid); local_irq_enable(); + free_delayed(&to_free); /* Clear memory outside IRQ disabled fastpath loop */ if (unlikely(slab_want_init_on_alloc(flags, s))) { @ mm/slub.c:3300 @ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, return i; error: local_irq_enable(); + free_delayed(&to_free); slab_post_alloc_hook(s, flags, i, p); __kmem_cache_free_bulk(s, i, p); return 0; @ mm/slub.c:3436 @ static void init_kmem_cache_node(struct kmem_cache_node *n) { n->nr_partial = 0; - spin_lock_init(&n->list_lock); + raw_spin_lock_init(&n->list_lock); INIT_LIST_HEAD(&n->partial); #ifdef CONFIG_SLUB_DEBUG atomic_long_set(&n->nr_slabs, 0); @ mm/slub.c:3785 @ static void list_slab_objects(struct kmem_cache *s, struct page *page, const char *text) { #ifdef CONFIG_SLUB_DEBUG +#ifdef CONFIG_PREEMPT_RT + /* XXX move out of irq-off section */ + slab_err(s, page, text, s->name); +#else + void *addr = page_address(page); void *p; unsigned long *map; @ mm/slub.c:3809 @ static void list_slab_objects(struct kmem_cache *s, struct page *page, slab_unlock(page); #endif +#endif } /* @ mm/slub.c:3823 @ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n) struct page *page, *h; BUG_ON(irqs_disabled()); - spin_lock_irq(&n->list_lock); + raw_spin_lock_irq(&n->list_lock); list_for_each_entry_safe(page, h, &n->partial, slab_list) { if (!page->inuse) { remove_partial(n, page); @ mm/slub.c:3833 @ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n) "Objects remaining in %s on __kmem_cache_shutdown()"); } } - spin_unlock_irq(&n->list_lock); + raw_spin_unlock_irq(&n->list_lock); list_for_each_entry_safe(page, h, &discard, slab_list) discard_slab(s, page); @ mm/slub.c:4105 @ int __kmem_cache_shrink(struct kmem_cache *s) for (i = 0; i < SHRINK_PROMOTE_MAX; i++) INIT_LIST_HEAD(promote + i); - spin_lock_irqsave(&n->list_lock, flags); + raw_spin_lock_irqsave(&n->list_lock, flags); /* * Build lists of slabs to discard or promote. @ mm/slub.c:4136 @ int __kmem_cache_shrink(struct kmem_cache *s) for (i = SHRINK_PROMOTE_MAX - 1; i >= 0; i--) list_splice(promote + i, &n->partial); - spin_unlock_irqrestore(&n->list_lock, flags); + raw_spin_unlock_irqrestore(&n->list_lock, flags); /* Release empty slabs */ list_for_each_entry_safe(page, t, &discard, slab_list) @ mm/slub.c:4343 @ void __init kmem_cache_init(void) { static __initdata struct kmem_cache boot_kmem_cache, boot_kmem_cache_node; + int cpu; + + for_each_possible_cpu(cpu) { + raw_spin_lock_init(&per_cpu(slub_free_list, cpu).lock); + INIT_LIST_HEAD(&per_cpu(slub_free_list, cpu).list); + } if (debug_guardpage_minorder()) slub_max_order = 0; @ mm/slub.c:4540 @ static int validate_slab_node(struct kmem_cache *s, struct page *page; unsigned long flags; - spin_lock_irqsave(&n->list_lock, flags); + raw_spin_lock_irqsave(&n->list_lock, flags); list_for_each_entry(page, &n->partial, slab_list) { validate_slab(s, page); @ mm/slub.c:4562 @ static int validate_slab_node(struct kmem_cache *s, s->name, count, atomic_long_read(&n->nr_slabs)); out: - spin_unlock_irqrestore(&n->list_lock, flags); + raw_spin_unlock_irqrestore(&n->list_lock, flags); return count; } @ mm/slub.c:4741 @ static int list_locations(struct kmem_cache *s, char *buf, if (!atomic_long_read(&n->nr_slabs)) continue; - spin_lock_irqsave(&n->list_lock, flags); + raw_spin_lock_irqsave(&n->list_lock, flags); list_for_each_entry(page, &n->partial, slab_list) process_slab(&t, s, page, alloc); list_for_each_entry(page, &n->full, slab_list) process_slab(&t, s, page, alloc); - spin_unlock_irqrestore(&n->list_lock, flags); + raw_spin_unlock_irqrestore(&n->list_lock, flags); } for (i = 0; i < t.count; i++) { @ mm/swap.c:36 @ #include <linux/memcontrol.h> #include <linux/gfp.h> #include <linux/uio.h> +#include <linux/locallock.h> #include <linux/hugetlb.h> #include <linux/page_idle.h> @ mm/swap.c:56 @ static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs); #ifdef CONFIG_SMP static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs); #endif +static DEFINE_LOCAL_IRQ_LOCK(rotate_lock); +DEFINE_LOCAL_IRQ_LOCK(swapvec_lock); /* * This path almost never happens for VM activity - pages are normally @ mm/swap.c:260 @ void rotate_reclaimable_page(struct page *page) unsigned long flags; get_page(page); - local_irq_save(flags); + local_lock_irqsave(rotate_lock, flags); pvec = this_cpu_ptr(&lru_rotate_pvecs); if (!pagevec_add(pvec, page) || PageCompound(page)) pagevec_move_tail(pvec); - local_irq_restore(flags); + local_unlock_irqrestore(rotate_lock, flags); } } @ mm/swap.c:314 @ void activate_page(struct page *page) { page = compound_head(page); if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { - struct pagevec *pvec = &get_cpu_var(activate_page_pvecs); + struct pagevec *pvec = &get_locked_var(swapvec_lock, + activate_page_pvecs); get_page(page); if (!pagevec_add(pvec, page) || PageCompound(page)) pagevec_lru_move_fn(pvec, __activate_page, NULL); - put_cpu_var(activate_page_pvecs); + put_locked_var(swapvec_lock, activate_page_pvecs); } } @ mm/swap.c:342 @ void activate_page(struct page *page) static void __lru_cache_activate_page(struct page *page) { - struct pagevec *pvec = &get_cpu_var(lru_add_pvec); + struct pagevec *pvec = &get_locked_var(swapvec_lock, lru_add_pvec); int i; /* @ mm/swap.c:364 @ static void __lru_cache_activate_page(struct page *page) } } - put_cpu_var(lru_add_pvec); + put_locked_var(swapvec_lock, lru_add_pvec); } /* @ mm/swap.c:411 @ EXPORT_SYMBOL(mark_page_accessed); static void __lru_cache_add(struct page *page) { - struct pagevec *pvec = &get_cpu_var(lru_add_pvec); + struct pagevec *pvec = &get_locked_var(swapvec_lock, lru_add_pvec); get_page(page); if (!pagevec_add(pvec, page) || PageCompound(page)) __pagevec_lru_add(pvec); - put_cpu_var(lru_add_pvec); + put_locked_var(swapvec_lock, lru_add_pvec); } /** @ mm/swap.c:610 @ void lru_add_drain_cpu(int cpu) unsigned long flags; /* No harm done if a racing interrupt already did this */ - local_irq_save(flags); + local_lock_irqsave(rotate_lock, flags); pagevec_move_tail(pvec); - local_irq_restore(flags); + local_unlock_irqrestore(rotate_lock, flags); } pvec = &per_cpu(lru_deactivate_file_pvecs, cpu); @ mm/swap.c:648 @ void deactivate_file_page(struct page *page) return; if (likely(get_page_unless_zero(page))) { - struct pagevec *pvec = &get_cpu_var(lru_deactivate_file_pvecs); + struct pagevec *pvec = &get_locked_var(swapvec_lock, + lru_deactivate_file_pvecs); if (!pagevec_add(pvec, page) || PageCompound(page)) pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL); - put_cpu_var(lru_deactivate_file_pvecs); + put_locked_var(swapvec_lock, lru_deactivate_file_pvecs); } } @ mm/swap.c:688 @ void mark_page_lazyfree(struct page *page) { if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) && !PageSwapCache(page) && !PageUnevictable(page)) { - struct pagevec *pvec = &get_cpu_var(lru_lazyfree_pvecs); + struct pagevec *pvec = &get_locked_var(swapvec_lock, + lru_lazyfree_pvecs); get_page(page); if (!pagevec_add(pvec, page) || PageCompound(page)) pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL); - put_cpu_var(lru_lazyfree_pvecs); + put_locked_var(swapvec_lock, lru_lazyfree_pvecs); } } void lru_add_drain(void) { - lru_add_drain_cpu(get_cpu()); - put_cpu(); + lru_add_drain_cpu(local_lock_cpu(swapvec_lock)); + local_unlock_cpu(swapvec_lock); } #ifdef CONFIG_SMP @ mm/vmalloc.c:1505 @ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask) struct vmap_block *vb; struct vmap_area *va; unsigned long vb_idx; - int node, err; + int node, err, cpu; void *vaddr; node = numa_node_id(); @ mm/vmalloc.c:1548 @ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask) BUG_ON(err); radix_tree_preload_end(); - vbq = &get_cpu_var(vmap_block_queue); + cpu = get_cpu_light(); + vbq = this_cpu_ptr(&vmap_block_queue); spin_lock(&vbq->lock); list_add_tail_rcu(&vb->free_list, &vbq->free); spin_unlock(&vbq->lock); - put_cpu_var(vmap_block_queue); + put_cpu_light(); return vaddr; } @ mm/vmalloc.c:1622 @ static void *vb_alloc(unsigned long size, gfp_t gfp_mask) struct vmap_block *vb; void *vaddr = NULL; unsigned int order; + int cpu; BUG_ON(offset_in_page(size)); BUG_ON(size > PAGE_SIZE*VMAP_MAX_ALLOC); @ mm/vmalloc.c:1637 @ static void *vb_alloc(unsigned long size, gfp_t gfp_mask) order = get_order(size); rcu_read_lock(); - vbq = &get_cpu_var(vmap_block_queue); + cpu = get_cpu_light(); + vbq = this_cpu_ptr(&vmap_block_queue); list_for_each_entry_rcu(vb, &vbq->free, free_list) { unsigned long pages_off; @ mm/vmalloc.c:1661 @ static void *vb_alloc(unsigned long size, gfp_t gfp_mask) break; } - put_cpu_var(vmap_block_queue); + put_cpu_light(); rcu_read_unlock(); /* Allocate new block if nothing was found */ @ mm/vmstat.c:324 @ void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item, long x; long t; + preempt_disable_rt(); x = delta + __this_cpu_read(*p); t = __this_cpu_read(pcp->stat_threshold); @ mm/vmstat.c:334 @ void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item, x = 0; } __this_cpu_write(*p, x); + preempt_enable_rt(); } EXPORT_SYMBOL(__mod_zone_page_state); @ mm/vmstat.c:346 @ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, long x; long t; + preempt_disable_rt(); x = delta + __this_cpu_read(*p); t = __this_cpu_read(pcp->stat_threshold); @ mm/vmstat.c:356 @ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, x = 0; } __this_cpu_write(*p, x); + preempt_enable_rt(); } EXPORT_SYMBOL(__mod_node_page_state); @ mm/vmstat.c:389 @ void __inc_zone_state(struct zone *zone, enum zone_stat_item item) s8 __percpu *p = pcp->vm_stat_diff + item; s8 v, t; + preempt_disable_rt(); v = __this_cpu_inc_return(*p); t = __this_cpu_read(pcp->stat_threshold); if (unlikely(v > t)) { @ mm/vmstat.c:398 @ void __inc_zone_state(struct zone *zone, enum zone_stat_item item) zone_page_state_add(v + overstep, zone, item); __this_cpu_write(*p, -overstep); } + preempt_enable_rt(); } void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) @ mm/vmstat.c:407 @ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) s8 __percpu *p = pcp->vm_node_stat_diff + item; s8 v, t; + preempt_disable_rt(); v = __this_cpu_inc_return(*p); t = __this_cpu_read(pcp->stat_threshold); if (unlikely(v > t)) { @ mm/vmstat.c:416 @ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) node_page_state_add(v + overstep, pgdat, item); __this_cpu_write(*p, -overstep); } + preempt_enable_rt(); } void __inc_zone_page_state(struct page *page, enum zone_stat_item item) @ mm/vmstat.c:437 @ void __dec_zone_state(struct zone *zone, enum zone_stat_item item) s8 __percpu *p = pcp->vm_stat_diff + item; s8 v, t; + preempt_disable_rt(); v = __this_cpu_dec_return(*p); t = __this_cpu_read(pcp->stat_threshold); if (unlikely(v < - t)) { @ mm/vmstat.c:446 @ void __dec_zone_state(struct zone *zone, enum zone_stat_item item) zone_page_state_add(v - overstep, zone, item); __this_cpu_write(*p, overstep); } + preempt_enable_rt(); } void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item) @ mm/vmstat.c:455 @ void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item) s8 __percpu *p = pcp->vm_node_stat_diff + item; s8 v, t; + preempt_disable_rt(); v = __this_cpu_dec_return(*p); t = __this_cpu_read(pcp->stat_threshold); if (unlikely(v < - t)) { @ mm/vmstat.c:464 @ void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item) node_page_state_add(v - overstep, pgdat, item); __this_cpu_write(*p, overstep); } + preempt_enable_rt(); } void __dec_zone_page_state(struct page *page, enum zone_stat_item item) @ mm/workingset.c:407 @ static struct list_lru shadow_nodes; void workingset_update_node(struct xa_node *node) { + struct address_space *mapping; + /* * Track non-empty nodes that contain only shadow entries; * unlink those that contain pages or are being freed. @ mm/workingset.c:417 @ void workingset_update_node(struct xa_node *node) * already where they should be. The list_empty() test is safe * as node->private_list is protected by the i_pages lock. */ - VM_WARN_ON_ONCE(!irqs_disabled()); /* For __inc_lruvec_page_state */ + mapping = container_of(node->array, struct address_space, i_pages); + lockdep_assert_held(&mapping->i_pages.xa_lock); if (node->count && node->count == node->nr_values) { if (list_empty(&node->private_list)) { @ mm/zsmalloc.c:60 @ #include <linux/wait.h> #include <linux/pagemap.h> #include <linux/fs.h> +#include <linux/locallock.h> #define ZSPAGE_MAGIC 0x58 @ mm/zsmalloc.c:78 @ */ #define ZS_MAX_ZSPAGE_ORDER 2 #define ZS_MAX_PAGES_PER_ZSPAGE (_AC(1, UL) << ZS_MAX_ZSPAGE_ORDER) - #define ZS_HANDLE_SIZE (sizeof(unsigned long)) +#ifdef CONFIG_PREEMPT_RT + +struct zsmalloc_handle { + unsigned long addr; + struct mutex lock; +}; + +#define ZS_HANDLE_ALLOC_SIZE (sizeof(struct zsmalloc_handle)) + +#else + +#define ZS_HANDLE_ALLOC_SIZE (sizeof(unsigned long)) +#endif + /* * Object location (<PFN>, <obj_idx>) is encoded as * as single (unsigned long) handle value. @ mm/zsmalloc.c:343 @ static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage) {} static int create_cache(struct zs_pool *pool) { - pool->handle_cachep = kmem_cache_create("zs_handle", ZS_HANDLE_SIZE, + pool->handle_cachep = kmem_cache_create("zs_handle", ZS_HANDLE_ALLOC_SIZE, 0, 0, NULL); if (!pool->handle_cachep) return 1; @ mm/zsmalloc.c:367 @ static void destroy_cache(struct zs_pool *pool) static unsigned long cache_alloc_handle(struct zs_pool *pool, gfp_t gfp) { - return (unsigned long)kmem_cache_alloc(pool->handle_cachep, - gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE)); + void *p; + + p = kmem_cache_alloc(pool->handle_cachep, + gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE)); +#ifdef CONFIG_PREEMPT_RT + if (p) { + struct zsmalloc_handle *zh = p; + + mutex_init(&zh->lock); + } +#endif + return (unsigned long)p; } +#ifdef CONFIG_PREEMPT_RT +static struct zsmalloc_handle *zs_get_pure_handle(unsigned long handle) +{ + return (void *)(handle &~((1 << OBJ_TAG_BITS) - 1)); +} +#endif + static void cache_free_handle(struct zs_pool *pool, unsigned long handle) { kmem_cache_free(pool->handle_cachep, (void *)handle); @ mm/zsmalloc.c:406 @ static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage) static void record_obj(unsigned long handle, unsigned long obj) { +#ifdef CONFIG_PREEMPT_RT + struct zsmalloc_handle *zh = zs_get_pure_handle(handle); + + WRITE_ONCE(zh->addr, obj); +#else /* * lsb of @obj represents handle lock while other bits * represent object value the handle is pointing so * updating shouldn't do store tearing. */ WRITE_ONCE(*(unsigned long *)handle, obj); +#endif } /* zpool driver */ @ mm/zsmalloc.c:500 @ MODULE_ALIAS("zpool-zsmalloc"); /* per-cpu VM mapping areas for zspage accesses that cross page boundaries */ static DEFINE_PER_CPU(struct mapping_area, zs_map_area); +static DEFINE_LOCAL_IRQ_LOCK(zs_map_area_lock); static bool is_zspage_isolated(struct zspage *zspage) { @ mm/zsmalloc.c:910 @ static unsigned long location_to_obj(struct page *page, unsigned int obj_idx) static unsigned long handle_to_obj(unsigned long handle) { +#ifdef CONFIG_PREEMPT_RT + struct zsmalloc_handle *zh = zs_get_pure_handle(handle); + + return zh->addr; +#else return *(unsigned long *)handle; +#endif } static unsigned long obj_to_head(struct page *page, void *obj) @ mm/zsmalloc.c:930 @ static unsigned long obj_to_head(struct page *page, void *obj) static inline int testpin_tag(unsigned long handle) { +#ifdef CONFIG_PREEMPT_RT + struct zsmalloc_handle *zh = zs_get_pure_handle(handle); + + return mutex_is_locked(&zh->lock); +#else return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle); +#endif } static inline int trypin_tag(unsigned long handle) { +#ifdef CONFIG_PREEMPT_RT + struct zsmalloc_handle *zh = zs_get_pure_handle(handle); + + return mutex_trylock(&zh->lock); +#else return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle); +#endif } static void pin_tag(unsigned long handle) { +#ifdef CONFIG_PREEMPT_RT + struct zsmalloc_handle *zh = zs_get_pure_handle(handle); + + return mutex_lock(&zh->lock); +#else bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle); +#endif } static void unpin_tag(unsigned long handle) { +#ifdef CONFIG_PREEMPT_RT + struct zsmalloc_handle *zh = zs_get_pure_handle(handle); + + return mutex_unlock(&zh->lock); +#else bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle); +#endif } static void reset_page(struct page *page) @ mm/zsmalloc.c:1395 @ void *zs_map_object(struct zs_pool *pool, unsigned long handle, class = pool->size_class[class_idx]; off = (class->size * obj_idx) & ~PAGE_MASK; - area = &get_cpu_var(zs_map_area); + area = &get_locked_var(zs_map_area_lock, zs_map_area); area->vm_mm = mm; if (off + class->size <= PAGE_SIZE) { /* this object is contained entirely within a page */ @ mm/zsmalloc.c:1449 @ void zs_unmap_object(struct zs_pool *pool, unsigned long handle) __zs_unmap_object(area, pages, off, class->size); } - put_cpu_var(zs_map_area); + put_locked_var(zs_map_area_lock, zs_map_area); migrate_read_unlock(zspage); unpin_tag(handle); @ mm/zswap.c:21 @ #include <linux/highmem.h> #include <linux/slab.h> #include <linux/spinlock.h> +#include <linux/locallock.h> #include <linux/types.h> #include <linux/atomic.h> #include <linux/frontswap.h> @ mm/zswap.c:394 @ static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, * per-cpu code **********************************/ static DEFINE_PER_CPU(u8 *, zswap_dstmem); +/* Used for zswap_dstmem and tfm */ +static DEFINE_LOCAL_IRQ_LOCK(zswap_cpu_lock); static int zswap_dstmem_prepare(unsigned int cpu) { @ mm/zswap.c:925 @ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle) dlen = PAGE_SIZE; src = (u8 *)zhdr + sizeof(struct zswap_header); dst = kmap_atomic(page); - tfm = *get_cpu_ptr(entry->pool->tfm); + local_lock(zswap_cpu_lock); + tfm = *this_cpu_ptr(entry->pool->tfm); ret = crypto_comp_decompress(tfm, src, entry->length, dst, &dlen); - put_cpu_ptr(entry->pool->tfm); + local_unlock(zswap_cpu_lock); kunmap_atomic(dst); BUG_ON(ret); BUG_ON(dlen != PAGE_SIZE); @ mm/zswap.c:1081 @ static int zswap_frontswap_store(unsigned type, pgoff_t offset, } /* compress */ - dst = get_cpu_var(zswap_dstmem); - tfm = *get_cpu_ptr(entry->pool->tfm); + local_lock(zswap_cpu_lock); + dst = *this_cpu_ptr(&zswap_dstmem); + tfm = *this_cpu_ptr(entry->pool->tfm); src = kmap_atomic(page); ret = crypto_comp_compress(tfm, src, PAGE_SIZE, dst, &dlen); kunmap_atomic(src); - put_cpu_ptr(entry->pool->tfm); if (ret) { ret = -EINVAL; goto put_dstmem; @ mm/zswap.c:1110 @ static int zswap_frontswap_store(unsigned type, pgoff_t offset, memcpy(buf, &zhdr, hlen); memcpy(buf + hlen, dst, dlen); zpool_unmap_handle(entry->pool->zpool, handle); - put_cpu_var(zswap_dstmem); + local_unlock(zswap_cpu_lock); /* populate entry */ entry->offset = offset; @ mm/zswap.c:1138 @ static int zswap_frontswap_store(unsigned type, pgoff_t offset, return 0; put_dstmem: - put_cpu_var(zswap_dstmem); + local_unlock(zswap_cpu_lock); zswap_pool_put(entry->pool); freepage: zswap_entry_cache_free(entry); @ mm/zswap.c:1183 @ static int zswap_frontswap_load(unsigned type, pgoff_t offset, if (zpool_evictable(entry->pool->zpool)) src += sizeof(struct zswap_header); dst = kmap_atomic(page); - tfm = *get_cpu_ptr(entry->pool->tfm); + local_lock(zswap_cpu_lock); + tfm = *this_cpu_ptr(entry->pool->tfm); ret = crypto_comp_decompress(tfm, src, entry->length, dst, &dlen); - put_cpu_ptr(entry->pool->tfm); + local_unlock(zswap_cpu_lock); kunmap_atomic(dst); zpool_unmap_handle(entry->pool->zpool, entry->handle); BUG_ON(ret); @ net/Kconfig:285 @ config CGROUP_NET_CLASSID config NET_RX_BUSY_POLL bool - default y + default y if !PREEMPT_RT config BQL bool @ net/bluetooth/rfcomm/sock.c:67 @ static void rfcomm_sk_data_ready(struct rfcomm_dlc *d, struct sk_buff *skb) static void rfcomm_sk_state_change(struct rfcomm_dlc *d, int err) { struct sock *sk = d->owner, *parent; - unsigned long flags; if (!sk) return; BT_DBG("dlc %p state %ld err %d", d, d->state, err); - local_irq_save(flags); - bh_lock_sock(sk); + spin_lock_bh(&sk->sk_lock.slock); if (err) sk->sk_err = err; @ net/bluetooth/rfcomm/sock.c:94 @ static void rfcomm_sk_state_change(struct rfcomm_dlc *d, int err) sk->sk_state_change(sk); } - bh_unlock_sock(sk); - local_irq_restore(flags); + spin_unlock_bh(&sk->sk_lock.slock); if (parent && sock_flag(sk, SOCK_ZAPPED)) { /* We have to drop DLC lock here, otherwise @ net/bpf/test_run.c:40 @ static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, repeat = 1; rcu_read_lock(); - preempt_disable(); + migrate_disable(); time_start = ktime_get_ns(); for (i = 0; i < repeat; i++) { bpf_cgroup_storage_set(storage); @ net/bpf/test_run.c:57 @ static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, if (need_resched()) { time_spent += ktime_get_ns() - time_start; - preempt_enable(); + migrate_enable(); rcu_read_unlock(); cond_resched(); rcu_read_lock(); - preempt_disable(); + migrate_disable(); time_start = ktime_get_ns(); } } time_spent += ktime_get_ns() - time_start; - preempt_enable(); + migrate_enable(); rcu_read_unlock(); do_div(time_spent, repeat); @ net/core/dev.c:82 @ #include <linux/sched.h> #include <linux/sched/mm.h> #include <linux/mutex.h> +#include <linux/rwsem.h> #include <linux/string.h> #include <linux/mm.h> #include <linux/socket.h> @ net/core/dev.c:198 @ static DEFINE_SPINLOCK(napi_hash_lock); static unsigned int napi_gen_id = NR_CPUS; static DEFINE_READ_MOSTLY_HASHTABLE(napi_hash, 8); -static seqcount_t devnet_rename_seq; +static DECLARE_RWSEM(devnet_rename_sem); static inline void dev_base_seq_inc(struct net *net) { @ net/core/dev.c:221 @ static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex) static inline void rps_lock(struct softnet_data *sd) { #ifdef CONFIG_RPS - spin_lock(&sd->input_pkt_queue.lock); + raw_spin_lock(&sd->input_pkt_queue.raw_lock); #endif } static inline void rps_unlock(struct softnet_data *sd) { #ifdef CONFIG_RPS - spin_unlock(&sd->input_pkt_queue.lock); + raw_spin_unlock(&sd->input_pkt_queue.raw_lock); #endif } @ net/core/dev.c:934 @ EXPORT_SYMBOL(dev_get_by_napi_id); * @net: network namespace * @name: a pointer to the buffer where the name will be stored. * @ifindex: the ifindex of the interface to get the name from. - * - * The use of raw_seqcount_begin() and cond_resched() before - * retrying is required as we want to give the writers a chance - * to complete when CONFIG_PREEMPTION is not set. */ int netdev_get_name(struct net *net, char *name, int ifindex) { struct net_device *dev; - unsigned int seq; + int ret; -retry: - seq = raw_seqcount_begin(&devnet_rename_seq); + down_read(&devnet_rename_sem); rcu_read_lock(); + dev = dev_get_by_index_rcu(net, ifindex); if (!dev) { - rcu_read_unlock(); - return -ENODEV; + ret = -ENODEV; + goto out; } strcpy(name, dev->name); - rcu_read_unlock(); - if (read_seqcount_retry(&devnet_rename_seq, seq)) { - cond_resched(); - goto retry; - } - return 0; + ret = 0; +out: + rcu_read_unlock(); + up_read(&devnet_rename_sem); + return ret; } /** @ net/core/dev.c:1227 @ int dev_change_name(struct net_device *dev, const char *newname) likely(!(dev->priv_flags & IFF_LIVE_RENAME_OK))) return -EBUSY; - write_seqcount_begin(&devnet_rename_seq); + down_write(&devnet_rename_sem); if (strncmp(newname, dev->name, IFNAMSIZ) == 0) { - write_seqcount_end(&devnet_rename_seq); + up_write(&devnet_rename_sem); return 0; } @ net/core/dev.c:1238 @ int dev_change_name(struct net_device *dev, const char *newname) err = dev_get_valid_name(net, dev, newname); if (err < 0) { - write_seqcount_end(&devnet_rename_seq); + up_write(&devnet_rename_sem); return err; } @ net/core/dev.c:1253 @ int dev_change_name(struct net_device *dev, const char *newname) if (ret) { memcpy(dev->name, oldname, IFNAMSIZ); dev->name_assign_type = old_assign_type; - write_seqcount_end(&devnet_rename_seq); + up_write(&devnet_rename_sem); return ret; } - write_seqcount_end(&devnet_rename_seq); + up_write(&devnet_rename_sem); netdev_adjacent_rename_links(dev, oldname); @ net/core/dev.c:1278 @ int dev_change_name(struct net_device *dev, const char *newname) /* err >= 0 after dev_alloc_name() or stores the first errno */ if (err >= 0) { err = ret; - write_seqcount_begin(&devnet_rename_seq); + down_write(&devnet_rename_sem); memcpy(dev->name, oldname, IFNAMSIZ); memcpy(oldname, newname, IFNAMSIZ); dev->name_assign_type = old_assign_type; @ net/core/dev.c:2948 @ static void __netif_reschedule(struct Qdisc *q) sd->output_queue_tailp = &q->next_sched; raise_softirq_irqoff(NET_TX_SOFTIRQ); local_irq_restore(flags); + preempt_check_resched_rt(); } void __netif_schedule(struct Qdisc *q) @ net/core/dev.c:3011 @ void __dev_kfree_skb_irq(struct sk_buff *skb, enum skb_free_reason reason) __this_cpu_write(softnet_data.completion_queue, skb); raise_softirq_irqoff(NET_TX_SOFTIRQ); local_irq_restore(flags); + preempt_check_resched_rt(); } EXPORT_SYMBOL(__dev_kfree_skb_irq); @ net/core/dev.c:3679 @ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, * This permits qdisc->running owner to get the lock more * often and dequeue packets faster. */ +#ifdef CONFIG_PREEMPT_RT + contended = true; +#else contended = qdisc_is_running(q); +#endif if (unlikely(contended)) spin_lock(&q->busylock); @ net/core/dev.c:4477 @ static int enqueue_to_backlog(struct sk_buff *skb, int cpu, rps_unlock(sd); local_irq_restore(flags); + preempt_check_resched_rt(); atomic_long_inc(&skb->dev->rx_dropped); kfree_skb(skb); @ net/core/dev.c:4692 @ static int netif_rx_internal(struct sk_buff *skb) struct rps_dev_flow voidflow, *rflow = &voidflow; int cpu; - preempt_disable(); + migrate_disable(); rcu_read_lock(); cpu = get_rps_cpu(skb->dev, skb, &rflow); @ net/core/dev.c:4702 @ static int netif_rx_internal(struct sk_buff *skb) ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail); rcu_read_unlock(); - preempt_enable(); + migrate_enable(); } else #endif { unsigned int qtail; - ret = enqueue_to_backlog(skb, get_cpu(), &qtail); - put_cpu(); + ret = enqueue_to_backlog(skb, get_cpu_light(), &qtail); + put_cpu_light(); } return ret; } @ net/core/dev.c:4748 @ int netif_rx_ni(struct sk_buff *skb) trace_netif_rx_ni_entry(skb); - preempt_disable(); + local_bh_disable(); err = netif_rx_internal(skb); - if (local_softirq_pending()) - do_softirq(); - preempt_enable(); + local_bh_enable(); trace_netif_rx_ni_exit(err); return err; @ net/core/dev.c:5510 @ static void flush_backlog(struct work_struct *work) skb_queue_walk_safe(&sd->input_pkt_queue, skb, tmp) { if (skb->dev->reg_state == NETREG_UNREGISTERING) { __skb_unlink(skb, &sd->input_pkt_queue); - kfree_skb(skb); + __skb_queue_tail(&sd->tofree_queue, skb); input_queue_head_incr(sd); } } @ net/core/dev.c:5520 @ static void flush_backlog(struct work_struct *work) skb_queue_walk_safe(&sd->process_queue, skb, tmp) { if (skb->dev->reg_state == NETREG_UNREGISTERING) { __skb_unlink(skb, &sd->process_queue); - kfree_skb(skb); + __skb_queue_tail(&sd->tofree_queue, skb); input_queue_head_incr(sd); } } + if (!skb_queue_empty(&sd->tofree_queue)) + raise_softirq_irqoff(NET_RX_SOFTIRQ); local_bh_enable(); + } static void flush_all_backlogs(void) @ net/core/dev.c:6111 @ static void net_rps_action_and_irq_enable(struct softnet_data *sd) sd->rps_ipi_list = NULL; local_irq_enable(); + preempt_check_resched_rt(); /* Send pending IPI's to kick RPS processing on remote cpus. */ net_rps_send_ipi(remsd); } else #endif local_irq_enable(); + preempt_check_resched_rt(); } static bool sd_has_rps_ipi_waiting(struct softnet_data *sd) @ net/core/dev.c:6148 @ static int process_backlog(struct napi_struct *napi, int quota) while (again) { struct sk_buff *skb; + local_irq_disable(); while ((skb = __skb_dequeue(&sd->process_queue))) { + local_irq_enable(); rcu_read_lock(); __netif_receive_skb(skb); rcu_read_unlock(); @ net/core/dev.c:6158 @ static int process_backlog(struct napi_struct *napi, int quota) if (++work >= quota) return work; + local_irq_disable(); } - local_irq_disable(); rps_lock(sd); if (skb_queue_empty(&sd->input_pkt_queue)) { /* @ net/core/dev.c:6198 @ void __napi_schedule(struct napi_struct *n) local_irq_save(flags); ____napi_schedule(this_cpu_ptr(&softnet_data), n); local_irq_restore(flags); + preempt_check_resched_rt(); } EXPORT_SYMBOL(__napi_schedule); @ net/core/dev.c:6642 @ static __latent_entropy void net_rx_action(struct softirq_action *h) unsigned long time_limit = jiffies + usecs_to_jiffies(netdev_budget_usecs); int budget = netdev_budget; + struct sk_buff_head tofree_q; + struct sk_buff *skb; LIST_HEAD(list); LIST_HEAD(repoll); + __skb_queue_head_init(&tofree_q); + local_irq_disable(); + skb_queue_splice_init(&sd->tofree_queue, &tofree_q); list_splice_init(&sd->poll_list, &list); local_irq_enable(); + while ((skb = __skb_dequeue(&tofree_q))) + kfree_skb(skb); + for (;;) { struct napi_struct *n; @ net/core/dev.c:10194 @ static int dev_cpu_dead(unsigned int oldcpu) raise_softirq_irqoff(NET_TX_SOFTIRQ); local_irq_enable(); + preempt_check_resched_rt(); #ifdef CONFIG_RPS remsd = oldsd->rps_ipi_list; @ net/core/dev.c:10208 @ static int dev_cpu_dead(unsigned int oldcpu) netif_rx_ni(skb); input_queue_head_incr(oldsd); } - while ((skb = skb_dequeue(&oldsd->input_pkt_queue))) { + while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) { netif_rx_ni(skb); input_queue_head_incr(oldsd); } + while ((skb = __skb_dequeue(&oldsd->tofree_queue))) { + kfree_skb(skb); + } return 0; } @ net/core/dev.c:10527 @ static int __init net_dev_init(void) INIT_WORK(flush, flush_backlog); - skb_queue_head_init(&sd->input_pkt_queue); - skb_queue_head_init(&sd->process_queue); + skb_queue_head_init_raw(&sd->input_pkt_queue); + skb_queue_head_init_raw(&sd->process_queue); + skb_queue_head_init_raw(&sd->tofree_queue); #ifdef CONFIG_XFRM_OFFLOAD skb_queue_head_init(&sd->xfrm_backlog); #endif @ net/core/flow_dissector.c:939 @ bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx, (int)FLOW_DISSECTOR_F_STOP_AT_ENCAP); flow_keys->flags = flags; - preempt_disable(); - result = BPF_PROG_RUN(prog, ctx); - preempt_enable(); + result = bpf_prog_run_pin_on_cpu(prog, ctx); flow_keys->nhoff = clamp_t(u16, flow_keys->nhoff, nhoff, hlen); flow_keys->thoff = clamp_t(u16, flow_keys->thoff, @ net/core/gen_estimator.c:45 @ struct net_rate_estimator { struct gnet_stats_basic_packed *bstats; spinlock_t *stats_lock; - seqcount_t *running; + net_seqlock_t *running; struct gnet_stats_basic_cpu __percpu *cpu_bstats; u8 ewma_log; u8 intvl_log; /* period : (250ms << intvl_log) */ @ net/core/gen_estimator.c:128 @ int gen_new_estimator(struct gnet_stats_basic_packed *bstats, struct gnet_stats_basic_cpu __percpu *cpu_bstats, struct net_rate_estimator __rcu **rate_est, spinlock_t *lock, - seqcount_t *running, + net_seqlock_t *running, struct nlattr *opt) { struct gnet_estimator *parm = nla_data(opt); @ net/core/gen_estimator.c:226 @ int gen_replace_estimator(struct gnet_stats_basic_packed *bstats, struct gnet_stats_basic_cpu __percpu *cpu_bstats, struct net_rate_estimator __rcu **rate_est, spinlock_t *lock, - seqcount_t *running, struct nlattr *opt) + net_seqlock_t *running, struct nlattr *opt) { return gen_new_estimator(bstats, cpu_bstats, rate_est, lock, running, opt); @ net/core/gen_stats.c:140 @ __gnet_stats_copy_basic_cpu(struct gnet_stats_basic_packed *bstats, } void -__gnet_stats_copy_basic(const seqcount_t *running, +__gnet_stats_copy_basic(net_seqlock_t *running, struct gnet_stats_basic_packed *bstats, struct gnet_stats_basic_cpu __percpu *cpu, struct gnet_stats_basic_packed *b) @ net/core/gen_stats.c:153 @ __gnet_stats_copy_basic(const seqcount_t *running, } do { if (running) - seq = read_seqcount_begin(running); + seq = net_seq_begin(running); bstats->bytes = b->bytes; bstats->packets = b->packets; - } while (running && read_seqcount_retry(running, seq)); + } while (running && net_seq_retry(running, seq)); } EXPORT_SYMBOL(__gnet_stats_copy_basic); static int -___gnet_stats_copy_basic(const seqcount_t *running, +___gnet_stats_copy_basic(net_seqlock_t *running, struct gnet_dump *d, struct gnet_stats_basic_cpu __percpu *cpu, struct gnet_stats_basic_packed *b, @ net/core/gen_stats.c:207 @ ___gnet_stats_copy_basic(const seqcount_t *running, * if the room in the socket buffer was not sufficient. */ int -gnet_stats_copy_basic(const seqcount_t *running, +gnet_stats_copy_basic(net_seqlock_t *running, struct gnet_dump *d, struct gnet_stats_basic_cpu __percpu *cpu, struct gnet_stats_basic_packed *b) @ net/core/gen_stats.c:231 @ EXPORT_SYMBOL(gnet_stats_copy_basic); * if the room in the socket buffer was not sufficient. */ int -gnet_stats_copy_basic_hw(const seqcount_t *running, +gnet_stats_copy_basic_hw(net_seqlock_t *running, struct gnet_dump *d, struct gnet_stats_basic_cpu __percpu *cpu, struct gnet_stats_basic_packed *b) @ net/core/skmsg.c:631 @ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock, struct bpf_prog *prog; int ret; - preempt_disable(); rcu_read_lock(); prog = READ_ONCE(psock->progs.msg_parser); if (unlikely(!prog)) { @ net/core/skmsg.c:640 @ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock, sk_msg_compute_data_pointers(msg); msg->sk = sk; - ret = BPF_PROG_RUN(prog, msg); + ret = bpf_prog_run_pin_on_cpu(prog, msg); ret = sk_psock_map_verd(ret, msg->sk_redir); psock->apply_bytes = msg->apply_bytes; if (ret == __SK_REDIRECT) { @ net/core/skmsg.c:655 @ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock, } out: rcu_read_unlock(); - preempt_enable(); return ret; } EXPORT_SYMBOL_GPL(sk_psock_msg_verdict); @ net/core/skmsg.c:666 @ static int sk_psock_bpf_run(struct sk_psock *psock, struct bpf_prog *prog, skb->sk = psock->sk; bpf_compute_data_end_sk_skb(skb); - preempt_disable(); - ret = BPF_PROG_RUN(prog, skb); - preempt_enable(); + ret = bpf_prog_run_pin_on_cpu(prog, skb); /* strparser clones the skb before handing it to a upper layer, * meaning skb_orphan has been called. We NULL sk on the way out * to ensure we don't trigger a BUG_ON() in skb/sk operations @ net/kcm/kcmsock.c:383 @ static int kcm_parse_func_strparser(struct strparser *strp, struct sk_buff *skb) struct bpf_prog *prog = psock->bpf_prog; int res; - preempt_disable(); - res = BPF_PROG_RUN(prog, skb); - preempt_enable(); + res = bpf_prog_run_pin_on_cpu(prog, skb); return res; } @ net/netfilter/nf_conntrack_core.c:181 @ EXPORT_SYMBOL_GPL(nf_conntrack_htable_size); unsigned int nf_conntrack_max __read_mostly; EXPORT_SYMBOL_GPL(nf_conntrack_max); -seqcount_t nf_conntrack_generation __read_mostly; +seqcount_spinlock_t nf_conntrack_generation __read_mostly; static unsigned int nf_conntrack_hash_rnd __read_mostly; static u32 hash_conntrack_raw(const struct nf_conntrack_tuple *tuple, @ net/netfilter/nf_conntrack_core.c:2590 @ int nf_conntrack_init_start(void) /* struct nf_ct_ext uses u8 to store offsets/size */ BUILD_BUG_ON(total_extension_size() > 255u); - seqcount_init(&nf_conntrack_generation); + seqcount_spinlock_init(&nf_conntrack_generation, + &nf_conntrack_locks_all_lock); for (i = 0; i < CONNTRACK_LOCKS; i++) spin_lock_init(&nf_conntrack_locks[i]); @ net/netfilter/nft_set_rbtree.c:21 @ struct nft_rbtree { struct rb_root root; rwlock_t lock; - seqcount_t count; + seqcount_rwlock_t count; struct delayed_work gc_work; }; @ net/netfilter/nft_set_rbtree.c:519 @ static int nft_rbtree_init(const struct nft_set *set, struct nft_rbtree *priv = nft_set_priv(set); rwlock_init(&priv->lock); - seqcount_init(&priv->count); + seqcount_rwlock_init(&priv->count, &priv->lock); priv->root = RB_ROOT; INIT_DEFERRABLE_WORK(&priv->gc_work, nft_rbtree_gc); @ net/packet/af_packet.c:60 @ #include <linux/if_packet.h> #include <linux/wireless.h> #include <linux/kernel.h> +#include <linux/delay.h> #include <linux/kmod.h> #include <linux/slab.h> #include <linux/vmalloc.h> @ net/packet/af_packet.c:665 @ static void prb_retire_rx_blk_timer_expired(struct timer_list *t) if (BLOCK_NUM_PKTS(pbd)) { while (atomic_read(&pkc->blk_fill_in_prog)) { /* Waiting for skb_copy_bits to finish... */ - cpu_relax(); + cpu_chill(); } } @ net/packet/af_packet.c:927 @ static void prb_retire_current_block(struct tpacket_kbdq_core *pkc, if (!(status & TP_STATUS_BLK_TMO)) { while (atomic_read(&pkc->blk_fill_in_prog)) { /* Waiting for skb_copy_bits to finish... */ - cpu_relax(); + cpu_chill(); } } prb_close_block(pkc, pbd, po, status); @ net/sched/sch_api.c:1251 @ static struct Qdisc *qdisc_create(struct net_device *dev, rcu_assign_pointer(sch->stab, stab); } if (tca[TCA_RATE]) { - seqcount_t *running; + net_seqlock_t *running; err = -EOPNOTSUPP; if (sch->flags & TCQ_F_MQROOT) { @ net/sched/sch_generic.c:555 @ struct Qdisc noop_qdisc = { .ops = &noop_qdisc_ops, .q.lock = __SPIN_LOCK_UNLOCKED(noop_qdisc.q.lock), .dev_queue = &noop_netdev_queue, +#ifdef CONFIG_PREEMPT_RT + .running = __SEQLOCK_UNLOCKED(noop_qdisc.running), +#else .running = SEQCNT_ZERO(noop_qdisc.running), +#endif .busylock = __SPIN_LOCK_UNLOCKED(noop_qdisc.busylock), .gso_skb = { .next = (struct sk_buff *)&noop_qdisc.gso_skb, @ net/sched/sch_generic.c:855 @ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue, spin_lock_init(&sch->busylock); /* seqlock has the same scope of busylock, for NOLOCK qdisc */ spin_lock_init(&sch->seqlock); +#ifdef CONFIG_PREEMPT_RT + seqlock_init(&sch->running); +#else seqcount_init(&sch->running); +#endif sch->ops = ops; sch->flags = ops->static_flags; @ net/sched/sch_generic.c:873 @ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue, if (sch != &noop_qdisc) { lockdep_set_class(&sch->busylock, &dev->qdisc_tx_busylock_key); lockdep_set_class(&sch->seqlock, &dev->qdisc_tx_busylock_key); +#ifdef CONFIG_PREEMPT_RT + lockdep_set_class(&sch->running.lock, &dev->qdisc_running_key); +#else lockdep_set_class(&sch->running, &dev->qdisc_running_key); +#endif } return sch; @ net/sunrpc/svc_xprt.c:414 @ void svc_xprt_do_enqueue(struct svc_xprt *xprt) if (test_and_set_bit(XPT_BUSY, &xprt->xpt_flags)) return; - cpu = get_cpu(); + cpu = get_cpu_light(); pool = svc_pool_for_cpu(xprt->xpt_server, cpu); atomic_long_inc(&pool->sp_stats.packets); @ net/sunrpc/svc_xprt.c:438 @ void svc_xprt_do_enqueue(struct svc_xprt *xprt) rqstp = NULL; out_unlock: rcu_read_unlock(); - put_cpu(); + put_cpu_light(); trace_svc_xprt_do_enqueue(xprt, rqstp); } EXPORT_SYMBOL_GPL(svc_xprt_do_enqueue); @ net/xfrm/xfrm_policy.c:125 @ struct xfrm_pol_inexact_bin { /* list containing '*:*' policies */ struct hlist_head hhead; - seqcount_t count; + seqcount_spinlock_t count; /* tree sorted by daddr/prefix */ struct rb_root root_d; @ net/xfrm/xfrm_policy.c:158 @ static struct xfrm_policy_afinfo const __rcu *xfrm_policy_afinfo[AF_INET6 + 1] __read_mostly; static struct kmem_cache *xfrm_dst_cache __ro_after_init; -static __read_mostly seqcount_t xfrm_policy_hash_generation; +static __read_mostly seqcount_mutex_t xfrm_policy_hash_generation; static struct rhashtable xfrm_policy_inexact_table; static const struct rhashtable_params xfrm_pol_inexact_params; @ net/xfrm/xfrm_policy.c:722 @ xfrm_policy_inexact_alloc_bin(const struct xfrm_policy *pol, u8 dir) INIT_HLIST_HEAD(&bin->hhead); bin->root_d = RB_ROOT; bin->root_s = RB_ROOT; - seqcount_init(&bin->count); + seqcount_spinlock_init(&bin->count, &net->xfrm.xfrm_policy_lock); prev = rhashtable_lookup_get_insert_key(&xfrm_policy_inexact_table, &bin->k, &bin->head, @ net/xfrm/xfrm_policy.c:1909 @ static int xfrm_policy_match(const struct xfrm_policy *pol, static struct xfrm_pol_inexact_node * xfrm_policy_lookup_inexact_addr(const struct rb_root *r, - seqcount_t *count, + seqcount_spinlock_t *count, const xfrm_address_t *addr, u16 family) { const struct rb_node *parent; @ net/xfrm/xfrm_policy.c:4157 @ void __init xfrm_init(void) { register_pernet_subsys(&xfrm_net_ops); xfrm_dev_init(); - seqcount_init(&xfrm_policy_hash_generation); + seqcount_mutex_init(&xfrm_policy_hash_generation, &hash_resize_mutex); xfrm_input_init(); #ifdef CONFIG_INET_ESPINTCP @ net/xfrm/xfrm_state.c:47 @ static void xfrm_state_gc_task(struct work_struct *work); */ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024; -static __read_mostly seqcount_t xfrm_state_hash_generation = SEQCNT_ZERO(xfrm_state_hash_generation); +static __read_mostly seqcount_spinlock_t xfrm_state_hash_generation; static struct kmem_cache *xfrm_state_cache __ro_after_init; static DECLARE_WORK(xfrm_state_gc_work, xfrm_state_gc_task); @ net/xfrm/xfrm_state.c:142 @ static void xfrm_hash_resize(struct work_struct *work) return; } + /* XXX - the locking which protects the sequence counter appears + * to be broken here. The sequence counter is global, but the + * spinlock used for the sequence counter write serialization is + * per network namespace... + */ spin_lock_bh(&net->xfrm.xfrm_state_lock); write_seqcount_begin(&xfrm_state_hash_generation); @ net/xfrm/xfrm_state.c:2565 @ int __net_init xfrm_state_init(struct net *net) net->xfrm.state_num = 0; INIT_WORK(&net->xfrm.state_hash_work, xfrm_hash_resize); spin_lock_init(&net->xfrm.xfrm_state_lock); + seqcount_spinlock_init(&xfrm_state_hash_generation, + &net->xfrm.xfrm_state_lock); return 0; out_byspi: @ virt/kvm/arm/arm.c:703 @ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) * involves poking the GIC, which must be done in a * non-preemptible context. */ - preempt_disable(); + migrate_disable(); kvm_pmu_flush_hwstate(vcpu); @ virt/kvm/arm/arm.c:752 @ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) kvm_timer_sync_hwstate(vcpu); kvm_vgic_sync_hwstate(vcpu); local_irq_enable(); - preempt_enable(); + migrate_enable(); continue; } @ virt/kvm/arm/arm.c:828 @ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) /* Exit types that need handling before we can be preempted */ handle_exit_early(vcpu, run, ret); - preempt_enable(); + migrate_enable(); ret = handle_exit(vcpu, run, ret); } @ virt/kvm/eventfd.c:306 @ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args) INIT_LIST_HEAD(&irqfd->list); INIT_WORK(&irqfd->inject, irqfd_inject); INIT_WORK(&irqfd->shutdown, irqfd_shutdown); - seqcount_init(&irqfd->irq_entry_sc); + seqcount_spinlock_init(&irqfd->irq_entry_sc, &kvm->irqfds.lock); f = fdget(args->fd); if (!f.file) {