Subject: Exclude defect cpus from being used
From: Carsten Emde <C.Emde@osadl.org>
Date: Sun,  3 Feb 2013 14:22:41 +0100

An Intel i7-980X multi-core processor regularly crashed with the message

mce: [Hardware Error]: CPU 8: Machine Check Exception: 4 Bank 2: b200000000000005
mce: [Hardware Error]: TSC 1e3406c01d 
mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1355421008 SOCKET 0 APIC 5 microcode 14
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: CPU 2: Machine Check Exception: 4 Bank 2: b200000000000005
mce: [Hardware Error]: TSC 1e3406bfc9
mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1355421008 SOCKET 0 APIC 4 microcode 14
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Processor context corrupt
Kernel panic - not syncing: Fatal Machine check
panic occurred, switching back to text console

After the below kernel patch was applied and the kernel parameter
  defect_cpus=2,8
added to the kernel command line, the remaining 5 x 2 cores are
working properly.

Signed-off-by: Carsten Emde <C.Emde@osadl.org>

---
 Documentation/kernel-parameters.txt |    9 +++++++++
 kernel/cpu.c                        |   14 ++++++++++++++
 2 files changed, 23 insertions(+)

Index: linux-3.12.24-rt37/Documentation/kernel-parameters.txt
===================================================================
@ linux-3.12.24-rt37/Documentation/kernel-parameters.txt:767 @ bytes respectively. Such letter suffixes
 			Defaults to the default architecture's huge page size
 			if not specified.
 
+	defect_cpus=	[SMP] Exclude defect cpus from being used
+			Format:
+			<cpu number>,...,<cpu number>
+			or
+			<cpu number>-<cpu number>
+			(must be a positive range in ascending order)
+			or a mixture
+			<cpu number>,...,<cpu number>-<cpu number>
+
 	dhash_entries=	[KNL]
 			Set number of hash buckets for dentry cache.
 
Index: linux-3.12.24-rt37/kernel/cpu.c
===================================================================
--- linux-3.12.24-rt37.orig/kernel/cpu.c
+++ linux-3.12.24-rt37/kernel/cpu.c
@ linux-3.12.24-rt37/Documentation/kernel-parameters.txt:687 @ out:
 EXPORT_SYMBOL(cpu_down);
 #endif /*CONFIG_HOTPLUG_CPU*/
 
+static cpumask_var_t __cpuinitdata cpu_defect_map;
+static int __init setup_defect_cpus(char *str)
+{
+	alloc_bootmem_cpumask_var(&cpu_defect_map);
+	cpulist_parse(str, cpu_defect_map);
+	return 0;
+}
+early_param("defect_cpus", setup_defect_cpus);
+
 /* Requires cpu_add_remove_lock to be held */
 static int _cpu_up(unsigned int cpu, int tasks_frozen)
 {
@ linux-3.12.24-rt37/Documentation/kernel-parameters.txt:759 @ int cpu_up(unsigned int cpu)
 	pg_data_t	*pgdat;
 #endif
 
+	if (cpumask_test_cpu(cpu, cpu_defect_map)) {
+		pr_warn("Can't online cpu %u. It's marked defect.\n", cpu);
+		return -ENODEV;
+	}
+
 	if (!cpu_possible(cpu)) {
 		printk(KERN_ERR "can't online cpu %d because it is not "
 			"configured as may-hotadd at boot time\n", cpu);