AgeCommit message (Collapse)AuthorFilesLines
2016-07-07drm/nouveau/acpi: use DSM if bridge does not support D3coldnouveau-acpi-v2.3-with-pci-pm-d3Peter Wu1-0/+3
Even if PR3 support is available on the bridge, it will not be used if the PCI layer considers it unavailable. Ensure that this condition is checked to allow a fallback to the Optimus DSM for device poweroff. Signed-off-by: Peter Wu <>
2016-07-07PCI: export pci_bridge_d3_possiblePeter Wu2-1/+3
Allow the nouveau driver to find out whether the bridge can put itself in the D3cold state or whether it should use a specific DSM method to achieve the same result. Cc: Mika Westerberg <> Signed-off-by: Peter Wu <>
2016-07-07Merge branch 'pci/pm' into nouveau-acpi-v2.3Peter Wu14-9/+277
2016-07-07drm/nouveau/acpi: fix lockup with PCIe runtime PMnouveau-acpi-v2.3Peter Wu1-4/+29
Since "PCI: Add runtime PM support for PCIe ports", the parent PCIe port can be runtime-suspended which disables power resources via ACPI. This is incompatible with DSM, resulting in a GPU device which is still in D3 and locks up the kernel on resume (on a Clevo P651RA, GTX965M). Mirror the behavior of Windows 8 and newer[1] (as observed via an AMLi debugger trace) and stop using the DSM functions for D3cold when power resources are available on the parent PCIe port. pci_d3cold_disable() is not used because on some machines, the old DSM method is broken. On a Lenovo T440p (GT 730M) memory and disk corruption would occur, but that is fixed with this patch[2]. [1]: [2]: v2: simply check directly for _PR3. Added affected machines. Signed-off-by: Peter Wu <>
2016-07-07drm/nouveau/acpi: check for function 0x1B before using itPeter Wu1-5/+13
Do not unconditionally invoke function 0x1B without checking for its availability, it leads to an infinite loop on some firmware. Bugzilla: Fixes: 5addcf0a5f0fad ("nouveau: add runtime PM support (v0.9)") Signed-off-by: Peter Wu <>
2016-07-07drm/nouveau/acpi: return supported DSM functionsPeter Wu1-7/+9
Return the set of supported functions to the caller. No functional changes. Signed-off-by: Peter Wu <>
2016-07-07drm/nouveau/acpi: ensure matching ACPI handle and supported functionsPeter Wu1-32/+26
Ensure that the returned set of supported DSM functions (MUX, Optimus) match the ACPI handle that is set in nouveau_dsm_pci_probe. As there are no machines with a MUX function on just one PCI device and an Optimus on another, there should not be a functional impact. This change however makes this implicit assumption more obvious. Convert int to bool and rename has_dsm to has_mux while at it. Let the caller set nouveau_dsm_priv.dhandle as needed. v2: pass dhandle to the caller. Signed-off-by: Peter Wu <>
2016-06-26Linux 4.7-rc5Linus Torvalds1-1/+1
2016-06-26Merge tag 'scsi-fixes' of ↵Linus Torvalds4-6/+12
git:// Pull SCSI fixes from James Bottomley: "Two straightforward fixes. One is a concurrency issue only affecting SAS connected SATA drives, but which could hang the storage subsystem if it triggers (because the outstanding command count on error never goes back to zero) and the other is a NO_TAG fallout from the switch to hostwide tags which causes the system to crash on module insertion (we've checked carefully and only the 53c700 family of drivers is vulnerable to this issue)" * tag 'scsi-fixes' of git:// 53c700: fix BUG on untagged commands scsi: fix race between simultaneous decrements of ->host_failed
2016-06-25Merge branch 'for-linus-4.7-part2' of ↵Linus Torvalds3-13/+34
git:// Pull btrfs fixes part 2 from Chris Mason: "This has one patch from Omar to bring iterate_shared back to btrfs. We have a tree of work we queue up for directory items and it doesn't lend itself well to shared access. While we're cleaning it up, Omar has changed things to use an exclusive lock when there are delayed items" * 'for-linus-4.7-part2' of git:// Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes
2016-06-25Merge branch 'for-linus-4.7' of ↵Linus Torvalds11-36/+57
git:// Pull btrfs fixes from Chris Mason: "I have a two part pull this time because one of the patches Dave Sterba collected needed to be against v4.7-rc2 or higher (we used rc4). I try to make my for-linus-xx branch testable on top of the last major so we can hand fixes to people on the list more easily, so I've split this pull in two. This first part has some fixes and two performance improvements that we've been testing for some time. Josef's two performance fixes are most notable. The transid tracking patch makes a big improvement on pretty much every workload" * 'for-linus-4.7' of git:// Btrfs: Force stripesize to the value of sectorsize btrfs: fix disk_i_size update bug when fallocate() fails Btrfs: fix error handling in map_private_extent_buffer Btrfs: fix error return code in btrfs_init_test_fs() Btrfs: don't do nocow check unless we have to btrfs: fix deadlock in delayed_ref_async_start Btrfs: track transid for delayed ref flushing
2016-06-25Merge tag 'sound-4.7-rc5' of ↵Linus Torvalds4-12/+19
git:// Pull sound fixes from Takashi Iwai: "Again pretty calm weeks: we've had only a few trivial / stable HD-audio fixes in addition to a possible race fix for snd-dummy driver spotted by syzkaller" * tag 'sound-4.7-rc5' of git:// ALSA: dummy: Fix a use-after-free at closing ALSA: hda / realtek - add two more Thinkpad IDs (5050,5053) for tpt460 fixup ALSA: hda - Fix the headset mic jack detection on Dell machine ALSA: hda/tegra: iomem fixups for sparse warnings ALSA: hdac_regmap - fix the register access for runtime PM
2016-06-25Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds1-0/+12
git:// Pull x86 kprobe fix from Thomas Gleixner: "A single fix clearing the TF bit when a fault is single stepped" * 'perf-urgent-for-linus' of git:// kprobes/x86: Clear TF bit in fault on single-stepping
2016-06-25Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds3-21/+66
git:// Pull scheduler fixes from Thomas Gleixner: "A couple of scheduler fixes: - force watchdog reset while processing sysrq-w - fix a deadlock when enabling trace events in the scheduler - fixes to the throttled next buddy logic - fixes for the average accounting (missing serialization and underflow handling) - allow kernel threads for fallback to online but not active cpus" * 'sched-urgent-for-linus' of git:// sched/core: Allow kthreads to fall back to online && !active cpus sched/fair: Do not announce throttled next buddy in dequeue_task_fair() sched/fair: Initialize throttle_count for new task-groups lazily sched/fair: Fix cfs_rq avg tracking underflow kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w sched/debug: Fix deadlock when enabling sched events sched/fair: Fix post_init_entity_util_avg() serialization
2016-06-25Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodesOmar Sandoval3-13/+34
Commit fe742fd4f90f ("Revert "btrfs: switch to ->iterate_shared()"") backed out the conversion to ->iterate_shared() for Btrfs because the delayed inode handling in btrfs_real_readdir() is racy. However, we can still do readdir in parallel if there are no delayed nodes. This is a temporary fix which upgrades the shared inode lock to an exclusive lock only when we have delayed items until we come up with a more complete solution. While we're here, rename the btrfs_{get,put}_delayed_items functions to make it very clear that they're just for readdir. Tested with xfstests and by doing a parallel kernel build: while make tinyconfig && make -j4 && git clean dqfx; do : done along with a bunch of parallel finds in another shell: while true; do for ((i=0; i<4; i++)); do find . >/dev/null & done wait done Signed-off-by: Omar Sandoval <> Signed-off-by: David Sterba <> Signed-off-by: Chris Mason <>
2016-06-25Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds2-6/+46
git:// Pull locking fix from Thomas Gleixner: "A single fix to address a race in the static key logic" * 'locking-urgent-for-linus' of git:// locking/static_key: Fix concurrent static_key_slow_inc()
2016-06-25Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds1-1/+11
git:// Pull irq fix from Thomas Gleixner: "A single fix for the fallout from the conversion of MIPS GIC to irq domains" * 'irq-urgent-for-linus' of git:// irqchip/mips-gic: Fix IRQs in gic_dev_domain
2016-06-25Merge tag 'powerpc-4.7-4' of ↵Linus Torvalds15-51/+137
git:// Pull powerpc fixes from Michael Ellerman: "mm/radix (Aneesh Kumar K.V): - Update to tlb functions ric argument - Flush page walk cache when freeing page table - Update Radix tree size as per ISA 3.0 mm/hash (Aneesh Kumar K.V): - Use the correct PPP mask when updating HPTE - Don't add memory coherence if cache inhibited is set eeh (Gavin Shan): - Fix invalid cached PE primary bus bpf/jit (Naveen N. Rao): - Disable classic BPF JIT on ppc64le .. and fix faults caused by radix patching of SLB miss handler (Michael Ellerman)" * tag 'powerpc-4.7-4' of git:// powerpc/bpf/jit: Disable classic BPF JIT on ppc64le powerpc: Fix faults caused by radix patching of SLB miss handler powerpc/eeh: Fix invalid cached PE primary bus powerpc/mm/radix: Update Radix tree size as per ISA 3.0 powerpc/mm/hash: Don't add memory coherence if cache inhibited is set powerpc/mm/hash: Use the correct PPP mask when updating HPTE powerpc/mm/radix: Flush page walk cache when freeing page table powerpc/mm/radix: Update to tlb functions ric argument
2016-06-25Fix build break in fork.c when THREAD_SIZE < PAGE_SIZEMichael Ellerman1-2/+2
Commit b235beea9e99 ("Clarify naming of thread info/stack allocators") breaks the build on some powerpc configs, where THREAD_SIZE < PAGE_SIZE: kernel/fork.c:235:2: error: implicit declaration of function 'free_thread_stack' kernel/fork.c:355:8: error: assignment from incompatible pointer type stack = alloc_thread_stack_node(tsk, node); ^ Fix it by renaming free_stack() to free_thread_stack(), and updating the return type of alloc_thread_stack_node(). Fixes: b235beea9e99 ("Clarify naming of thread info/stack allocators") Signed-off-by: Michael Ellerman <> Signed-off-by: Linus Torvalds <>
2016-06-24Merge branch 'akpm' (patches from Andrew)Linus Torvalds67-207/+178
Merge misc fixes from Andrew Morton: "Two weeks worth of fixes here" * emailed patches from Andrew Morton <>: (41 commits) init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64 autofs: don't get stuck in a loop if vfs_write() returns an error mm/page_owner: avoid null pointer dereference tools/vm/slabinfo: fix spelling mistake: "Ocurrences" -> "Occurrences" fs/nilfs2: fix potential underflow in call to crc32_le oom, suspend: fix oom_reaper vs. oom_killer_disable race ocfs2: disable BUG assertions in reading blocks mm, compaction: abort free scanner if split fails mm: prevent KASAN false positives in kmemleak mm/hugetlb: clear compound_mapcount when freeing gigantic pages mm/swap.c: flush lru pvecs on compound page arrival memcg: css_alloc should return an ERR_PTR value on error memcg: mem_cgroup_migrate() may be called with irq disabled hugetlb: fix nr_pmds accounting with shared page tables Revert "mm: disable fault around on emulated access bit architecture" Revert "mm: make faultaround produce old ptes" mailmap: add Boris Brezillon's email mailmap: add Antoine Tenart's email mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask mm: mempool: kasan: don't poot mempool objects in quarantine ...
2016-06-24Merge tag 'for-linus' of ↵Linus Torvalds30-113/+190
git:// Pull rdma fixes from Doug Ledford: "This is the second batch of queued up rdma patches for this rc cycle. There isn't anything really major in here. It's passed 0day, linux-next, and local testing across a wide variety of hardware. There are still a few known issues to be tracked down, but this should amount to the vast majority of the rdma RC fixes. Round two of 4.7 rc fixes: - A couple minor fixes to the rdma core - Multiple minor fixes to hfi1 - Multiple minor fixes to mlx4/mlx4 - A few minor fixes to i40iw" * tag 'for-linus' of git:// (31 commits) IB/srpt: Reduce QP buffer size i40iw: Enable level-1 PBL for fast memory registration i40iw: Return correct max_fast_reg_page_list_len i40iw: Correct status check on i40iw_get_pble i40iw: Correct CQ arming IB/rdmavt: Correct qp_priv_alloc() return value test IB/hfi1: Don't zero out qp->s_ack_queue in rvt_reset_qp IB/hfi1: Fix deadlock with txreq allocation slow path IB/mlx4: Prevent cross page boundary allocation IB/mlx4: Fix memory leak if QP creation failed IB/mlx4: Verify port number in flow steering create flow IB/mlx4: Fix error flow when sending mads under SRIOV IB/mlx4: Fix the SQ size of an RC QP IB/mlx5: Fix wrong naming of port_rcv_data counter IB/mlx5: Fix post send fence logic IB/uverbs: Initialize ib_qp_init_attr with zeros IB/core: Fix false search of the IB_SA_WELL_KNOWN_GUID IB/core: Fix RoCE v1 multicast join logic issue IB/core: Fix no default GIDs when netdevice reregisters IB/hfi1: Send a pkey change event on driver pkey update ...
2016-06-24Merge branch 'for-linus' of ↵Linus Torvalds1-5/+5
git:// Pull HID fix from Jiri Kosina: "hiddev ioctl() validation fix from Scott Bauer" * 'for-linus' of git:// HID: hiddev: validate num_values for HIDIOCGUSAGES, HIDIOCSUSAGES commands
2016-06-24Merge tag 'hwmon-for-linus-v4.7-rc5' of ↵Linus Torvalds1-5/+20
git:// Pull hwmon fix from Guenter Roeck: "Improve fan type detection for dell-smm to prevent kernel hang" * tag 'hwmon-for-linus-v4.7-rc5' of git:// hwmon: (dell-smm) Cache fan_type() calls and change fan detection
2016-06-24Merge tag 'acpi-4.7-rc5' of ↵Linus Torvalds2-2/+9
git:// Pull ACPI fix from Rafael Wysocki: "Stable-candidate fix for a deadlock in ACPICA introduced during the 4.5 development cycle by a commit attempting to improve the handling of AML code that doesn't belong to any namespace objects in a given definition block (Lv Zheng)" * tag 'acpi-4.7-rc5' of git:// ACPICA: Namespace: Fix deadlock triggered by MLC support in dynamic table loading
2016-06-24Merge tag 'pm-4.7-rc5' of ↵Linus Torvalds3-17/+15
git:// Pull power management fixes from Rafael Wysocki: "Fix for a latent cpufreq driver bug uncovered by a recent ACPICA change and several fixes for the devfreq framework, including one fix for an issue introduced recently. Specifics: - Fix a latent initialization issue in the pcc-cpufreq driver (incorrect initial value of a structure field) that has been uncovered by a recent ACPICA commit (Mike Galbraith). - Add a missing notification in an update_devfreq() error code path forgotten by a recent devfreq commit (Chanwoo Choi). - Fix devfreq device frequency initialization (Lukasz Luba). - Fix an incorrect IS_ERR() check in the devfreq framework discovered by the Smatch checker (Dan Carpenter). - Drop two excessive put_device() calls from the devfreq framework (MyungJoo Ham, Cai Zhiyong). - Fix a possible memory leak in the devfreq framework and drop an unnecessary kfree() invocation from it (MyungJoo Ham)" * tag 'pm-4.7-rc5' of git:// PM / devfreq: Send the DEVFREQ_POSTCHANGE notification when target() is failed cpufreq: pcc-cpufreq: Fix doorbell.access_width PM / devfreq: fix initialization of current frequency in last status PM / devfreq: exynos-nocp: Remove incorrect IS_ERR() check PM / devfreq: remove double put_device PM / devfreq: fix double call put_device PM / devfreq: fix duplicated kfree on devfreq pointer PM / devfreq: devm_kzalloc to have dev pointer more precisely
2016-06-24Merge tag 'for-linus-4.7b-rc4-tag' of ↵Linus Torvalds4-68/+58
git:// Pull xen bug fixes from David Vrabel: - fix x86 PV dom0 crash during early boot on some hardware - fix two pciback bugs affects certain devices - fix potential overflow when clearing page tables in x86 PV * tag 'for-linus-4.7b-rc4-tag' of git:// xen-pciback: return proper values during BAR sizing x86/xen: avoid m2p lookup when setting early page table entries xen/pciback: Fix conf_space read/write overlap check. x86/xen: fix upper bound of pmd loop in xen_cleanhighmap() xen/balloon: Fix declared-but-not-defined warning
2016-06-24Merge tag 'arm64-fixes' of ↵Linus Torvalds6-8/+43
git:// Pull arm64 fixes from Will Deacon: "Here are a few more arm64 fixes, but things do finally appear to be slowing down. The main fix is avoiding hibernation in a previously unanticipated situation where we have CPUs parked in the kernel, but it's all good stuff. - Fix icache/dcache sync for anonymous pages under migration - Correct the ASID limit check - Fix parallel builds of Image and Image.gz - Refuse to hibernate when we have CPUs that we can't offline" * tag 'arm64-fixes' of git:// arm64: hibernate: Don't hibernate on systems with stuck CPUs arm64: smp: Add function to determine if cpus are stuck in the kernel arm64: mm: remove page_mapping check in __sync_icache_dcache arm64: fix boot image dependencies to not generate invalid images arm64: update ASID limit
2016-06-24init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64Rasmus Villemoes1-1/+3
When I replaced kasprintf("%pf") with a direct call to sprint_symbol_no_offset I must have broken the initcall blacklisting feature on the arches where dereference_function_descriptor() is non-trivial. Fixes: c8cdd2be213f (init/main.c: simplify initcall_blacklisted()) Link: Signed-off-by: Rasmus Villemoes <> Cc: Yang Shi <> Cc: Prarit Bhargava <> Cc: Petr Mladek <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24autofs: don't get stuck in a loop if vfs_write() returns an errorAndrey Vagin1-3/+4
__vfs_write() returns a negative value in a error case. Link: Signed-off-by: Andrey Vagin <> Signed-off-by: Ian Kent <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24mm/page_owner: avoid null pointer dereferenceSudip Mukherjee1-2/+4
We have dereferenced page_ext before checking it. Lets check it first and then used it. Fixes: f86e4271978b ("mm: check the return value of lookup_page_ext for all call sites") Link: Signed-off-by: Sudip Mukherjee <> Acked-by: Vlastimil Babka <> Cc: Joonsoo Kim <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24tools/vm/slabinfo: fix spelling mistake: "Ocurrences" -> "Occurrences"Colin Ian King1-1/+1
trivial fix to spelling mistake Link: Signed-off-by: Colin Ian King <> Acked-by: Christoph Lameter <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24fs/nilfs2: fix potential underflow in call to crc32_leTorsten Hilbrich1-1/+1
The value `bytes' comes from the filesystem which is about to be mounted. We cannot trust that the value is always in the range we expect it to be. Check its value before using it to calculate the length for the crc32_le call. It value must be larger (or equal) sumoff + 4. This fixes a kernel bug when accidentially mounting an image file which had the nilfs2 magic value 0x3434 at the right offset 0x406 by chance. The bytes 0x01 0x00 were stored at 0x408 and were interpreted as a s_bytes value of 1. This caused an underflow when substracting sumoff + 4 (20) in the call to crc32_le. BUG: unable to handle kernel paging request at ffff88021e600000 IP: crc32_le+0x36/0x100 ... Call Trace: nilfs_valid_sb.part.5+0x52/0x60 [nilfs2] nilfs_load_super_block+0x142/0x300 [nilfs2] init_nilfs+0x60/0x390 [nilfs2] nilfs_mount+0x302/0x520 [nilfs2] mount_fs+0x38/0x160 vfs_kern_mount+0x67/0x110 do_mount+0x269/0xe00 SyS_mount+0x9f/0x100 entry_SYSCALL_64_fastpath+0x16/0x71 Link: Signed-off-by: Torsten Hilbrich <> Tested-by: Torsten Hilbrich <> Signed-off-by: Ryusuke Konishi <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24oom, suspend: fix oom_reaper vs. oom_killer_disable raceMichal Hocko1-0/+12
Tetsuo has reported the following potential oom_killer_disable vs. oom_reaper race: (1) freeze_processes() starts freezing user space threads. (2) Somebody (maybe a kenrel thread) calls out_of_memory(). (3) The OOM killer calls mark_oom_victim() on a user space thread P1 which is already in __refrigerator(). (4) oom_killer_disable() sets oom_killer_disabled = true. (5) P1 leaves __refrigerator() and enters do_exit(). (6) The OOM reaper calls exit_oom_victim(P1) before P1 can call exit_oom_victim(P1). (7) oom_killer_disable() returns while P1 not yet finished (8) P1 perform IO/interfere with the freezer. This situation is unfortunate. We cannot move oom_killer_disable after all the freezable kernel threads are frozen because the oom victim might depend on some of those kthreads to make a forward progress to exit so we could deadlock. It is also far from trivial to teach the oom_reaper to not call exit_oom_victim() because then we would lose a guarantee of the OOM killer and oom_killer_disable forward progress because exit_mm->mmput might block and never call exit_oom_victim. It seems the easiest way forward is to workaround this race by calling try_to_freeze_tasks again after oom_killer_disable. This will make sure that all the tasks are frozen or it bails out. Fixes: 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper") Link: Signed-off-by: Michal Hocko <> Reported-by: Tetsuo Handa <> Cc: "Rafael J. Wysocki" <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24ocfs2: disable BUG assertions in reading blocksGang He2-2/+5
According to some high-load testing, these two BUG assertions were encountered, this led system panic. Actually, there were some discussions about removing these two BUG() assertions, it would not bring any side effect. Then, I did the the following changes, 1) use the existing macro CATCH_BH_JBD_RACES to wrap BUG() in the ocfs2_read_blocks_sync function like before. 2) disable the macro CATCH_BH_JBD_RACES in Makefile by default. Link: Signed-off-by: Gang He <> Cc: Mark Fasheh <> Cc: Joel Becker <> Cc: Junxiao Bi <> Cc: Joseph Qi <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24mm, compaction: abort free scanner if split failsDavid Rientjes1-18/+21
If the memory compaction free scanner cannot successfully split a free page (only possible due to per-zone low watermark), terminate the free scanner rather than continuing to scan memory needlessly. If the watermark is insufficient for a free page of order <= cc->order, then terminate the scanner since all future splits will also likely fail. This prevents the compaction freeing scanner from scanning all memory on very large zones (very noticeable for zones > 128GB, for instance) when all splits will likely fail while holding zone->lock. compaction_alloc() iterating a 128GB zone has been benchmarked to take over 400ms on some systems whereas any free page isolated and ready to be split ends up failing in split_free_page() because of the low watermark check and thus the iteration continues. The next time compaction occurs, the freeing scanner will likely start at the end of the zone again since no success was made previously and we get the same lengthy iteration until the zone is brought above the low watermark. All thp page faults can take >400ms in such a state without this fix. Link: Signed-off-by: David Rientjes <> Acked-by: Vlastimil Babka <> Cc: Minchan Kim <> Cc: Joonsoo Kim <> Cc: Mel Gorman <> Cc: Hugh Dickins <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24mm: prevent KASAN false positives in kmemleakDmitry Vyukov1-0/+2
When kmemleak dumps contents of leaked objects it reads whole objects regardless of user-requested size. This upsets KASAN. Disable KASAN checks around object dump. Link: Signed-off-by: Dmitry Vyukov <> Acked-by: Catalin Marinas <> Cc: Andrey Ryabinin <> Cc: Alexander Potapenko <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24mm/hugetlb: clear compound_mapcount when freeing gigantic pagesGerald Schaefer1-0/+1
While working on s390 support for gigantic hugepages I ran into the following "Bad page state" warning when freeing gigantic pages: BUG: Bad page state in process bash pfn:580001 page:000003d116000040 count:0 mapcount:0 mapping:ffffffff00000000 index:0x0 flags: 0x7fffc0000000000() page dumped because: non-NULL mapping This is because page->compound_mapcount, which is part of a union with page->mapping, is initialized with -1 in prep_compound_gigantic_page(), and not cleared again during destroy_compound_gigantic_page(). Fix this by clearing the compound_mapcount in destroy_compound_gigantic_page() before clearing compound_head. Interestingly enough, the warning will not show up on x86_64, although this should not be architecture specific. Apparently there is an endianness issue, combined with the fact that the union contains both a 64 bit ->mapping pointer and a 32 bit atomic_t ->compound_mapcount as members. The resulting bogus page->mapping on x86_64 therefore contains 00000000ffffffff instead of ffffffff00000000 on s390, which will falsely trigger the PageAnon() check in free_pages_prepare() because page->mapping & PAGE_MAPPING_ANON is true on little-endian architectures like x86_64 in this case (the page is not compound anymore, ->compound_head was already cleared before). As a result, page->mapping will be cleared before doing the checks in free_pages_check(). Not sure if the bogus "PageAnon() returning true" on x86_64 for the first tail page of a gigantic page (at this stage) has other theoretical implications, but they would also be fixed with this patch. Link: Signed-off-by: Gerald Schaefer <> Reviewed-by: Mike Kravetz <> Cc: Luiz Capitulino <> Cc: Naoya Horiguchi <> Cc: Hillf Danton <> Cc: "Kirill A . Shutemov" <> Cc: Dave Hansen <> Cc: Paul Gortmaker <> Cc: "Aneesh Kumar K . V" <> Cc: Martin Schwidefsky <> Cc: Heiko Carstens <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24mm/swap.c: flush lru pvecs on compound page arrivalLukasz Odzioba1-6/+5
Currently we can have compound pages held on per cpu pagevecs, which leads to a lot of memory unavailable for reclaim when needed. In the systems with hundreads of processors it can be GBs of memory. On of the way of reproducing the problem is to not call munmap explicitly on all mapped regions (i.e. after receiving SIGTERM). After that some pages (with THP enabled also huge pages) may end up on lru_add_pvec, example below. void main() { #pragma omp parallel { size_t size = 55 * 1000 * 1000; // smaller than MEM/CPUS void *p = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS , -1, 0); if (p != MAP_FAILED) memset(p, 0, size); //munmap(p, size); // uncomment to make the problem go away } } When we run it with THP enabled it will leave significant amount of memory on lru_add_pvec. This memory will be not reclaimed if we hit OOM, so when we run above program in a loop: for i in `seq 100`; do ./a.out; done many processes (95% in my case) will be killed by OOM. The primary point of the LRU add cache is to save the zone lru_lock contention with a hope that more pages will belong to the same zone and so their addition can be batched. The huge page is already a form of batched addition (it will add 512 worth of memory in one go) so skipping the batching seems like a safer option when compared to a potential excess in the caching which can be quite large and much harder to fix because lru_add_drain_all is way to expensive and it is not really clear what would be a good moment to call it. Similarly we can reproduce the problem on lru_deactivate_pvec by adding: madvise(p, size, MADV_FREE); after memset. This patch flushes lru pvecs on compound page arrival making the problem less severe - after applying it kill rate of above example drops to 0%, due to reducing maximum amount of memory held on pvec from 28MB (with THP) to 56kB per CPU. Suggested-by: Michal Hocko <> Link: Signed-off-by: Lukasz Odzioba <> Acked-by: Michal Hocko <> Cc: Kirill Shutemov <> Cc: Andrea Arcangeli <> Cc: Vladimir Davydov <> Cc: Ming Li <> Cc: Minchan Kim <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24memcg: css_alloc should return an ERR_PTR value on errorTejun Heo1-1/+1
mem_cgroup_css_alloc() was returning NULL on failure while cgroup core expected it to return an ERR_PTR value leading to the following NULL deref after a css allocation failure. Fix it by return ERR_PTR(-ENOMEM) instead. I'll also update cgroup core so that it can handle NULL returns. mkdir: page allocation failure: order:6, mode:0x240c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO) CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123 ... Call Trace: dump_stack+0x68/0xa1 warn_alloc_failed+0xd6/0x130 __alloc_pages_nodemask+0x4c6/0xf20 alloc_pages_current+0x66/0xe0 alloc_kmem_pages+0x14/0x80 kmalloc_order_trace+0x2a/0x1a0 __kmalloc+0x291/0x310 memcg_update_all_caches+0x6c/0x130 mem_cgroup_css_alloc+0x590/0x610 cgroup_apply_control_enable+0x18b/0x370 cgroup_mkdir+0x1de/0x2e0 kernfs_iop_mkdir+0x55/0x80 vfs_mkdir+0xb9/0x150 SyS_mkdir+0x66/0xd0 do_syscall_64+0x53/0x120 entry_SYSCALL64_slow_path+0x25/0x25 ... BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0 IP: init_and_link_css+0x37/0x220 PGD 34b1e067 PUD 3a109067 PMD 0 Oops: 0002 [#1] SMP Modules linked in: CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.2-20160422_131301-anatol 04/01/2014 task: ffff88007cbc5200 ti: ffff8800666d4000 task.ti: ffff8800666d4000 RIP: 0010:[<ffffffff810f2ca7>] [<ffffffff810f2ca7>] init_and_link_css+0x37/0x220 RSP: 0018:ffff8800666d7d90 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffffffff810f2499 RSI: 0000000000000000 RDI: 0000000000000008 RBP: ffff8800666d7db8 R08: 0000000000000003 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff88005a5fb400 R13: ffffffff81f0f8a0 R14: ffff88005a5fb400 R15: 0000000000000010 FS: 00007fc944689700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f3aed0d2b80 CR3: 000000003a1e8000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: cgroup_apply_control_enable+0x1ac/0x370 cgroup_mkdir+0x1de/0x2e0 kernfs_iop_mkdir+0x55/0x80 vfs_mkdir+0xb9/0x150 SyS_mkdir+0x66/0xd0 do_syscall_64+0x53/0x120 entry_SYSCALL64_slow_path+0x25/0x25 Code: 89 f5 48 89 fb 49 89 d4 48 83 ec 08 8b 05 72 3b d8 00 85 c0 0f 85 60 01 00 00 4c 89 e7 e8 72 f7 ff ff 48 8d 7b 08 48 89 d9 31 c0 <48> c7 83 d0 00 00 00 00 00 00 00 48 83 e7 f8 48 29 f9 81 c1 d8 RIP init_and_link_css+0x37/0x220 RSP <ffff8800666d7d90> CR2: 00000000000000d0 ---[ end trace a2d8836ae1e852d1 ]--- Link: Signed-off-by: Tejun Heo <> Reported-by: Johannes Weiner <> Reviewed-by: Vladimir Davydov <> Acked-by: Johannes Weiner <> Acked-by: Michal Hocko <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24memcg: mem_cgroup_migrate() may be called with irq disabledTejun Heo1-2/+3
mem_cgroup_migrate() uses local_irq_disable/enable() but can be called with irq disabled from migrate_page_copy(). This ends up enabling irq while holding a irq context lock triggering the following lockdep warning. Fix it by using irq_save/restore instead. ================================= [ INFO: inconsistent lock state ] 4.7.0-rc1+ #52 Tainted: G W --------------------------------- inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. kcompactd0/151 [HC0[0]:SC0[0]:HE1:SE1] takes: (&(&ctx->completion_lock)->rlock){+.?.-.}, at: [<000000000038fd96>] aio_migratepage+0x156/0x1e8 {IN-SOFTIRQ-W} state was registered at: __lock_acquire+0x5b6/0x1930 lock_acquire+0xee/0x270 _raw_spin_lock_irqsave+0x66/0xb0 aio_complete+0x98/0x328 dio_complete+0xe4/0x1e0 blk_update_request+0xd4/0x450 scsi_end_request+0x48/0x1c8 scsi_io_completion+0x272/0x698 blk_done_softirq+0xca/0xe8 __do_softirq+0xc8/0x518 irq_exit+0xee/0x110 do_IRQ+0x6a/0x88 io_int_handler+0x11a/0x25c __mutex_unlock_slowpath+0x144/0x1d8 __mutex_unlock_slowpath+0x140/0x1d8 kernfs_iop_permission+0x64/0x80 __inode_permission+0x9e/0xf0 link_path_walk+0x6e/0x510 path_lookupat+0xc4/0x1a8 filename_lookup+0x9c/0x160 user_path_at_empty+0x5c/0x70 SyS_readlinkat+0x68/0x140 system_call+0xd6/0x270 irq event stamp: 971410 hardirqs last enabled at (971409): migrate_page_move_mapping+0x3ea/0x588 hardirqs last disabled at (971410): _raw_spin_lock_irqsave+0x3c/0xb0 softirqs last enabled at (970526): __do_softirq+0x460/0x518 softirqs last disabled at (970519): irq_exit+0xee/0x110 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&ctx->completion_lock)->rlock); <Interrupt> lock(&(&ctx->completion_lock)->rlock); *** DEADLOCK *** 3 locks held by kcompactd0/151: #0: (&(&mapping->private_lock)->rlock){+.+.-.}, at: aio_migratepage+0x42/0x1e8 #1: (&ctx->ring_lock){+.+.+.}, at: aio_migratepage+0x5a/0x1e8 #2: (&(&ctx->completion_lock)->rlock){+.?.-.}, at: aio_migratepage+0x156/0x1e8 stack backtrace: CPU: 20 PID: 151 Comm: kcompactd0 Tainted: G W 4.7.0-rc1+ #52 Call Trace: show_trace+0xea/0xf0 show_stack+0x72/0xf0 dump_stack+0x9a/0xd8 print_usage_bug.part.27+0x2d4/0x2e8 mark_lock+0x17e/0x758 mark_held_locks+0xa2/0xd0 trace_hardirqs_on_caller+0x140/0x1c0 mem_cgroup_migrate+0x266/0x370 aio_migratepage+0x16a/0x1e8 move_to_new_page+0xb0/0x260 migrate_pages+0x8f4/0x9f0 compact_zone+0x4dc/0xdc8 kcompactd_do_work+0x1aa/0x358 kcompactd+0xba/0x2c8 kthread+0x10a/0x110 kernel_thread_starter+0x6/0xc kernel_thread_starter+0x0/0xc INFO: lockdep is turned off. Link: Link: Fixes: 74485cf2bc85 ("mm: migrate: consolidate mem_cgroup_migrate() calls") Signed-off-by: Tejun Heo <> Reported-by: Christian Borntraeger <> Acked-by: Johannes Weiner <> Acked-by: Michal Hocko <> Reviewed-by: Vladimir Davydov <> Cc: <> [4.5+] Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24hugetlb: fix nr_pmds accounting with shared page tablesKirill A. Shutemov1-2/+1
We account HugeTLB's shared page table to all processes who share it. The accounting happens during huge_pmd_share(). If somebody populates pud entry under us, we should decrease pagetable's refcount and decrease nr_pmds of the process. By mistake, I increase nr_pmds again in this case. :-/ It will lead to "BUG: non-zero nr_pmds on freeing mm: 2" on process' exit. Let's fix this by increasing nr_pmds only when we're sure that the page table will be used. Link: Fixes: dc6c9a35b66b ("mm: account pmd page tables to the process") Signed-off-by: Kirill A. Shutemov <> Reported-by: zhongjiang <> Reviewed-by: Mike Kravetz <> Acked-by: Michal Hocko <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24Revert "mm: disable fault around on emulated access bit architecture"Kirill A. Shutemov1-8/+0
This reverts commit d0834a6c2c5b0c76cfb806bd7dba6556d8b4edbb. After revert of 5c0a85fad949 ("mm: make faultaround produce old ptes") faultaround doesn't have dependencies on hardware accessed bit, so let's revert this one too. Link: Signed-off-by: Kirill A. Shutemov <> Reported-by: "Huang, Ying" <> Cc: Linus Torvalds <> Cc: Rik van Riel <> Cc: Mel Gorman <> Cc: Michal Hocko <> Cc: Minchan Kim <> Cc: Vinayak Menon <> Cc: Dave Hansen <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24Revert "mm: make faultaround produce old ptes"Kirill A. Shutemov3-20/+7
This reverts commit 5c0a85fad949212b3e059692deecdeed74ae7ec7. The commit causes ~6% regression in unixbench. Let's revert it for now and consider other solution for reclaim problem later. Link: Signed-off-by: Kirill A. Shutemov <> Reported-by: "Huang, Ying" <> Cc: Linus Torvalds <> Cc: Rik van Riel <> Cc: Mel Gorman <> Cc: Michal Hocko <> Cc: Minchan Kim <> Cc: Vinayak Menon <> Cc: Dave Hansen <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24mailmap: add Boris Brezillon's emailAntoine Tenart1-0/+3
There are different versions of Boris' name and email in the log, and one typo. Add his emails in mailmap to have all of his contributions under the same name/email tuple. Link: Signed-off-by: Antoine Tenart <> Acked-by: Boris Brezillon <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24mailmap: add Antoine Tenart's emailAntoine Tenart1-0/+1
I used "Antoine Ténart" at first but then moved to a name without accent as this cause some issues from time to time... Add my email in the mailmap file to have a consistent shortlog output. Link: Signed-off-by: Antoine Tenart <> Cc: Antoine Tenart <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim maskMel Gorman1-1/+2
Commit d0164adc89f6 ("mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd") modified __GFP_WAIT to explicitly identify the difference between atomic callers and those that were unwilling to sleep. Later the definition was removed entirely. The GFP_RECLAIM_MASK is the set of flags that affect watermark checking and reclaim behaviour but __GFP_ATOMIC was never added. Without it, atomic users of the slab allocator strip the __GFP_ATOMIC flag and cannot access the page allocator atomic reserves. This patch addresses the problem. The user-visible impact depends on the workload but potentially atomic allocations unnecessarily fail without this path. Link: Signed-off-by: Mel Gorman <> Reported-by: Marcin Wojtas <> Acked-by: Vlastimil Babka <> Acked-by: Michal Hocko <> Cc: <> [4.4+] Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24mm: mempool: kasan: don't poot mempool objects in quarantineAndrey Ryabinin3-15/+14
Currently we may put reserved by mempool elements into quarantine via kasan_kfree(). This is totally wrong since quarantine may really free these objects. So when mempool will try to use such element, use-after-free will happen. Or mempool may decide that it no longer need that element and double-free it. So don't put object into quarantine in kasan_kfree(), just poison it. Rename kasan_kfree() to kasan_poison_kfree() to respect that. Also, we shouldn't use kasan_slab_alloc()/kasan_krealloc() in kasan_unpoison_element() because those functions may update allocation stacktrace. This would be wrong for the most of the remove_element call sites. (The only call site where we may want to update alloc stacktrace is in mempool_alloc(). Kmemleak solves this by calling kmemleak_update_trace(), so we could make something like that too. But this is out of scope of this patch). Fixes: 55834c59098d ("mm: kasan: initial memory quarantine implementation") Link: Signed-off-by: Andrey Ryabinin <> Reported-by: Kuthonuzo Luruo <> Acked-by: Alexander Potapenko <> Cc: Dmitriy Vyukov <> Cc: Kostya Serebryany <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24MAINTAINERS: update Calgary IOMMUJon Mason1-3/+3
Update the contact info for Muli, clean-up my name, and update the mailing list to the IOMMU mailing list. Link: Signed-off-by: Jon Mason <> Cc: Muli Ben-Yehuda <> Cc: Greg Kroah-Hartman <> Cc: Krzysztof Kozlowski <> Cc: Thomas Gleixner <> Cc: Ingo Molnar <> Cc: "H. Peter Anvin" <> Cc: Bartlomiej Zolnierkiewicz <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24jbd2: get rid of superfluous __GFP_REPEATMichal Hocko1-25/+7
jbd2_alloc is explicit about its allocation preferences wrt. the allocation size. Sub page allocations go to the slab allocator and larger are using either the page allocator or vmalloc. This is all good but the logic is unnecessarily complex. 1) as per Ted, the vmalloc fallback is a left-over: : jbd2_alloc is only passed in the bh->b_size, which can't be PAGE_SIZE, so : the code path that calls vmalloc() should never get called. When we : conveted jbd2_alloc() to suppor sub-page size allocations in commit : d2eecb039368, there was an assumption that it could be called with a size : greater than PAGE_SIZE, but that's certaily not true today. Moreover vmalloc allocation might even lead to a deadlock because the callers expect GFP_NOFS context while vmalloc is GFP_KERNEL. 2) __GFP_REPEAT for requests <= PAGE_ALLOC_COSTLY_ORDER is ignored since the flag was introduced. Let's simplify the code flow and use the slab allocator for sub-page requests and the page allocator for others. Even though order > 0 is not currently used as per above leave that option open. Link: Signed-off-by: Michal Hocko <> Reviewed-by: Jan Kara <> Cc: "Theodore Ts'o" <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-06-24unicore32: get rid of superfluous __GFP_REPEATMichal Hocko1-1/+1
__GFP_REPEAT has a rather weak semantic but since it has been introduced around 2.6.12 it has been ignored for low order allocations. PGALLOC_GFP uses __GFP_REPEAT but it is only used in pte_alloc_one, pte_alloc_one_kernel which does order-0 request. This means that this flag has never been actually useful here because it has always been used only for PAGE_ALLOC_COSTLY requests. Link: Signed-off-by: Michal Hocko <> Cc: Guan Xuetao <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>