path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2014-10-02Merge branch 'akpm' (fixes from Andrew Morton)Linus Torvalds2-9/+34
Merge fixes from Andrew Morton: "5 fixes" * emailed patches from Andrew Morton <>: mm: page_alloc: fix zone allocation fairness on UP perf: fix perf bug in fork() MAINTAINERS: change git URL for mpc5xxx tree mm: memcontrol: do not iterate uninitialized memcgs ocfs2/dlm: should put mle when goto kill in dlm_assert_master_handler
2014-10-02mm: page_alloc: fix zone allocation fairness on UPJohannes Weiner1-4/+3
The zone allocation batches can easily underflow due to higher-order allocations or spills to remote nodes. On SMP that's fine, because underflows are expected from concurrency and dealt with by returning 0. But on UP, zone_page_state will just return a wrapped unsigned long, which will get past the <= 0 check and then consider the zone eligible until its watermarks are hit. Commit 3a025760fc15 ("mm: page_alloc: spill to remote nodes before waking kswapd") already made the counter-resetting use atomic_long_read() to accomodate underflows from remote spills, but it didn't go all the way with it. Make it clear that these batches are expected to go negative regardless of concurrency, and use atomic_long_read() everywhere. Fixes: 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") Reported-by: Vlastimil Babka <> Reported-by: Leon Romanovsky <> Signed-off-by: Johannes Weiner <> Acked-by: Mel Gorman <> Cc: <> [3.12+] Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-10-02mm: memcontrol: do not iterate uninitialized memcgsJohannes Weiner1-5/+31
The cgroup iterators yield css objects that have not yet gone through css_online(), but they are not complete memcgs at this point and so the memcg iterators should not return them. Commit d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully initialized") set out to implement exactly this, but it uses CSS_ONLINE, a cgroup-internal flag that does not meet the ordering requirements for memcg, and so the iterator may skip over initialized groups, or return partially initialized memcgs. The cgroup core can not reasonably provide a clear answer on whether the object around the css has been fully initialized, as that depends on controller-specific locking and lifetime rules. Thus, introduce a memcg-specific flag that is set after the memcg has been initialized in css_online(), and read before mem_cgroup_iter() callers access the memcg members. Signed-off-by: Johannes Weiner <> Cc: Tejun Heo <> Acked-by: Michal Hocko <> Cc: Hugh Dickins <> Cc: Peter Zijlstra <> Cc: <> [3.12+] Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-10-02mm: numa: Do not mark PTEs pte_numa when splitting huge pagesMel Gorman1-2/+5
This patch reverts 1ba6e0b50b ("mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte"). If a huge page is being split due a protection change and the tail will be in a PROT_NONE vma then NUMA hinting PTEs are temporarily created in the protected VMA. VM_RW|VM_PROTNONE |-----------------| ^ split here In the specific case above, it should get fixed up by change_pte_range() but there is a window of opportunity for weirdness to happen. Similarly, if a huge page is shrunk and split during a protection update but before pmd_numa is cleared then a pte_numa can be left behind. Instead of adding complexity trying to deal with the case, this patch will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults will not be triggered which is marginal in comparison to the complexity in dealing with the corner cases during THP split. Cc: Signed-off-by: Mel Gorman <> Acked-by: Rik van Riel <> Acked-by: Kirill A. Shutemov <> Signed-off-by: Linus Torvalds <>
2014-10-02mm: migrate: Close race between migration completion and mprotectMel Gorman1-1/+4
A migration entry is marked as write if pte_write was true at the time the entry was created. The VMA protections are not double checked when migration entries are being removed as mprotect marks write-migration-entries as read. It means that potentially we take a spurious fault to mark PTEs write again but it's straight-forward. However, there is a race between write migrations being marked read and migrations finishing. This potentially allows a PTE to be write that should have been read. Close this race by double checking the VMA permissions using maybe_mkwrite when migration completes. [ use maybe_mkwrite] Cc: Signed-off-by: Mel Gorman <> Acked-by: Rik van Riel <> Signed-off-by: Linus Torvalds <>
2014-09-27Merge branch 'for-linus' of ↵Linus Torvalds2-6/+12
git:// Pull vfs fixes from Al Viro: "Assorted fixes + unifying __d_move() and __d_materialise_dentry() + minimal regression fix for d_path() of victims of overwriting rename() ported on top of that" * 'for-linus' of git:// vfs: Don't exchange "short" filenames unconditionally. fold swapping ->d_name.hash into switch_names() fold unlocking the children into dentry_unlock_parents_for_move() kill __d_materialise_dentry() __d_materialise_dentry(): flip the order of arguments __d_move(): fold manipulations with ->d_child/->d_subdirs don't open-code d_rehash() in d_materialise_unique() pull rehashing and unlocking the target dentry into __d_materialise_dentry() ufs: deal with nfsd/iget races fuse: honour max_read and max_write in direct_io mode shmem: fix nlink for rename overwrite directory
2014-09-27Merge branch 'for-3.17-fixes' of ↵Linus Torvalds1-2/+2
git:// Pull cgroup fixes from Tejun Heo: "This is quite late but these need to be backported anyway. This is the fix for a long-standing cpuset bug which existed from 2009. cpuset makes use of PF_SPREAD_{PAGE|SLAB} flags to modify the task's memory allocation behavior according to the settings of the cpuset it belongs to; unfortunately, when those flags have to be changed, cpuset did so directly even whlie the target task is running, which is obviously racy as task->flags may be modified by the task itself at any time. This obscure bug manifested as corrupt PF_USED_MATH flag leading to a weird crash. The bug is fixed by moving the flag to task->atomic_flags. The first two are prepatory ones to help defining atomic_flags accessors and the third one is the actual fix" * 'for-3.17-fixes' of git:// cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags sched: add macros to define bitops for task atomic flags sched: fix confusing PFA_NO_NEW_PRIVS constant
2014-09-26fuse: honour max_read and max_write in direct_io modeMiklos Szeredi1-5/+9
The third argument of fuse_get_user_pages() "nbytesp" refers to the number of bytes a caller asked to pack into fuse request. This value may be lesser than capacity of fuse request or iov_iter. So fuse_get_user_pages() must ensure that *nbytesp won't grow. Now, when helper iov_iter_get_pages() performs all hard work of extracting pages from iov_iter, it can be done by passing properly calculated "maxsize" to the helper. The other caller of iov_iter_get_pages() (dio_refill_pages()) doesn't need this capability, so pass LONG_MAX as the maxsize argument here. Fixes: c9c37e2e6378 ("fuse: switch to iov_iter_get_pages()") Reported-by: Werner Baumann <> Tested-by: Maxim Patlasov <> Signed-off-by: Miklos Szeredi <> Signed-off-by: Al Viro <>
2014-09-26shmem: fix nlink for rename overwrite directoryMiklos Szeredi1-1/+3
If overwriting an empty directory with rename, then need to drop the extra nlink. Test prog: #include <stdio.h> #include <fcntl.h> #include <err.h> #include <sys/stat.h> int main(void) { const char *test_dir1 = "test-dir1"; const char *test_dir2 = "test-dir2"; int res; int fd; struct stat statbuf; res = mkdir(test_dir1, 0777); if (res == -1) err(1, "mkdir(\"%s\")", test_dir1); res = mkdir(test_dir2, 0777); if (res == -1) err(1, "mkdir(\"%s\")", test_dir2); fd = open(test_dir2, O_RDONLY); if (fd == -1) err(1, "open(\"%s\")", test_dir2); res = rename(test_dir1, test_dir2); if (res == -1) err(1, "rename(\"%s\", \"%s\")", test_dir1, test_dir2); res = fstat(fd, &statbuf); if (res == -1) err(1, "fstat(%i)", fd); if (statbuf.st_nlink != 0) { fprintf(stderr, "nlink is %lu, should be 0\n", statbuf.st_nlink); return 1; } return 0; } Signed-off-by: Miklos Szeredi <> Cc: Signed-off-by: Al Viro <>
2014-09-26mm: softdirty: keep bit when zapping file ptePeter Feiner1-1/+1
This fixes the same bug as b43790eedd31 ("mm: softdirty: don't forget to save file map softdiry bit on unmap") and 9aed8614af5a ("mm/memory.c: don't forget to set softdirty on file mapped fault") where the return value of pte_*mksoft_dirty was being ignored. To be sure that no other pte/pmd "mk" function return values were being ignored, I annotated the functions in arch/x86/include/asm/pgtable.h with __must_check and rebuilt. The userspace effect of this bug is that the softdirty mark might be lost if a file mapped pte get zapped. Signed-off-by: Peter Feiner <> Acked-by: Cyrill Gorcunov <> Cc: Pavel Emelyanov <> Cc: Jamie Liu <> Cc: Hugh Dickins <> Cc: <> [3.12+] Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-09-26mm, slab: initialize object alignment on cache creationDavid Rientjes1-9/+2
Since commit 4590685546a3 ("mm/sl[aou]b: Common alignment code"), the "ralign" automatic variable in __kmem_cache_create() may be used as uninitialized. The proper alignment defaults to BYTES_PER_WORD and can be overridden by SLAB_RED_ZONE or the alignment specified by the caller. This fixes Signed-off-by: David Rientjes <> Reported-by: Andrei Elovikov <> Acked-by: Christoph Lameter <> Cc: Pekka Enberg <> Cc: Joonsoo Kim <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-09-24cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flagsZefan Li1-2/+2
When we change cpuset.memory_spread_{page,slab}, cpuset will flip PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset. This should be done using atomic bitops, but currently we don't, which is broken. Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened when one thread tried to clear PF_USED_MATH while at the same time another thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on the same task. Here's the full report: To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags. v4: - updated mm/slab.c. (Fengguang Wu) - updated Documentation. Cc: Peter Zijlstra <> Cc: Ingo Molnar <> Cc: Miao Xie <> Cc: Kees Cook <> Fixes: 950592f7b991 ("cpusets: update tasks' page/slab spread flags in time") Cc: <> # 2.6.31+ Reported-by: Tetsuo Handa <> Signed-off-by: Zefan Li <> Signed-off-by: Tejun Heo <>
2014-09-22Merge tag 'for-linus' of git:// Torvalds1-0/+2
Pull KVM fixes from Paolo Bonzini: "Two very simple bugfixes, affecting all supported architectures" [ Two? There's three commits in here. Oh well, I guess Paolo didn't count the preparatory symbol export ] * tag 'for-linus' of git:// KVM: correct null pid check in kvm_vcpu_yield_to() KVM: check for !is_zero_pfn() in kvm_is_mmio_pfn() mm: export symbol dependencies of is_zero_pfn()
2014-09-18Fix unbalanced mutex in dma_pool_create().Krzysztof Hałasa1-1/+1
dma_pool_create() needs to unlock the mutex in error case. The bug was introduced in the 3.16 by commit cc6b664aa26d ("mm/dmapool.c: remove redundant NULL check for dev in dma_pool_create()")/ Signed-off-by: Krzysztof Hałasa <> Cc: # v3.16 Signed-off-by: Linus Torvalds <>
2014-09-14mm: export symbol dependencies of is_zero_pfn()Ard Biesheuvel1-0/+2
In order to make the static inline function is_zero_pfn() callable by modules, export its symbol dependencies 'zero_pfn' and (for s390 and mips) 'zero_page_mask'. We need this for KVM, as CONFIG_KVM is a tristate for all supported architectures except ARM and arm64, and testing a pfn whether it refers to the zero page is required to correctly distinguish the zero page from other special RAM ranges that may also have the PG_reserved bit set, but need to be treated as MMIO memory. Signed-off-by: Ard Biesheuvel <> Acked-by: Andrew Morton <> Signed-off-by: Paolo Bonzini <>
2014-09-10mm/mmap.c: use pr_emerg when printing BUG related informationSasha Levin1-8/+8
Make sure we actually see the output of validate_mm() and browse_rb() before triggering a BUG(). pr_info isn't shown by default so the reason for the BUG() isn't obvious. Signed-off-by: Sasha Levin <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-09-10mem-hotplug: let memblock skip the hotpluggable memory regions in ↵Xishi Qiu2-0/+6
__next_mem_range() Let memblock skip the hotpluggable memory regions in __next_mem_range(), it is used to to prevent memblock from allocating hotpluggable memory for the kernel at early time. The code is the same as __next_mem_range_rev(). Clear hotpluggable flag before releasing free pages to the buddy allocator. If we don't clear hotpluggable flag in free_low_memory_core_early(), the memory which marked hotpluggable flag will not free to buddy allocator. Because __next_mem_range() will skip them. free_low_memory_core_early for_each_free_mem_range for_each_mem_range __next_mem_range [ fix warning] Signed-off-by: Xishi Qiu <> Cc: Tejun Heo <> Cc: Tang Chen <> Cc: Zhang Yanfei <> Cc: Wen Congyang <> Cc: "Rafael J. Wysocki" <> Cc: "H. Peter Anvin" <> Cc: Wu Fengguang <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-09-07Merge branch 'for-3.17-fixes' of ↵Linus Torvalds2-6/+18
git:// Pull percpu fixes from Tejun Heo: "One patch to fix a failure path in the alloc path. The bug is dangerous but probably not too likely to actually trigger in the wild given that there hasn't been any report yet. The other two are low impact fixes" * 'for-3.17-fixes' of git:// percpu: free percpu allocation info for uniprocessor system percpu: perform tlb flush after pcpu_map_pages() failure percpu: fix pcpu_alloc_pages() failure path
2014-09-05mm: memcontrol: revert use of root_mem_cgroup res_counterJohannes Weiner1-25/+78
Dave Hansen reports a massive scalability regression in an uncontained page fault benchmark with more than 30 concurrent threads, which he bisected down to 05b843012335 ("mm: memcontrol: use root_mem_cgroup res_counter") and pin-pointed on res_counter spinlock contention. That change relied on the per-cpu charge caches to mostly swallow the res_counter costs, but it's apparent that the caches don't scale yet. Revert memcg back to bypassing res_counters on the root level in order to restore performance for uncontained workloads. Reported-by: Dave Hansen <> Signed-off-by: Johannes Weiner <> Tested-by: Dave Hansen <> Acked-by: Michal Hocko <> Acked-by: Vladimir Davydov <> Signed-off-by: Linus Torvalds <>
2014-08-29x86,mm: fix pte_special versus pte_numaHugh Dickins1-4/+3
Sasha Levin has shown oopses on ffffea0003480048 and ffffea0003480008 at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels: where zap_pte_range() checks page->mapping to see if PageAnon(page). Those addresses fit struct pages for pfns d2001 and d2000, and in each dump a register or a stack slot showed d2001730 or d2000730: pte flags 0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has a hole between cfffffff and 100000000, which would need special access. Commit c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels") has broken vm_normal_page(): a PROTNONE SPECIAL pte no longer passes the pte_special() test, so zap_pte_range() goes on to try to access a non-existent struct page. Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE) to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE). A hint that this was a problem was that c46a7c817e66 added pte_numa() test to vm_normal_page(), and moved its is_zero_pfn() test from slow to fast path: This was papering over a pte_special() snag when the zero page was encountered during zap. This patch reverts vm_normal_page() to how it was before, relying on pte_special(). It still appears that this patch may be incomplete: aren't there other places which need to be handling PROTNONE along with PRESENT? For example, pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA, but on a PROT_NONE area, that would make it pte_special(). This is side-stepped by the fact that NUMA hinting faults skipped PROT_NONE VMAs and there are no grounds where a NUMA hinting fault on a PROT_NONE VMA would be interesting. Fixes: c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels") Reported-by: Sasha Levin <> Tested-by: Sasha Levin <> Signed-off-by: Hugh Dickins <> Signed-off-by: Mel Gorman <> Cc: "Kirill A. Shutemov" <> Cc: Peter Zijlstra <> Cc: Rik van Riel <> Cc: Johannes Weiner <> Cc: Cyrill Gorcunov <> Cc: Matthew Wilcox <> Cc: <> [3.16] Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-29hugetlb_cgroup: use lockdep_assert_held rather than spin_is_lockedMichal Hocko1-1/+1
spin_lock may be an empty struct for !SMP configurations and so arch_spin_is_locked may return unconditional 0 and trigger the VM_BUG_ON even when the lock is held. Replace spin_is_locked by lockdep_assert_held. We will not BUG anymore but it is questionable whether crashing makes a lot of sense in the uncharge path. Uncharge happens after the last page reference was released so nobody should touch the page and the function doesn't update any shared state except for res counter which uses synchronization of its own. Signed-off-by: Michal Hocko <> Reviewed-by: Aneesh Kumar K.V <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-29mm/zpool: use prefixed module loadingKees Cook3-1/+3
To avoid potential format string expansion via module parameters, do not use the zpool type directly in request_module() without a format string. Additionally, to avoid arbitrary modules being loaded via zpool API (e.g. via the zswap_zpool_type module parameter) add a "zpool-" prefix to the requested module, as well as module aliases for the existing zpool types (zbud and zsmalloc). Signed-off-by: Kees Cook <> Cc: Seth Jennings <> Cc: Minchan Kim <> Cc: Nitin Gupta <> Acked-by: Dan Streetman <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-29mm: actually clear pmd_numa before invalidatingMatthew Wilcox1-1/+1
Commit 67f87463d3a3 ("mm: clear pmd_numa before invalidating") cleared the NUMA bit in a copy of the PMD entry, but then wrote back the original Signed-off-by: Matthew Wilcox <> Acked-by: Mel Gorman <> Reviewed-by: Rik van Riel <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-29memblock, memhotplug: fix wrong type in memblock_find_in_range_node().Tang Chen1-2/+1
In memblock_find_in_range_node(), we defined ret as int. But it should be phys_addr_t because it is used to store the return value from __memblock_find_range_bottom_up(). The bug has not been triggered because when allocating low memory near the kernel end, the "int ret" won't turn out to be negative. When we started to allocate memory on other nodes, and the "int ret" could be minus. Then the kernel will panic. A simple way to reproduce this: comment out the following code in numa_init(), memblock_set_bottom_up(false); and the kernel won't boot. Reported-by: Xishi Qiu <> Signed-off-by: Tang Chen <> Tested-by: Xishi Qiu <> Cc: <> [3.13+] Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-16percpu: free percpu allocation info for uniprocessor systemHonggang Li1-0/+2
Currently, only SMP system free the percpu allocation info. Uniprocessor system should free it too. For example, one x86 UML virtual machine with 256MB memory, UML kernel wastes one page memory. Signed-off-by: Honggang Li <> Signed-off-by: Tejun Heo <> Cc:
2014-08-15percpu: perform tlb flush after pcpu_map_pages() failureTejun Heo1-0/+1
If pcpu_map_pages() fails midway, it unmaps the already mapped pages. Currently, it doesn't flush tlb after the partial unmapping. This may be okay in most cases as the established mapping hasn't been used at that point but it can go wrong and when it goes wrong it'd be extremely difficult to track down. Flush tlb after the partial unmapping. Signed-off-by: Tejun Heo <> Cc:
2014-08-15percpu: fix pcpu_alloc_pages() failure pathTejun Heo1-6/+15
When pcpu_alloc_pages() fails midway, pcpu_free_pages() is invoked to free what has already been allocated. The invocation is across the whole requested range and pcpu_free_pages() will try to free all non-NULL pages; unfortunately, this is incorrect as pcpu_get_pages_and_bitmap(), unlike what its comment suggests, doesn't clear the pages array and thus the array may have entries from the previous invocations making the partial failure path free incorrect pages. Fix it by open-coding the partial freeing of the already allocated pages. Signed-off-by: Tejun Heo <> Cc:
2014-08-14Merge branch 'akpm' (fixes from Andrew Morton)Linus Torvalds1-0/+1
Merge leftovers from Andrew Morton: "A few leftovers. I have a bunch of OCFS2 patches which are still out for review and which I might sneak along after -rc1. Partly my fault - I should send my review pokes out earlier" * emailed patches from Andrew Morton <>: mm: fix CROSS_MEMORY_ATTACH help text grammar drivers/mfd/rtsx_usb.c: export device table mm, hugetlb_cgroup: align hugetlb cgroup limit to hugepage size
2014-08-14mm, hugetlb_cgroup: align hugetlb cgroup limit to hugepage sizeDavid Rientjes1-0/+1
Memcg aligns memory.limit_in_bytes to PAGE_SIZE as part of the resource counter since it makes no sense to allow a partial page to be charged. As a result of the hugetlb cgroup using the resource counter, it is also aligned to PAGE_SIZE but makes no sense unless aligned to the size of the hugepage being limited. Align hugetlb cgroup limit to hugepage size. Signed-off-by: David Rientjes <> Acked-by: Michal Hocko <> Cc: "Aneesh Kumar K.V" <> Cc: Tejun Heo <> Cc: Li Zefan <> Cc: Michal Hocko <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-11Merge branch 'for-linus' of ↵Linus Torvalds3-12/+39
git:// Pull vfs updates from Al Viro: "Stuff in here: - acct.c fixes and general rework of mnt_pin mechanism. That allows to go for delayed-mntput stuff, which will permit mntput() on deep stack without worrying about stack overflows - fs shutdown will happen on shallow stack. IOW, we can do Eric's umount-on-rmdir series without introducing tons of stack overflows on new mntput() call chains it introduces. - Bruce's d_splice_alias() patches - more Miklos' rename() stuff. - a couple of regression fixes (stable fodder, in the end of branch) and a fix for API idiocy in iov_iter.c. There definitely will be another pile, maybe even two. I'd like to get Eric's series in this time, but even if we miss it, it'll go right in the beginning of for-next in the next cycle - the tricky part of prereqs is in this pile" * 'for-linus' of git:// (40 commits) fix copy_tree() regression __generic_file_write_iter(): fix handling of sync error after DIO switch iov_iter_get_pages() to passing maximal number of pages fs: mark __d_obtain_alias static dcache: d_splice_alias should detect loops exportfs: update Exporting documentation dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED dcache: remove unused d_find_alias parameter dcache: d_obtain_alias callers don't all want DISCONNECTED dcache: d_splice_alias should ignore DCACHE_DISCONNECTED dcache: d_splice_alias mustn't create directory aliases dcache: close d_move race in d_splice_alias dcache: move d_splice_alias namei: trivial fix to vfs_rename_dir comment VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode. cifs: support RENAME_NOREPLACE hostfs: support rename flags shmem: support RENAME_EXCHANGE shmem: support RENAME_NOREPLACE btrfs: add RENAME_NOREPLACE ...
2014-08-11__generic_file_write_iter(): fix handling of sync error after DIOAl Viro1-1/+1
If DIO results in short write and sync write fails, we want to bugger off whether the DIO part has written anything or not; the logics on the return will take care of the right return value. Cc: [3.16] Reported-by: Anton Altaparmakov <> Signed-off-by: Al Viro <>
2014-08-08shm: wait for pins to be released when sealingDavid Herrmann1-1/+109
If we set SEAL_WRITE on a file, we must make sure there cannot be any ongoing write-operations on the file. For write() calls, we simply lock the inode mutex, for mmap() we simply verify there're no writable mappings. However, there might be pages pinned by AIO, Direct-IO and similar operations via GUP. We must make sure those do not write to the memfd file after we set SEAL_WRITE. As there is no way to notify GUP users to drop pages or to wait for them to be done, we implement the wait ourself: When setting SEAL_WRITE, we check all pages for their ref-count. If it's bigger than 1, we know there's some user of the page. We then mark the page and wait for up to 150ms for those ref-counts to be dropped. If the ref-counts are not dropped in time, we refuse the seal operation. Signed-off-by: David Herrmann <> Acked-by: Hugh Dickins <> Cc: Michael Kerrisk <> Cc: Ryan Lortie <> Cc: Lennart Poettering <> Cc: Daniel Mack <> Cc: Andy Lutomirski <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-08shm: add memfd_create() syscallDavid Herrmann1-0/+73
memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor that you can pass to mmap(). It can support sealing and avoids any connection to user-visible mount-points. Thus, it's not subject to quotas on mounted file-systems, but can be used like malloc()'ed memory, but with a file-descriptor to it. memfd_create() returns the raw shmem file, so calls like ftruncate() can be used to modify the underlying inode. Also calls like fstat() will return proper information and mark the file as regular file. If you want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not supported (like on all other regular files). Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not subject to a filesystem size limit. It is still properly accounted to memcg limits, though, and to the same overcommit or no-overcommit accounting as all user memory. Signed-off-by: David Herrmann <> Acked-by: Hugh Dickins <> Cc: Michael Kerrisk <> Cc: Ryan Lortie <> Cc: Lennart Poettering <> Cc: Daniel Mack <> Cc: Andy Lutomirski <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-08shm: add sealing APIDavid Herrmann1-0/+143
If two processes share a common memory region, they usually want some guarantees to allow safe access. This often includes: - one side cannot overwrite data while the other reads it - one side cannot shrink the buffer while the other accesses it - one side cannot grow the buffer beyond previously set boundaries If there is a trust-relationship between both parties, there is no need for policy enforcement. However, if there's no trust relationship (eg., for general-purpose IPC) sharing memory-regions is highly fragile and often not possible without local copies. Look at the following two use-cases: 1) A graphics client wants to share its rendering-buffer with a graphics-server. The memory-region is allocated by the client for read/write access and a second FD is passed to the server. While scanning out from the memory region, the server has no guarantee that the client doesn't shrink the buffer at any time, requiring rather cumbersome SIGBUS handling. 2) A process wants to perform an RPC on another process. To avoid huge bandwidth consumption, zero-copy is preferred. After a message is assembled in-memory and a FD is passed to the remote side, both sides want to be sure that neither modifies this shared copy, anymore. The source may have put sensible data into the message without a separate copy and the target may want to parse the message inline, to avoid a local copy. While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide ways to achieve most of this, the first one is unproportionally ugly to use in libraries and the latter two are broken/racy or even disabled due to denial of service attacks. This patch introduces the concept of SEALING. If you seal a file, a specific set of operations is blocked on that file forever. Unlike locks, seals can only be set, never removed. Hence, once you verified a specific set of seals is set, you're guaranteed that no-one can perform the blocked operations on this file, anymore. An initial set of SEALS is introduced by this patch: - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced in size. This affects ftruncate() and open(O_TRUNC). - GROW: If SEAL_GROW is set, the file in question cannot be increased in size. This affects ftruncate(), fallocate() and write(). - WRITE: If SEAL_WRITE is set, no write operations (besides resizing) are possible. This affects fallocate(PUNCH_HOLE), mmap() and write(). - SEAL: If SEAL_SEAL is set, no further seals can be added to a file. This basically prevents the F_ADD_SEAL operation on a file and can be set to prevent others from adding further seals that you don't want. The described use-cases can easily use these seals to provide safe use without any trust-relationship: 1) The graphics server can verify that a passed file-descriptor has SEAL_SHRINK set. This allows safe scanout, while the client is allowed to increase buffer size for window-resizing on-the-fly. Concurrent writes are explicitly allowed. 2) For general-purpose IPC, both processes can verify that SEAL_SHRINK, SEAL_GROW and SEAL_WRITE are set. This guarantees that neither process can modify the data while the other side parses it. Furthermore, it guarantees that even with writable FDs passed to the peer, it cannot increase the size to hit memory-limits of the source process (in case the file-storage is accounted to the source). The new API is an extension to fcntl(), adding two new commands: F_GET_SEALS: Return a bitset describing the seals on the file. This can be called on any FD if the underlying file supports sealing. F_ADD_SEALS: Change the seals of a given file. This requires WRITE access to the file and F_SEAL_SEAL may not already be set. Furthermore, the underlying file must support sealing and there may not be any existing shared mapping of that file. Otherwise, EBADF/EPERM is returned. The given seals are _added_ to the existing set of seals on the file. You cannot remove seals again. The fcntl() handler is currently specific to shmem and disabled on all files. A file needs to explicitly support sealing for this interface to work. A separate syscall is added in a follow-up, which creates files that support sealing. There is no intention to support this on other file-systems. Semantics are unclear for non-volatile files and we lack any use-case right now. Therefore, the implementation is specific to shmem. Signed-off-by: David Herrmann <> Acked-by: Hugh Dickins <> Cc: Michael Kerrisk <> Cc: Ryan Lortie <> Cc: Lennart Poettering <> Cc: Daniel Mack <> Cc: Andy Lutomirski <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-08mm: allow drivers to prevent new writable mappingsDavid Herrmann2-6/+25
This patch (of 6): The i_mmap_writable field counts existing writable mappings of an address_space. To allow drivers to prevent new writable mappings, make this counter signed and prevent new writable mappings if it is negative. This is modelled after i_writecount and DENYWRITE. This will be required by the shmem-sealing infrastructure to prevent any new writable mappings after the WRITE seal has been set. In case there exists a writable mapping, this operation will fail with EBUSY. Note that we rely on the fact that iff you already own a writable mapping, you can increase the counter without using the helpers. This is the same that we do for i_writecount. Signed-off-by: David Herrmann <> Acked-by: Hugh Dickins <> Cc: Michael Kerrisk <> Cc: Ryan Lortie <> Cc: Lennart Poettering <> Cc: Daniel Mack <> Cc: Andy Lutomirski <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-08arm64,ia64,ppc,s390,sh,tile,um,x86,mm: remove default gate areaAndy Lutomirski2-43/+0
The core mm code will provide a default gate area based on FIXADDR_USER_START and FIXADDR_USER_END if !defined(__HAVE_ARCH_GATE_AREA) && defined(AT_SYSINFO_EHDR). This default is only useful for ia64. arm64, ppc, s390, sh, tile, 64-bit UML, and x86_32 have their own code just to disable it. arm, 32-bit UML, and x86_64 have gate areas, but they have their own implementations. This gets rid of the default and moves the code into ia64. This should save some code on architectures without a gate area: it's now possible to inline the gate_area functions in the default case. Signed-off-by: Andy Lutomirski <> Acked-by: Nathan Lynch <> Acked-by: H. Peter Anvin <> Acked-by: Benjamin Herrenschmidt <> [in principle] Acked-by: Richard Weinberger <> [for um] Acked-by: Will Deacon <> [for arm64] Cc: Catalin Marinas <> Cc: Will Deacon <> Cc: Tony Luck <> Cc: Fenghua Yu <> Cc: Benjamin Herrenschmidt <> Cc: Paul Mackerras <> Cc: Martin Schwidefsky <> Cc: Heiko Carstens <> Cc: Chris Metcalf <> Cc: Jeff Dike <> Cc: Richard Weinberger <> Cc: Thomas Gleixner <> Cc: Ingo Molnar <> Cc: "H. Peter Anvin" <> Cc: Nathan Lynch <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-08mm/zswap.c: add __init to zswap_entry_cache_destroy()Fabian Frederick1-2/+2
zswap_entry_cache_destroy() is only called by __init init_zswap(). This patch also fixes function name zswap_entry_cache_ s/destory/destroy Signed-off-by: Fabian Frederick <> Acked-by: Seth Jennings <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-08mm: memcontrol: avoid charge statistics churn during page migrationJohannes Weiner1-25/+10
Charge migration currently disables IRQs twice to update the charge statistics for the old page and then again for the new page. But migration is a seamless transition of a charge from one physical page to another one of the same size, so this should be a non-event from an accounting point of view. Leave the statistics alone. Signed-off-by: Johannes Weiner <> Acked-by: Michal Hocko <> Cc: Hugh Dickins <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-08mm: memcontrol: use page lists for uncharge batchingJohannes Weiner3-109/+115
Pages are now uncharged at release time, and all sources of batched uncharges operate on lists of pages. Directly use those lists, and get rid of the per-task batching state. This also batches statistics accounting, in addition to the res counter charges, to reduce IRQ-disabling and re-enabling. Signed-off-by: Johannes Weiner <> Acked-by: Michal Hocko <> Cc: Hugh Dickins <> Cc: Tejun Heo <> Cc: Vladimir Davydov <> Cc: Naoya Horiguchi <> Cc: Vladimir Davydov <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-08mm: memcontrol: rewrite uncharge APIJohannes Weiner12-570/+355
The memcg uncharging code that is involved towards the end of a page's lifetime - truncation, reclaim, swapout, migration - is impressively complicated and fragile. Because anonymous and file pages were always charged before they had their page->mapping established, uncharges had to happen when the page type could still be known from the context; as in unmap for anonymous, page cache removal for file and shmem pages, and swap cache truncation for swap pages. However, these operations happen well before the page is actually freed, and so a lot of synchronization is necessary: - Charging, uncharging, page migration, and charge migration all need to take a per-page bit spinlock as they could race with uncharging. - Swap cache truncation happens during both swap-in and swap-out, and possibly repeatedly before the page is actually freed. This means that the memcg swapout code is called from many contexts that make no sense and it has to figure out the direction from page state to make sure memory and memory+swap are always correctly charged. - On page migration, the old page might be unmapped but then reused, so memcg code has to prevent untimely uncharging in that case. Because this code - which should be a simple charge transfer - is so special-cased, it is not reusable for replace_page_cache(). But now that charged pages always have a page->mapping, introduce mem_cgroup_uncharge(), which is called after the final put_page(), when we know for sure that nobody is looking at the page anymore. For page migration, introduce mem_cgroup_migrate(), which is called after the migration is successful and the new page is fully rmapped. Because the old page is no longer uncharged after migration, prevent double charges by decoupling the page's memcg association (PCG_USED and pc->mem_cgroup) from the page holding an actual charge. The new bits PCG_MEM and PCG_MEMSW represent the respective charges and are transferred to the new page during migration. mem_cgroup_migrate() is suitable for replace_page_cache() as well, which gets rid of mem_cgroup_replace_page_cache(). However, care needs to be taken because both the source and the target page can already be charged and on the LRU when fuse is splicing: grab the page lock on the charge moving side to prevent changing pc->mem_cgroup of a page under migration. Also, the lruvecs of both pages change as we uncharge the old and charge the new during migration, and putback may race with us, so grab the lru lock and isolate the pages iff on LRU to prevent races and ensure the pages are on the right lruvec afterward. Swap accounting is massively simplified: because the page is no longer uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry before the final put_page() in page reclaim. Finally, page_cgroup changes are now protected by whatever protection the page itself offers: anonymous pages are charged under the page table lock, whereas page cache insertions, swapin, and migration hold the page lock. Uncharging happens under full exclusion with no outstanding references. Charging and uncharging also ensure that the page is off-LRU, which serializes against charge migration. Remove the very costly page_cgroup lock and set pc->flags non-atomically. [ mem_cgroup_charge_statistics needs preempt_disable] [ fix flags definition] Signed-off-by: Johannes Weiner <> Cc: Hugh Dickins <> Cc: Tejun Heo <> Cc: Vladimir Davydov <> Tested-by: Jet Chen <> Acked-by: Michal Hocko <> Tested-by: Felipe Balbi <> Signed-off-by: Vladimir Davydov <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-08mm: memcontrol: rewrite charge APIJohannes Weiner8-322/+308
These patches rework memcg charge lifetime to integrate more naturally with the lifetime of user pages. This drastically simplifies the code and reduces charging and uncharging overhead. The most expensive part of charging and uncharging is the page_cgroup bit spinlock, which is removed entirely after this series. Here are the top-10 profile entries of a stress test that reads a 128G sparse file on a freshly booted box, without even a dedicated cgroup (i.e. executing in the root memcg). Before: 15.36% cat [kernel.kallsyms] [k] copy_user_generic_string 13.31% cat [kernel.kallsyms] [k] memset 11.48% cat [kernel.kallsyms] [k] do_mpage_readpage 4.23% cat [kernel.kallsyms] [k] get_page_from_freelist 2.38% cat [kernel.kallsyms] [k] put_page 2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge 2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common 1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list 1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup 1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn After: 15.67% cat [kernel.kallsyms] [k] copy_user_generic_string 13.48% cat [kernel.kallsyms] [k] memset 11.42% cat [kernel.kallsyms] [k] do_mpage_readpage 3.98% cat [kernel.kallsyms] [k] get_page_from_freelist 2.46% cat [kernel.kallsyms] [k] put_page 2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list 1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup 1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn 1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk 1.30% cat [kernel.kallsyms] [k] kfree As you can see, the memcg footprint has shrunk quite a bit. text data bss dec hex filename 37970 9892 400 48262 bc86 mm/memcontrol.o.old 35239 9892 400 45531 b1db mm/memcontrol.o This patch (of 4): The memcg charge API charges pages before they are rmapped - i.e. have an actual "type" - and so every callsite needs its own set of charge and uncharge functions to know what type is being operated on. Worse, uncharge has to happen from a context that is still type-specific, rather than at the end of the page's lifetime with exclusive access, and so requires a lot of synchronization. Rewrite the charge API to provide a generic set of try_charge(), commit_charge() and cancel_charge() transaction operations, much like what's currently done for swap-in: mem_cgroup_try_charge() attempts to reserve a charge, reclaiming pages from the memcg if necessary. mem_cgroup_commit_charge() commits the page to the charge once it has a valid page->mapping and PageAnon() reliably tells the type. mem_cgroup_cancel_charge() aborts the transaction. This reduces the charge API and enables subsequent patches to drastically simplify uncharging. As pages need to be committed after rmap is established but before they are added to the LRU, page_add_new_anon_rmap() must stop doing LRU additions again. Revive lru_cache_add_active_or_unevictable(). [ fix shmem_unuse] [ Add comments on the private use of -EAGAIN] Signed-off-by: Johannes Weiner <> Acked-by: Michal Hocko <> Cc: Tejun Heo <> Cc: Vladimir Davydov <> Signed-off-by: Hugh Dickins <> Cc: Naoya Horiguchi <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-08vm_is_stack: use for_each_thread() rather then buggy while_each_thread()Oleg Nesterov1-6/+3
Aleksei hit the soft lockup during reading /proc/PID/smaps. David investigated the problem and suggested the right fix. while_each_thread() is racy and should die, this patch updates vm_is_stack(). Signed-off-by: Oleg Nesterov <> Reported-by: Aleksei Besogonov <> Tested-by: Aleksei Besogonov <> Suggested-by: David Rientjes <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-08Revert "slab: remove BAD_ALIEN_MAGIC"Joonsoo Kim1-1/+3
This reverts commit a640616822b2 ("slab: remove BAD_ALIEN_MAGIC"). commit a640616822b2 ("slab: remove BAD_ALIEN_MAGIC") assumes that the system with !CONFIG_NUMA has only one memory node. But, it turns out to be false by the report from Geert. His system, m68k, has many memory nodes and is configured in !CONFIG_NUMA. So it couldn't boot with above change. Here goes his failure report. With latest mainline, I'm getting a crash during bootup on m68k/ARAnyM: enable_cpucache failed for radix_tree_node, error 12. kernel BUG at /scratch/geert/linux/linux-m68k/mm/slab.c:1522! *** TRAP #7 *** FORMAT=0 Current process id is 0 BAD KERNEL TRAP: 00000000 Modules linked in: PC: [<0039c92c>] kmem_cache_init_late+0x70/0x8c SR: 2200 SP: 00345f90 a2: 0034c2e8 d0: 0000003d d1: 00000000 d2: 00000000 d3: 003ac942 d4: 00000000 d5: 00000000 a0: 0034f686 a1: 0034f682 Process swapper (pid: 0, task=0034c2e8) Frame format=0 Stack from 00345fc4: 002f69ef 002ff7e5 000005f2 000360fa 0017d806 003921d4 00000000 00000000 00000000 00000000 00000000 00000000 003ac942 00000000 003912d6 Call Trace: [<000360fa>] parse_args+0x0/0x2ca [<0017d806>] strlen+0x0/0x1a [<003921d4>] start_kernel+0x23c/0x428 [<003912d6>] _sinittext+0x2d6/0x95e Code: f7e5 4879 002f 69ef 61ff ffca 462a 4e47 <4879> 0035 4b1c 61ff fff0 0cc4 7005 23c0 0037 fd20 588f 265f 285f 4e75 48e7 301c Disabling lock debugging due to kernel taint Kernel panic - not syncing: Attempted to kill the idle task! Although there is a alternative way to fix this issue such as disabling use of alien cache on !CONFIG_NUMA, but, reverting issued commit is better to me in this time. Signed-off-by: Joonsoo Kim <> Reported-by: Geert Uytterhoeven <> Cc: Christoph Lameter <> Cc: Pekka Enberg <> Cc: David Rientjes <> Cc: Vladimir Davydov <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-07switch iov_iter_get_pages() to passing maximal number of pagesAl Viro1-9/+8
... instead of maximal size. Signed-off-by: Al Viro <>
2014-08-07shmem: support RENAME_EXCHANGEMiklos Szeredi1-1/+26
This is really simple in tmpfs since the VFS already takes care of shuffling the dentries. Just adjust nlink on parent directories and touch c & mtimes. Signed-off-by: Miklos Szeredi <> Acked-by: Hugh Dickins <> Signed-off-by: Christoph Hellwig <> Signed-off-by: Al Viro <>
2014-08-07shmem: support RENAME_NOREPLACEMiklos Szeredi1-2/+5
Implement ->rename2 instead of ->rename. Signed-off-by: Miklos Szeredi <> Acked-by: Hugh Dickins <> Signed-off-by: Christoph Hellwig <> Signed-off-by: Al Viro <>
2014-08-06mm/zpool: update zswap to use zpoolDan Streetman2-31/+46
Change zswap to use the zpool api instead of directly using zbud. Add a boot-time param to allow selecting which zpool implementation to use, with zbud as the default. Signed-off-by: Dan Streetman <> Tested-by: Seth Jennings <> Cc: Weijie Yang <> Cc: Minchan Kim <> Cc: Nitin Gupta <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-06mm/zpool: zbud/zsmalloc implement zpoolDan Streetman2-0/+179
Update zbud and zsmalloc to implement the zpool api. [ make functions static] Signed-off-by: Dan Streetman <> Tested-by: Seth Jennings <> Cc: Minchan Kim <> Cc: Nitin Gupta <> Cc: Weijie Yang <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-06mm/zpool: implement common zpool api to zbud/zsmallocDan Streetman4-18/+389
Add zpool api. zpool provides an interface for memory storage, typically of compressed memory. Users can select what backend to use; currently the only implementations are zbud, a low density implementation with up to two compressed pages per storage page, and zsmalloc, a higher density implementation with multiple compressed pages per storage page. Signed-off-by: Dan Streetman <> Tested-by: Seth Jennings <> Cc: Minchan Kim <> Cc: Nitin Gupta <> Cc: Weijie Yang <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-08-06mm/zbud: change zbud_alloc size type to size_tDan Streetman1-2/+2
Change the type of the zbud_alloc() size param from unsigned int to size_t. Technically, this should not make any difference, as the zbud implementation already restricts the size to well within either type's limits; but as zsmalloc (and kmalloc) use size_t, and zpool will use size_t, this brings the size parameter type in line with zsmalloc/zpool. Signed-off-by: Dan Streetman <> Acked-by: Seth Jennings <> Tested-by: Seth Jennings <> Cc: Weijie Yang <> Cc: Minchan Kim <> Cc: Nitin Gupta <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>