path: root/security/commoncap.c
AgeCommit message (Collapse)AuthorFilesLines
2016-05-17Merge branch 'for-linus' of ↵Linus Torvalds1-3/+3
git:// Pull parallel filesystem directory handling update from Al Viro. This is the main parallel directory work by Al that makes the vfs layer able to do lookup and readdir in parallel within a single directory. That's a big change, since this used to be all protected by the directory inode mutex. The inode mutex is replaced by an rwsem, and serialization of lookups of a single name is done by a "in-progress" dentry marker. The series begins with xattr cleanups, and then ends with switching filesystems over to actually doing the readdir in parallel (switching to the "iterate_shared()" that only takes the read lock). A more detailed explanation of the process from Al Viro: "The xattr work starts with some acl fixes, then switches ->getxattr to passing inode and dentry separately. This is the point where the things start to get tricky - that got merged into the very beginning of the -rc3-based #work.lookups, to allow untangling the security_d_instantiate() mess. The xattr work itself proceeds to switch a lot of filesystems to generic_...xattr(); no complications there. After that initial xattr work, the series then does the following: - untangle security_d_instantiate() - convert a bunch of open-coded lookup_one_len_unlocked() to calls of that thing; one such place (in overlayfs) actually yields a trivial conflict with overlayfs fixes later in the cycle - overlayfs ended up switching to a variant of lookup_one_len_unlocked() sans the permission checks. I would've dropped that commit (it gets overridden on merge from #ovl-fixes in #for-next; proper resolution is to use the variant in mainline fs/overlayfs/super.c), but I didn't want to rebase the damn thing - it was fairly late in the cycle... - some filesystems had managed to depend on lookup/lookup exclusion for *fs-internal* data structures in a way that would break if we relaxed the VFS exclusion. Fixing hadn't been hard, fortunately. - core of that series - parallel lookup machinery, replacing ->i_mutex with rwsem, making lookup_slow() take it only shared. At that point lookups happen in parallel; lookups on the same name wait for the in-progress one to be done with that dentry. Surprisingly little code, at that - almost all of it is in fs/dcache.c, with fs/namei.c changes limited to lookup_slow() - making it use the new primitive and actually switching to locking shared. - parallel readdir stuff - first of all, we provide the exclusion on per-struct file basis, same as we do for read() vs lseek() for regular files. That takes care of most of the needed exclusion in readdir/readdir; however, these guys are trickier than lookups, so I went for switching them one-by-one. To do that, a new method '->iterate_shared()' is added and filesystems are switched to it as they are either confirmed to be OK with shared lock on directory or fixed to be OK with that. I hope to kill the original method come next cycle (almost all in-tree filesystems are switched already), but it's still not quite finished. - several filesystems get switched to parallel readdir. The interesting part here is dealing with dcache preseeding by readdir; that needs minor adjustment to be safe with directory locked only shared. Most of the filesystems doing that got switched to in those commits. Important exception: NFS. Turns out that NFS folks, with their, er, insistence on VFS getting the fuck out of the way of the Smart Filesystem Code That Knows How And What To Lock(tm) have grown the locking of their own. They had their own homegrown rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink is the reader there). Of course, with VFS getting the fuck out of the way, as requested, the actual smarts of the smart filesystem code etc. had become exposed... - do_last/lookup_open/atomic_open cleanups. As the result, open() without O_CREAT locks the directory only shared. Including the ->atomic_open() case. Backmerge from #for-linus in the middle of that - atomic_open() fix got brought in. - then comes NFS switch to saner (VFS-based ;-) locking, killing the homegrown "lookup and readdir are writers" kinda-sorta rwsem. All exclusion for sillyunlink/lookup is done by the parallel lookups mechanism. Exclusion between sillyunlink and rmdir is a real rwsem now - rmdir being the writer. Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel now. - the rest of the series consists of switching a lot of filesystems to parallel readdir; in a lot of cases ->llseek() gets simplified as well. One backmerge in there (again, #for-linus - rockridge fix)" * 'for-linus' of git:// (74 commits) ext4: switch to ->iterate_shared() hfs: switch to ->iterate_shared() hfsplus: switch to ->iterate_shared() hostfs: switch to ->iterate_shared() hpfs: switch to ->iterate_shared() hpfs: handle allocation failures in hpfs_add_pos() gfs2: switch to ->iterate_shared() f2fs: switch to ->iterate_shared() afs: switch to ->iterate_shared() befs: switch to ->iterate_shared() befs: constify stuff a bit isofs: switch to ->iterate_shared() get_acorn_filename(): deobfuscate a bit btrfs: switch to ->iterate_shared() logfs: no need to lock directory in lseek switch ecryptfs to ->iterate_shared 9p: switch to ->iterate_shared() fat: switch to ->iterate_shared() romfs, squashfs: switch to ->iterate_shared() more trivial ->iterate_shared conversions ...
2016-04-22security: Introduce security_settime64()Baolin Wang1-1/+1
security_settime() uses a timespec, which is not year 2038 safe on 32bit systems. Thus this patch introduces the security_settime64() function with timespec64 type. We also convert the cap_settime() helper function to use the 64bit types. This patch then moves security_settime() to the header file as an inline helper function so that existing users can be iteratively converted. None of the existing hooks is using the timespec argument and therefor the patch is not making any functional changes. Cc: Serge Hallyn <>, Cc: James Morris <>, Cc: "Serge E. Hallyn" <>, Cc: Paul Moore <> Cc: Stephen Smalley <> Cc: Kees Cook <> Cc: Prarit Bhargava <> Cc: Richard Cochran <> Cc: Thomas Gleixner <> Cc: Ingo Molnar <> Reviewed-by: James Morris <> Signed-off-by: Baolin Wang <> [jstultz: Reworded commit message] Signed-off-by: John Stultz <>
2016-04-11->getxattr(): pass dentry and inode as separate argumentsAl Viro1-3/+3
Signed-off-by: Al Viro <>
2016-01-20ptrace: use fsuid, fsgid, effective creds for fs access checksJann Horn1-1/+6
By checking the effective credentials instead of the real UID / permitted capabilities, ensure that the calling process actually intended to use its credentials. To ensure that all ptrace checks use the correct caller credentials (e.g. in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS flag), use two new flags and require one of them to be set. The problem was that when a privileged task had temporarily dropped its privileges, e.g. by calling setreuid(0, user_uid), with the intent to perform following syscalls with the credentials of a user, it still passed ptrace access checks that the user would not be able to pass. While an attacker should not be able to convince the privileged task to perform a ptrace() syscall, this is a problem because the ptrace access check is reused for things in procfs. In particular, the following somewhat interesting procfs entries only rely on ptrace access checks: /proc/$pid/stat - uses the check for determining whether pointers should be visible, useful for bypassing ASLR /proc/$pid/maps - also useful for bypassing ASLR /proc/$pid/cwd - useful for gaining access to restricted directories that contain files with lax permissions, e.g. in this scenario: lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar drwx------ root root /root drwxr-xr-x root root /root/foobar -rw-r--r-- root root /root/foobar/secret Therefore, on a system where a root-owned mode 6755 binary changes its effective credentials as described and then dumps a user-specified file, this could be used by an attacker to reveal the memory layout of root's processes or reveal the contents of files he is not allowed to access (through /proc/$pid/cwd). [ fix warning] Signed-off-by: Jann Horn <> Acked-by: Kees Cook <> Cc: Casey Schaufler <> Cc: Oleg Nesterov <> Cc: Ingo Molnar <> Cc: James Morris <> Cc: "Serge E. Hallyn" <> Cc: Andy Shevchenko <> Cc: Andy Lutomirski <> Cc: Al Viro <> Cc: "Eric W. Biederman" <> Cc: Willy Tarreau <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2015-09-04capabilities: add a securebit to disable PR_CAP_AMBIENT_RAISEAndy Lutomirski1-1/+2
Per Andrew Morgan's request, add a securebit to allow admins to disable PR_CAP_AMBIENT_RAISE. This securebit will prevent processes from adding capabilities to their ambient set. For simplicity, this disables PR_CAP_AMBIENT_RAISE entirely rather than just disabling setting previously cleared bits. Signed-off-by: Andy Lutomirski <> Acked-by: Andrew G. Morgan <> Acked-by: Serge Hallyn <> Cc: Kees Cook <> Cc: Christoph Lameter <> Cc: Serge Hallyn <> Cc: Jonathan Corbet <> Cc: Aaron Jones <> Cc: Ted Ts'o <> Cc: Andrew G. Morgan <> Cc: Mimi Zohar <> Cc: Austin S Hemmelgarn <> Cc: Markku Savela <> Cc: Jarkko Sakkinen <> Cc: Michael Kerrisk <> Cc: James Morris <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2015-09-04capabilities: ambient capabilitiesAndy Lutomirski1-10/+92
Credit where credit is due: this idea comes from Christoph Lameter with a lot of valuable input from Serge Hallyn. This patch is heavily based on Christoph's patch. ===== The status quo ===== On Linux, there are a number of capabilities defined by the kernel. To perform various privileged tasks, processes can wield capabilities that they hold. Each task has four capability masks: effective (pE), permitted (pP), inheritable (pI), and a bounding set (X). When the kernel checks for a capability, it checks pE. The other capability masks serve to modify what capabilities can be in pE. Any task can remove capabilities from pE, pP, or pI at any time. If a task has a capability in pP, it can add that capability to pE and/or pI. If a task has CAP_SETPCAP, then it can add any capability to pI, and it can remove capabilities from X. Tasks are not the only things that can have capabilities; files can also have capabilities. A file can have no capabilty information at all [1]. If a file has capability information, then it has a permitted mask (fP) and an inheritable mask (fI) as well as a single effective bit (fE) [2]. File capabilities modify the capabilities of tasks that execve(2) them. A task that successfully calls execve has its capabilities modified for the file ultimately being excecuted (i.e. the binary itself if that binary is ELF or for the interpreter if the binary is a script.) [3] In the capability evolution rules, for each mask Z, pZ represents the old value and pZ' represents the new value. The rules are: pP' = (X & fP) | (pI & fI) pI' = pI pE' = (fE ? pP' : 0) X is unchanged For setuid binaries, fP, fI, and fE are modified by a moderately complicated set of rules that emulate POSIX behavior. Similarly, if euid == 0 or ruid == 0, then fP, fI, and fE are modified differently (primary, fP and fI usually end up being the full set). For nonroot users executing binaries with neither setuid nor file caps, fI and fP are empty and fE is false. As an extra complication, if you execute a process as nonroot and fE is set, then the "secure exec" rules are in effect: AT_SECURE gets set, LD_PRELOAD doesn't work, etc. This is rather messy. We've learned that making any changes is dangerous, though: if a new kernel version allows an unprivileged program to change its security state in a way that persists cross execution of a setuid program or a program with file caps, this persistent state is surprisingly likely to allow setuid or file-capped programs to be exploited for privilege escalation. ===== The problem ===== Capability inheritance is basically useless. If you aren't root and you execute an ordinary binary, fI is zero, so your capabilities have no effect whatsoever on pP'. This means that you can't usefully execute a helper process or a shell command with elevated capabilities if you aren't root. On current kernels, you can sort of work around this by setting fI to the full set for most or all non-setuid executable files. This causes pP' = pI for nonroot, and inheritance works. No one does this because it's a PITA and it isn't even supported on most filesystems. If you try this, you'll discover that every nonroot program ends up with secure exec rules, breaking many things. This is a problem that has bitten many people who have tried to use capabilities for anything useful. ===== The proposed change ===== This patch adds a fifth capability mask called the ambient mask (pA). pA does what most people expect pI to do. pA obeys the invariant that no bit can ever be set in pA if it is not set in both pP and pI. Dropping a bit from pP or pI drops that bit from pA. This ensures that existing programs that try to drop capabilities still do so, with a complication. Because capability inheritance is so broken, setting KEEPCAPS, using setresuid to switch to nonroot uids, and then calling execve effectively drops capabilities. Therefore, setresuid from root to nonroot conditionally clears pA unless SECBIT_NO_SETUID_FIXUP is set. Processes that don't like this can re-add bits to pA afterwards. The capability evolution rules are changed: pA' = (file caps or setuid or setgid ? 0 : pA) pP' = (X & fP) | (pI & fI) | pA' pI' = pI pE' = (fE ? pP' : pA') X is unchanged If you are nonroot but you have a capability, you can add it to pA. If you do so, your children get that capability in pA, pP, and pE. For example, you can set pA = CAP_NET_BIND_SERVICE, and your children can automatically bind low-numbered ports. Hallelujah! Unprivileged users can create user namespaces, map themselves to a nonzero uid, and create both privileged (relative to their namespace) and unprivileged process trees. This is currently more or less impossible. Hallelujah! You cannot use pA to try to subvert a setuid, setgid, or file-capped program: if you execute any such program, pA gets cleared and the resulting evolution rules are unchanged by this patch. Users with nonzero pA are unlikely to unintentionally leak that capability. If they run programs that try to drop privileges, dropping privileges will still work. It's worth noting that the degree of paranoia in this patch could possibly be reduced without causing serious problems. Specifically, if we allowed pA to persist across executing non-pA-aware setuid binaries and across setresuid, then, naively, the only capabilities that could leak as a result would be the capabilities in pA, and any attacker *already* has those capabilities. This would make me nervous, though -- setuid binaries that tried to privilege-separate might fail to do so, and putting CAP_DAC_READ_SEARCH or CAP_DAC_OVERRIDE into pA could have unexpected side effects. (Whether these unexpected side effects would be exploitable is an open question.) I've therefore taken the more paranoid route. We can revisit this later. An alternative would be to require PR_SET_NO_NEW_PRIVS before setting ambient capabilities. I think that this would be annoying and would make granting otherwise unprivileged users minor ambient capabilities (CAP_NET_BIND_SERVICE or CAP_NET_RAW for example) much less useful than it is with this patch. ===== Footnotes ===== [1] Files that are missing the "security.capability" xattr or that have unrecognized values for that xattr end up with has_cap set to false. The code that does that appears to be complicated for no good reason. [2] The libcap capability mask parsers and formatters are dangerously misleading and the documentation is flat-out wrong. fE is *not* a mask; it's a single bit. This has probably confused every single person who has tried to use file capabilities. [3] Linux very confusingly processes both the script and the interpreter if applicable, for reasons that elude me. The results from thinking about a script's file capabilities and/or setuid bits are mostly discarded. Preliminary userspace code is here, but it needs updating: Here is a test program that can be used to verify the functionality (from Christoph): /* * Test program for the ambient capabilities. This program spawns a shell * that allows running processes with a defined set of capabilities. * * (C) 2015 Christoph Lameter <> * Released under: GPL v3 or later. * * * Compile using: * * gcc -o ambient_test ambient_test.o -lcap-ng * * This program must have the following capabilities to run properly: * Permissions for CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE * * A command to equip the binary with the right caps is: * * setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test * * * To get a shell with additional caps that can be inherited by other processes: * * ./ambient_test /bin/bash * * * Verifying that it works: * * From the bash spawed by ambient_test run * * cat /proc/$$/status * * and have a look at the capabilities. */ #include <stdlib.h> #include <stdio.h> #include <errno.h> #include <cap-ng.h> #include <sys/prctl.h> #include <linux/capability.h> /* * Definitions from the kernel header files. These are going to be removed * when the /usr/include files have these defined. */ #define PR_CAP_AMBIENT 47 #define PR_CAP_AMBIENT_IS_SET 1 #define PR_CAP_AMBIENT_RAISE 2 #define PR_CAP_AMBIENT_LOWER 3 #define PR_CAP_AMBIENT_CLEAR_ALL 4 static void set_ambient_cap(int cap) { int rc; capng_get_caps_process(); rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap); if (rc) { printf("Cannot add inheritable cap\n"); exit(2); } capng_apply(CAPNG_SELECT_CAPS); /* Note the two 0s at the end. Kernel checks for these */ if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) { perror("Cannot set cap"); exit(1); } } int main(int argc, char **argv) { int rc; set_ambient_cap(CAP_NET_RAW); set_ambient_cap(CAP_NET_ADMIN); set_ambient_cap(CAP_SYS_NICE); printf("Ambient_test forking shell\n"); if (execv(argv[1], argv + 1)) perror("Cannot exec"); return 0; } Signed-off-by: Christoph Lameter <> # Original author Signed-off-by: Andy Lutomirski <> Acked-by: Serge E. Hallyn <> Acked-by: Kees Cook <> Cc: Jonathan Corbet <> Cc: Aaron Jones <> Cc: Ted Ts'o <> Cc: Andrew G. Morgan <> Cc: Mimi Zohar <> Cc: Austin S Hemmelgarn <> Cc: Markku Savela <> Cc: Jarkko Sakkinen <> Cc: Michael Kerrisk <> Cc: James Morris <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2015-05-12LSM: Switch to lists of hooksCasey Schaufler1-8/+33
Instead of using a vector of security operations with explicit, special case stacking of the capability and yama hooks use lists of hooks with capability and yama hooks included as appropriate. The security_operations structure is no longer required. Instead, there is a union of the function pointers that allows all the hooks lists to use a common mechanism for list management while retaining typing. Each module supplies an array describing the hooks it provides instead of a sparsely populated security_operations structure. The description includes the element that gets put on the hook list, avoiding the issues surrounding individual element allocation. The method for registering security modules is changed to reflect the information available. The method for removing a module, currently only used by SELinux, has also changed. It should be generic now, however if there are potential race conditions based on ordering of hook removal that needs to be addressed by the calling module. The security hooks are called from the lists and the first failure is returned. Signed-off-by: Casey Schaufler <> Acked-by: John Johansen <> Acked-by: Kees Cook <> Acked-by: Paul Moore <> Acked-by: Stephen Smalley <> Acked-by: Tetsuo Handa <> Signed-off-by: James Morris <>
2015-04-15VFS: security/: d_backing_inode() annotationsDavid Howells1-3/+3
most of the ->d_inode uses there refer to the same inode IO would go to, i.e. d_backing_inode() Signed-off-by: David Howells <> Signed-off-by: Al Viro <>
2015-01-25file->f_path.dentry is pinned down for as long as the file is open...Al Viro1-5/+1
Signed-off-by: Al Viro <>
2014-11-19kill f_dentry usesAl Viro1-1/+1
Signed-off-by: Al Viro <>
2014-07-24CAPABILITIES: remove undefined caps from all processesEric Paris1-0/+3
This is effectively a revert of 7b9a7ec565505699f503b4fcf61500dceb36e744 plus fixing it a different way... We found, when trying to run an application from an application which had dropped privs that the kernel does security checks on undefined capability bits. This was ESPECIALLY difficult to debug as those undefined bits are hidden from /proc/$PID/status. Consider a root application which drops all capabilities from ALL 4 capability sets. We assume, since the application is going to set eff/perm/inh from an array that it will clear not only the defined caps less than CAP_LAST_CAP, but also the higher 28ish bits which are undefined future capabilities. The BSET gets cleared differently. Instead it is cleared one bit at a time. The problem here is that in security/commoncap.c::cap_task_prctl() we actually check the validity of a capability being read. So any task which attempts to 'read all things set in bset' followed by 'unset all things set in bset' will not even attempt to unset the undefined bits higher than CAP_LAST_CAP. So the 'parent' will look something like: CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: ffffffc000000000 All of this 'should' be fine. Given that these are undefined bits that aren't supposed to have anything to do with permissions. But they do... So lets now consider a task which cleared the eff/perm/inh completely and cleared all of the valid caps in the bset (but not the invalid caps it couldn't read out of the kernel). We know that this is exactly what the libcap-ng library does and what the go capabilities library does. They both leave you in that above situation if you try to clear all of you capapabilities from all 4 sets. If that root task calls execve() the child task will pick up all caps not blocked by the bset. The bset however does not block bits higher than CAP_LAST_CAP. So now the child task has bits in eff which are not in the parent. These are 'meaningless' undefined bits, but still bits which the parent doesn't have. The problem is now in cred_cap_issubset() (or any operation which does a subset test) as the child, while a subset for valid cap bits, is not a subset for invalid cap bits! So now we set durring commit creds that the child is not dumpable. Given it is 'more priv' than its parent. It also means the parent cannot ptrace the child and other stupidity. The solution here: 1) stop hiding capability bits in status This makes debugging easier! 2) stop giving any task undefined capability bits. it's simple, it you don't put those invalid bits in CAP_FULL_SET you won't get them in init and you won't get them in any other task either. This fixes the cap_issubset() tests and resulting fallout (which made the init task in a docker container untraceable among other things) 3) mask out undefined bits when sys_capset() is called as it might use ~0, ~0 to denote 'all capabilities' for backward/forward compatibility. This lets 'capsh --caps="all=eip" -- -c /bin/bash' run. 4) mask out undefined bit when we read a file capability off of disk as again likely all bits are set in the xattr for forward/backward compatibility. This lets 'setcap all+pe /bin/bash; /bin/bash' run Signed-off-by: Eric Paris <> Reviewed-by: Kees Cook <> Cc: Andrew Vagin <> Cc: Andrew G. Morgan <> Cc: Serge E. Hallyn <> Cc: Kees Cook <> Cc: Steve Grubb <> Cc: Dan Walsh <> Cc: Signed-off-by: James Morris <>
2014-07-24commoncap: don't alloc the credential unless needed in cap_task_prctlTetsuo Handa1-42/+30
In function cap_task_prctl(), we would allocate a credential unconditionally and then check if we support the requested function. If not we would release this credential with abort_creds() by using RCU method. But on some archs such as powerpc, the sys_prctl is heavily used to get/set the floating point exception mode. So the unnecessary allocating/releasing of credential not only introduce runtime overhead but also do cause OOM due to the RCU implementation. This patch removes abort_creds() from cap_task_prctl() by calling prepare_creds() only when we need to modify it. Reported-by: Kevin Hao <> Signed-off-by: Tetsuo Handa <> Reviewed-by: Paul Moore <> Acked-by: Serge E. Hallyn <> Reviewed-by: Kees Cook <> Signed-off-by: James Morris <>
2013-08-30capabilities: allow nice if we are privilegedSerge Hallyn1-4/+4
We allow task A to change B's nice level if it has a supserset of B's privileges, or of it has CAP_SYS_NICE. Also allow it if A has CAP_SYS_NICE with respect to B - meaning it is root in the same namespace, or it created B's namespace. Signed-off-by: Serge Hallyn <> Reviewed-by: "Eric W. Biederman" <> Signed-off-by: Eric W. Biederman <>
2013-08-30userns: Allow PR_CAPBSET_DROP in a user namespace.Eric W. Biederman1-1/+1
As the capabilites and capability bounding set are per user namespace properties it is safe to allow changing them with just CAP_SETPCAP permission in the user namespace. Acked-by: Serge Hallyn <> Tested-by: Richard Weinberger <> Signed-off-by: "Eric W. Biederman" <>
2013-02-26kill f_vfsmntAl Viro1-1/+1
very few users left... Signed-off-by: Al Viro <>
2012-12-14Fix cap_capable to only allow owners in the parent user namespace to have caps.Eric W. Biederman1-8/+17
Andy Lutomirski pointed out that the current behavior of allowing the owner of a user namespace to have all caps when that owner is not in a parent user namespace is wrong. Add a test to ensure the owner of a user namespace is in the parent of the user namespace to fix this bug. Thankfully this bug did not apply to the initial user namespace, keeping the mischief that can be caused by this bug quite small. This is bug was introduced in v3.5 by commit 783291e6900 "Simplify the user_namespace by making userns->creator a kuid." But did not matter until the permisions required to create a user namespace were relaxed allowing a user namespace to be created inside of a user namespace. The bug made it possible for the owner of a user namespace to be present in a child user namespace. Since the owner of a user nameapce is granted all capabilities it became possible for users in a grandchild user namespace to have all privilges over their parent user namspace. Reorder the checks in cap_capable. This should make the common case faster and make it clear that nothing magic happens in the initial user namespace. The reordering is safe because cred->user_ns can only be in targ_ns or targ_ns->parent but not both. Add a comment a the top of the loop to make the logic of the code clear. Add a distinct variable ns that changes as we walk up the user namespace hierarchy to make it clear which variable is changing. Acked-by: Serge Hallyn <> Signed-off-by: "Eric W. Biederman" <>
2012-05-31split ->file_mmap() into ->mmap_addr()/->mmap_file()Al Viro1-18/+3
... i.e. file-dependent and address-dependent checks. Signed-off-by: Al Viro <>
2012-05-31split cap_mmap_addr() out of cap_file_mmap()Al Viro1-9/+23
... switch callers. Signed-off-by: Al Viro <>
2012-05-23Merge branch 'for-linus' of ↵Linus Torvalds1-25/+36
git:// Pull user namespace enhancements from Eric Biederman: "This is a course correction for the user namespace, so that we can reach an inexpensive, maintainable, and reasonably complete implementation. Highlights: - Config guards make it impossible to enable the user namespace and code that has not been converted to be user namespace safe. - Use of the new kuid_t type ensures the if you somehow get past the config guards the kernel will encounter type errors if you enable user namespaces and attempt to compile in code whose permission checks have not been updated to be user namespace safe. - All uids from child user namespaces are mapped into the initial user namespace before they are processed. Removing the need to add an additional check to see if the user namespace of the compared uids remains the same. - With the user namespaces compiled out the performance is as good or better than it is today. - For most operations absolutely nothing changes performance or operationally with the user namespace enabled. - The worst case performance I could come up with was timing 1 billion cache cold stat operations with the user namespace code enabled. This went from 156s to 164s on my laptop (or 156ns to 164ns per stat operation). - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value. Most uid/gid setting system calls treat these value specially anyway so attempting to use -1 as a uid would likely cause entertaining failures in userspace. - If setuid is called with a uid that can not be mapped setuid fails. I have looked at sendmail, login, ssh and every other program I could think of that would call setuid and they all check for and handle the case where setuid fails. - If stat or a similar system call is called from a context in which we can not map a uid we lie and return overflowuid. The LFS experience suggests not lying and returning an error code might be better, but the historical precedent with uids is different and I can not think of anything that would break by lying about a uid we can't map. - Capabilities are localized to the current user namespace making it safe to give the initial user in a user namespace all capabilities. My git tree covers all of the modifications needed to convert the core kernel and enough changes to make a system bootable to runlevel 1." Fix up trivial conflicts due to nearby independent changes in fs/stat.c * 'for-linus' of git:// (46 commits) userns: Silence silly gcc warning. cred: use correct cred accessor with regards to rcu read lock userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq userns: Convert cgroup permission checks to use uid_eq userns: Convert tmpfs to use kuid and kgid where appropriate userns: Convert sysfs to use kgid/kuid where appropriate userns: Convert sysctl permission checks to use kuid and kgids. userns: Convert proc to use kuid/kgid where appropriate userns: Convert ext4 to user kuid/kgid where appropriate userns: Convert ext3 to use kuid/kgid where appropriate userns: Convert ext2 to use kuid/kgid where appropriate. userns: Convert devpts to use kuid/kgid where appropriate userns: Convert binary formats to use kuid/kgid where appropriate userns: Add negative depends on entries to avoid building code that is userns unsafe userns: signal remove unnecessary map_cred_ns userns: Teach inode_capable to understand inodes whose uids map to other namespaces. userns: Fail exec for suid and sgid binaries with ids outside our user namespace. userns: Convert stat to return values mapped from kuids and kgids userns: Convert user specfied uids and gids in chown into kuids and kgid userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs ...
2012-05-04Merge tag 'v3.4-rc5' into nextJames Morris1-0/+6
Linux 3.4-rc5 Merge to pull in prerequisite change for Smack: 86812bb0de1a3758dc6c7aa01a763158a7c0638a Requested by Casey.
2012-05-03userns: Convert capabilities related permsion checksEric W. Biederman1-15/+26
- Use uid_eq when comparing kuids Use gid_eq when comparing kgids - Use make_kuid(user_ns, 0) to talk about the user_namespace root uid Acked-by: Serge Hallyn <> Signed-off-by: Eric W. Biederman <>
2012-05-03userns: Store uid and gid values in struct cred with kuid_t and kgid_t typesEric W. Biederman1-2/+1
cred.h and a few trivial users of struct cred are changed. The rest of the users of struct cred are left for other patches as there are too many changes to make in one go and leave the change reviewable. If the user namespace is disabled and CONFIG_UIDGID_STRICT_TYPE_CHECKS are disabled the code will contiue to compile and behave correctly. Acked-by: Serge Hallyn <> Signed-off-by: Eric W. Biederman <>
2012-04-26userns: Simplify the user_namespace by making userns->creator a kuid.Eric W. Biederman1-2/+3
- Transform userns->creator from a user_struct reference to a simple kuid_t, kgid_t pair. In cap_capable this allows the check to see if we are the creator of a namespace to become the classic suser style euid permission check. This allows us to remove the need for a struct cred in the mapping functions and still be able to dispaly the user namespace creators uid and gid as 0. - Remove the now unnecessary delayed_work in free_user_ns. All that is left for free_user_ns to do is to call kmem_cache_free and put_user_ns. Those functions can be called in any context so call them directly from free_user_ns removing the need for delayed work. Acked-by: Serge Hallyn <> Signed-off-by: Eric W. Biederman <>
2012-04-19security: fix compile error in commoncap.cJonghwan Choi1-0/+1
Add missing "personality.h" security/commoncap.c: In function 'cap_bprm_set_creds': security/commoncap.c:510: error: 'PER_CLEAR_ON_SETID' undeclared (first use in this function) security/commoncap.c:510: error: (Each undeclared identifier is reported only once security/commoncap.c:510: error: for each function it appears in.) Signed-off-by: Jonghwan Choi <> Acked-by: Serge Hallyn <> Signed-off-by: James Morris <>
2012-04-18fcaps: clear the same personality flags as suid when fcaps are usedEric Paris1-0/+5
If a process increases permissions using fcaps all of the dangerous personality flags which are cleared for suid apps should also be cleared. Thus programs given priviledge with fcaps will continue to have address space randomization enabled even if the parent tried to disable it to make it easier to attack. Signed-off-by: Eric Paris <> Reviewed-by: Serge Hallyn <> Signed-off-by: James Morris <>
2012-04-14Add PR_{GET,SET}_NO_NEW_PRIVS to prevent execve from granting privsAndy Lutomirski1-2/+5
With this change, calling prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) disables privilege granting operations at execve-time. For example, a process will not be able to execute a setuid binary to change their uid or gid if this bit is set. The same is true for file capabilities. Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that LSMs respect the requested behavior. To determine if the NO_NEW_PRIVS bit is set, a task may call prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0); It returns 1 if set and 0 if it is not set. If any of the arguments are non-zero, it will return -1 and set errno to -EINVAL. (PR_SET_NO_NEW_PRIVS behaves similarly.) This functionality is desired for the proposed seccomp filter patch series. By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the system call behavior for itself and its child tasks without being able to impact the behavior of a more privileged task. Another potential use is making certain privileged operations unprivileged. For example, chroot may be considered "safe" if it cannot affect privileged tasks. Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is set and AppArmor is in use. It is fixed in a subsequent patch. Signed-off-by: Andy Lutomirski <> Signed-off-by: Will Drewry <> Acked-by: Eric Paris <> Acked-by: Kees Cook <> v18: updated change desc v17: using new define values as per 3.4 Signed-off-by: James Morris <>
2012-04-07userns: Add an explicit reference to the parent user namespaceEric W. Biederman1-1/+1
I am about to remove the struct user_namespace reference from struct user_struct. So keep an explicit track of the parent user namespace. Take advantage of this new reference and replace instances of user_ns->creator->user_ns with user_ns->parent. Acked-by: Serge Hallyn <> Signed-off-by: Eric W. Biederman <>
2012-04-07userns: Use cred->user_ns instead of cred->user->user_nsEric W. Biederman1-7/+7
Optimize performance and prepare for the removal of the user_ns reference from user_struct. Remove the slow long walk through cred->user->user_ns and instead go straight to cred->user_ns. Acked-by: Serge Hallyn <> Signed-off-by: Eric W. Biederman <>
2012-02-14security: trim security.hAl Viro1-0/+1
Trim security.h Signed-off-by: Al Viro <> Signed-off-by: James Morris <>
2012-01-14Merge branch 'for-linus' of git:// Torvalds1-17/+7
* 'for-linus' of git:// capabilities: remove __cap_full_set definition security: remove the security_netlink_recv hook as it is equivalent to capable() ptrace: do not audit capability check when outputing /proc/pid/stat capabilities: remove task_ns_* functions capabitlies: ns_capable can use the cap helpers rather than lsm call capabilities: style only - move capable below ns_capable capabilites: introduce new has_ns_capabilities_noaudit capabilities: call has_ns_capability from has_capability capabilities: remove all _real_ interfaces capabilities: introduce security_capable_noaudit capabilities: reverse arguments to security_capable capabilities: remove the task from capable LSM hook entirely selinux: sparse fix: fix several warnings in the security server cod selinux: sparse fix: fix warnings in netlink code selinux: sparse fix: eliminate warnings for selinuxfs selinux: sparse fix: declare selinux_disable() in security.h selinux: sparse fix: move selinux_complete_init selinux: sparse fix: make selinux_secmark_refcount static SELinux: Fix RCU deref check warning in sel_netport_insert() Manually fix up a semantic mis-merge wrt security_netlink_recv(): - the interface was removed in commit fd7784615248 ("security: remove the security_netlink_recv hook as it is equivalent to capable()") - a new user of it appeared in commit a38f7907b926 ("crypto: Add userspace configuration API") causing no automatic merge conflict, but Eric Paris pointed out the issue.
2012-01-05security: remove the security_netlink_recv hook as it is equivalent to capable()Eric Paris1-8/+0
Once upon a time netlink was not sync and we had to get the effective capabilities from the skb that was being received. Today we instead get the capabilities from the current task. This has rendered the entire purpose of the hook moot as it is now functionally equivalent to the capable() call. Signed-off-by: Eric Paris <>
2012-01-05capabilities: remove the task from capable LSM hook entirelyEric Paris1-9/+7
The capabilities framework is based around credentials, not necessarily the current task. Yet we still passed the current task down into LSMs from the security_capable() LSM hook as if it was a meaningful portion of the security decision. This patch removes the 'generic' passing of current and instead forces individual LSMs to use current explicitly if they think it is appropriate. In our case those LSMs are SELinux and AppArmor. I believe the AppArmor use of current is incorrect, but that is wholely unrelated to this patch. This patch does not change what AppArmor does, it just makes it clear in the AppArmor code that it is doing it. The SELinux code still uses current in it's audit message, which may also be wrong and needs further investigation. Again this is NOT a change, it may have always been wrong, this patch just makes it clear what is happening. Signed-off-by: Eric Paris <>
2011-08-16capabilities: initialize has_capSerge Hallyn1-1/+1
Initialize has_cap in cap_bprm_set_creds() Reported-by: Andrew G. Morgan <> Signed-off-by: Serge Hallyn <> Signed-off-by: James Morris <>
2011-08-12capabilities: do not grant full privs for setuid w/ file caps + no effective ↵Zhi Li1-6/+10
caps A task (when !SECURE_NOROOT) which executes a setuid-root binary will obtain root privileges while executing that binary. If the binary also has effective capabilities set, then only those capabilities will be granted. The rationale is that the same binary can carry both setuid-root and the minimal file capability set, so that on a filesystem not supporting file caps the binary can still be executed with privilege, while on a filesystem supporting file caps it will run with minimal privilege. This special case currently does NOT happen if there are file capabilities but no effective capabilities. Since capability-aware programs can very well start with empty pE but populated pP and move those caps to pE when needed. In other words, if the file has file capabilities but NOT effective capabilities, then we should do the same thing as if there were file capabilities, and not grant full root privileges. This patchset does that. (Changelog by Serge Hallyn). Signed-off-by: Zhi Li <> Acked-by: Serge Hallyn <> Signed-off-by: James Morris <>
2011-04-04capabilities: do not special case exec of initEric Paris1-9/+4
When the global init task is exec'd we have special case logic to make sure the pE is not reduced. There is no reason for this. If init wants to drop it's pE is should be allowed to do so. Remove this special logic. Signed-off-by: Eric Paris <> Acked-by: Serge Hallyn <> Acked-by: David Howells <> Acked-by: Andrew G. Morgan <> Signed-off-by: James Morris <>
2011-03-23userns: allow ptrace from non-init user namespacesSerge E. Hallyn1-8/+32
ptrace is allowed to tasks in the same user namespace according to the usual rules (i.e. the same rules as for two tasks in the init user namespace). ptrace is also allowed to a user namespace to which the current task the has CAP_SYS_PTRACE capability. Changelog: Dec 31: Address feedback by Eric: . Correct ptrace uid check . Rename may_ptrace_ns to ptrace_capable . Also fix the cap_ptrace checks. Jan 1: Use const cred struct Jan 11: use task_ns_capable() in place of ptrace_capable(). Feb 23: same_or_ancestore_user_ns() was not an appropriate check to constrain cap_issubset. Rather, cap_issubset() only is meaningful when both capsets are in the same user_ns. Signed-off-by: Serge E. Hallyn <> Cc: "Eric W. Biederman" <> Acked-by: Daniel Lezcano <> Acked-by: David Howells <> Cc: James Morris <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2011-03-23userns: security: make capabilities relative to the user namespaceSerge E. Hallyn1-7/+31
- Introduce ns_capable to test for a capability in a non-default user namespace. - Teach cap_capable to handle capabilities in a non-default user namespace. The motivation is to get to the unprivileged creation of new namespaces. It looks like this gets us 90% of the way there, with only potential uid confusion issues left. I still need to handle getting all caps after creation but otherwise I think I have a good starter patch that achieves all of your goals. Changelog: 11/05/2010: [serge] add apparmor 12/14/2010: [serge] fix capabilities to created user namespaces Without this, if user serge creates a user_ns, he won't have capabilities to the user_ns he created. THis is because we were first checking whether his effective caps had the caps he needed and returning -EPERM if not, and THEN checking whether he was the creator. Reverse those checks. 12/16/2010: [serge] security_real_capable needs ns argument in !security case 01/11/2011: [serge] add task_ns_capable helper 01/11/2011: [serge] add nsown_capable() helper per Bastian Blank suggestion 02/16/2011: [serge] fix a logic bug: the root user is always creator of init_user_ns, but should not always have capabilities to it! Fix the check in cap_capable(). 02/21/2011: Add the required user_ns parameter to security_capable, fixing a compile failure. 02/23/2011: Convert some macros to functions as per akpm comments. Some couldn't be converted because we can't easily forward-declare them (they are inline if !SECURITY, extern if SECURITY). Add a current_user_ns function so we can use it in capability.h without #including cred.h. Move all forward declarations together to the top of the #ifdef __KERNEL__ section, and use kernel-doc format. 02/23/2011: Per dhowells, clean up comment in cap_capable(). 02/23/2011: Per akpm, remove unreachable 'return -EPERM' in cap_capable. (Original written and signed off by Eric; latest, modified version acked by him) [ fix build] [ export current_user_ns() for ecryptfs] [ remove unneeded extra argument in selinux's task_has_capability] Signed-off-by: Eric W. Biederman <> Signed-off-by: Serge E. Hallyn <> Acked-by: "Eric W. Biederman" <> Acked-by: Daniel Lezcano <> Acked-by: David Howells <> Cc: James Morris <> Signed-off-by: Serge E. Hallyn <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2011-03-16Merge git:// Torvalds1-2/+1
* git:// (1480 commits) bonding: enable netpoll without checking link status xfrm: Refcount destination entry on xfrm_lookup net: introduce rx_handler results and logic around that bonding: get rid of IFF_SLAVE_INACTIVE netdev->priv_flag bonding: wrap slave state work net: get rid of multiple bond-related netdevice->priv_flags bonding: register slave pointer for rx_handler be2net: Bump up the version number be2net: Copyright notice change. Update to Emulex instead of ServerEngines e1000e: fix kconfig for crc32 dependency netfilter ebtables: fix xt_AUDIT to work with ebtables xen network backend driver bonding: Improve syslog message at device creation time bonding: Call netif_carrier_off after register_netdevice bonding: Incorrect TX queue offset net_sched: fix ip_tos2prio xfrm: fix __xfrm_route_forward() be2net: Fix UDP packet detected status in RX compl Phonet: fix aligned-mode pipe socket buffer header reserve netxen: support for GbE port settings ... Fix up conflicts in drivers/staging/brcm80211/brcmsmac/wl_mac80211.c with the staging updates.
2011-03-03netlink: kill eff_cap from struct netlink_skb_parmsPatrick McHardy1-2/+1
Netlink message processing in the kernel is synchronous these days, capabilities can be checked directly in security_netlink_recv() from the current process. Signed-off-by: Patrick McHardy <> Reviewed-by: James Morris <> [chrisw: update to include pohmelfs and uvesafb] Signed-off-by: Chris Wright <> Signed-off-by: David S. Miller <>
2011-02-02time: Correct the *settime* parametersRichard Cochran1-1/+1
Both settimeofday() and clock_settime() promise with a 'const' attribute not to alter the arguments passed in. This patch adds the missing 'const' attribute into the various kernel functions implementing these calls. Signed-off-by: Richard Cochran <> Acked-by: John Stultz <> LKML-Reference: <> Signed-off-by: Thomas Gleixner <>
2010-11-15capabilities/syslog: open code cap_syslog logic to fix build failureEric Paris1-21/+0
The addition of CONFIG_SECURITY_DMESG_RESTRICT resulted in a build failure when CONFIG_PRINTK=n. This is because the capabilities code which used the new option was built even though the variable in question didn't exist. The patch here fixes this by moving the capabilities checks out of the LSM and into the caller. All (known) LSMs should have been calling the capabilities hook already so it actually makes the code organization better to eliminate the hook altogether. Signed-off-by: Eric Paris <> Acked-by: James Morris <> Signed-off-by: Linus Torvalds <>
2010-11-12Restrict unprivileged access to kernel syslogDan Rosenberg1-0/+2
The kernel syslog contains debugging information that is often useful during exploitation of other vulnerabilities, such as kernel heap addresses. Rather than futilely attempt to sanitize hundreds (or thousands) of printk statements and simultaneously cripple useful debugging functionality, it is far simpler to create an option that prevents unprivileged users from reading the syslog. This patch, loosely based on grsecurity's GRKERNSEC_DMESG, creates the dmesg_restrict sysctl. When set to "0", the default, no restrictions are enforced. When set to "1", only users with CAP_SYS_ADMIN can read the kernel syslog via dmesg(8) or other mechanisms. [ explain the config option in kernel.txt] Signed-off-by: Dan Rosenberg <> Acked-by: Ingo Molnar <> Acked-by: Eugene Teo <> Acked-by: Kees Cook <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2010-10-21security: remove unused parameter from security_task_setscheduler()KOSAKI Motohiro1-4/+1
All security modules shouldn't change sched_param parameter of security_task_setscheduler(). This is not only meaningless, but also make a harmful result if caller pass a static variable. This patch remove policy and sched_param parameter from security_task_setscheduler() becuase none of security module is using it. Cc: James Morris <> Signed-off-by: KOSAKI Motohiro <> Signed-off-by: James Morris <>
2010-08-17Make do_execve() take a const filename pointerDavid Howells1-1/+1
Make do_execve() take a const filename pointer so that kernel_execve() compiles correctly on ARM: arch/arm/kernel/sys_arm.c:88: warning: passing argument 1 of 'do_execve' discards qualifiers from pointer target type This also requires the argv and envp arguments to be consted twice, once for the pointer array and once for the strings the array points to. This is because do_execve() passes a pointer to the filename (now const) to copy_strings_kernel(). A simpler alternative would be to cast the filename pointer in do_execve() when it's passed to copy_strings_kernel(). do_execve() may not change any of the strings it is passed as part of the argv or envp lists as they are some of them in .rodata, so marking these strings as const should be fine. Further kernel_execve() and sys_execve() need to be changed to match. This has been test built on x86_64, frv, arm and mips. Signed-off-by: David Howells <> Tested-by: Ralf Baechle <> Acked-by: Russell King <> Signed-off-by: Linus Torvalds <>
2010-04-23security: whitespace coding style fixesJustin P. Mattock1-2/+2
Whitespace coding style fixes. Signed-off-by: Justin P. Mattock <> Signed-off-by: James Morris <>
2010-04-20Security: Fix the comment of cap_file_mmap()wzt.wzt@gmail.com1-1/+1
In the comment of cap_file_mmap(), replace mmap_min_addr to be dac_mmap_min_addr. Signed-off-by: Zhitong Wang <> Acked-by: Eric Paris <> Signed-off-by: James Morris <>
2010-02-05syslog: clean up needless commentKees Cook1-1/+0
Drop my typoed comment as it is both unhelpful and redundant. Signed-off-by: Kees Cook <> Acked-by: Serge Hallyn <> Signed-off-by: James Morris <>
2010-02-04syslog: use defined constants instead of raw numbersKees Cook1-2/+3
Right now the syslog "type" action are just raw numbers which makes the source difficult to follow. This patch replaces the raw numbers with defined constants for some level of sanity. Signed-off-by: Kees Cook <> Acked-by: John Johansen <> Acked-by: Serge Hallyn <> Signed-off-by: James Morris <>
2010-02-04syslog: distinguish between /proc/kmsg and syscallsKees Cook1-1/+6
This allows the LSM to distinguish between syslog functions originating from /proc/kmsg access and direct syscalls. By default, the commoncaps will now no longer require CAP_SYS_ADMIN to read an opened /proc/kmsg file descriptor. For example the kernel syslog reader can now drop privileges after opening /proc/kmsg, instead of staying privileged with CAP_SYS_ADMIN. MAC systems that implement security_syslog have unchanged behavior. Signed-off-by: Kees Cook <> Acked-by: Serge Hallyn <> Acked-by: John Johansen <> Signed-off-by: James Morris <>
2009-11-24remove CONFIG_SECURITY_FILE_CAPABILITIES compile optionSerge E. Hallyn1-70/+2
As far as I know, all distros currently ship kernels with default CONFIG_SECURITY_FILE_CAPABILITIES=y. Since having the option on leaves a 'no_file_caps' option to boot without file capabilities, the main reason to keep the option is that turning it off saves you (on my s390x partition) 5k. In particular, vmlinux sizes came to: without patch fscaps=n: 53598392 without patch fscaps=y: 53603406 with this patch applied: 53603342 with the security-next tree. Against this we must weigh the fact that there is no simple way for userspace to figure out whether file capabilities are supported, while things like per-process securebits, capability bounding sets, and adding bits to pI if CAP_SETPCAP is in pE are not supported with SECURITY_FILE_CAPABILITIES=n, leaving a bit of a problem for applications wanting to know whether they can use them and/or why something failed. It also adds another subtly different set of semantics which we must maintain at the risk of severe security regressions. So this patch removes the SECURITY_FILE_CAPABILITIES compile option. It drops the kernel size by about 50k over the stock SECURITY_FILE_CAPABILITIES=y kernel, by removing the cap_limit_ptraced_target() function. Changelog: Nov 20: remove cap_limit_ptraced_target() as it's logic was ifndef'ed. Signed-off-by: Serge E. Hallyn <> Acked-by: Andrew G. Morgan" <> Signed-off-by: James Morris <>