Skip to content

Commit 3bf3ac1

Browse files
braunerjosefbacik
authored andcommitted
fs: don't block i_writecount during exec
Back in 2021 we already discussed removing deny_write_access() for executables. Back then I was hesistant because I thought that this might cause issues in userspace. But even back then I had started taking some notes on what could potentially depend on this and I didn't come up with a lot so I've changed my mind and I would like to try this. Here are some of the notes that I took: (1) The deny_write_access() mechanism is causing really pointless issues such as [1]. If a thread in a thread-group opens a file writable, then writes some stuff, then closing the file descriptor and then calling execve() they can fail the execve() with ETXTBUSY because another thread in the thread-group could have concurrently called fork(). Multi-threaded libraries such as go suffer from this. (2) There are userspace attacks that rely on overwriting the binary of a running process. These attacks are _mitigated_ but _not at all prevented_ from ocurring by the deny_write_access() mechanism. I'll go over some details. The clearest example of such attacks was the attack against runC in CVE-2019-5736 (cf. [3]). An attack could compromise the runC host binary from inside a _privileged_ runC container. The malicious binary could then be used to take over the host. (It is crucial to note that this attack is _not_ possible with unprivileged containers. IOW, the setup here is already insecure.) The attack can be made when attaching to a running container or when starting a container running a specially crafted image. For example, when runC attaches to a container the attacker can trick it into executing itself. This could be done by replacing the target binary inside the container with a custom binary pointing back at the runC binary itself. As an example, if the target binary was /bin/bash, this could be replaced with an executable script specifying the interpreter path #!/proc/self/exe. As such when /bin/bash is executed inside the container, instead the target of /proc/self/exe will be executed. That magic link will point to the runc binary on the host. The attacker can then proceed to write to the target of /proc/self/exe to try and overwrite the runC binary on the host. However, this will not succeed because of deny_write_access(). Now, one might think that this would prevent the attack but it doesn't. To overcome this, the attacker has multiple ways: * Open a file descriptor to /proc/self/exe using the O_PATH flag and then proceed to reopen the binary as O_WRONLY through /proc/self/fd/<nr> and try to write to it in a busy loop from a separate process. Ultimately it will succeed when the runC binary exits. After this the runC binary is compromised and can be used to attack other containers or the host itself. * Use a malicious shared library annotating a function in there with the constructor attribute making the malicious function run as an initializor. The malicious library will then open /proc/self/exe for creating a new entry under /proc/self/fd/<nr>. It'll then call exec to a) force runC to exit and b) hand the file descriptor off to a program that then reopens /proc/self/fd/<nr> for writing (which is now possible because runC has exited) and overwriting that binary. To sum up: the deny_write_access() mechanism doesn't prevent such attacks in insecure setups. It just makes them minimally harder. That's all. The only way back then to prevent this is to create a temporary copy of the calling binary itself when it starts or attaches to containers. So what I did back then for LXC (and Aleksa for runC) was to create an anonymous, in-memory file using the memfd_create() system call and to copy itself into the temporary in-memory file, which is then sealed to prevent further modifications. This sealed, in-memory file copy is then executed instead of the original on-disk binary. Any compromising write operations from a privileged container to the host binary will then write to the temporary in-memory binary and not to the host binary on-disk, preserving the integrity of the host binary. Also as the temporary, in-memory binary is sealed, writes to this will also fail. The point is that deny_write_access() is uselss to prevent these attacks. (3) Denying write access to an inode because it's currently used in an exec path could easily be done on an LSM level. It might need an additional hook but that should be about it. (4) The MAP_DENYWRITE flag for mmap() has been deprecated a long time ago so while we do protect the main executable the bigger portion of the things you'd think need protecting such as the shared libraries aren't. IOW, we let anyone happily overwrite shared libraries. (5) We removed all remaining uses of VM_DENYWRITE in [2]. That means: (5.1) We removed the legacy uselib() protection for preventing overwriting of shared libraries. Nobody cared in 3 years. (5.2) We allow write access to the elf interpreter after exec completed treating it on a par with shared libraries. Yes, someone in userspace could potentially be relying on this. It's not completely out of the realm of possibility but let's find out if that's actually the case and not guess. Link: golang/go#22315 [1] Link: 49624ef ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2] Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3] Link: https://lwn.net/Articles/866493 Link: golang/go#22220 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/buildid.go#L724 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/exec.go#L1493 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/script/cmds.go#L457 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/test/test.go#L1557 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/os/exec/lp_linux_test.go#L61 Link: buildkite/agent#2736 Link: rust-lang/rust#114554 Link: https://bugs.openjdk.org/browse/JDK-8068370 Link: dotnet/runtime#58964 Signed-off-by: Christian Brauner <brauner@kernel.org>
1 parent e0cce98 commit 3bf3ac1

File tree

5 files changed

+9
-46
lines changed

5 files changed

+9
-46
lines changed

fs/binfmt_elf.c

-2
Original file line numberDiff line numberDiff line change
@@ -1216,7 +1216,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
12161216
}
12171217
reloc_func_desc = interp_load_addr;
12181218

1219-
allow_write_access(interpreter);
12201219
fput(interpreter);
12211220

12221221
kfree(interp_elf_ex);
@@ -1308,7 +1307,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
13081307
kfree(interp_elf_ex);
13091308
kfree(interp_elf_phdata);
13101309
out_free_file:
1311-
allow_write_access(interpreter);
13121310
if (interpreter)
13131311
fput(interpreter);
13141312
out_free_ph:

fs/binfmt_elf_fdpic.c

+1-4
Original file line numberDiff line numberDiff line change
@@ -394,7 +394,6 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
394394
goto error;
395395
}
396396

397-
allow_write_access(interpreter);
398397
fput(interpreter);
399398
interpreter = NULL;
400399
}
@@ -466,10 +465,8 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
466465
retval = 0;
467466

468467
error:
469-
if (interpreter) {
470-
allow_write_access(interpreter);
468+
if (interpreter)
471469
fput(interpreter);
472-
}
473470
kfree(interpreter_name);
474471
kfree(exec_params.phdrs);
475472
kfree(exec_params.loadmap);

fs/binfmt_misc.c

+2-5
Original file line numberDiff line numberDiff line change
@@ -247,13 +247,10 @@ static int load_misc_binary(struct linux_binprm *bprm)
247247
if (retval < 0)
248248
goto ret;
249249

250-
if (fmt->flags & MISC_FMT_OPEN_FILE) {
250+
if (fmt->flags & MISC_FMT_OPEN_FILE)
251251
interp_file = file_clone_open(fmt->interp_file);
252-
if (!IS_ERR(interp_file))
253-
deny_write_access(interp_file);
254-
} else {
252+
else
255253
interp_file = open_exec(fmt->interpreter);
256-
}
257254
retval = PTR_ERR(interp_file);
258255
if (IS_ERR(interp_file))
259256
goto ret;

fs/exec.c

+3-12
Original file line numberDiff line numberDiff line change
@@ -951,11 +951,6 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
951951
if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode) ||
952952
path_noexec(&file->f_path)))
953953
goto exit;
954-
955-
err = deny_write_access(file);
956-
if (err)
957-
goto exit;
958-
959954
out:
960955
return file;
961956

@@ -971,8 +966,7 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
971966
*
972967
* Returns ERR_PTR on failure or allocated struct file on success.
973968
*
974-
* As this is a wrapper for the internal do_open_execat(), callers
975-
* must call allow_write_access() before fput() on release. Also see
969+
* As this is a wrapper for the internal do_open_execat(). Also see
976970
* do_close_execat().
977971
*/
978972
struct file *open_exec(const char *name)
@@ -1524,10 +1518,8 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
15241518
/* Matches do_open_execat() */
15251519
static void do_close_execat(struct file *file)
15261520
{
1527-
if (!file)
1528-
return;
1529-
allow_write_access(file);
1530-
fput(file);
1521+
if (file)
1522+
fput(file);
15311523
}
15321524

15331525
static void free_bprm(struct linux_binprm *bprm)
@@ -1846,7 +1838,6 @@ static int exec_binprm(struct linux_binprm *bprm)
18461838
bprm->file = bprm->interpreter;
18471839
bprm->interpreter = NULL;
18481840

1849-
allow_write_access(exec);
18501841
if (unlikely(bprm->have_execfd)) {
18511842
if (bprm->executable) {
18521843
fput(exec);

kernel/fork.c

+3-23
Original file line numberDiff line numberDiff line change
@@ -616,12 +616,6 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
616616

617617
exe_file = get_mm_exe_file(oldmm);
618618
RCU_INIT_POINTER(mm->exe_file, exe_file);
619-
/*
620-
* We depend on the oldmm having properly denied write access to the
621-
* exe_file already.
622-
*/
623-
if (exe_file && deny_write_access(exe_file))
624-
pr_warn_once("deny_write_access() failed in %s\n", __func__);
625619
}
626620

627621
#ifdef CONFIG_MMU
@@ -1412,20 +1406,11 @@ int set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
14121406
*/
14131407
old_exe_file = rcu_dereference_raw(mm->exe_file);
14141408

1415-
if (new_exe_file) {
1416-
/*
1417-
* We expect the caller (i.e., sys_execve) to already denied
1418-
* write access, so this is unlikely to fail.
1419-
*/
1420-
if (unlikely(deny_write_access(new_exe_file)))
1421-
return -EACCES;
1409+
if (new_exe_file)
14221410
get_file(new_exe_file);
1423-
}
14241411
rcu_assign_pointer(mm->exe_file, new_exe_file);
1425-
if (old_exe_file) {
1426-
allow_write_access(old_exe_file);
1412+
if (old_exe_file)
14271413
fput(old_exe_file);
1428-
}
14291414
return 0;
14301415
}
14311416

@@ -1464,9 +1449,6 @@ int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
14641449
return ret;
14651450
}
14661451

1467-
ret = deny_write_access(new_exe_file);
1468-
if (ret)
1469-
return -EACCES;
14701452
get_file(new_exe_file);
14711453

14721454
/* set the new file */
@@ -1475,10 +1457,8 @@ int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
14751457
rcu_assign_pointer(mm->exe_file, new_exe_file);
14761458
mmap_write_unlock(mm);
14771459

1478-
if (old_exe_file) {
1479-
allow_write_access(old_exe_file);
1460+
if (old_exe_file)
14801461
fput(old_exe_file);
1481-
}
14821462
return 0;
14831463
}
14841464

0 commit comments

Comments
 (0)