Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command::spawn on a newly-written file can fail with ETXTBSY due to racing with itself on Unix #114554

Open
SabrinaJewson opened this issue Aug 6, 2023 · 6 comments
Labels
A-process Area: `std::process` and `std::env` C-bug Category: This is a bug. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@SabrinaJewson
Copy link
Contributor

SabrinaJewson commented Aug 6, 2023

The following code:

fn main() {
    thread::spawn(|| loop {
        process::Command::new("/bin/true").status().unwrap();
    });

    for _ in 0..20 {
        let mut file = File::create("/tmp/executable.sh").unwrap();
        file.write_all(b"#!/bin/true\n").unwrap();

        let mode = 0b111_101_101;
        file.set_permissions(Permissions::from_mode(mode)).unwrap();

        drop(file);

        process::Command::new("/tmp/executable.sh")
            .status()
            .unwrap();
    }
}

use std::fs::File;
use std::fs::Permissions;
use std::io::Write as _;
use std::os::unix::prelude::PermissionsExt as _;
use std::process;
use std::thread;

Will almost immediately panic (on Linux) with:

$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/example`
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 26, kind: ExecutableFileBusy, message: "Text file busy" }', src/main.rs:17:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
$ rustc --version --verbose
rustc 1.71.0 (8ede3aae2 2023-07-12)
binary: rustc
commit-hash: 8ede3aae28fe6e4d52b38157d7bfe0d3bceef225
commit-date: 2023-07-12
host: x86_64-unknown-linux-gnu
release: 1.71.0
LLVM version: 16.0.5

(verison info is included for completeness, but this error will occur on all existing versions of Rust).

This is due to the following sequence of events occuring:

  1. The main thread opens the file descriptor for writing into /tmp/executable.sh.
  2. The spawned thread calls fork(2), inheriting the file descriptor that has write access to /tmp/executable.sh.
  3. The main thread closes its copy of the file descriptor — but note that it is not closed in the child process.
  4. The main thread then attempts to execute the file. However, because there exists a process with a file descriptor open that has write access to the file, it fails with ETXTBSY and the process panics.

After this, the spawned process will run execve and the presence of O_CLOEXEC on the file descriptor will cause it to be closed as we desire. However, that is too late: the damage has already been done.

An “ideal” solution: O_CLOFORK

The ideal solution here is for the kernel to support a file-opening flag analogous to O_CLOEXEC, O_CLOFORK, which causes the file descriptor to be closed when fork(2) is called, preventing it from beïng leaked to child processes. The feature has been accepted into POSIX and allegedly exists on AIX, *BSD, Solaris and macOS. However, it is not in Linux despite multiple suggestions:

Userspace Solutions

Wrapping calls to Command::spawn in a mutex

One simple userspace solution to this is to wrap all calls to Command::spawn in a mutex. This prevents the bug by ensuring that calls to Command::spawn cannot occur in between another process calling spawn and execve, as Command::spawn only returns after the cloexec event has occurred. However, this can be quite invasive to an existing codebase, and implementing this in Command::spawn itself may lose performance.

A potential alternative is to build in to Command::spawn a special synchronization primitive that allows for a parallelism-supporting fast-path, only falling back to the slow path when an ETXTBSY error is encountered.

The flocking algorithm

An alternative solution that only affects the parts of the codebase that write to a file and then execute it is to follow the following algorithm:

  1. Open the file with write permissions and write the contents
  2. Lock the file with exclusive access (note that because of how flock(2) works, the resulting lock guard attached to the file descriptor is shared with the file descriptor in any child process; this means that there is no race condition even if the fork happens between the previous step and this one)
  3. Close the file
  4. Reopen the file with read permissions
  5. Lock the file with exclusive or shared access — if any existing writeäble file descriptor is currently open, this will wait for it to close
  6. Close the file again
  7. Call Command::spawn. A fork may still hold an open file descriptor to the file, but it is guaranteed to be read-only and so cannot cause ETXTBSY errors
Implementation of the `flock` algorithm

Note that for simplicity error handling is omitted; in a real implementation you’d check for errors after calling flock.

fn main() {
    thread::spawn(|| loop {
        process::Command::new("/bin/true").status().unwrap();
    });

    for _ in 0..20 {
        let mut file = File::create("/tmp/executable.sh").unwrap();
        file.write_all(b"#!/bin/true\n").unwrap();

        let mode = 0b111_101_101;
        file.set_permissions(Permissions::from_mode(mode)).unwrap();

        std::thread::sleep(std::time::Duration::from_micros(2));
        unsafe { libc::flock(file.as_raw_fd(), libc::LOCK_EX) };

        drop(file);

        let file = File::open("/tmp/executable.sh").unwrap();
        unsafe { libc::flock(file.as_raw_fd(), libc::LOCK_SH) };
        drop(file);

        process::Command::new("/tmp/executable.sh")
            .status()
            .unwrap();
    }
}

use std::fs::File;
use std::fs::Permissions;
use std::io::Write as _;
use std::os::fd::AsRawFd as _;
use std::os::unix::prelude::PermissionsExt as _;
use std::process;
use std::thread;

Given that this algorithm will be needed in relatively rare scenarios, it is unlikely we would want to add it to the standard library anywhere. However, it may be useful to add a note about this in the documentation.

Waiting for an anonymous pipe to close

A somewhat simplified version of the above algorithm involves:

  1. Opening an anonymous O_CLOEXEC pipe before opening the file.
  2. Closing the writer end of the pipe between closing the file and running Command::spawn.
  3. Waiting for an EOF on the reader end of the pipe directly after that.

Since the pipe writer file descriptor’s lifetime definitely exceeds the lifetime of the file’s file descriptor, any fork(2) call that inherits the file’s file descriptor will also inherit the pipe writer. Thus, by refusing to continue until the pipe writer has been closed we also refuse to continue until the file’s file descriptor has been closed.

Note that this trick works based on the assumption that the pipe EOF delivery will happen-after all the CLOEXECs have finished taking effect. I am uncertain of whether this is a guarantee made by Linux.

Implementation of the anonymous pipe trick
fn main() {
    thread::spawn(|| loop {
        process::Command::new("/bin/true").status().unwrap();
    });

    for _ in 0..100 {
        let [reader, writer] = pipe().unwrap();
        let mut reader = File::from(reader);

        let mut file = File::create("/tmp/executable.sh").unwrap();
        file.write_all(b"#!/bin/true\n").unwrap();

        let mode = 0b111_101_101;
        file.set_permissions(Permissions::from_mode(mode)).unwrap();

        drop(file);
        drop(writer);

        loop {
            match reader.read(&mut [0]) {
                Ok(0) => break,
                Err(e) if e.kind() == io::ErrorKind::Interrupted => {}
                _ => panic!(),
            }
        }

        process::Command::new("/tmp/executable.sh")
            .status()
            .unwrap();
    }
}

fn pipe() -> io::Result<[OwnedFd; 2]> {
    let mut fds = [0; 2];
    if unsafe { libc::pipe2(fds.as_mut_ptr(), libc::O_CLOEXEC) } != 0 {
        return Err(io::Error::last_os_error());
    }
    Ok(fds.map(|fd| unsafe { OwnedFd::from_raw_fd(fd) }))
}

use std::fs::File;
use std::fs::Permissions;
use std::io;
use std::io::Write as _;
use std::os::fd::FromRawFd as _;
use std::os::fd::OwnedFd;
use std::os::unix::prelude::PermissionsExt as _;
use std::process;
use std::thread;
use std::io::Read as _;

Sleeping or spinlooping on ETXTBSY

One possible, but hacky, solution is to simply sleep when Command::spawn encounters an ETXTBSY error. The sleeping version was used by the go command from 2012 and removed after it was no longer necessary in 2017. The spinning version has been used elsewhere in Go since 2022.

Performing the write inside a fork

This is an easy and safe solution more oriënted for usage by applications. Simply perform all the writing in a child process: as long as the main process never opens a file descriptor to the executable file, nothing can go wrong.

let mut writer = process::Command::new("sh")
    .args(["-c", "cat >/tmp/executable.sh && chmod +x /tmp/executable.sh"])
    .stdin(process::Stdio::piped())
    .spawn()?;
writer.stdin.take().unwrap().write_all(b"#!/bin/sh\n")?;
writer.wait()?.exit_ok()?;

Other languages with this problem

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Aug 6, 2023
@Noratrieb Noratrieb added C-bug Category: This is a bug. T-libs Relevant to the library team, which will review and decide on the PR/issue. A-process Area: `std::process` and `std::env` and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Aug 6, 2023
@the8472
Copy link
Member

the8472 commented Aug 7, 2023

Another possible solution on recent linux kernels is to use io_uring and open/write/close the file as a ring-registered file rather than through file descriptors. This way the file descriptor will be owned by the ring, not by the process's file table. Closing it on the ring will close it across forks too since the forks will share the same ring.

@bjorn3
Copy link
Member

bjorn3 commented Aug 15, 2023

It has been suggested to remove ETXTBUSY entirely from Linux: https://lwn.net/Articles/866493/ I don't know if there has been any movement on that front since.

github-merge-queue bot pushed a commit to flox/flox that referenced this issue Feb 13, 2024
Resolve issues with rust unit tests in
`models::container_builder::tests::*`.

The container builder tests spuriously failed with `Text file busy`:

```
thread 'models::container_builder::tests::test_writes_output_to_writer' panicked at flox-rust-sdk/src/models/container_builder.rs:81:54:
called `Result::unwrap()` on an `Err` value: CallContainerBuilder(Os { code: 26, kind: ExecutableFileBusy, message: "Text file busy" })
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: core::result::Result<T,E>::unwrap
             at /build/rustc-1.73.0-src/library/core/src/result.rs:1077:23
   4: flox_rust_sdk::models::container_builder::tests::test_writes_output_to_writer
             at ./src/models/container_builder.rs:81:9
   5: flox_rust_sdk::models::container_builder::tests::test_writes_output_to_writer::{{closure}}
             at ./src/models/container_builder.rs:76:39
   6: core::ops::function::FnOnce::call_once
             at /build/rustc-1.73.0-src/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

```

`Text file busy (26)` may occur when executing a script 
that is has been written to immediately prior.
This is a known bug in rust (and allegedly other languages)
which is tracked in rust-lang/rust#114554.

We typically see this error in tests, where we write a new container
builder script
and immediately execute it within the same process:


https://github.com/flox/flox/blob/main/cli/flox-rust-sdk/src/models/container_builder.rs#L76-L83


In production use, this should not be a problem as the script will be
written
by a different process, i.e. `pkgdb`.

---------

Co-authored-by: Matthew Kenigsberg <matthew@floxdev.com>
@pgerber
Copy link
Contributor

pgerber commented Apr 9, 2024

One workaround, for scripts, is to not rely on shebang (e.g. #!/bin/sh).

So, if you have this ./script:

#!/bin/sh
echo "hello world"

You could simply call /bin/sh directly:

fn main() {
    std::process::Command::new("/bin/sh")
        .arg("./script")
        .status()
        .unwrap();
}

Only works if you know what the shebang will be, or you parse it, but seems like an easy solution in many cases. If we call the interpreter (e.g. /bin/sh) directly, the check if the executable is open for writing, which triggers the ETXTBSY, is done on said interpreter and not the script.

@pgerber
Copy link
Contributor

pgerber commented Apr 10, 2024

Another possible solution on recent linux kernels is to use io_uring and open/write/close the file as a ring-registered file rather than through file descriptors. This way the file descriptor will be owned by the ring, not by the process's file table. Closing it on the ring will close it across forks too since the forks will share the same ring.

There is a sysctl knob to disable uring for security reasons. As result,I expect uring to be unavailable on various hardened systems. Google has indicated that uring is disabled on their servers, Chrome books and on Android, and I would also expect it to be unavailable in sandboxed processes.

brauner added a commit to brauner/linux that referenced this issue Jun 3, 2024
Back in 2021 we already discussed removing deny_write_access() for
executables. Back then I was hesistant because I thought that this might
cause issues in userspace. But even back then I had started taking some
notes on what could potentially depend on this and I didn't come up with
a lot so I've changed my mind and I would like to try this.

Here are some of the notes that I took:

(1) The deny_write_access() mechanism is causing really pointless issues
    such as [1]. If a thread in a thread-group opens a file writable,
    then writes some stuff, then closing the file descriptor and then
    calling execve() they can fail the execve() with ETXTBUSY because
    another thread in the thread-group could have concurrently called
    fork(). Multi-threaded libraries such as go suffer from this.

(2) There are userspace attacks that rely on overwriting the binary of a
    running process. These attacks are _mitigated_ but _not at all
    prevented_ from ocurring by the deny_write_access() mechanism.

    I'll go over some details. The clearest example of such attacks was
    the attack against runC in CVE-2019-5736 (cf. [3]).

    An attack could compromise the runC host binary from inside a
    _privileged_ runC container. The malicious binary could then be used
    to take over the host.

    (It is crucial to note that this attack is _not_ possible with
     unprivileged containers. IOW, the setup here is already insecure.)

    The attack can be made when attaching to a running container or when
    starting a container running a specially crafted image. For example,
    when runC attaches to a container the attacker can trick it into
    executing itself.

    This could be done by replacing the target binary inside the
    container with a custom binary pointing back at the runC binary
    itself. As an example, if the target binary was /bin/bash, this
    could be replaced with an executable script specifying the
    interpreter path #!/proc/self/exe.

    As such when /bin/bash is executed inside the container, instead the
    target of /proc/self/exe will be executed. That magic link will
    point to the runc binary on the host. The attacker can then proceed
    to write to the target of /proc/self/exe to try and overwrite the
    runC binary on the host.

    However, this will not succeed because of deny_write_access(). Now,
    one might think that this would prevent the attack but it doesn't.

    To overcome this, the attacker has multiple ways:
    * Open a file descriptor to /proc/self/exe using the O_PATH flag and
      then proceed to reopen the binary as O_WRONLY through
      /proc/self/fd/<nr> and try to write to it in a busy loop from a
      separate process. Ultimately it will succeed when the runC binary
      exits. After this the runC binary is compromised and can be used
      to attack other containers or the host itself.
    * Use a malicious shared library annotating a function in there with
      the constructor attribute making the malicious function run as an
      initializor. The malicious library will then open /proc/self/exe
      for creating a new entry under /proc/self/fd/<nr>. It'll then call
      exec to a) force runC to exit and b) hand the file descriptor off
      to a program that then reopens /proc/self/fd/<nr> for writing
      (which is now possible because runC has exited) and overwriting
      that binary.

    To sum up: the deny_write_access() mechanism doesn't prevent such
    attacks in insecure setups. It just makes them minimally harder.
    That's all.

    The only way back then to prevent this is to create a temporary copy
    of the calling binary itself when it starts or attaches to
    containers. So what I did back then for LXC (and Aleksa for runC)
    was to create an anonymous, in-memory file using the memfd_create()
    system call and to copy itself into the temporary in-memory file,
    which is then sealed to prevent further modifications. This sealed,
    in-memory file copy is then executed instead of the original on-disk
    binary.

    Any compromising write operations from a privileged container to the
    host binary will then write to the temporary in-memory binary and
    not to the host binary on-disk, preserving the integrity of the host
    binary. Also as the temporary, in-memory binary is sealed, writes to
    this will also fail.

    The point is that deny_write_access() is uselss to prevent these
    attacks.

(3) Denying write access to an inode because it's currently used in an
    exec path could easily be done on an LSM level. It might need an
    additional hook but that should be about it.

(4) The MAP_DENYWRITE flag for mmap() has been deprecated a long time
    ago so while we do protect the main executable the bigger portion of
    the things you'd think need protecting such as the shared libraries
    aren't. IOW, we let anyone happily overwrite shared libraries.

(5) We removed all remaining uses of VM_DENYWRITE in [2]. That means:
    (5.1) We removed the legacy uselib() protection for preventing
          overwriting of shared libraries. Nobody cared in 3 years.
    (5.2) We allow write access to the elf interpreter after exec
          completed treating it on a par with shared libraries.

Yes, someone in userspace could potentially be relying on this. It's not
completely out of the realm of possibility but let's find out if that's
actually the case and not guess.

Link: golang/go#22315 [1]
Link: 49624ef ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2]
Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3]
Link: https://lwn.net/Articles/866493
Link: golang/go#22220
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/buildid.go#L724
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/exec.go#L1493
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/script/cmds.go#L457
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/test/test.go#L1557
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/os/exec/lp_linux_test.go#L61
Link: buildkite/agent#2736
Link: rust-lang/rust#114554
Link: https://bugs.openjdk.org/browse/JDK-8068370
Link: dotnet/runtime#58964
Link: https://lore.kernel.org/r/20240531-vfs-i_writecount-v1-1-a17bea7ee36b@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
brauner added a commit to brauner/linux that referenced this issue Jun 10, 2024
Back in 2021 we already discussed removing deny_write_access() for
executables. Back then I was hesistant because I thought that this might
cause issues in userspace. But even back then I had started taking some
notes on what could potentially depend on this and I didn't come up with
a lot so I've changed my mind and I would like to try this.

Here are some of the notes that I took:

(1) The deny_write_access() mechanism is causing really pointless issues
    such as [1]. If a thread in a thread-group opens a file writable,
    then writes some stuff, then closing the file descriptor and then
    calling execve() they can fail the execve() with ETXTBUSY because
    another thread in the thread-group could have concurrently called
    fork(). Multi-threaded libraries such as go suffer from this.

(2) There are userspace attacks that rely on overwriting the binary of a
    running process. These attacks are _mitigated_ but _not at all
    prevented_ from ocurring by the deny_write_access() mechanism.

    I'll go over some details. The clearest example of such attacks was
    the attack against runC in CVE-2019-5736 (cf. [3]).

    An attack could compromise the runC host binary from inside a
    _privileged_ runC container. The malicious binary could then be used
    to take over the host.

    (It is crucial to note that this attack is _not_ possible with
     unprivileged containers. IOW, the setup here is already insecure.)

    The attack can be made when attaching to a running container or when
    starting a container running a specially crafted image. For example,
    when runC attaches to a container the attacker can trick it into
    executing itself.

    This could be done by replacing the target binary inside the
    container with a custom binary pointing back at the runC binary
    itself. As an example, if the target binary was /bin/bash, this
    could be replaced with an executable script specifying the
    interpreter path #!/proc/self/exe.

    As such when /bin/bash is executed inside the container, instead the
    target of /proc/self/exe will be executed. That magic link will
    point to the runc binary on the host. The attacker can then proceed
    to write to the target of /proc/self/exe to try and overwrite the
    runC binary on the host.

    However, this will not succeed because of deny_write_access(). Now,
    one might think that this would prevent the attack but it doesn't.

    To overcome this, the attacker has multiple ways:
    * Open a file descriptor to /proc/self/exe using the O_PATH flag and
      then proceed to reopen the binary as O_WRONLY through
      /proc/self/fd/<nr> and try to write to it in a busy loop from a
      separate process. Ultimately it will succeed when the runC binary
      exits. After this the runC binary is compromised and can be used
      to attack other containers or the host itself.
    * Use a malicious shared library annotating a function in there with
      the constructor attribute making the malicious function run as an
      initializor. The malicious library will then open /proc/self/exe
      for creating a new entry under /proc/self/fd/<nr>. It'll then call
      exec to a) force runC to exit and b) hand the file descriptor off
      to a program that then reopens /proc/self/fd/<nr> for writing
      (which is now possible because runC has exited) and overwriting
      that binary.

    To sum up: the deny_write_access() mechanism doesn't prevent such
    attacks in insecure setups. It just makes them minimally harder.
    That's all.

    The only way back then to prevent this is to create a temporary copy
    of the calling binary itself when it starts or attaches to
    containers. So what I did back then for LXC (and Aleksa for runC)
    was to create an anonymous, in-memory file using the memfd_create()
    system call and to copy itself into the temporary in-memory file,
    which is then sealed to prevent further modifications. This sealed,
    in-memory file copy is then executed instead of the original on-disk
    binary.

    Any compromising write operations from a privileged container to the
    host binary will then write to the temporary in-memory binary and
    not to the host binary on-disk, preserving the integrity of the host
    binary. Also as the temporary, in-memory binary is sealed, writes to
    this will also fail.

    The point is that deny_write_access() is uselss to prevent these
    attacks.

(3) Denying write access to an inode because it's currently used in an
    exec path could easily be done on an LSM level. It might need an
    additional hook but that should be about it.

(4) The MAP_DENYWRITE flag for mmap() has been deprecated a long time
    ago so while we do protect the main executable the bigger portion of
    the things you'd think need protecting such as the shared libraries
    aren't. IOW, we let anyone happily overwrite shared libraries.

(5) We removed all remaining uses of VM_DENYWRITE in [2]. That means:
    (5.1) We removed the legacy uselib() protection for preventing
          overwriting of shared libraries. Nobody cared in 3 years.
    (5.2) We allow write access to the elf interpreter after exec
          completed treating it on a par with shared libraries.

Yes, someone in userspace could potentially be relying on this. It's not
completely out of the realm of possibility but let's find out if that's
actually the case and not guess.

Link: golang/go#22315 [1]
Link: 49624ef ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2]
Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3]
Link: https://lwn.net/Articles/866493
Link: golang/go#22220
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/buildid.go#L724
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/exec.go#L1493
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/script/cmds.go#L457
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/test/test.go#L1557
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/os/exec/lp_linux_test.go#L61
Link: buildkite/agent#2736
Link: rust-lang/rust#114554
Link: https://bugs.openjdk.org/browse/JDK-8068370
Link: dotnet/runtime#58964
Link: https://lore.kernel.org/r/20240531-vfs-i_writecount-v1-1-a17bea7ee36b@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
josefbacik pushed a commit to josefbacik/linux that referenced this issue Jul 12, 2024
Back in 2021 we already discussed removing deny_write_access() for
executables. Back then I was hesistant because I thought that this might
cause issues in userspace. But even back then I had started taking some
notes on what could potentially depend on this and I didn't come up with
a lot so I've changed my mind and I would like to try this.

Here are some of the notes that I took:

(1) The deny_write_access() mechanism is causing really pointless issues
    such as [1]. If a thread in a thread-group opens a file writable,
    then writes some stuff, then closing the file descriptor and then
    calling execve() they can fail the execve() with ETXTBUSY because
    another thread in the thread-group could have concurrently called
    fork(). Multi-threaded libraries such as go suffer from this.

(2) There are userspace attacks that rely on overwriting the binary of a
    running process. These attacks are _mitigated_ but _not at all
    prevented_ from ocurring by the deny_write_access() mechanism.

    I'll go over some details. The clearest example of such attacks was
    the attack against runC in CVE-2019-5736 (cf. [3]).

    An attack could compromise the runC host binary from inside a
    _privileged_ runC container. The malicious binary could then be used
    to take over the host.

    (It is crucial to note that this attack is _not_ possible with
     unprivileged containers. IOW, the setup here is already insecure.)

    The attack can be made when attaching to a running container or when
    starting a container running a specially crafted image. For example,
    when runC attaches to a container the attacker can trick it into
    executing itself.

    This could be done by replacing the target binary inside the
    container with a custom binary pointing back at the runC binary
    itself. As an example, if the target binary was /bin/bash, this
    could be replaced with an executable script specifying the
    interpreter path #!/proc/self/exe.

    As such when /bin/bash is executed inside the container, instead the
    target of /proc/self/exe will be executed. That magic link will
    point to the runc binary on the host. The attacker can then proceed
    to write to the target of /proc/self/exe to try and overwrite the
    runC binary on the host.

    However, this will not succeed because of deny_write_access(). Now,
    one might think that this would prevent the attack but it doesn't.

    To overcome this, the attacker has multiple ways:
    * Open a file descriptor to /proc/self/exe using the O_PATH flag and
      then proceed to reopen the binary as O_WRONLY through
      /proc/self/fd/<nr> and try to write to it in a busy loop from a
      separate process. Ultimately it will succeed when the runC binary
      exits. After this the runC binary is compromised and can be used
      to attack other containers or the host itself.
    * Use a malicious shared library annotating a function in there with
      the constructor attribute making the malicious function run as an
      initializor. The malicious library will then open /proc/self/exe
      for creating a new entry under /proc/self/fd/<nr>. It'll then call
      exec to a) force runC to exit and b) hand the file descriptor off
      to a program that then reopens /proc/self/fd/<nr> for writing
      (which is now possible because runC has exited) and overwriting
      that binary.

    To sum up: the deny_write_access() mechanism doesn't prevent such
    attacks in insecure setups. It just makes them minimally harder.
    That's all.

    The only way back then to prevent this is to create a temporary copy
    of the calling binary itself when it starts or attaches to
    containers. So what I did back then for LXC (and Aleksa for runC)
    was to create an anonymous, in-memory file using the memfd_create()
    system call and to copy itself into the temporary in-memory file,
    which is then sealed to prevent further modifications. This sealed,
    in-memory file copy is then executed instead of the original on-disk
    binary.

    Any compromising write operations from a privileged container to the
    host binary will then write to the temporary in-memory binary and
    not to the host binary on-disk, preserving the integrity of the host
    binary. Also as the temporary, in-memory binary is sealed, writes to
    this will also fail.

    The point is that deny_write_access() is uselss to prevent these
    attacks.

(3) Denying write access to an inode because it's currently used in an
    exec path could easily be done on an LSM level. It might need an
    additional hook but that should be about it.

(4) The MAP_DENYWRITE flag for mmap() has been deprecated a long time
    ago so while we do protect the main executable the bigger portion of
    the things you'd think need protecting such as the shared libraries
    aren't. IOW, we let anyone happily overwrite shared libraries.

(5) We removed all remaining uses of VM_DENYWRITE in [2]. That means:
    (5.1) We removed the legacy uselib() protection for preventing
          overwriting of shared libraries. Nobody cared in 3 years.
    (5.2) We allow write access to the elf interpreter after exec
          completed treating it on a par with shared libraries.

Yes, someone in userspace could potentially be relying on this. It's not
completely out of the realm of possibility but let's find out if that's
actually the case and not guess.

Link: golang/go#22315 [1]
Link: 49624ef ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2]
Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3]
Link: https://lwn.net/Articles/866493
Link: golang/go#22220
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/buildid.go#L724
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/exec.go#L1493
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/script/cmds.go#L457
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/test/test.go#L1557
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/os/exec/lp_linux_test.go#L61
Link: buildkite/agent#2736
Link: rust-lang/rust#114554
Link: https://bugs.openjdk.org/browse/JDK-8068370
Link: dotnet/runtime#58964
Signed-off-by: Christian Brauner <brauner@kernel.org>
@nekto0n
Copy link

nekto0n commented Sep 18, 2024

Looks like this have been resolved in Linux 6.11?

facebook-github-bot pushed a commit to WhatsApp/erlang-language-platform that referenced this issue Feb 18, 2025
Summary:
A race condition can occur between a thread trying to use the eqWAlizer executable, and the main thread creating the executable, leading to a panic with "text file busy".

See rust-lang/rust#114554 for more details.

This mitigates the issue by attempting to start the eqWAlizer command in a loop, retrying if it fails with ETXTBSY.

Reviewed By: robertoaloi, TD5

Differential Revision: D69775408

fbshipit-source-id: 9f08258bdb762edbe9c75ed813b52cbb0fb303df
@davidlattimore
Copy link
Contributor

Looks like this have been resolved in Linux 6.11?

Sadly the fix was reverted in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-process Area: `std::process` and `std::env` C-bug Category: This is a bug. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

8 participants