Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cells] PID not found errors when stopping running executables #534

Open
mccormickt opened this issue Nov 7, 2024 · 9 comments
Open

[cells] PID not found errors when stopping running executables #534

mccormickt opened this issue Nov 7, 2024 · 9 comments
Labels
Auraed The Aurae Daemon (gRPC Server) Bug

Comments

@mccormickt
Copy link
Contributor

mccormickt commented Nov 7, 2024

Attempting to stop a running executable seems to have the following behavior on my (x86_64 Ubuntu 24.04) system:

  • sh -c <executable> process is created and has its PID tracked in the Executables cache.
  • Child <executable> process is not tracked.
  • Executable stop command is issued, Auraed errors with PID not found
  • Parent shell process is killed/missing, child executable process remains as a zombie.
$ ps aux | grep "tail -f"
root     4095196  0.0  0.0   8320  1792 ?        S    09:34   0:00 tail -f /dev/null

Testing

Rust

Rust tests, configuring new remote client for nested auraed

#[test_helpers_macros::shared_runtime_test]
async fn cells_start_stop_delete() {
    skip_if_not_root!("cells_start_stop_delete");
    skip_if_seccomp!("cells_start_stop_delete");

    let client = common::auraed_client().await;

    // Allocate a cell
    let cell_name = retry!(
        client
            .allocate(
                common::cells::CellServiceAllocateRequestBuilder::new().build()
            )
            .await
    )
    .unwrap()
    .into_inner()
    .cell_name;

    // Start the executable
    let req = common::cells::CellServiceStartRequestBuilder::new()
        .cell_name(cell_name.clone())
        .executable_name("aurae-exe".to_string())
        .build();
    let _ = retry!(client.start(req.clone()).await).unwrap().into_inner();

    // Stop the executable
    let _ = retry!(
        client
            .stop(proto::cells::CellServiceStopRequest {
                cell_name: Some(cell_name.clone()),
                executable_name: "aurae-exe".to_string(),
            })
            .await
    )
    .unwrap();

    // Delete the cell
    let _ = retry!(
        client
            .free(proto::cells::CellServiceFreeRequest {
                cell_name: cell_name.clone()
            })
            .await
    )
    .unwrap();
}
sudo -E cargo test -p auraed --test vms_start_must_start_vm_with_auraed -- --include-ignored
[...snip...]
2024-11-07T01:30:08.068934Z  INFO start: auraed::cells::cell_service::cell_service: CellService: start() executable=ValidatedExec
utable { name: ExecutableName("aurae-exe"), command: "sleep 400", description: "description" } request=ValidatedCellServiceStartR
equest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "sleep 400", description:
 "description" }, uid: None, gid: None }                                                                                         
2024-11-07T01:30:08.069353Z  INFO start: auraed::observe::observe_service: Registering channel for pid 1668303 Stdout request=Val
idatedCellServiceStartRequest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "s
leep 400", description: "description" }, uid: None, gid: None }
2024-11-07T01:30:08.069445Z  INFO start: auraed::observe::observe_service: Registering channel for pid 1668303 Stderr request=Val
idatedCellServiceStartRequest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "s
leep 400", description: "description" }, uid: None, gid: None }
2024-11-07T01:30:08.103119Z  INFO stop: auraed::cells::cell_service::cell_service: CellService: stop() executable_name=Executable
Name("aurae-exe") request=ValidatedCellServiceStopRequest { cell_name: None, executable_name: ExecutableName("aurae-exe") }
2024-11-07T01:30:08.103377Z ERROR stop: auraed::cells::cell_service::error: executable 'aurae-exe' failed to stop: No child proce
sses (os error 10) request=ValidatedCellServiceStopRequest { cell_name: None, executable_name: ExecutableName("aurae-exe") }
thread 'cells_start_stop_delete' panicked at auraed/tests/cell_list_must_list_allocated_cells_recursively.rs:172:6:             
called `Result::unwrap()` on an `Err` value: Status { code: Internal, message: "executable 'aurae-exe' failed to stop: No child p
rocesses (os error 10)", metadata: MetadataMap { headers: {"content-type": "application/grpc", "content-length": "0", "date": "Th
u, 07 Nov 2024 01:30:08 GMT"} }, source: None }

Manually with aer and cloud-hypervisor

Install cloud-hypervisor and build guest image/kernel

sudo make /opt/aurae/cloud-hypervisor/cloud-hypervisor
sudo make build-guest-kernel
sudo make prepare-image

Run cloud-hypervisor with the auraed pid1 image

sudo cloud-hypervisor --kernel /var/lib/aurae/vm/kernel/vmlinux.bin \                                 
--disk path=/var/lib/aurae/vm/image/disk.raw \                                                                                   
--cmdline "console=hvc0 root=/dev/vda1 rw" \                                                                                     
--cpus boot=4 \                                                                                                                  
--memory size=4096M \                                                                                                            
--net "tap=tap0,mac=aa:ae:00:00:00:01,id=eth0"

Retrieve zone ID from tap0 (13 in my case):

ip link show tap0                                               
13: tap0: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 06:66:42:a8:3f:e1 brd ff:ff:ff:ff:ff:ff

Configure aurae client config in ~/.aurae/config:

[system]
socket = "[fe80::2%13]:8080"

Verify cells run:

aer cell allocate sleeper
aer cell start --executable-command "sleep 9000" sleeper sleep-forever
aer cell list
aer cell stop sleeper sleep-forever
aer cell free sleeper
@dmah42 dmah42 added Auraed The Aurae Daemon (gRPC Server) Bug labels Nov 10, 2024
@dmah42
Copy link
Contributor

dmah42 commented Nov 11, 2024

i'm going to see if i can create a cell service level test for this use case.

@dmah42
Copy link
Contributor

dmah42 commented Nov 11, 2024

i have a test that should be testing this scenario, and it works. note though that it's not using nested cells, so i suspect this is where the issue is.

#535 is the current draft PR.

next step is to change it to use nested cells instead and see if it starts failing :)

@dmah42
Copy link
Contributor

dmah42 commented Nov 13, 2024

well well well.

2024-11-13T11:02:09.173289Z ERROR start_in_cell: auraed::cells::cell_service::error: cgroup 'ae-test-aab9ac5e-a042-4f05-a7e1-6a0f1ecf70ec' exists on host, but is not controlled by auraed cell_name=CellName("ae-test-aab9ac5e-a042-4f05-a7e1-6a0f1ecf70ec") request=CellServiceStartRequest { cell_name: None, executable: Some(Executable { name: "ae-exec-f94a9213-518d-40b6-8b66-71a1a67d0f03", command: "tail -f /dev/null", description: "" }), uid: None, gid: None }
11:02:09 [ERROR] failed to start in cell: status: FailedPrecondition, message: "cgroup 'ae-test-aab9ac5e-a042-4f05-a7e1-6a0f1ecf70ec' exists on host, but is not controlled by auraed", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Wed, 13 Nov 2024 11:02:09 GMT", "content-length": "0"} }
error: test failed, to rerun pass `-p auraed --lib`

@dmah42
Copy link
Contributor

dmah42 commented Nov 13, 2024

something very odd is going on with the cell cache. i confirmed that we're inserting into the cache on allocate, but when we try to get the cell back out of the cache it isn't there, but the cgroup exists.

out of time for debugging for now but i'll keep hacking on this later.

@dmah42
Copy link
Contributor

dmah42 commented Nov 13, 2024

confirmed the cell name is a key in the cache at the moment we call self.cache.get, but this call is returning None.

@dmah42
Copy link
Contributor

dmah42 commented Nov 13, 2024

leaving this here as a note to myself:

allocated ae-test-start-stop-in-cell
getting ae-test-start-stop-in-cell from cache
get cell ae-test-start-stop-in-cell
cgroup ae-test-start-stop-in-cell exists
cache size: 1
  CellName("ae-test-start-stop-in-cell")
    MATCH
2024-11-13T13:22:37.869166Z ERROR start_in_cell: auraed::cells::cell_service::cells::cells: get cell ae-test-start-stop-in-cell: cell not in cache cell_name=CellName("ae-test-start-stop-in-cell") request=CellServiceStartRequest { cell_name: None, executable: Some(Executable { name: "ae-exec-start-stop-in-cell", command: "tail -f /dev/null", description: "" }), uid: None, gid: None }
2024-11-13T13:22:37.869241Z ERROR start_in_cell: auraed::cells::cell_service::error: cgroup 'ae-test-start-stop-in-cell' exists on host, but is not controlled by auraed cell_name=CellName("ae-test-start-stop-in-cell") request=CellServiceStartRequest { cell_name: None, executable: Some(Executable { name: "ae-exec-start-stop-in-cell", command: "tail -f /dev/null", description: "" }), uid: None, gid: None }

@dmah42
Copy link
Contributor

dmah42 commented Nov 13, 2024

i think the issue is somewhere between how we "start in cell" and how we "proxy if needed". i'm debating stripping out a lot of the complexity here as i'm not sure it's necessary.

@dmah42
Copy link
Contributor

dmah42 commented Nov 18, 2024

ok all of this is a red herring based on things running in parallel. the actual bug is in the executables cache. when we stop, we're returning an error in not an error case (by the look of it). looks like maybe a bad merge. i'm on it now :)

@dmah42
Copy link
Contributor

dmah42 commented Nov 19, 2024

i've narrowed this down to Executable::kill. i thought maybe it was because we are calling kill (which waits) and then wait to get the exit status, but replacing the kill with start_kill (which doesn't wait) doesn't resolve the issue.

the error "No child processes (os error 10)" that i'm seeing is coming from the child exiting. however, i can't see why this error is being reported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Auraed The Aurae Daemon (gRPC Server) Bug
Projects
None yet
Development

No branches or pull requests

2 participants