Skip to content

Commit

Permalink
vm-builder: Run all processes in a cgroup
Browse files Browse the repository at this point in the history
More specifically, with this change:

1. There is a new cgroup inside the VM, /neonvm-root
2. There are associated cgroup & mount namespaces such that when inside
   those namespaces, /neonvm-root appears to be the root cgroup
3. All 'command's from the image spec are run inside the /neonvm-root
   cgroup and associated namespaces

In order to support the creation of new cgroups inside /neonvm-root
(like what's used in the test images, or in neondatabase/neon), we have
to run normal processes in a separate /neonvm-root/leaf cgroup in order
to not voilate the "no internal processes" rule.[1]

But other than that, most things should continue to behave as normal.
For example, cgconfigparser (which we use to set up user cgroups)
appears to continue to work just fine.

Also worth noting: Currently sshd also runs in the same cgroup &
namespaces as everything else, so that it continues to match the
perspective of processes running inside the VM.

When running as root, it's possible to break out of the namespaces by
entering PID 1's namespaces:

  nsenter --target=1 --cgroup --mount <command...>

[1]: https://man7.org/linux/man-pages/man7/cgroups.7.html
  • Loading branch information
sharnoff committed Sep 25, 2024
1 parent 3e6b8ff commit e533771
Show file tree
Hide file tree
Showing 7 changed files with 117 additions and 17 deletions.
10 changes: 10 additions & 0 deletions neonvm/tools/vm-builder/files/Dockerfile.img
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,17 @@ RUN set -e \
COPY helper.move-bins.sh /helper.move-bins.sh

# add udevd and agetty (with shared libs)
#
# We need unshare and nsenter from util-linux-misc because buxybox's implementations don't have
# support for cgroup namespaces (at least, master as of 2024-08-11).
RUN set -e \
&& apk add --no-cache --no-progress --quiet \
acpid \
udev \
agetty \
su-exec \
util-linux-misc \
cgroup-tools \
e2fsprogs-extra \
blkid \
flock \
Expand All @@ -35,6 +40,8 @@ RUN set -e \
udevadm \
agetty \
su-exec \
unshare nsenter \
cgexec \
resize2fs \
blkid \
flock \
Expand Down Expand Up @@ -77,6 +84,9 @@ COPY sshd_config /neonvm/config/sshd_config
RUN chmod +rx /neonvm/bin/vminit /neonvm/bin/vmstart /neonvm/bin/vmshutdown
COPY udev-init.sh /neonvm/bin/udev-init.sh
RUN chmod +rx /neonvm/bin/udev-init.sh
COPY cg-setup.sh /neonvm/bin/cg-setup.sh
COPY cg-run.sh /neonvm/bin/cg-run.sh
RUN chmod +rx /neonvm/bin/cg-setup.sh /neonvm/bin/cg-run.sh
COPY resize-swap.sh /neonvm/bin/resize-swap
RUN chmod +rx /neonvm/bin/resize-swap
COPY set-disk-quota.sh /neonvm/bin/set-disk-quota
Expand Down
16 changes: 16 additions & 0 deletions neonvm/tools/vm-builder/files/cg-run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/neonvm/bin/sh
#
# Helper script to run a program in the root cgroup (+ cgroup namespace).
# This is automatically used to run user-provided programs so that they're transparently included in
# a cgroup that we have control over (and can limit the CPU of, for fractional CPU support).
#
# USAGE: /neonvm/bin/cg-run.sh <COMMAND...>

set -eux

# cgexec ... - run in the neonvm-root/leaf cgroup
# nsenter ... - run in the cgroup namespace
# "$@" - the command we were asked to run
exec /neonvm/bin/cgexec -g cpu,memory:neonvm-root/leaf \
/neonvm/bin/nsenter --cgroup=/tmp/neonvm-user-namespace/cgroup --mount=/tmp/neonvm-user-namespace/mnt \
"$@"
75 changes: 75 additions & 0 deletions neonvm/tools/vm-builder/files/cg-setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
#!/neonvm/bin/sh
#
# Helper script to set up the root cgroup and cgroup namespace for CPU limiting.
#
# USAGE: /neonvm/bin/cg-setup.sh

set -eu

# enable controllers
echo '+cpu +memory' > /sys/fs/cgroup/cgroup.subtree_control

# create a new cgroup called 'neonvm-root'
mkdir /sys/fs/cgroup/neonvm-root

# create a directory and files in tmp to bind mount a new namespace -- otherwise the namespace will
# be removed once all processes exit. We could alternately keep a long-running process in the
# namespace, but it seemed easier to go this route.
mkdir -m 0600 /tmp/neonvm-user-namespace # 0600 to prevent non-root access
touch /tmp/neonvm-user-namespace/cgroup /tmp/neonvm-user-namespace/mnt

# We now need to:
# 1. enter the cgroup and create a fresh cgroup AND mount namespace
# 2. remount /sys/fs/cgroup so that we only have access to the child cgroup from within the mount
# namespace (via bind to overwrite it)
# 3. OUTSIDE the namespace, we need to bind mount the cgroup and mount namespaces so they're
# persisted.
# 4. Allow the process in the namespace to exit

mkfifo -m 0600 /tmp/neonvm-cgsetup-childpid.pipe
mkfifo -m 0600 /tmp/neonvm-cgsetup-nsdone.pipe

# In the background, wait for the child PID to be known.
#
# We *could* run the cgexec + unshare in the background instead and have one less child, but it's
# MUCH easier to debug if that's in the foreground.
sh -c '
child_pid="$(cat /tmp/neonvm-cgsetup-childpid.pipe)"
# persist the child namespaces by bind mounting them
mount --bind /proc/$child_pid/ns/cgroup /tmp/neonvm-user-namespace/cgroup
mount --bind /proc/$child_pid/ns/mnt /tmp/neonvm-user-namespace/mnt
echo "" >> /tmp/neonvm-cgsetup-nsdone.pipe
' &

# 'cgexec ... neonvm-root' - enter the 'neonvm-root' cgroup
# 'unshare --cgroup --mount' - create a new cgroup and mount namespaces
# - at this point, /sys/fs/cgroup still looks the same, although /proc/self/cgroup says we're at
# the root (even though we're in 'neonvm-root')
# 'mount ... /sys/fs/cgroup' - restrict what's visible in /sys/fs/cgroup to just the 'neonvm-root' cgroup
cgexec -g cpu,memory:neonvm-root unshare --cgroup --mount sh -c '
echo $$ >> /tmp/neonvm-cgsetup-childpid.pipe
umount /sys/fs/cgroup
mount -t cgroup2 cgroup2 /sys/fs/cgroup
# wait for namespace binding to finish
cat /tmp/neonvm-cgsetup-nsdone.pipe
'

# done with the pipes, can get rid of them.
rm /tmp/neonvm-cgsetup-childpid.pipe /tmp/neonvm-cgsetup-nsdone.pipe

# The default cgroup will be neonvm-root/leaf to allow creation of other cgroups inside neonvm-root,
# due to the "no internal processes" rule that prevents having processes inside a cgroup when
# cgroup.subtree_control is not empty.
mkdir /sys/fs/cgroup/neonvm-root/leaf
echo "+cpu +memory" > /sys/fs/cgroup/neonvm-root/cgroup.subtree_control

# Allow all users to move processes to/from the root cgroup.
#
# This is required in order to be able to 'cgexec' anything, if the entrypoint is not being run as
# root, because moving tasks between one cgroup and another *requires write access to the
# cgroup.procs file of the common ancestor*, and because the entrypoint isn't already in a cgroup,
# any new tasks are automatically placed in the top-level cgroup.
#
# This *would* be bad for security, if we relied on cgroups for security; but instead because they
# are just used for cooperative signaling, this should be mostly ok.
chmod go+w /sys/fs/cgroup/neonvm-root/cgroup.procs
8 changes: 4 additions & 4 deletions neonvm/tools/vm-builder/files/inittab
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@
::respawn:/neonvm/bin/udevd
::wait:/neonvm/bin/udev-init.sh
::respawn:/neonvm/bin/acpid -f -c /neonvm/acpi
::respawn:/neonvm/bin/vector -c /neonvm/config/vector.yaml --config-dir /etc/vector --color never
::respawn:/neonvm/bin/cg-run.sh /neonvm/bin/vector -c /neonvm/config/vector.yaml --config-dir /etc/vector --color never
::respawn:/neonvm/bin/chronyd -n -f /neonvm/config/chrony.conf -l /var/log/chrony/chrony.log
::respawn:/neonvm/bin/sshd -E /var/log/ssh.log -f /neonvm/config/sshd_config
::respawn:/neonvm/bin/vmstart
::respawn:/neonvm/bin/cg-run.sh /neonvm/bin/sshd -E /var/log/ssh.log -f /neonvm/config/sshd_config
::respawn:/neonvm/bin/cg-run.sh /neonvm/bin/vmstart
{{ range .InittabCommands }}
::{{.SysvInitAction}}:su -p {{.CommandUser}} -c {{.ShellEscapedCommand}}
::{{.SysvInitAction}}:/neonvm/bin/cg-run.sh su -p {{.CommandUser}} -c {{.ShellEscapedCommand}}
{{ end }}
ttyS0::respawn:/neonvm/bin/agetty --8bits --local-line --noissue --noclear --noreset --host console --login-program /neonvm/bin/login --login-pause --autologin root 115200 ttyS0 linux
::shutdown:/neonvm/bin/vmshutdown
3 changes: 2 additions & 1 deletion neonvm/tools/vm-builder/files/resize-swap.sh
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@ if [ ! -f /neonvm/runtime/resize-swap-internal.sh ]; then
exit 1
fi

/neonvm/bin/sh /neonvm/runtime/resize-swap-internal.sh "$size"
# we need to break out of the current mount namespace in order to make changes on the host.
/neonvm/bin/nsenter --mount=/proc/1/ns/mnt /neonvm/bin/sh /neonvm/runtime/resize-swap-internal.sh "$size"
if [ "$once" = 'yes' ]; then
# remove *this* script so that it cannot be called again.
rm /neonvm/bin/resize-swap
Expand Down
16 changes: 4 additions & 12 deletions neonvm/tools/vm-builder/files/vminit
Original file line number Diff line number Diff line change
Expand Up @@ -23,18 +23,7 @@ chmod 0755 /dev/pts
chmod 1777 /dev/shm
mount -t proc proc /proc
mount -t sysfs sysfs /sys
mount -t cgroup2 cgroup2 /sys/fs/cgroup

# Allow all users to move processes to/from the root cgroup.
#
# This is required in order to be able to 'cgexec' anything, if the entrypoint is not being run as
# root, because moving tasks between one cgroup and another *requires write access to the
# cgroup.procs file of the common ancestor*, and because the entrypoint isn't already in a cgroup,
# any new tasks are automatically placed in the top-level cgroup.
#
# This *would* be bad for security, if we relied on cgroups for security; but instead because they
# are just used for cooperative signaling, this should be mostly ok.
chmod go+w /sys/fs/cgroup/cgroup.procs
mount -t cgroup2 cgroup2 -o nosuid,nodev,noexec,nsdelegate /sys/fs/cgroup

mount -t devpts -o noexec,nosuid devpts /dev/pts
mount -t tmpfs -o noexec,nosuid,nodev shm-tmpfs /dev/shm
Expand All @@ -48,6 +37,9 @@ test -f /neonvm/runtime/mounts.sh && /neonvm/bin/sh /neonvm/runtime/mounts.sh
# set any user-supplied sysctl settings
test -f /neonvm/runtime/sysctl.conf && /neonvm/bin/sysctl -p /neonvm/runtime/sysctl.conf

# cgroups setup
/neonvm/bin/cg-setup.sh

# try resize filesystem
resize2fs /dev/vda

Expand Down
6 changes: 6 additions & 0 deletions neonvm/tools/vm-builder/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,10 @@ var (
scriptVmInit string
//go:embed files/udev-init.sh
scriptUdevInit string
//go:embed files/cg-setup.sh
scriptCgSetup string
//go:embed files/cg-run.sh
scriptCgRun string
//go:embed files/resize-swap.sh
scriptResizeSwap string
//go:embed files/set-disk-quota.sh
Expand Down Expand Up @@ -336,6 +340,8 @@ func main() {
{"chrony.conf", configChrony},
{"sshd_config", configSshd},
{"udev-init.sh", scriptUdevInit},
{"cg-setup.sh", scriptCgSetup},
{"cg-run.sh", scriptCgRun},
{"resize-swap.sh", scriptResizeSwap},
{"set-disk-quota.sh", scriptSetDiskQuota},
}
Expand Down

0 comments on commit e533771

Please sign in to comment.