Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firecracker Snapshots Support #448

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

plamenmpetrov
Copy link
Contributor

Hello, we have been working on supporting microVM snapshotting in containerd-firecracker, following its introduction to firecracker. This PR contains new functions for the firecracker-containerd API that together comprise a complete working prototype for working with Firecracker snapshots. This prototype, however, contains workarounds for the missing calls in go-sdk. We also highlight a couple of issues that we would like to hear your feedback on.

We are open to feedback from the community and would be glad to engage in discussions to finalize and contribute this code to upstream.

Authored by @plamenmpetrov and @ustiugov

Summary

  1. We implement functionality for:

    • Pausing a microVM - PauseVM
    • Creating a snapshot of a microVM - CreateSnapshot
    • Resuming a microVM - ResumeVM
    • Loading a snapshot of a microVM - LoadSnapshot
    • “Offloading” a microVM, which frees up the resources occupied by the microVM - Offload

    We refer to these collectively as microVM snapshotting requests.

  2. The firecracker-go-sdk does not support microVM snapshotting as of now. As a result, we embedded the microVM snapshotting requests inside the runtime as HTTP requests. We use our own fork of the firecracker-go-sdk v0.21.0, where we provide basic support to the new logging and metrics of the firecracker version that we use (see below). Without these changes in the firecracker-go-sdk, we observe an error in the containerd logs concerning the firecracker logging. This prevents us from seeing the firecracker logs and makes debugging difficult.

  3. We use the following firecracker version in our tests: firecracker.

API Extensions Description

We create an HTTP client upon creating a microVM or loading a microVM snapshot, which is used to send HTTP requests directly to the firecracker process for the respective microVM (contrary to using the firecracker-go-sdk).

ResumeVM, PauseVM and CreateSnapshot

ResumeVM, PauseVM and CreateSnapshot use the HTTP client to send the respective request to the firecracker process. The return code from firecracker is checked to verify that the operation was successful.

Note that CreateSnapshot does not pause the microVM, but assumes that it is paused. This is in line with the prerequisites for creating a microVM snapshot in firecracker.

Offload

Offload kills the firecracker process for the microVM with the respective ID (using SIGKILL) and deletes the firecracker process’ sock file and vsock file so the microVM can later be loaded. This functionality is implemented in the runtime.

In addition, Offload also kills the shim using SIGKILL, so that the resources can be freed up until/if the microVM is loaded in the future. We remove the functionality where the shim directory for the microVM is removed when the shim terminates. This is because in our use case we decide to store the guest memory file and the state file in the shim directory. We also remove the shim socket file and the firecracker shim socket file and recreate the sockets upon LoadSnapshot (see below). This functionality is implemented in the control plugin.

LoadSnapshot

Before doing anything else, the shim needs to be started for the microVM. We recreate the shim socket and the fccontrol shim socket, and start the shim binary. This functionality is implemented in the control plugin.

LoadSnapshot starts a firecracker process listening on the API same socket that the microVM was using prior to being offloaded. The HTTP client is recreated and a load snapshot request is sent to the firecracker process. The return code returned by firecracker is checked to verify the success of the operation. This functionality is implemented in the runtime.

Note that LoadSnapshot assumes that the tap with the same exact name, IP, and MAC, as before the VM was offloaded, exists. Currently, we recommend removing the tap after calling Offload and re-creating the tap before calling LoadSnapshot because if these two calls are back to back (as may be in tests), it would cause “Tap is busy” error.

Limitations

  1. When calling LoadSnapshot immediately after Offload, we encounter an error that the shimSocket address is in use when trying to load the shim on LoadSnapshot. A workaround is to introduce a sleep of 10-100ms after Offload, depending on the system. This does not happen for the fccontrol shim socket.
ERROR: VM with ID "3" already exists (socket: "/containerd-shim/53d9435747fdf335f1601ccebf98aa71b29f871fcdc68c595c22ca8b0a64d53d.sock")
  1. Calling StopVM on a microVM which has been loaded from a snapshot results in an error, because we lose connection to the agent running inside the microVM.

  2. Performance: re-creating a shim process takes about 30ms, before loading the snapshot in Firecracker, in our experiments, we haven’t yet investigated this issue. The intuition is that shim start-up should not exceed 5-10ms as it is for starting a Firecracker process.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Signed-off-by: Plamen Petrov <[email protected]>
Notes:
1. Uses logging-only branch from ustiugov/firecracker-go-sdk
2. Firecracker logs path is hard-coded.

Signed-off-by: Plamen Petrov <[email protected]>
firecracker update

Signed-off-by: Plamen Petrov <[email protected]>
* Check that shim dir exists when loading shim
* No longer try to create shim dir when loading shim, as it must exist

Signed-off-by: Plamen Petrov <[email protected]>
@kzys
Copy link
Contributor

kzys commented Sep 24, 2020

This is really cool. Thanks for all the work! I have discussed with @ustiugov a few months back regarding the snapshot support but haven't updated you since then. I apologize for the lack of communication from our side.

@ustiugov
Copy link
Contributor

hi @kzys
sure thing! we are happy to contribute. We hope that this PR can establish the ground for finalizing snapshotting support together. Also, we did quite a few performance breakdown studies using this code and the boot-based baseline with different functions that we can contribute too once our submitted paper is published.

@kzys
Copy link
Contributor

kzys commented Oct 5, 2020

For starting a micro VM, the proposed API asks clients to call;

  1. CreateVM to start a Firecracker process and boot Linux inside.
  2. Offload to kill the Firecracker process.
  3. LoadSnapshot to start a new Firecracker process again.

However, if a client is going to call LoadSnapshot, CreateVM doesn't have to load a guest OS. Is that correct understanding?

I think it would be better to provide a way to create a brand-new VM from its snapshot, rather than "offloading" a booted VM. Could you explain the reasons regarding the design decision?

@ustiugov
Copy link
Contributor

ustiugov commented Oct 8, 2020

@kzys Sorry for the late response.

We expect that a workflow where an idle VM can be snapshotted followed by freeing its resources with Offload (i.e., killed) on the same physical host. When a new request comes for this VM, the orchestrator (i.e., the containerd client) can restore the VM into a newly created Firecracker process.

LoadSnapshot allows to load the VM state and resume the VM from the same exact point where the VM was previously Offloaded. Compared to StartVM, LoadSnapshot expects certain files to be present before loading the VM state (including the guest memory). These files include the disk image (the two disk drives, namely fc-dev-thinpool-snap-9 and ctrstub0for the first started VM) and the tap device with the same exact characteristics.

As opposed to StopVM, Offload preserves the disk drives whereas the tap is recreated manually (by our custom orchestrator). To simplify the procedure, Offload does NOT remove the VM's shim folder, keeping all the files in place.

I hope these explanations make sense although we are open for the feedback.

@kzys
Copy link
Contributor

kzys commented Oct 8, 2020

Thanks! Now I can understand the assumption here, but I'd like to give more options to customers regarding how to keep micro VM's artifacts.

Moving micro VMs between multiple hosts would be beneficial for customers. For example, if a host is having a hardware failure or some system updates such as updating Linux kernel, the customer may want to keep their micro VMs in somewhere (e.g. cloud storage) and run them in somewhere else.

While don't want to directly integrate cloud storage clients in firecracker-containerd, we should make that possible for higher-level orchestrators such as Fargate.


Would you mind if I ask you to split this pull request into a few ones? Pause, Resume and CreateSnapshot are relatively straightforward. We will move some of the implementation from here to the SDK, but the proposed APIs look good to me. The rest may need more design discussions upfront.

@ustiugov
Copy link
Contributor

ustiugov commented Oct 8, 2020

@kzys Thank you! Indeed, the scenario that you mentioned makes a lot of sense. We took the easier path, by limiting our scenario to the same physical host, because we are still not clear what firecracker-containerd can assume about the disk state of VMs (both the data and the emulated devices). This is the only missing piece that we miss here and would really like to hear your opinion/advice on.

sure, we can split the PR. should we create the following PRs:

  1. PR1 with current implementation of Pause, Resume and CreateSnapshot. Should we merge this one first, and then you will move some parts of the code to the SDK?
  2. PR2 with everything else, namely LoadSnapshot and Offload. Then, we discuss how firecracker-containerd should work with the disk-related state.

What do you think?

@kzys
Copy link
Contributor

kzys commented Oct 9, 2020

What do you mean by the disk state of VMs? It would be better to make less assumptions and let customers decide what they want to do.

Regarding the PRs, the split makes sense but I'd like to have Pause, Resume and CreateSnapshot on the SDK first. We are going to have the SDK's 0.22.0 release next week. Could you help us to have the APIs on the SDK after the release?

@ustiugov
Copy link
Contributor

ustiugov commented Oct 9, 2020

@kzys I call the disk state the following: 1) the images pulled by firecracker-containerd; 2) the devmapper block devices.
According to the current way in which firecracker-containerd manages the VM lifecycle, both must be present on the host machine before loading a VM from a snapshot. So, unless you plan on revisiting these design decisions, we need to find a way to reconstruct this state on the target host. I think that there should be a better way than complete block device's content migration.

Yes, I think that we should be able to get Pause, Resume and CreateSnapshot to the SDK first. @plamenmpetrov is our leading developer here.

@plamenmpetrov
Copy link
Contributor Author

@kzys Just to clarify, version 0.22.0 of the SDK does not seem to offer support for Pause, Resume, or CreateSnapshot. In that case, would you suggest that we wait for the SDK's 0.22.0 release and then submit a PR to the SDK implementing these calls? Once this is done, then we can use the SDK to implement the calls in firecracker-containerd.

@kzys
Copy link
Contributor

kzys commented Oct 27, 2020

@plamenmpetrov Yes! We've finally released 0.22.0 and Firecracker has released 0.23.0. It would be awesome if you could port Pause, Resume and CreateSnapshot changes to the SDK.

@ustiugov
Copy link
Contributor

@kzys we merged PauseVM, ResumeVM, and CreateSnapshot into the sdk. what should be our next steps?

@kzys
Copy link
Contributor

kzys commented Nov 13, 2020

Thanks for the contribution. The next step would be using the new SDK APIs from firecracker-containerd.

While CreateSnapshot is probably complicated since we need to keep a container's root filesystem in addition to the microVM, Pause and Resume should be straightforward.

Can you make a new PR that exposes Pause and Resume from firecracker-containerd?

@ustiugov
Copy link
Contributor

ustiugov commented Apr 2, 2021

@kzys are there any plans on finalizing the snapshots support? maybe, we could start discussing the next steps

@RoyceDavison RoyceDavison changed the base branch from master to main April 22, 2021 00:09
@haikyuu
Copy link

haikyuu commented May 4, 2023

Any updates to the snapshots feature? Now that loading snapshots is available in go SDK v1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants