Scrub memory before platform-specific invalidation #506

peterfang · 2024-10-30T12:12:03Z

Instead of invalidating memory, TDP platforms only need to zero out its content before handing it over to the L2 guest. This is not equivalent to page invalidation because the page remains accepted after zeroing.

Always scrub the memory region prior to performing platform-specific invalidation (which is a no-op on TDP platforms). This should have no functional impact on SNP platforms.

msft-jlange · 2024-10-30T14:44:10Z

Why is zeroing required at all when invalidating pages on TDP? The point of "invalidate" is to mark the pages as unusable prior to returning them to host control. But in the TDX architecture, there is no platform operation defined to do that. If the page is going to be used again, it must be accepted again, and the next call to TDG.MEM.PAGE.ACCEPT would fail if the host declined to retake control of the page as expected, and would zero it if the page went through the correct flow, so it doesn't seem like there's anything the L1 needs to do here.

Freax13 · 2024-10-30T15:00:14Z

My guess is that we do this because td_accept_memory (or td_accept_pages) doesn't fail if pages are already validated: https://docs.rs/tdx-tdcall/0.2.1/src/tdx_tdcall/tdx.rs.html#617-619. By ensuring that memory is zero-initialized before releasing it, we know that it's still zero-initialized when we try to accept already accepted memory.

Personally I'm not a big fan of this strategy. What happens if we try to accept memory that's already been accepted, but that also hasn't been zero-initialized yet? AFAICT we wouldn't get an error, but the memory wouldn't be zero-initialized.

Is there a reason we zero-initialize the memory inside the guest instead of letting the TD module do it?

peterfang · 2024-10-30T18:22:29Z

In SVSM zeroing is mainly done to avoid leaking L1 information to L2. Since L1 and L2 share the same GPA space and L1 steals some of L2's lowmem for its own boot flow (e.g. stage1/stage2/kernel ELF/kernel fs), it makes sense to zero these pages out as part of invalidate_early_boot_memory(). We currently don't manipulate L2's e820 table to exclude these lowmem pages because it introduces additional complexity in host QEMU and also doesn't play well with L2 BIOS. In other words, we give early boot memory to L2 once L1 is booted and starts operating in high GPA areas. No page re-acceptance is needed here. You were right that on TDP platforms the guest doesn't return a page to the host after acceptance except when it's returned implicitly during private->shared page conversions.

msft-jlange · 2024-10-30T18:22:33Z

Personally I'm not a big fan of this strategy. What happens if we try to accept memory that's already been accepted, but that also hasn't been zero-initialized yet? AFAICT we wouldn't get an error, but the memory wouldn't be zero-initialized.

I couldn't agree more. Ignoring TDX_PAGE_ALREADY_ACCEPTED will lead to fatal security errors. What if the page holds valid data, and there is a mistaken call to re-accept, and then the page is zeroed again? Any valid data already in the page will be corrupted, which is surely not even close to safe behavior.

Is there a reason we zero-initialize the memory inside the guest instead of letting the TD module do it?

I failed to refactor adequately. The zeroing should be in the platform-specific implementation of SvsmPlatform::validate_*_page_range. I can address that in a future PR.

peterfang · 2024-10-30T18:29:09Z

Personally I'm not a big fan of this strategy. What happens if we try to accept memory that's already been accepted, but that also hasn't been zero-initialized yet? AFAICT we wouldn't get an error, but the memory wouldn't be zero-initialized.

I couldn't agree more. Ignoring TDX_PAGE_ALREADY_ACCEPTED will lead to fatal security errors. What if the page holds valid data, and there is a mistaken call to re-accept, and then the page is zeroed again? Any valid data already in the page will be corrupted, which is surely not even close to safe behavior.

Is there a reason we zero-initialize the memory inside the guest instead of letting the TD module do it?

I failed to refactor adequately. The zeroing should be in the platform-specific implementation of SvsmPlatform::validate_*_page_range. I can address that in a future PR.

I think "page invalidation" probably means something slightly different on TDP. I look forward to discussing with you what that means on TDP platforms. I'm open to addressing the concerns around TDX_PAGE_ALREADY_ACCEPTED with the crate stakeholders if that's not a desirable/safe behavior.

Freax13 · 2024-10-31T08:43:39Z

Is there a reason we zero-initialize the memory inside the guest instead of letting the TD module do it?

I failed to refactor adequately. The zeroing should be in the platform-specific implementation of SvsmPlatform::validate_*_page_range. I can address that in a future PR.

Sorry, I think there's a slight misunderstanding here. I meant "TD module" as in the code running in SEAM root mode aka the code that's handling our TD calls aka the code signed by Intel, not "TD module" as in the module within the SVSM responsible for TDX stuff. Also, I just realized that it's actually called the TDX module, not the TD module, my bad. The zeroing already happens in a platform-specific module within the SVSM, but I was suggesting that we should rather let the TDG.MEM.PAGE.ACCEPT TD call handle the zeroing (of course this only works if we reject TDX_PAGE_ALREADY_ACCEPTED).

In SVSM zeroing is mainly done to avoid leaking L1 information to L2. Since L1 and L2 share the same GPA space and L1 steals some of L2's lowmem for its own boot flow (e.g. stage1/stage2/kernel ELF/kernel fs), it makes sense to zero these pages out as part of invalidate_early_boot_memory()

So far we understood "page invalidation" to refer to making memory inaccessible to the SVSM and guest. On SNP, this unsets the validate bit in the RMP, on TDP the equivalent would be to move the page back to the PENDING state. "page invalidation" does not refer to any actions required to prepare memory to be used by an L2 e.g. scrubbing out secrets by zeroing memory.

I think "page invalidation" probably means something slightly different on TDP. I look forward to discussing with you what that means on TDP platforms. I'm open to addressing the concerns around TDX_PAGE_ALREADY_ACCEPTED with the crate stakeholders if that's not a desirable/safe behavior.

I think "page invalidation" is slightly different on TDP because there's no TD call for a guest to request for a page to be moved back to the PENDING state. Still, the GHCI spec has a VMCALL, TDG.VP.VMCALL<MapGPA>, that the SVSM can use to ask the host to move memory into a shared state and remove private pages. Of course, the SVSM can't trust the host to correctly execute TDG.VP.VMCALL<MapGPA>, but if the host doesn't remove the private pages, we would get TDX_PAGE_ALREADY_ACCEPTED errors upon re-accepting the memory and that's something we can actually detect.

If we assume that well-behaved hypervisors always faithfully execute TDG.VP.VMCALL<MapGPA> by removing the private pages (I'm not sure if this assumption is correct), I think we can always treat TDX_PAGE_ALREADY_ACCEPTED as a fatal error. If this assumption is incorrect, I think that we should detect TDX_PAGE_ALREADY_ACCEPTED and bring the page into a known state by issuing TDG.MEM.PAGE.ATTR.WR TD calls to remove access bits for L2 VMs, flushing the EPT caches of L2 VMs using TDG.VP.INVEPT on all vCPUs, and zeroing the contents.

Freax13 · 2024-10-31T11:38:58Z

Hmm, so I just took a deeper look into the tdx-tdcall crate and I noticed several flaws as well as other implementation oddities e.g. tdcall_report using a heap allocated buffer and global lock (I'm not convinced either of these are necessary) or some functions halting instead of returning errors. If we also consider the issues around TDX_PAGE_ALREADY_ACCEPTED, maybe it would be better if we ditch the tdx-tdcall crate and write our own. It's not that much code and we wouldn't need to be dependent on the implementation choices made for td-shim. We also wouldn't need to wait for a release to be published to crates.io every time we want to make a change.

peterfang · 2024-10-31T11:40:51Z

So far we understood "page invalidation" to refer to making memory inaccessible to the SVSM and guest. On SNP, this unsets the validate bit in the RMP, on TDP the equivalent would be to move the page back to the PENDING state. "page invalidation" does not refer to any actions required to prepare memory to be used by an L2 e.g. scrubbing out secrets by zeroing memory.

I think "page invalidation" probably means something slightly different on TDP. I look forward to discussing with you what that means on TDP platforms. I'm open to addressing the concerns around TDX_PAGE_ALREADY_ACCEPTED with the crate stakeholders if that's not a desirable/safe behavior.

I think "page invalidation" is slightly different on TDP because there's no TD call for a guest to request for a page to be moved back to the PENDING state. Still, the GHCI spec has a VMCALL, TDG.VP.VMCALL<MapGPA>, that the SVSM can use to ask the host to move memory into a shared state and remove private pages. Of course, the SVSM can't trust the host to correctly execute TDG.VP.VMCALL<MapGPA>, but if the host doesn't remove the private pages, we would get TDX_PAGE_ALREADY_ACCEPTED errors upon re-accepting the memory and that's something we can actually detect.

This is an interesting idea. I've not seen a TD guest do private->shared->private conversions for the sole purpose of moving it to PENDING state. I believe this would be a pretty expensive operation and the end result is you get a zero private page after the guest accepts it again. I'm not sure if this has any advantages over zeroing the page in L1.

On the topic of page invalidation, I'm curious as to why we would want to make SVSM's early boot memory regions inaccessible to the guest on TDP platforms. We support L2 guests that do not accept memory so the decision to accept a page is upon L1 itself. Also, our goal is to avoid giving the L2 guest a memory map that has fragmented lowmem. L2 BIOSes (e.g. OVMF) can assume that this range is contiguous during early memory discovery and we try to avoid adding special logic to them.

If we assume that well-behaved hypervisors always faithfully execute TDG.VP.VMCALL<MapGPA> by removing the private pages (I'm not sure if this assumption is correct), I think we can always treat TDX_PAGE_ALREADY_ACCEPTED as a fatal error. If this assumption is incorrect, I think that we should detect TDX_PAGE_ALREADY_ACCEPTED and bring the page into a known state by issuing TDG.MEM.PAGE.ATTR.WR TD calls to remove access bits for L2 VMs, flushing the EPT caches of L2 VMs using TDG.VP.INVEPT on all vCPUs, and zeroing the contents.

Yes, it's reasonable to assume that successful private->shared conversions always result in the removal of the private pages. I think behaviors breaking this assumption should be considered a bug. In the future we plan to introduce a "page acceptance bitmap" which tracks the pages that have already been accepted. We can then treat TDX_PAGE_ALREADY_ACCEPTED as a fatal error unless L2 guests are also capable of issuing TDG.MEM.PAGE.ACCEPT (this is probably not a case we will support initially). This bitmap is already used by OVMF & the Linux kernel running on TDX platforms.

Freax13 · 2024-10-31T12:52:49Z

This is an interesting idea. I've not seen a TD guest do private->shared->private conversions for the sole purpose of moving it to PENDING state. I believe this would be a pretty expensive operation and the end result is you get a zero private page after the guest accepts it again. I'm not sure if this has any advantages over zeroing the page in L1.

My main problem with just zeroing the page is that it doesn't match the doc-comment on the method:

Marks a physical range of pages as valid or invalid for use as private pages.

I get your argument that we don't gain much from doing a private->shared->private conversion, but I also think that we should implement methods as they are documented. If the requirements don't make sense for a given platform maybe that's a sign that we should reconsider our abstractions.

We currently have this function because on SNP calling pvalidate on a page that has already been pvalidate'd is very dangerous and we can't rely on an error like TDX_PAGE_ALREADY_ACCEPTED to tell us that the page had already been accepted. We need to mark the pages as invalid so that later OVMF we safely validate it again. OVMF on SNP expects to be able to do this.

On the topic of page invalidation, I'm curious as to why we would want to make SVSM's early boot memory regions inaccessible to the guest on TDP platforms.

Maybe we don't. If OVMF is fine with already-accepted memory, I don't see a reason why we need to make it inaccessible. We currently need this on SNP, but if OVMF on TDP is fine without this maybe we don't need to do it.

We support L2 guests that do not accept memory so the decision to accept a page is upon L1 itself.

The COCONUT SVSM doesn't currently support that (i.e. unenlightened guests). This is one of the future goals, but as of right now, this is not supported on any platform.

Also, our goal is to avoid giving the L2 guest a memory map that has fragmented lowmem. L2 BIOSes (e.g. OVMF) can assume that this range is contiguous during early memory discovery and we try to avoid adding special logic to them.

👍

If we assume that well-behaved hypervisors always faithfully execute TDG.VP.VMCALL<MapGPA> by removing the private pages (I'm not sure if this assumption is correct), I think we can always treat TDX_PAGE_ALREADY_ACCEPTED as a fatal error. If this assumption is incorrect, I think that we should detect TDX_PAGE_ALREADY_ACCEPTED and bring the page into a known state by issuing TDG.MEM.PAGE.ATTR.WR TD calls to remove access bits for L2 VMs, flushing the EPT caches of L2 VMs using TDG.VP.INVEPT on all vCPUs, and zeroing the contents.

Yes, it's reasonable to assume that successful private->shared conversions always result in the removal of the private pages. I think behaviors breaking this assumption should be considered a bug. In the future we plan to introduce a "page acceptance bitmap" which tracks the pages that have already been accepted. We can then treat TDX_PAGE_ALREADY_ACCEPTED as a fatal error unless L2 guests are also capable of issuing TDG.MEM.PAGE.ACCEPT (this is probably not a case we will support initially). This bitmap is already used by OVMF & the Linux kernel running on TDX platforms.

This bitmap already exists for SNP in the SVSM and I think maybe we can reuse it on TDP. The bitmap isn't currently part of any platform-specific abstraction, so maybe we don't even need to change anything?

msft-jlange · 2024-11-01T16:55:50Z

This thread has gotten a bit contorted, so I'll make a couple of points here.

Nothing in COCONUT-SVSM relies on the property that accessing invalidated pages will fault. Nothing in any L1 should ever rely on that property. PVALIDATE(FALSE) exists on SNP to prevent the possibility of page substitution attacks, but that's an artifact of how SNP works. There is no analogous attack on TDX, so there is no reason to have the logical equivalent of PVALIDATE(FALSE).

It is true that we need the ability to revoke access from an L2 so we can have the property that unexpected access by an L2 will generate a fault. This is needed for any unenlightened code. But we can achieve that property by simply revoking L2 access, which is possible on both SNP and TDX (on SNP, it is achieved through RMPADJUST and on TDX, it's achieved through TDG.MEM.PAGE.ATTR.WR). This revocation is not related to page invalidation.

It is definitely the case that it is not possible to rely on TDX_PAGE_ALREADY_ACCEPTED to know whether a page was already accepted because the host can remove/reinsert a page and make reacceptance appear to be safe. I agree that the only way to prevent this is to include a bitmap in the L1 to track which pages have been accepted. This is on the roadmap, but it will require additional memory that scales with the size of the guest, so we can't implement it until we have a contract between L1 and L2 to allow the L1 to consume additional memory. Until we have a commitment to make the required OVMF changes, we can't move forward with this work. Peter has previously suggested that he could change OVMF to accept memory information in IGVM format, which would unblock this work, but that remains a goal and not a plan. The only bitmap that exists in the SVSM today applies to L1-owned memory, which isn't enough to solve the problem generally.

Regardless of whether TDX_PAGE_ALREADY_ACCEPTED is a robust error notification, it is easy to say that it is never benign. If the L1 observes that error, then something is definitely wrong, and that error needs to be visible. Not seeing the error doesn't mean that the operation is safe, but seeing the error means the operation is definitely not safe.

msft-jlange · 2024-11-01T16:59:46Z

maybe it would be better if we ditch the tdx-tdcall crate and write our own. It's not that much code and we wouldn't need to be dependent on the implementation choices made for td-shim. We also wouldn't need to wait for a release to be published to crates.io every time we want to make a change.

I am equally nervous about the safety of that crate and I had reached the same conclusion. There is very little special logic in that crate and replicating it (more robustly) in our own project is trivial and would be very worthwhile.

peterfang · 2024-11-04T20:07:55Z

This bitmap already exists for SNP in the SVSM and I think maybe we can reuse it on TDP. The bitmap isn't currently part of any platform-specific abstraction, so maybe we don't even need to change anything?

That's pretty much the plan - to reuse that bitmap for TDP platforms as well. I'm glad we're on the same page :-)

Thanks for explaining why page invalidation is needed on SNP. I believe there's no such hard requirement for TDP machines. OVMF does assume that some of the pages (such as pages needed by OVMF initialization code) have already been installed when it runs and it can discover them. Sounds like we're trying to make these interfaces meaningful for both platforms and I completely agree that the implementation in this PR doesn't match the doc-comment on the method.

peterfang · 2024-11-04T20:30:35Z

Peter has previously suggested that he could change OVMF to accept memory information in IGVM format, which would unblock this work, but that remains a goal and not a plan. The only bitmap that exists in the SVSM today applies to L1-owned memory, which isn't enough to solve the problem generally.

Yes this is still a goal of mine and the initial conversation with our OVMF owner was positive. OMVF now has incredibly complicated x86 boot paths because of platform-dependent requirements and perhaps this could be a step toward convergence. We're also working through some challenges we encountered when enabling OVMF for L2 TDP guests with them. My next step is to work with them and get back to you with their feedback.

peterfang · 2024-11-04T20:33:21Z

maybe it would be better if we ditch the tdx-tdcall crate and write our own. It's not that much code and we wouldn't need to be dependent on the implementation choices made for td-shim. We also wouldn't need to wait for a release to be published to crates.io every time we want to make a change.

I am equally nervous about the safety of that crate and I had reached the same conclusion. There is very little special logic in that crate and replicating it (more robustly) in our own project is trivial and would be very worthwhile.

Yes it has some soundness/correctness issues overall. We were trying to not reinvent the wheel but we may have much higher security requirements than them. How about we open an issue for this rewrite and address it separately?

peterfang · 2024-11-04T20:40:24Z

Thanks for all of the feedback. It is clear now that there's no need for page invalidation on TDP platforms. I will try to address the concerns around page invalidation and update this PR. The goal is to remove any implication that zeroing these pages is equivalent to invalidation.

Add align_down() in addition to align_up() to the Address trait, and use the generic align_{up,down}() calls while at it. Signed-off-by: Peter Fang <[email protected]>

peterfang · 2024-11-07T08:38:54Z

Updated the PR based on feedback

Add page_align() in addition to page_align_up(). This mirrors the methods provided by the Address trait. Signed-off-by: Peter Fang <[email protected]>

Allow the entire virtual range to be allocated at once. No guard page is needed in this case. Signed-off-by: Peter Fang <[email protected]>

There's no need for page invalidation on TDP platforms and zeroing memory instead makes it confusing. Remove the implementation altogether. Signed-off-by: Peter Fang <[email protected]>

Page invalidation is not required on TDP platforms. Neutralize the naming so that these operations don't always imply invalidation. Memory scrubbing will be introduced in a later commit. No functional changes. Signed-off-by: Peter Fang <[email protected]>

Instead of invalidating memory, TDP platforms only need to zero out its content before handing it over to the L2 guest. This is not equivalent to page invalidation because the page remains accepted after zeroing. Always scrub the memory region prior to performing platform-specific invalidation (which is a no-op on TDP platforms). This should have no functional impact on SNP platforms. To scrub a physical memory region, the following strategies are adopted: - Add a cap to the mapping size per iteration - Try using 4K temporary mappings when the mapping size is small enough (< 2M) and try using 2M temporary mappings otherwise - When a mapping guard cannot be created, reduce the mapping size by half Signed-off-by: Peter Fang <[email protected]>

msft-jlange · 2024-11-07T15:47:06Z

I still don't understand the point of this PR. You're zeroing memory before it is invalidated. However, before it can be used again, it has to be validated again, and the process of validating it again will also zero it. So what's the point of zeroing at invalidation time?

It looks like you are guarding against the possibility of leaking data from L1 to L2. However, in the SNP case, the pages are zeroed at the time they are validated by L2. In the TDP case, there is no code yet that knows how to manage an L2 at all, so it's way too premature to be making assumptions about how L1/L2 memory delegation for these specific memory ranges is supposed to work. It would be much better to defer all of this until there's a clear design for how memory delegation is supposed to work on all platforms and then we can make it work correctly on all of them.

peterfang · 2024-11-07T19:33:05Z

Makes sense. Dropping this PR and will be removing the page invalidation implementation for TDP in a different PR.

msft-jlange · 2024-11-07T20:02:52Z

Makes sense. Dropping this PR and will be removing the page invalidation implementation for TDP in a different PR.

Thanks for working through this!

I'm planning on submitting a PR that removes references to tdx-tdcall and replaces it with local implementations. I have to touch the TDP version of validation as part of that. I can clean up the TDP invalidation logic at the same time.

kernel/address: Add align_down() method to Address

3ffa748

Add align_down() in addition to align_up() to the Address trait, and use the generic align_{up,down}() calls while at it. Signed-off-by: Peter Fang <[email protected]>

peterfang force-pushed the tdp-validate-prange branch from 1613746 to f18f8cd Compare November 7, 2024 08:35

peterfang changed the title ~~platform: tdp: Make validate_physical_page_range() more flexible~~ Scrub memory before platform-specific invalidation Nov 7, 2024

peterfang added 5 commits November 7, 2024 00:50

kernel/utils: Add page_align()

f4e5f96

Add page_align() in addition to page_align_up(). This mirrors the methods provided by the Address trait. Signed-off-by: Peter Fang <[email protected]>

mm/virtualrange: Reserve guard page only when needed

5c4ac7d

Allow the entire virtual range to be allocated at once. No guard page is needed in this case. Signed-off-by: Peter Fang <[email protected]>

platform/tdp: Remove implementation for page invalidation

bb1b448

There's no need for page invalidation on TDP platforms and zeroing memory instead makes it confusing. Remove the implementation altogether. Signed-off-by: Peter Fang <[email protected]>

peterfang force-pushed the tdp-validate-prange branch from f18f8cd to b571616 Compare November 7, 2024 08:51

peterfang closed this Nov 7, 2024

peterfang deleted the tdp-validate-prange branch November 7, 2024 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrub memory before platform-specific invalidation #506

Scrub memory before platform-specific invalidation #506

peterfang commented Oct 30, 2024 •

edited

Loading

msft-jlange commented Oct 30, 2024

Freax13 commented Oct 30, 2024

peterfang commented Oct 30, 2024

msft-jlange commented Oct 30, 2024

peterfang commented Oct 30, 2024

Freax13 commented Oct 31, 2024

Freax13 commented Oct 31, 2024

peterfang commented Oct 31, 2024

Freax13 commented Oct 31, 2024

msft-jlange commented Nov 1, 2024

msft-jlange commented Nov 1, 2024

peterfang commented Nov 4, 2024 •

edited

Loading

peterfang commented Nov 4, 2024

peterfang commented Nov 4, 2024

peterfang commented Nov 4, 2024 •

edited

Loading

peterfang commented Nov 7, 2024

msft-jlange commented Nov 7, 2024

peterfang commented Nov 7, 2024

msft-jlange commented Nov 7, 2024 •

edited

Loading

Scrub memory before platform-specific invalidation #506

Scrub memory before platform-specific invalidation #506

Conversation

peterfang commented Oct 30, 2024 • edited Loading

msft-jlange commented Oct 30, 2024

Freax13 commented Oct 30, 2024

peterfang commented Oct 30, 2024

msft-jlange commented Oct 30, 2024

peterfang commented Oct 30, 2024

Freax13 commented Oct 31, 2024

Freax13 commented Oct 31, 2024

peterfang commented Oct 31, 2024

Freax13 commented Oct 31, 2024

msft-jlange commented Nov 1, 2024

msft-jlange commented Nov 1, 2024

peterfang commented Nov 4, 2024 • edited Loading

peterfang commented Nov 4, 2024

peterfang commented Nov 4, 2024

peterfang commented Nov 4, 2024 • edited Loading

peterfang commented Nov 7, 2024

msft-jlange commented Nov 7, 2024

peterfang commented Nov 7, 2024

msft-jlange commented Nov 7, 2024 • edited Loading

peterfang commented Oct 30, 2024 •

edited

Loading

peterfang commented Nov 4, 2024 •

edited

Loading

peterfang commented Nov 4, 2024 •

edited

Loading

msft-jlange commented Nov 7, 2024 •

edited

Loading