Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharing CpuAdressableDisplayProvider between eglstreams and gbm platforms #3067

Merged
merged 15 commits into from
Oct 26, 2023

Conversation

mattkae
Copy link
Contributor

@mattkae mattkae commented Oct 6, 2023

What's new?

  • The EglStreams platform now has a CPUAddressableDisplayProvider so that it works in this specific hybrid situation
  • Moved some classes to be shared between GBM and EGLStreams

How to test

sudo XDG_RUNTIME_DIR=/run ./bin/mir_demo_server --platform-rendering-libs mir:gbm-kms

It still SIGABRTs after, but it gets through the selection part.

@RAOF
Copy link
Contributor

RAOF commented Oct 6, 2023

This'll get Mir starting, but it won't actually display anything. To get the CPU buffers on the screen, you'll need to additionally touch DisplayBuffer::overlay to store the Framebuffer being passed in, and DisplayBuffer::post to actually use the Framebuffer in a KMS call.

Here is where that is currently done, and specifically here is where the KMS FB ID is used.

@mattkae
Copy link
Contributor Author

mattkae commented Oct 6, 2023

This'll get Mir starting, but it won't actually display anything. To get the CPU buffers on the screen, you'll need to additionally touch DisplayBuffer::overlay to store the Framebuffer being passed in, and DisplayBuffer::post to actually use the Framebuffer in a KMS call.

Here is where that is currently done, and specifically here is where the KMS FB ID is used.

Oh that's the next_swap business, huh?

@mattkae
Copy link
Contributor Author

mattkae commented Oct 6, 2023

This'll get Mir starting, but it won't actually display anything. To get the CPU buffers on the screen, you'll need to additionally touch DisplayBuffer::overlay to store the Framebuffer being passed in, and DisplayBuffer::post to actually use the Framebuffer in a KMS call.

Here is where that is currently done, and specifically here is where the KMS FB ID is used.

And does that mean that we sometimes use the next_swap and sometimes use the existing path?

@RAOF
Copy link
Contributor

RAOF commented Oct 6, 2023

And does that mean that we sometimes use the next_swap and sometimes use the existing path?

Yes. When there's a non-null std::unique_ptr<Framebuffer> in there it should take the direct KMS path, otherwise the EGL path.

As a practical matter I think any given instance is only ever going to use one of these paths - either always direct KMS, or always EGLStream.

@RAOF
Copy link
Contributor

RAOF commented Oct 9, 2023

Extra bits of relevance:
When making the atomic commit in the CPUAllocator path, the ALLOW_MODESET flag should instead be the pageflip flag. Additionally, the final nullptr there will want to be replaced with something to tie into the ThreadedDRMHandler thing (so that you can get a future that'll resolve when the page flip completes).

@mattkae
Copy link
Contributor Author

mattkae commented Oct 11, 2023

I'm getting a permission denied error on set_crtc for the gbm-kms platform. I am unsure of the reason at the moment

@mattkae mattkae force-pushed the platform-API-merge-fix-eglstreams branch from 6a201e0 to 90649d9 Compare October 11, 2023 19:07
@codecov
Copy link

codecov bot commented Oct 11, 2023

Codecov Report

Merging #3067 (1705d00) into main (4a57bc6) will decrease coverage by 0.07%.
Report is 7 commits behind head on main.
The diff coverage is 26.85%.

@@            Coverage Diff             @@
##             main    #3067      +/-   ##
==========================================
- Coverage   78.43%   78.36%   -0.07%     
==========================================
  Files        1073     1074       +1     
  Lines       74717    74804      +87     
==========================================
+ Hits        58602    58623      +21     
- Misses      16115    16181      +66     
Files Coverage Δ
src/platforms/common/server/kms_framebuffer.h 100.00% <ø> (ø)
...platforms/eglstream-kms/server/drm_event_handler.h 100.00% <ø> (ø)
src/platforms/gbm-kms/server/kms/display.cpp 56.90% <ø> (-0.69%) ⬇️
...rc/platforms/gbm-kms/server/kms/display_buffer.cpp 71.18% <100.00%> (ø)
src/platforms/gbm-kms/server/kms/display_buffer.h 100.00% <ø> (ø)
src/platforms/gbm-kms/server/kms/kms_output.h 100.00% <ø> (ø)
src/platforms/gbm-kms/server/kms/platform.cpp 23.83% <100.00%> (ø)
...atforms/gbm-kms/kms/test_display_multi_monitor.cpp 98.02% <100.00%> (+0.01%) ⬆️
...sts/platforms/gbm-kms/kms/test_real_kms_output.cpp 98.56% <ø> (ø)
...eglstream-kms/server/kms_display_configuration.cpp 0.00% <0.00%> (ø)
... and 8 more

... and 7 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Saviq
Copy link
Collaborator

Saviq commented Oct 12, 2023

I've not seen this one elsewhere:

 [ RUN ] RealKMSOutputTest.drm_set_gamma
/spread/mir/tests/unit-tests/platforms/gbm-kms/kms/test_real_kms_output.cpp:516: Failure
Actual function call count doesn't match EXPECT_CALL(mock_drm, drmModeCrtcSetGamma(drm_fd, crtc_ids[0], gamma.red.size(), const_cast<uint16_t*>(gamma.red.data()), const_cast<uint16_t*>(gamma.green.data()), const_cast<uint16_t*>(gamma.blue.data())))...
Expected: to be called once
Actual: never called - unsatisfied and active

[ FAILED ] RealKMSOutputTest.drm_set_gamma (0 ms)
[ RUN ] RealKMSOutputTest.drm_set_gamma_failure_does_not_throw
/spread/mir/tests/unit-tests/platforms/gbm-kms/kms/test_real_kms_output.cpp:545: Failure
Actual function call count doesn't match EXPECT_CALL(mock_drm, drmModeCrtcSetGamma(drm_fd, crtc_ids[0], gamma.red.size(), const_cast<uint16_t*>(gamma.red.data()), const_cast<uint16_t*>(gamma.green.data()), const_cast<uint16_t*>(gamma.blue.data())))...
Expected: to be called once
Actual: never called - unsatisfied and active

[ FAILED ] RealKMSOutputTest.drm_set_gamma_failure_does_not_throw (3 ms)

@Saviq
Copy link
Collaborator

Saviq commented Oct 12, 2023

I've not seen this one elsewhere:

Correction. As one could expect, this happens occasionally on platform-API-merge as well.

{
auto const predicted_render_time = 50ms; // Predicted worst case render time for the next frame...

output->set_crtc(*next_swap);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will do an immediate, tearing CRTC update (ie: it will not wait for vsync). I don't think this will trigger the flip event handler, either?

@RAOF
Copy link
Contributor

RAOF commented Oct 16, 2023

I'm getting a permission denied error on set_crtc for the gbm-kms platform. I am unsure of the reason at the moment

Hm. It's not obvious why that would be the case. I was wondering whether we'd need to not do the EGLStream setup if we're using the CPUAllocated path. That is, defer the eglStreamConsumerOutputEXT call until an EGLStream consumer requests it.

But I wouldn't expect that to manifest as permission denied. I'll test this and see what I can immediately dig up tomorrow.

@mattkae
Copy link
Contributor Author

mattkae commented Oct 16, 2023

I'm getting a permission denied error on set_crtc for the gbm-kms platform. I am unsure of the reason at the moment

Hm. It's not obvious why that would be the case. I was wondering whether we'd need to not do the EGLStream setup if we're using the CPUAllocated path. That is, defer the eglStreamConsumerOutputEXT call until an EGLStream consumer requests it.

But I wouldn't expect that to manifest as permission denied. I'll test this and see what I can immediately dig up tomorrow.

Sounds good let me know 👍

@RAOF
Copy link
Contributor

RAOF commented Oct 17, 2023

With --platform-display-libs=mir:eglstream-kms and --platform-rendering-libs=mir:gbm-kms I do not see the permission denied. I do see an abort() in libepoxy, though.

Wait - I could have sworn that a previous version of this set next_swap in overlay, but the current version does not. That's not going to work!

With that fixed, it then doesn't crash, but it does hang indefinitely in pending_flip.get(), presumably because we haven't actually hooked up the flip event handling.

@RAOF
Copy link
Contributor

RAOF commented Oct 17, 2023

OK. The last commit I pushed almost works (and is not suitable for merging). Everything seems to work - post() doesn't block, no exceptions are raised, the frame events get signalled - except that nothing appears on the screen. 🤦‍♀️

@RAOF
Copy link
Contributor

RAOF commented Oct 17, 2023

https://gitlab.freedesktop.org/daniels/kms-quads/-/blob/master/kms.c is a pretty good reference for Atomic KMS shenanigans.

@mattkae
Copy link
Contributor Author

mattkae commented Oct 17, 2023

OK. The last commit I pushed almost works (and is not suitable for merging). Everything seems to work - post() doesn't block, no exceptions are raised, the frame events get signalled - except that nothing appears on the screen. 🤦‍♀️

Reproduced this on my local setup at least 😄 Although I'm getting a mysterious error:

error in client communication (pid 1387)
[destroyed object]: error 4: Client requested unsupported format/modifier combination DRM_FORMAT_ARGB8888/NVIDIA:BLOCK_LINEAR_2D,HEIGHT=4,KIND=254,GEN=0,SEC

Correction: I can flash stuff on the screen (e.g. gedit), but it is clearly mildly broken...

Update: Obviously the cursor update on gedit is making it pop on the screen from time to time. The screens seems to have no persistence though, as though it's getting cleared right away... 🤔


scheduled_fb = std::move(next_swap);
next_swap = nullptr;
auto const predicted_render_time = 50ms; // Predicted worst case render time for the next frame...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually need to hook up the recommended sleep stuff here; it's not really hooked up anywhere else and it's not really useful.


if (ret != 0)
{
BOOST_THROW_EXCEPTION((std::system_error{-ret, std::system_category(), "Failed to commit Atomic KMS state"}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should at least fall back to trying again with ALLOW_MODESET (that would require sharing some more state with configure, notably the “mode blob”).

I don't think we need to have a fallback to drmModeSetCrtc, as we've guaranteed we have the atomic API available

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this was the cause of this issue. The kmscube people always do an allow mode set on the first render, and then turn it off afterwards. Does that sound like something we might want too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... actually it seems like configure should do the ALLOW_MODESET for us huh?

@@ -316,6 +320,42 @@ uint32_t mgek::EGLOutput::crtc_id() const
return static_cast<uint32_t>(crtc_id);
}

bool mgek::EGLOutput::queue_atomic_flip(FBHandle const& fb, void const* drm_event_userdata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be shared with configure; it's largely the same code. Probably configure can be changed to call this.

Base automatically changed from platform-API-merge to main October 18, 2023 12:59
@Saviq Saviq marked this pull request as ready for review October 18, 2023 13:03
@mattkae
Copy link
Contributor Author

mattkae commented Oct 18, 2023

@RAOF I'm finding the following error when running ./bin/miral-shell.bin --platform-display-libs=mir:eglstream-kms --platform-rendering-libs=mir:gbm-kms btw:

error in client communication (pid 2859)
[destroyed object]: error 4: Client requested unsupported format/modifier combination DRM_FORMAT_ARGB8888/NVIDIA:BLOCK_LINEAR_2D,HEIGHT=4,KIND=254,GEN=0,SEC

Update: The culprit is the DecorationProviderClient::draw_background in miral-shell

Update again: After fidgeting with it/running it a lot, I can no longer reproduce, which leads me to believe it may be some sort of modesetting issue?

Update again again: I see this after restarting my computer. That's a funny one! I'll see if allowing a modeset later on makes it happy?

Copy link
Contributor

@RAOF RAOF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, works¹ for me, both with just eglstream-kms driving the display + gbm-kms rendering, and with MIR_EXPERIMENTAL_HYBRID so that both amdgpu and nvidia displays are active.

Just a couple of nits, plus some error handling needed.

¹: There's a malloc-corruption crash/EGL_BAD_ALLOC I see with gbm-kms, but I don't think this is related and I'll continue to investigate.

@@ -256,7 +263,7 @@ class DisplayBuffer

std::chrono::milliseconds recommended_sleep() const override
{
return std::chrono::milliseconds{0};
return std::chrono::milliseconds(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff is not particularly necessary 😉

drmModeAtomicAddProperty(request.get(), plane_id, plane_props->id_for("CRTC_ID"), _crtc_id);
drmModeAtomicAddProperty(request.get(), plane_id, plane_props->id_for("FB_ID"), fb);

/* We don't monitor the DRM events (yet), so have no userdata */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually do have userdata now 😉

int ret = atomic_commit(fb, drm_event_userdata, DRM_MODE_ATOMIC_NONBLOCK | DRM_MODE_PAGE_FLIP_EVENT);
if (ret != 0)
{
BOOST_THROW_EXCEPTION((std::system_error{-ret, std::system_category(), "Failed to commit Atomic KMS state"}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, there are a bunch of ways atomic_commit can fail, but transiently. It can require a modeset after switching back from VT switching. It can also fail with EBUSY if there's already a pending commit that hasn't finished (although we shouldn't hit that, having waited for previous page flips.)

Maybe we don't have to handle EBUSY here, and just assume that our page-filp handling will cover it.

We should handle the resume-from-VT-switch, which probably means twiddling a flag in mge::Display::resume() that causes queue_atomic_flip to additionally pass ALLOW_MODESET?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented something that makes a lot of sense, but I get a fatal error whenever I VT switch 🤔

@mattkae mattkae force-pushed the platform-API-merge-fix-eglstreams branch from af206e7 to 8361f3d Compare October 19, 2023 15:37
Copy link
Contributor

@RAOF RAOF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so this does look plausible, but we die with EPERM in atomic-commit when VT switching away.

What's happening here is that we've lost DRM master, but are still in the process of scheduling a flip. We should catch EPERM in this case or, possibly, not throw an exception at all here and just log any failure - that's what gbm-kms does.

next_swap = std::move(fb);
return true;
}
if (auto fb = std::dynamic_pointer_cast<graphics::FBHandle>(renderable_list[0].buffer))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, what? Is this just the same if duplicated twice?

@RAOF
Copy link
Contributor

RAOF commented Oct 23, 2023

Ok, a debugging log:

  • VT switching away works, but VT switching back dies
    • Oh, it's in WaylandConnector::start!?
    • Oh, it's because the VT switch away didn't properly stop the WaylandConnector, and so there's an already running Wayland dispatch thread?!
    • Ooooh, Failure to pause in VT switching silently leaves compositor running #3087
    • Ah, if you stop it in gdb before the compositor-shut-down timeout, you find that it's blocking in DisplayBuffer::post().
    • Aaaah, because it's waiting for a flip event that will never come, because we failed to actually submit the flip, because we've lost access to the DRM node due to VT switch
    • Fix that and now… it still fails when VT switching back? But now it does so with EBUSY.
    • Make modesetting commits blocking, to work around EBUSY?
    • Ok, that gets things mostly working
    • Oh, but repeatedly VT switching it looks like sometimes we hit
[2023-10-23 17:41:47.476460] < - debug - > mirserver: Received logind force-pause event for device 13:64                                                                                                                                    │
│[2023-10-23 17:41:47.476480] < - debug - > mirserver: Received logind force-pause event for device 13:92                                                                                                                                    │
│[2023-10-23 17:41:47.512323] < CRITICAL! > mirserver: terminate_with_current_exception(): /home/chris/Canonical/Mir/mir/main/src/platforms/eglstream-kms/server/egl_output.cpp(355): Throw in function bool mir::graphics::eglstream::kms::E│
│GLOutput::queue_atomic_flip(const FBHandle &, const void *)                                                                                                                                                                                 │
│Dynamic exception type: boost::wrapexcept<std::system_error>                                                                                                                                                                                │
│std::exception::what: Failed to commit Atomic KMS state: Invalid argument                                                                                                                                                                   │
│                                                                                                                                                                                                                                            │
│[2023-10-23 17:41:47.512411] < - debug - > mirserver: Handling Terminated from pid=632670                                                                                                                                                   │
│[2023-10-23 17:41:47.512695] < -warning- > mirserver: wl_surface@9 destroyed before associated role                                                                                                                                         │
│[2023-10-23 17:41:47.516028] <information> evdev-input: Removed /dev/input/event7: Ducky One 3 TKL RGB                                                                                                                                      │
│ERROR: /home/chris/Canonical/Mir/mir/main/src/platforms/eglstream-kms/server/egl_output.cpp(355): Throw in function bool mir::graphics::eglstream::kms::EGLOutput::queue_atomic_flip(const FBHandle &, const void *)                        │
│Dynamic exception type: boost::wrapexcept<std::system_error>                                                                                                                                                                                │
│std::exception::what: Failed to commit Atomic KMS state: Invalid argument

@RAOF
Copy link
Contributor

RAOF commented Oct 23, 2023

So next steps here:

  • Rather than making ALLOW_MODESET commits blocking, we should instead return EBUSY up to post(), and if we hit that, sleep a couple of milliseconds and try again.
  • See if the EINVAL that sometimes shows up still does with a retry-on-EBUSY approach. If it does, enable some DRM kernel debugging and see if that gives any clues as to what is invalid.

KMS kernel debugging: echo 0x14 | sudo tee /sys/module/drm/parameters/debug will (should) enable KMS debugging (0x4) and atomic debugging (0x10). After enabling that, modesetting should be very verbose in dmesg.

The kernel docs show what debug options are available.

@Saviq Saviq assigned RAOF and unassigned mattkae Oct 23, 2023
RAOF added 3 commits October 24, 2023 16:38
Urgh, there are so many ways that an atomic commit can fail, even if we do everything correctly.
We *might* need to punt this way back up to the `Compositor` to handle to get the EGLStream output
VT switch wortking, but that's broken on Mir release anyway 😬
We now handle `EBUSY` higher up the stack, so the `EGLOutput` can always use non-blocking
commits, even for modesetting.
We fully expect VT switching to cancel flip notifications, we don't really
need to log that now.
@RAOF
Copy link
Contributor

RAOF commented Oct 24, 2023

Alright! Apart from #3089 (which happens on 2.15, too 🤦‍♀️), this appears to entirely work for me now. 🎉

VT switching with the CPUAddressableDisplayProvider works reliably, things work with MIR_EXPERIMENTAL_HYBRID_GRAPHICS=1, and the EGLStreamOutput only crashes in ways that the current release does!

RAOF added 2 commits October 24, 2023 17:20
“Basic” is never the best name if there's something else you can think of,
and `kms::` is a much better descriptor for it.
We require KMS “dumb” buffers for the CPU-addressable output, so we might as
well check that the driver claims to support them.
Copy link
Contributor

@RAOF RAOF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now good to go as far as I'm concerned. The only vaguely concerning thing is that set_flags_for_next_flip is thread-unsafe, but this doesn't matter because it's only called from resume(), which is called from DisplayServer::Private::resume(), which calls it before the compositor thread(s) are started.

I can't really approve this now, having authored a bunch of it, so @AlanGriffiths ?

@Saviq
Copy link
Collaborator

Saviq commented Oct 24, 2023

And I can confirm our smoke tests are fine, too :)

  job passed   : Run Mir smoke tests
  job passed   : Run Mir smoke tests as root                                                                                       
  job passed   : Run hybrid Mir smoke tests rendering on CometLake-H GT2 [UHD Graphics]
  job passed   : Run hybrid Mir smoke tests rendering on TU117GLM [Quadro T2000 Mobile / Max-Q]

Copy link
Collaborator

@AlanGriffiths AlanGriffiths left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really test this, but it looks reasonable.

Some nitpicking to address

Copy link
Collaborator

@AlanGriffiths AlanGriffiths left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK now

Not merging as @mattkae is planning on some (hopefully final) testing

@Saviq Saviq mentioned this pull request Oct 24, 2023
@Saviq
Copy link
Collaborator

Saviq commented Oct 24, 2023

Got a test to fix on clang:

	183 - mir_unit_tests_gbm-kms.MesaDisplayMultiMonitorTest.* (Failed)

@AlanGriffiths
Copy link
Collaborator

183 - mir_unit_tests_gbm-kms.MesaDisplayMultiMonitorTest.* (Failed)

OK, I know what is different, just not (yet) worked out why...

In the clang build drmModeAddFB2() is intercepted once, and so only stores fb_id once. After which, the drmModePageFlip() calls that check the fb_id parameter fail (because they are passed 0).

In the gcc build, drmModeAddFB2() is intercepted and stores fb_id three times. And the drmModePageFlip() calls that get the expected fb_id parameter.

Now to figure out how this needs fixing...

@mattkae
Copy link
Contributor Author

mattkae commented Oct 25, 2023

@RAOF and @AlanGriffiths: I found these two bugs in my final testing, but feel free to tackle them at a later date if you think they shouldn't block this work:

@AlanGriffiths AlanGriffiths added this pull request to the merge queue Oct 26, 2023
Merged via the queue into main with commit deca1ba Oct 26, 2023
32 of 33 checks passed
@AlanGriffiths AlanGriffiths deleted the platform-API-merge-fix-eglstreams branch October 26, 2023 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants