-
Notifications
You must be signed in to change notification settings - Fork 948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow sharing assigns between live navigation #3482
base: main
Are you sure you want to change the base?
Conversation
cb51076
to
5d0b028
Compare
This commit relies on a yet to be finalized feature in Phoenix Channels to perform a custom handover between channel rejoins. It changes the LV code to not leave the old channel before rejoining and also instructs Phoenix to not kill the old process before starting the new channel process. After the new channel is joined, the old one is killed. Any `assign_new` calls in the LV mount will try to fetch assigns from the old LV. This is a backwards compatible optimization, so if a version of Phoenix is used that does not support handover, it just falls back to calling the function supplied to assign_new as usual. Closes #3357.
5d0b028
to
8518e61
Compare
@SteffenDE I originally thought about another way to implement this, which is to let the server say "I have spawned up a new channel" and the client assumes it has taken over control. The benefit of doing it this way is that you don't need to keep the process around. You immediately spawn the new process passing the assigns it needs and that's it. Do you have thoughts about this route? |
I'll actually need to handle push_navigate as well. Forgot about that. I need to think about the consequences of your proposal (e.g. we'll still want to start the new process under the channel's supervisor). Currently, I cannot really imagine what an API for this would look like. We cannot really shorten the time that the process needs to wait though, I think, as we don't know the keys from assign_new upfront - at least without looking through the LV mount's AST or just passing all assigns, which could be rather expensive. So we need to at least wait for the initial mount of the new LV, don't we? |
We will wait for the initial mount but, because we are also the one who called the initial mount, we will know if the initial mount either succeed or failed. We don't need to have a timeout that says "in case I am not pinged in 15 seconds, terminate". |
We don’t wait in the current implementation :) |
So yeah I think the current implementation is fine for client-initiated live navigation. What you described is what we need for push_navigate. |
Oh, I see, we don't wait because it is client initiated! @SteffenDE if we need the version I proposed for push_navigate, it can then be used for both push_navigate and live_navigate, right? 🤔 So it should be more general? |
Another downside of the current implementation is that it only works if you rejoin the same channel. The other version should be a bit more general... but it may also be (much?) harder to implement. :( |
Sorry to disrupt the conversation, but I couldn't understand the end-goal with the issue. |
@greven yes, the end goal is to reduce things like querying the current user from the DB on each live navigation by expanding |
TODO: check flash handling TODO: refactor js to remove duplicatings between replaceMain and handover TODO: check if should keep the existing view class or replace it completely, passing the old channel to the new one?
@josevalim I implemented a proof of concept for server initiated handovers in 2ec6c6a and phoenixframework/phoenix@cd99fcf. It breaks a couple of tests, so very much not finished, but it shows the concept. I did not change the client handovers to use this yet, but we could do it by changing the client from issuing a rejoin to instead send a message and trigger the same path. |
@bcardarella it would, but it's imho probably nothing LiveView will ever try to do do itself. There have been lots of discussions around the double mount in the past. The main issue is that you simply cannot assume that a live render will happen at all as a dead render can be triggered by any crawler, bot, etc. without ever connecting to the WebSocket. Therefore, you'd always need a timeout on how long things would be kept in memory. If people want to do this, it should be possible to implement something like this right now by spawning a globally registered process under a key that is put into the session on the dead render, requesting the data from it during the live render. This could be implemented as a separate library and shared with the community if there is lots of demand for it. I'm not sure if it is really worth it though. Anyways, it's definitely out of scope for this issue. |
I highly vote in favor of this for different reasons; I think LV should always spawn the new process waiting for the connect. Time To First InteractionThe average web visitor will leave after opening a page or site in a very few seconds. It's normal to see average session times in seconds for even the most-read websites worldwide (yes even the sites you are reading for hours sometimes have average session times that would surprise you). Users will open a tab then put the phone down, users will click on a link on another tab and never return, etc. All this means that users who are just connecting and are getting the fresh browser tab will make up a large proportion of the user experience. As such, LiveView should optimize for the first few interactions. By sharing all the Even if the assigns take <50 ms to compute on average, the long tail will be a noticeable lag on the some percentage of socket connection. This means I would like to give developers a default where they don't have to think about bucketing assigns as needing to be cached or shared and not. Methods that start out fast have a way of adding more complexity as a code base matures. Why LV Users CareIf you are a web shop, you care greatly about the time to the first interaction since you hope that the first interaction is placing something in the cart. Heavy web applications with long-running sessions will consider the time to first interaction for sign-up. I don't want my user to bounce because the page froze. Every extra ms of input latency is painful here. Fewer Foot CannonsTreating all Assigns Caching TuningBy unifying assigns for deadview, connect and reconnect it has a nice operational property: We can tune how long to keep processes and their assigns alive. This gives a very nice knob. For example, if I operate a website and I see that my users are having latency issues that last between 500ms and 1.5s, I can tune the LV processes to keep the assigns around for 2 seconds. This means that almost all cell tower-related latency blips will not need to recompute their As an SRE operating a system, I want to have a way to trade off memory to perceived latency. If I have a site where almost no one reconnects (dead view only), then I can turn off speculatively spawning the process or reduce the wait time significantly. Currently, I don't have an easy way to say I am okay with using more memory for a better user experience by default. Source: I have been the SRE/PE in charge of a large web property that made live connections for processing/rendering, and I have been building/using LV for a while now.
|
@elliottneilclark I'm sorry to disappoint you, but this PR as well as the corresponding issue are not about sharing assigns between the dead and connected mount. We're only experimenting with sharing assigns between navigations that happen over the existing websocket connection. If you want to optimize loading times for the connected mount, you can do so by having important data cached in memory. There are a lot of caveats with doing such things, especially when trying to do them by default, and I'm not convinced that this is something we want to focus on - at least right now. That being said, I can imagine an external library you might plug into your LiveView application that does most of the work for you, so if that's something you really think we should have, feel free to explore this. The important bits to do it are already there. I could imagine something like this working:
|
I know, and that's fine; I appreciate any progress on LV. However, before LV goes down the rabbit hole of optimizing assigns for different scenarios, it's important to say that I have seen the more general solution work at scale and that it has benefits. Hyrum's law means it's worth the time before settling on assigns behavior that will become the api. The time to first interact and reducing UI jankiness are essential to the end-user experience. Most web properties try to estimate the time to first interaction with FCP. However, from experience, that's not a great indication of actual user perception of the UI for reactive web properties. Reactive web pages often have an FCP that's good to look at and solves half the problem of slow rendering. The other important metric is when all UI elements on the screen are fully alive/inflated/running and when that UI isn't fully live/running (reconnect, or live navigation). Sharing assigns between navigations, initial mounts, and reconnect seems like a general transition of state that can be improved without the need for different APIs.
Batteries Included did it differently; by having long-running gen servers that cache the data, we made getting
Thank you for the suggestion, we might explore that in the future. |
Note: this is a reply to a now deleted comment.
|
Note: this is a reply to a now deleted comment. Also, fwiw, "I don't understand why a first-class solution is off the table for discussion" is completely innaccurate. We have discussed this extensively. As said early on, it is just out of scope for this issue.
This is very hard to achieve in practice because how do we know what changed between the dead and the live render if there is no live process? We would need to store the data on the server over some period of time and then transfer it to the node that accept the websocket mount, which comes with large implications in terms of application design. You now need to plan for your whole "live state" to be transferred across nodes, it may be more expensive than the database transfer (because the data is structured), and things like resources won't work at all. Perhaps this is easier to achieve by doing WebSockets over HTTP2, as the HTTP2 connection typically sticks to the same server. EDIT: alternatively, we keep the process in another node and route all LiveView requests to another node, but now you are paying a in-cluster round-trip for every single LiveView interaction. Anyway, hijacking an unrelated issue and saying we don't want to discuss something unrelated to the issue is not nice. :) |
That defeats the purpose of a full FCP, and the application that we're creating is more complex than that.I think we are the largest source available Phoenix Live View web application with public-facing, internal, and three-year history of building with LV. Please don't be defensive. I have been trying to keep the discussion around what the average user of LiveView would experience. Not single solutions that only make things better in narrow times.
I discussed why it's better to avoid making a piecemeal solution of assign behavior. Hyrums Law will make changing any of LiveView's behaviors harder to change after developers depend on it. So it will be calcified once you put in new behavior for The top comments in HN echo the same issues that I am here.
There are 19 different mentions of connection/reconnection in the HN threads.
From exerience with correct planning and fallbacks this can be the same behaviors.
Ok. |
Once again, there are two unrelated topics being mixed up. This is about a same-node optimization. Anything multi-node is by definition out of scope. The way this thread has been going is akin to:
Are there general solutions that would solve both distributed and concurrent issues outlined here: yes, there are. But they won't be as good as an optimization that is specific to the concurrent same-node one. In more concrete terms, when we optimize live_navigation, we have complete control over the life-cycle, so we can fully optimize the amount of round trips, latency, and how long the shared data lives. The dead render and reconnections do not have any of these properties! But it would be very silly of us to not leverage these properties for There is no one dismissing the reconnection or dead render issues. They are important. But it is orthogonal. We should leverage concurrency and same node properties when we can. Treating everything as a distributed problem would be a recipe for performance woes and hard to debug issues (since pids, ports, and references cannot be transparently shared across nodes). |
And, fwiw, @elliottneilclark, my previous replies were not direct to you. The comments I have replied to have been removed, so my comments now probably look out of context. |
Btw, here are some examples to drive the point home:
There are more examples where we leverage the fact we control the life-cycle to simplify operations and avoid data copying, such as fastlaning. This issue is, in a nutshell, a similar optimization: it leverages the existing processes to copy data once. Any shared solution would do two copies. If you believe those are single solutions that only make things better in narrow times, I recommend disabling the similar optimizations we have done in the past and assess the performance impact on your application. |
The single-node, more general solution makes the routing of multi-node solutions simply an optimization. It's all about the percentage chance that a subsequent request hits a node with a process already running. If we always start the process after the initial dead view, then in the single node with no delay between HTTP and websocket, it's a 100% chance that assigns don't have to be recomputed. The same goes for push navigation. If we push navigate while using a aging/ttl for old mounted live_views then it's all about the chance that the node hasn't had memory pressure or had to evict. That's why the same solution can help all the jank; it becomes an optimization to get an already running process with filled-in assigns. In a multi-node, it's similar. If the initial HTTP and WebSocket are routed to the same nodes (sticky session), then assigns can be reused. If, in a multi-node, the HTTP and WebSocket don't go to the same node, then we have to fall back and re-compute like the behavior today. You always need this fallback for misrouted requests if your load balancer isn't 100% reliable or if your application nodes can change (Layer 4 and Layer 7 LB's change their quorum frequently in the cloud) |
@elliottneilclark you are confusing two things: single-node is not the same as same-node. My comments above are not assuming it is a single node. It is assuming the fact you have multiple nodes but we know that it will always target the same node (because that's how websockets work). The optimizations we are describing apply 100% of the time. |
We're talking past each other. I do not want to derail this anymore. |
You are saying that you have a local cache that you use it if can, or you recompute it. The odds of hitting the cache is going to When it comes to
The benefit of your solution is that it could be used for operations that can be routed across nodes, such as dead render and reconnects, with the The solution proposed here has the new LiveView communicate with the old one:
If you believe you have a better solution for |
Please note that we're happy to discuss concrete improvements to the dead/live lifecycle in proposals section of the forum :) |
No. I am not saying that. I am saying that the more general optimization solves what this PR would try as well, so rather than single one-off solutions for
Memcopy is several orders of magnitude faster than any remote ecto or HTTP API call. ~500k nanoseconds for for 1mb memcopy. A single round trip packet in a dc is likely around that as well, meaning that a full query is at least 2 orders of magnitude more work. |
So we definitely understand each other, we just don't agree on the trade-offs. :) Your single elegant solution will be quite worse for dead render and worse for live navigation. And I believe the dead render and live navigation are common enough to warrant specialized solutions. This is speaking from experience optimizing stuff like channels fastlaning and pubsub, where avoiding the extra copy showed benefits in both benchmarks and production apps. |
Most of the times the improvements on raw throughput by avoiding data copying were limited, usually between 1-2x, but it consistently lead to meaningfully better performance on the memory department: fewer memory allocations, fewer GC runs, and you avoid unnecessarily sharing large binaries too. |
I deleted my previous comment (as I was also actively participating in the sidetracking of this issue :D) and I have opened #3551 to continue the other discussion. The proposal there should be strictly better than a GenServer as a cache layer, as it does zero or one copy within the same node, and has reduced payloads on the first connected render and reconnects. It is also orthogonal to this issue. It optimizes reconnects and first connected render by adopting LiveViews, but it doesn't apply to |
Considerations
To support sharing assigns between live navigations, there are multiple ways we could implement this:
We could have the transport own an ETS table or provide an API to assign values to it. Then LV could store assigns that use
assign_new
into this storage. Drawbacks: assign_new would have a messaging overhead, sending (potentially big) data to another process, even if it is not needed again. If an assign is updated later, we'd either need to sync those updates (keeping track which assigned keys used assign_new) or live with potentially old data.This is the solution that this PR implements (together with a PR in Phoenix). This solves all the drawbacks mentioned above: it only copies the data that is actually used for the optimization and it also always gets the most recent data from the old LV's assigns. The only new drawback that I can come up with is that the old LV needs to be in memory just a little bit longer than usual, but I think that's fine.
Changes
This PR relies on a yet to be finalized feature in Phoenix Channels (phoenixframework/phoenix#5959) to perform a custom handover between channel rejoins.
It changes the LV code to not leave the old channel before rejoining and also instructs Phoenix to not kill the old process before starting the new channel process. After the new channel is joined, the old one is killed. Any
assign_new
calls in the LV mount will try to fetch assigns from the old LV.This is a backwards compatible optimization, so if a version of Phoenix is used that does not support handover, it just falls back to calling the function supplied to assign_new as usual.
Closes #3357.
cc @josevalim