A few more questions/problems #2644
Replies: 5 comments 8 replies
-
Growing payments.sent problem: Growing audit.channel_errors: ExpiryTooBig: |
Beta Was this translation helpful? Give feedback.
-
AFAIR the router can be scaled out: https://github.com/ACINQ/eclair/blob/master/docs/Cluster.md Also you can play with mailbox type/size: |
Beta Was this translation helpful? Give feedback.
-
I think the back router can be scaled pretty easily. The router actor can spin up a bunch of worker actors. The only thing a worker actor can do is call Something like this:
The workers can be created on startup or on demand. On demand workers can terminate right after route calculation, so that they will always receive a fresh network graph. The router can keep track of the number of workers in flight and return a failure if the max limit of workers has been reached. Or if slight discrepancies in network data in the workers' mailboxes are allowed, the router can create a fixed number of workers and forward route requests to them according to some strategy (randomly, round robin, based on mailbox size, based on load, etc). |
Beta Was this translation helpful? Give feedback.
-
I created a crude prototype in this form: case Event(r: RouteRequest, d) =>
val sender = context.sender()
Future { RouteCalculation.handleRouteRequest(d, nodeParams.currentBlockHeight, r, sender) }
// RouteCalculation.handleRouteRequest(d, nodeParams.currentBlockHeight, r, sender)
stay() using d It works ~2.5x faster than the stock router on my laptop. For some reason it uses only 4 cores out of 8. Anyway, 4 is a bit greater than 2.5. YMMV tho... |
Beta Was this translation helpful? Give feedback.
-
@DerEwige can you try this PR #2651? It runs route calculations in parallel, but preserves the |
Beta Was this translation helpful? Give feedback.
-
Regarding the database:
Growing payments.sent problem:
I have the issue that my paments.sent table is growing really fast.
To prevent excessive growth of this table, I purge entries from this table manually on regular basis.
The reason for this large growth is the many rebalance attempts I do each day (100k+)
Currently I generate one invoice per rebalance attempt.
I’m currently looking into a way to reuse the same invoice multiple times, without creating conflicts in my workflow.
In the mean-time I was looking into using eclair functions to purge paments.sent instead of going through an external database connection.
When looking at the file PgPaymentsDb.scala I did not see any function to remove entries from paments.sent only from payments.received.
Did I miss anything?
Growing audit.channel_errors:
My audit.channel_errors table has grown a lot ( I believe it increased more since 0.8 was relese)
My first question: Can I purge entries from this table, that I no longer need?
Now to my 2nd question. What might be the reason for my many “ExpiryTooBig” errors?
I believe this has to do with the way findroutebetweennodes works.
Everytime I want to crate a circular route for rebalances I have to use findroutebetweennodes and then add one hop to close the loop.
(as per this document https://github.com/ACINQ/eclair/blob/master/docs/CircularRebalancing.md)
There is currently no option to define max expiry when using findroutebetweennodes, so adding one hop might exeed the maximum.
I also suspect that these settings might add to the problem:
maybe this is worth investigating?
akka.tell to Router.scala (FSM) with default buffer size = potential disaster?
I encountered some weird behaviour on my node lately, that mainly manifested itself in 2 ways
1.) The CPU load was massively increasing over time
2.) Channels that recently failed channels that should be excluded were still being used and only excluded after several seconds
After a lot of digging I believe I found the issue.
Router.scala implements an akka FSM that handles incoming events in a single thread.
Other classes use akka.tell to send those events to the Router.
Tell is “fire and forget” meaning the sender does not know when or even if the message was processed.
On the Router there is a buffer to receive those events (I believe as I have not found any config it is a default buffer with 1000 event size)
As I have been sending 10’000s of events to the router in burst every few hours. I think I overloaded the buffer, which lead to delays or even drops of events. (not only my events but eclair internal events)
After this analysis I’ve changed how the events are created and send out.
I reduced the number of events sent (by about 80%) per burst and added a small delay between each event.
This seems to have solved my issues.
But it made me wonder if this might be bottle neck per design?
Having the Router handling so many different kind of events in a single threaded FSM?
Beta Was this translation helpful? Give feedback.
All reactions