A few more questions/problems #2644

DerEwige · 2023-04-21T10:10:20Z

DerEwige
Apr 21, 2023

Regarding the database:
Growing payments.sent problem:
I have the issue that my paments.sent table is growing really fast.

To prevent excessive growth of this table, I purge entries from this table manually on regular basis.

The reason for this large growth is the many rebalance attempts I do each day (100k+)
Currently I generate one invoice per rebalance attempt.
I’m currently looking into a way to reuse the same invoice multiple times, without creating conflicts in my workflow.

In the mean-time I was looking into using eclair functions to purge paments.sent instead of going through an external database connection.

When looking at the file PgPaymentsDb.scala I did not see any function to remove entries from paments.sent only from payments.received.
Did I miss anything?

Growing audit.channel_errors:

My audit.channel_errors table has grown a lot ( I believe it increased more since 0.8 was relese)

My first question: Can I purge entries from this table, that I no longer need?

Now to my 2nd question. What might be the reason for my many “ExpiryTooBig” errors?
I believe this has to do with the way findroutebetweennodes works.
Everytime I want to crate a circular route for rebalances I have to use findroutebetweennodes and then add one hop to close the loop.
(as per this document https://github.com/ACINQ/eclair/blob/master/docs/CircularRebalancing.md)

There is currently no option to define max expiry when using findroutebetweennodes, so adding one hop might exeed the maximum.
I also suspect that these settings might add to the problem:

recipient-final-expiry {
  min-delta = 150 // minimum value to add to the current block height
  max-delta = 350 // maximum value to add to the current block height
}

maybe this is worth investigating?

akka.tell to Router.scala (FSM) with default buffer size = potential disaster?

I encountered some weird behaviour on my node lately, that mainly manifested itself in 2 ways
1.) The CPU load was massively increasing over time
2.) Channels that recently failed channels that should be excluded were still being used and only excluded after several seconds

After a lot of digging I believe I found the issue.
Router.scala implements an akka FSM that handles incoming events in a single thread.
Other classes use akka.tell to send those events to the Router.
Tell is “fire and forget” meaning the sender does not know when or even if the message was processed.
On the Router there is a buffer to receive those events (I believe as I have not found any config it is a default buffer with 1000 event size)

As I have been sending 10’000s of events to the router in burst every few hours. I think I overloaded the buffer, which lead to delays or even drops of events. (not only my events but eclair internal events)

After this analysis I’ve changed how the events are created and send out.
I reduced the number of events sent (by about 80%) per burst and added a small delay between each event.
This seems to have solved my issues.

But it made me wonder if this might be bottle neck per design?
Having the Router handling so many different kind of events in a single threaded FSM?

thomash-acinq · 2023-04-21T12:35:35Z

thomash-acinq
Apr 21, 2023
Collaborator

Growing payments.sent problem:
We don't have any function to remove entries from payments.sent because we've never felt the need for it. You could add one.

Growing audit.channel_errors:
Anything in audit.* tables is there to help the node operator but is not used by eclair itself. So you can safely remove entries there.

ExpiryTooBig:
The recipient-final-expiry is obviously not helping, you could set it to 0 (at the cost of some privacy).

1 reply

DerEwige Apr 21, 2023
Author

Growing audit.channel_errors: Anything in audit.* tables is there to help the node operator but is not used by eclair itself. So you can safely remove entries there.

Thanks. I've purged everything older than 2 weeks

ExpiryTooBig: The recipient-final-expiry is obviously not helping, you could set it to 0 (at the cost of some privacy).

I've deactivated it for now to see how much the situation improves.

Growing payments.sent problem: We don't have any function to remove entries from payments.sent because we've never felt the need for it. You could add one.

I will look into this

rorp · 2023-04-21T15:15:37Z

rorp
Apr 21, 2023

AFAIR the router can be scaled out: https://github.com/ACINQ/eclair/blob/master/docs/Cluster.md

Also you can play with mailbox type/size:
https://doc.akka.io/docs/akka/current/mailboxes.html
https://doc.akka.io/docs/akka/current/general/configuration-reference.html

3 replies

DerEwige Apr 21, 2023
Author

AFAIR the router can be scaled out: https://github.com/ACINQ/eclair/blob/master/docs/Cluster.md

I believe the router class is part of the back and not the front.
So it could not be scaled, I believe

Also you can play with mailbox type/size: https://doc.akka.io/docs/akka/current/mailboxes.html https://doc.akka.io/docs/akka/current/general/configuration-reference.html

The 2nd link is where I found the 1000 default buffer size.
But I am not sure, where in the code I would need to change these values

rorp Apr 21, 2023

AFAIR the router can be scaled out: https://github.com/ACINQ/eclair/blob/master/docs/Cluster.md
I believe the router class is part of the back and not the front. So it could not be scaled, I believe

Yeah, you're right. I looked at the code and it turns out that the front routers simply forward network updates to the back router...

rorp Apr 21, 2023

The 2nd link is where I found the 1000 default buffer size. But I am not sure, where in the code I would need to change these values

I think that's just a configuration change: akka.actor.default-mailbox.mailbox-type and akka.actor.default-mailbox.mailbox-capacity.

rorp · 2023-04-21T19:55:48Z

rorp
Apr 21, 2023

I think the back router can be scaled pretty easily.

The router actor can spin up a bunch of worker actors. The only thing a worker actor can do is call RouteCalculation.handleRouteRequest(). The router forwards route requests to the workers and the workers process the requests in separate threads.

Something like this:

                                                                        +----------+
                                                                        |          |
                                                                        | Worker 1 |
                                                                        |          |
                                                                        +----------+

                        +--------+                                      +----------+                       +--------+
 Event(routeRequest, d) |        | WorkerEvent(routeRequest, d, sender) |          | RouteResponse(routes) |        |
----------------------->| Router |------------------------------------->| Worker 2 |---------------------->| sender |
                        |        |                                      |          |                       |        |
                        +--------+                                      +----------+                       +--------+

                                                                        ...

                                                                        +----------+
                                                                        |          |
                                                                        | Worker n |
                                                                        |          |
                                                                        +----------+

The workers can be created on startup or on demand.

On demand workers can terminate right after route calculation, so that they will always receive a fresh network graph. The router can keep track of the number of workers in flight and return a failure if the max limit of workers has been reached.

Or if slight discrepancies in network data in the workers' mailboxes are allowed, the router can create a fixed number of workers and forward route requests to them according to some strategy (randomly, round robin, based on mailbox size, based on load, etc).

0 replies

rorp · 2023-04-24T20:32:34Z

rorp
Apr 24, 2023

I created a crude prototype in this form:

    case Event(r: RouteRequest, d) =>
      val sender = context.sender()
      Future { RouteCalculation.handleRouteRequest(d, nodeParams.currentBlockHeight, r, sender) }
//      RouteCalculation.handleRouteRequest(d, nodeParams.currentBlockHeight, r, sender)
      stay() using d

It works ~2.5x faster than the stock router on my laptop. For some reason it uses only 4 cores out of 8. Anyway, 4 is a bit greater than 2.5. YMMV tho...

2 replies

DerEwige Apr 26, 2023
Author

So basically you went from this:

single threaded inbox, that receives and processes the events

To this:

single threaded inbox, that receives the event und submits it to another thread for processing

A few things to add:
I believe there is a tick defined, on how fast the inbox will grab new events.
It might not use all cores depending on the parallelism configured
(I don’t know how it works in Scala, but in JAVA default parallelism is equal to physical cores of the system)
Are we sure that all the different “handle(event)” for the different events are thread save?
(Something that was not important before)

rorp Apr 27, 2023

single threaded inbox, that receives the event und submits it to another thread for processing

In fact it submits a task to a thread pool's queue.

Another difference is the change in semantics. The current version of Router actor returns route responses to the sender actor in order of incoming route requests. Here the order is undefined, since the requests can be processed in parallel and not one by one.
Luckily I seems that Akka HTTP implementation of findroute* RPC calls creates a disposable actor per HTTP request, and the actor sends exactly one route request to Router. So order is not an issue here.

rorp · 2023-05-08T16:49:25Z

rorp
May 8, 2023

@DerEwige can you try this PR #2651? It runs route calculations in parallel, but preserves the Router's semantics.

2 replies

DerEwige Jun 22, 2023
Author

@rorp
Sorry for taking so long.
But there was to big a difference from 0.8.0 to your PR to test it under real conditions

Applied your PR to my 0.9.0 node and running it on the live network now.

DerEwige Jun 27, 2023
Author

@rorp
I've been running the patch for about a week now.
It did not cause any issue.

But on my live node, I was not able to max out my CPU usage. (under real live load)
But the node feels more responsive during high rebalance load.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A few more questions/problems #2644

{{title}}

Replies: 5 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

A few more questions/problems #2644

Replies: 5 comments · 8 replies

thomash-acinq Apr 21, 2023 Collaborator

DerEwige Apr 21, 2023 Author

DerEwige Apr 21, 2023 Author

DerEwige Apr 26, 2023 Author

DerEwige Jun 22, 2023 Author

DerEwige Jun 27, 2023 Author

Replies: 5 comments 8 replies

thomash-acinq
Apr 21, 2023
Collaborator

DerEwige Apr 21, 2023
Author

DerEwige Apr 21, 2023
Author

DerEwige Apr 26, 2023
Author

DerEwige Jun 22, 2023
Author

DerEwige Jun 27, 2023
Author