Scaling Farcaster #163

varunsrin · 2024-04-10T20:52:54Z

varunsrin
Apr 10, 2024
Maintainer

Problem

Over the last three months, Farcaster grew 16x from 5,000 to 80,000 daily users.

Stress on the network has increased proportionally. We’re unlikely to keep up with another 10x in growth. Each day, our nodes (hubs) are seeing:

80,000 daily users
4.4M messages
1.7 GiB of state growth
45.5 - 70 GiB of network traffic

Goals

Farcaster should handle 1 million daily users and make sure that:

A hub costs no more than $1000 / month to operate.
Messages arrive in < 10 seconds, 90% of the time.
99% of hubs are “in sync”, according to some health metric.
99.9% of data can replicate to an external database in < 1s.

It’s important we achieve this without major architectural changes if feasible, and want to hold onto the following invariants:

A hub should store all state on the network (no sharding)
Consensus should happen through CRDT
Sync should happen through libp2p gossip + out of band sync

If you're new to Farcaster, check out the Hub overview video for more context.

Proposal

There are three areas we need to examine — state management, which validates new messages and updates a hub's state, sync which converges all hubs to the same global state and replication which copies data to an external data store.

State

When a user says “Hello World!”, a new cast message is created and added to the global network state through a hub. The message's validity is checked and then it is appended to the hub's local disk and gossiped out to the broader network.

A user must pay yearly rent (storage fees) to keep their state live on the network. Renting a unit of storage lets a user store a specific number of messages of each type. The average size of a message is ~ 300 bytes.

Problems

Global state is growing daily by 1.67 GiB and 4.39M messages. At 80k DAU, that's ~21 KiB and 55 messages of growth per user per day. At 1M users, we’d expect to see 21 GiB of state growth per day or 7.5 TB of state growth per year.
Messages are added at an average rate of 50/s. At 1M users, we’d expect 1000/s with spikes up to 10,000/s. We can handle this today but only by using NVMe disks because our system is very latency sensitive. This needs to be rewritten because NVMe disks have limited total capacity relative to general SSDs.

Storage problems aren't hair on fire, but if growth continues at this pace we'll likely need to deal with them in the next 6-12 months.

Metrics

The following data should be available on public dashboards:

Daily State Growth — how much state was added each day?
Throughput Stats — what was the peak and average messages per second in the last week?

We should also periodically run the following benchmarks:

Analysis of size and counts for messages for users (min, average, median, p95, p99, max)
Writes per second — can we commit 10,000 messages per second under ideal conditions?

Proposed Changes

Investigate latency sensitivity — is nvme truly a requirement? Run the benchmarks on our ideal system and build the dashboards.
Hub hardware limits — we increase the minimum storage requirement for starting a new hub to some dynamic multiple of the latest storage growth to prevent them from shutting down. As long as we are under $1000/month we should be fine.
Compression — A message is roughly 300 bytes but we use 400 bytes to store it on disk. Using compression or changing indices might reduce this by 50% (200 bytes).
Reducing limits — Changing storage limits for each unit to be 50% smaller would probably cause a 3-10% decrease in total state growth on the network.
Raising fees — Increasing storage costs would slow down the rate at which new users are added which would give us more headroom to grow.

Sync

A new hub will do a snapshot sync to download messages up to the last day, and gossip sync to stream all messages going forward. It will also do a "diff sync" with another hub to download messages from the current day, which were not in the snapshot. This diff sync is also re-run periodically to catch messages that were dropped due to lossy gossip or other downtime.

Sync can be thought of as a three-phase process:

Snapshot sync — a fast sync that brings a hub close to current state.
Gossip sync — a fast, lossy sync that fetches new messages via libp2p.
Diff sync — a slow, deterministic sync fetches any missing messages periodically.

Problems

Gossip sync consumes ~700 KiB/s on hubs while state is growing by ~ 20 KiB/s. That’s a 35x overhead, and should be closer to 10x.
Hubs are exhibiting odd sync behavior on the margins - some hubs have more messages than expected while others are falling behind perpetually. The root cause is still unclear.
Diff sync can only handle 70-100 msg/s at peak and if a hub goes offline for a short while, it will never catch up the current state.
Hubs sometimes run into rate limits on blockchain nodes, likely due to using a free plan. This breaks sync silently and they diverge from the rest of the network until they are re-synced.

Metrics

The following data should be available on dashboards:

Message Delay — p90 for messages going from one hub to another, ideally <10s
Health — how many hubs are within 0.1% of the latest network state? ideally 99%+

The following data should be periodically benchmarked:

Gossip overhead — the ratio of gossip network traffic to state growth (target: 12:1)
Catchup speed — time take to catch up on the last 12 hours of sync (target: 1 hour)

Proposed Changes

Add bundles to reduce gossip throughput.
Find the root cause for hubs getting out of sync.
Throw clearer errors when blockchain nodes are unreadable instead of failing sync.
Improve diff sync so that it approaches 1k messages / second.
Explore new sync model using events streams to complement diff sync.

Replication

When hubs receive messages they are stored locally in rocksdb. Apps will need to transfer this data to a higher level store like Postgres to be able to run queries on it efficiently. Farcaster has a library called shuttle which helps copy messages into a database table and keep it in sync.

Problems

Shuttle is still in alpha, and needs to be shipped. Once its out, we might have a better POV on what needs to improve.

Metrics

Replication Delay — p90 for messages going a hub to a replicated db, ideally < 1s

System Requirements

A hub should be capable of being run on commodity cloud hardware for < $1000 / month while successfully replicating to a postgres database. For today’s hubs we recommend provisioning at least the following:

Resource	Amount
CPU	4
RAM	16 GB
Storage	500 GB
Bandwidth	1 Gbps

This cost a little over $150/month on [latitude.sh](http://latitude.sh) today. Larger providers like AWS and GCP tend to be a bit more expensive.

We expect our requirements at 10x to be closer to:

Resource	Amount
CPU	16
RAM	128 GB
Storage	20 TB
Bandwidth	10 Gbps

Clus1 · 2024-04-11T02:09:47Z

Clus1
Apr 11, 2024

Wouldn't utilizing compression reduce the overall global state to 3.75 per year.

Another thought is utilizing the three different types of syncs matching msg txs and txs that are left unmatched can be assumed to be the missing msgs and the table can be updated with those msgs.

Potentially utilize SEI's db modify it to run a dual event listener.

Is it possible to have dual event processors for shuttle to process twice the amount of messages which could potentially cause an issue in the write function to the database.

0 replies

vrypan · 2024-04-11T05:01:10Z

vrypan
Apr 11, 2024

Why does shuttle need a reconciler? Could this be solved by having Hubble create a file-based state-change log instead of listening to Subscribe()?

4 replies

varunsrin Apr 11, 2024
Maintainer Author

It has to handle the case where a hubble instance dies and becomes unrecoverable, and you have to start listening to a new instance which received messages in a different order.

vrypan Apr 11, 2024

Also, would it make sense to add sharding to the replicator? It seems that sharding casts and reactions based on fids across replicators may be a good fit for "global" apps like Warpcast and Neynar. (A user's needs can be served by a single shard) and it's scalable.

On the other hand, some smaller apps (a bot for example) could deploy a replicator that only implements a single shard (the one they are interested in).

vrypan Apr 11, 2024

The way I see it, at the high level, sharding in this case is just the ability to filter the messages that will be stored + a governor that manages the instances.

varunsrin Apr 11, 2024
Maintainer Author

right now i think everyone (warpcast, neynar, airstack) wants the full data set. i can see this becoming useful in the future though.

vrypan · 2024-04-11T07:06:10Z

vrypan
Apr 11, 2024

It has to handle the case where a hubble instance dies and becomes unrecoverable, and you have to start listening to a new instance which received messages in a different order.

Bundles may be useful here too.

7 replies

vrypan Apr 11, 2024

Bundles are not ordered, but bundles from a specific hub can be ordered: my hub created bundle N, then N+1, then N+2.

If I'm right, this means that instead of dealing with thousands of messages flying around in no specific order, we have bundles that can be grouped in ordered lists.

In any case, I think bundles (whether we consider them ordered or not), are a great idea.

mastergaurang94 May 7, 2024

Any public thoughts / docs on how to introduce global ordering?

varunsrin May 7, 2024
Maintainer Author

we explicitly avoid global ordering for scalability. see: https://www.varunsrinivasan.com/2024/04/28/the-goldilocks-consensus-problem

mastergaurang94 May 7, 2024

Yup, read that recently! Thank you. Skipping forward a few scales: how to address eclipse attacks? Eventually, it seems peer scoring would be insufficient for a sophisticated actor. Do you think there is an alternative path than using DA? If so, like what?

I ask, because DA is quite expensive & volatile to subsidize, which naturally leads to users paying for each tx. I struggle to see an alternative.. Not attached to either strategy, but I do value reasonable pricing & verifiability.

varunsrin May 7, 2024
Maintainer Author

there is a simple solution right now, which is to peer with a trusted hub that you know isn't eclipsing you. since all hubs bootstrap off of ones that warpcast runs, even if every other hub is malicious as long as you don't drop the link to warpcasts hub you won't get eclipsed.

that said, it would be nice to have a solution that is also robust against warpcast being malicious. i think some combination of peer scoring, comitting state roots to the blockchain and "freezing" state will make it practically very difficult to run eclipse attacks.

johnjjung · 2024-04-11T10:36:41Z

johnjjung
Apr 11, 2024

Can you get stats on how often data is getting accessed historically?

Based on this you probably can prioritize resources.

For example opensearch has something called ultrawarm storage that takes slightly longer to query but it's okay because it's infrequently accessed. Maybe look into this because you're already storing snapshots on object storage (S3)

My prior experiences, we found out that our cache hit ratio for accessing data was wildly skewed towards recent messages so we only needed to make recent data more readily available. And take longer on fetches to historic data by just going back to the email server and optimize on search.

Likely hub runners will have different priorities in the future. Some hub runners might only need the last 30 days of casts, so they could optimize their hubs for that. Other hubs, main hubs will likely need to store all historic data and be able to provide that data probably running on much cheaper ssd, or even object storage (S3).

The metric here is ROI

Why do you need this storage in memory vs nvme vs ssd vs hdd vs s3?

Cost per message saved?
Performance per message saved?
ROI per message saved?
Value to people per message saved?

These are all trade offs to consider, but at scale I guarantee that there are fewer high value messages and a lot of low value messages.

Probably easy to put in a few highly sampled data to get a representative feel for the above.

Also, if you're likely to use shuttle in the future for reads, why does your read performance need to be so high? It's so high that the bottleneck is networking anyways. Hubs are ways to access the data to send to all the various different databases - shouldn't it be more optimized for adapters patterns for that? Why are hubs also an api? Shuttle should have several adapters in the future to store data to various databases with different tradeoffs, where different fields get indexed differently for different use cases. Hubs should provide the datastreams so that it helps services like shuttle move the data to more api friendly services. For example hubs do not have the ability to replay a message from a certain point in time when shuttle failed from a grpc stream.

Perhaps the right metrics are:

Cost per message write to a hub
Cost per message read from a hub

Write throughput to a hub
Read throughput from a hub

Value to users for cost and performance of writes to a hub
Value to users for cost and performance of reads from a hub

The value to users right now is unclear so I'm guessing you all are over optimizing on cost and performance dimensions only, which makes sense. However, there are some pretty obvious ways to see what's high or low value to people running hubs, how hub data is getting accessed, and what is over optimized that's costing a lot.

Will write this out more structured manner in the future but my two cents at first glance.

1 reply

varunsrin Apr 11, 2024
Maintainer Author

Directionally agree with this, and I think sharding across disks or a "freezer" type structure is something we will introduce soon.

chrisjmendez · 2024-04-11T11:40:35Z

chrisjmendez
Apr 11, 2024

Great doc. Thanks for the detail! Is it worth mentioning any foreseeable risks (+ mitigations) within each if your strategies?

0 replies

PhABC · 2024-04-11T12:40:13Z

PhABC
Apr 11, 2024

This is way too generous

I would change new account storage limit to 500 casts. Then I would 10x the cost per unit per month, i.e. 4$ a month for extra 2.5k casts.

People with 5k casts and that want to keep these casts around surely can pay. Most people will be totally fine to start losing their old casts after exceeding their limit.

96% of social medias’ value is recent content. Old & good content gets recreated organically anyway, so most people will totally not care about losing it. They can even save good content knowing they will likely use it. This isn't YouTube where people often revisit very old content.

You could even make it easy for people to export their content to compensate. With export, those who want to revive some casts later could. I.e. state expiry.

You could even let users mark certain casts as "stored" and avoid automatic pruning. Oldest non-stored casts gets pruned, letting users only make 2% of the casts they really care about stick around.

1 reply

varunsrin Apr 11, 2024
Maintainer Author

Yeah we will likely reduce this.

The flip side is that most users use a small portion of this today. From soem back of the envelope math if we cut these limits by 50% we'd reclaim less than ~3-10% of space only.

makemake-kbo · 2024-04-11T13:47:21Z

makemake-kbo
Apr 11, 2024

Big up on increasing how much storage costs.

I also think that solana turbine style p2p should be considered to reduce message duplication.

0 replies

KyleAMathews · 2024-04-30T15:34:07Z

KyleAMathews
Apr 30, 2024

I'm curious if you've benchmarked different databases? When I've done very perf sensitive stuff in the past, LMDB was dramatically faster than other embedded dbs like Rocksdb & sqlite.

4 replies

makemake-kbo May 1, 2024

I think it would be worth it to at least benchmark. Idk what hubble uses atm but B-Tree based could be perfect if it uses a KV database.

varunsrin May 1, 2024
Maintainer Author

@loriopatrick is doing some benchmarking to figure out why storage writes and reads are slow next week. actually suspect there's a lot of low hanging fruit in just tuning rocksdb.

but if someone else wanted to try benchmarking other db's, that would be incredibly helpful too.

loriopatrick May 1, 2024
Collaborator

I have a production snapshot of the rocksdb state and ~16GB of recorded api calls to rocksdb. Plan is to benchmark \w rocksdb on different media types and try tuning. Would also be reasonable to convert the state to a different backend and benchmark the calls.

If someone wants to make their own recording to use production data, I recommend copying the current .rocks directory then using branch benchmark-rocks [1] to record api calls. You'll need to compile addon with the bench-rocksdb-record feature [2].

I may try to start benchmarking a bit this week but if someone also wants to tackle here's something to reference for how to read the api calls: https://github.com/farcasterxyz/hub-monorepo/blob/plorio/benchmark-rocks/apps/hubble/src/addon/bin/benchmark-rocks.rs

[1] https://github.com/farcasterxyz/hub-monorepo/tree/plorio/benchmark-rocks
[2] https://github.com/farcasterxyz/hub-monorepo/blob/plorio/benchmark-rocks/apps/hubble/src/addon/Cargo.toml#L16C1-L16C21

vrypan May 2, 2024

Definitely not my domain. But it would be useful to have a benchmark that's portable in order to compare various setups/configurations/backends.

varunsrin · 2024-09-30T16:53:24Z

varunsrin
Sep 30, 2024
Maintainer Author

Closing - we've made a bunch of progress on this, and are now moving towards a new architecture in #193

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling Farcaster #163

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 17 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Scaling Farcaster #163

varunsrin Apr 10, 2024 Maintainer

Problem

Goals

Proposal

State

Problems

Metrics

Proposed Changes

Sync

Problems

Metrics

Proposed Changes

Replication

Problems

Metrics

System Requirements

Replies: 9 comments · 17 replies

varunsrin Apr 11, 2024 Maintainer Author

varunsrin Apr 11, 2024 Maintainer Author

varunsrin May 7, 2024 Maintainer Author

varunsrin May 7, 2024 Maintainer Author

varunsrin Apr 11, 2024 Maintainer Author

varunsrin Apr 11, 2024 Maintainer Author

varunsrin May 1, 2024 Maintainer Author

loriopatrick May 1, 2024 Collaborator

varunsrin Sep 30, 2024 Maintainer Author

varunsrin
Apr 10, 2024
Maintainer

Replies: 9 comments 17 replies

varunsrin Apr 11, 2024
Maintainer Author

varunsrin Apr 11, 2024
Maintainer Author

varunsrin May 7, 2024
Maintainer Author

varunsrin May 7, 2024
Maintainer Author

varunsrin Apr 11, 2024
Maintainer Author

varunsrin Apr 11, 2024
Maintainer Author

varunsrin May 1, 2024
Maintainer Author

loriopatrick May 1, 2024
Collaborator

varunsrin
Sep 30, 2024
Maintainer Author