Scaling Farcaster #163
Replies: 9 comments 17 replies
-
Wouldn't utilizing compression reduce the overall global state to 3.75 per year. Another thought is utilizing the three different types of syncs matching msg txs and txs that are left unmatched can be assumed to be the missing msgs and the table can be updated with those msgs. Potentially utilize SEI's db modify it to run a dual event listener. Is it possible to have dual event processors for shuttle to process twice the amount of messages which could potentially cause an issue in the write function to the database. |
Beta Was this translation helpful? Give feedback.
-
Why does shuttle need a reconciler? Could this be solved by having Hubble create a file-based state-change log instead of listening to |
Beta Was this translation helpful? Give feedback.
-
Bundles may be useful here too. |
Beta Was this translation helpful? Give feedback.
-
Can you get stats on how often data is getting accessed historically? Based on this you probably can prioritize resources. For example opensearch has something called ultrawarm storage that takes slightly longer to query but it's okay because it's infrequently accessed. Maybe look into this because you're already storing snapshots on object storage (S3) My prior experiences, we found out that our cache hit ratio for accessing data was wildly skewed towards recent messages so we only needed to make recent data more readily available. And take longer on fetches to historic data by just going back to the email server and optimize on search. Likely hub runners will have different priorities in the future. Some hub runners might only need the last 30 days of casts, so they could optimize their hubs for that. Other hubs, main hubs will likely need to store all historic data and be able to provide that data probably running on much cheaper ssd, or even object storage (S3). The metric here is ROI Why do you need this storage in memory vs nvme vs ssd vs hdd vs s3? Cost per message saved? These are all trade offs to consider, but at scale I guarantee that there are fewer high value messages and a lot of low value messages. Probably easy to put in a few highly sampled data to get a representative feel for the above. Also, if you're likely to use shuttle in the future for reads, why does your read performance need to be so high? It's so high that the bottleneck is networking anyways. Hubs are ways to access the data to send to all the various different databases - shouldn't it be more optimized for adapters patterns for that? Why are hubs also an api? Shuttle should have several adapters in the future to store data to various databases with different tradeoffs, where different fields get indexed differently for different use cases. Hubs should provide the datastreams so that it helps services like shuttle move the data to more api friendly services. For example hubs do not have the ability to replay a message from a certain point in time when shuttle failed from a grpc stream. Perhaps the right metrics are: Cost per message write to a hub Write throughput to a hub Value to users for cost and performance of writes to a hub The value to users right now is unclear so I'm guessing you all are over optimizing on cost and performance dimensions only, which makes sense. However, there are some pretty obvious ways to see what's high or low value to people running hubs, how hub data is getting accessed, and what is over optimized that's costing a lot. Will write this out more structured manner in the future but my two cents at first glance. |
Beta Was this translation helpful? Give feedback.
-
Great doc. Thanks for the detail! Is it worth mentioning any foreseeable risks (+ mitigations) within each if your strategies? |
Beta Was this translation helpful? Give feedback.
-
This is way too generous I would change new account storage limit to 500 casts. Then I would 10x the cost per unit per month, i.e. 4$ a month for extra 2.5k casts. People with 5k casts and that want to keep these casts around surely can pay. Most people will be totally fine to start losing their old casts after exceeding their limit. 96% of social medias’ value is recent content. Old & good content gets recreated organically anyway, so most people will totally not care about losing it. They can even save good content knowing they will likely use it. This isn't YouTube where people often revisit very old content. You could even make it easy for people to export their content to compensate. With export, those who want to revive some casts later could. I.e. state expiry. You could even let users mark certain casts as "stored" and avoid automatic pruning. Oldest non-stored casts gets pruned, letting users only make 2% of the casts they really care about stick around. |
Beta Was this translation helpful? Give feedback.
-
Big up on increasing how much storage costs. I also think that solana turbine style p2p should be considered to reduce message duplication. |
Beta Was this translation helpful? Give feedback.
-
I'm curious if you've benchmarked different databases? When I've done very perf sensitive stuff in the past, LMDB was dramatically faster than other embedded dbs like Rocksdb & sqlite. |
Beta Was this translation helpful? Give feedback.
-
Closing - we've made a bunch of progress on this, and are now moving towards a new architecture in #193 |
Beta Was this translation helpful? Give feedback.
-
Problem
Over the last three months, Farcaster grew 16x from 5,000 to 80,000 daily users.
Stress on the network has increased proportionally. We’re unlikely to keep up with another 10x in growth. Each day, our nodes (hubs) are seeing:
Goals
Farcaster should handle 1 million daily users and make sure that:
It’s important we achieve this without major architectural changes if feasible, and want to hold onto the following invariants:
If you're new to Farcaster, check out the Hub overview video for more context.
Proposal
There are three areas we need to examine — state management, which validates new messages and updates a hub's state, sync which converges all hubs to the same global state and replication which copies data to an external data store.
State
When a user says “Hello World!”, a new cast message is created and added to the global network state through a hub. The message's validity is checked and then it is appended to the hub's local disk and gossiped out to the broader network.
A user must pay yearly rent (storage fees) to keep their state live on the network. Renting a unit of storage lets a user store a specific number of messages of each type. The average size of a message is ~ 300 bytes.
Problems
Global state is growing daily by 1.67 GiB and 4.39M messages. At 80k DAU, that's ~21 KiB and 55 messages of growth per user per day. At 1M users, we’d expect to see 21 GiB of state growth per day or 7.5 TB of state growth per year.
Messages are added at an average rate of 50/s. At 1M users, we’d expect 1000/s with spikes up to 10,000/s. We can handle this today but only by using NVMe disks because our system is very latency sensitive. This needs to be rewritten because NVMe disks have limited total capacity relative to general SSDs.
Storage problems aren't hair on fire, but if growth continues at this pace we'll likely need to deal with them in the next 6-12 months.
Metrics
The following data should be available on public dashboards:
We should also periodically run the following benchmarks:
Proposed Changes
Investigate latency sensitivity — is nvme truly a requirement? Run the benchmarks on our ideal system and build the dashboards.
Hub hardware limits — we increase the minimum storage requirement for starting a new hub to some dynamic multiple of the latest storage growth to prevent them from shutting down. As long as we are under $1000/month we should be fine.
Compression — A message is roughly 300 bytes but we use 400 bytes to store it on disk. Using compression or changing indices might reduce this by 50% (200 bytes).
Reducing limits — Changing storage limits for each unit to be 50% smaller would probably cause a 3-10% decrease in total state growth on the network.
Raising fees — Increasing storage costs would slow down the rate at which new users are added which would give us more headroom to grow.
Sync
A new hub will do a snapshot sync to download messages up to the last day, and gossip sync to stream all messages going forward. It will also do a "diff sync" with another hub to download messages from the current day, which were not in the snapshot. This diff sync is also re-run periodically to catch messages that were dropped due to lossy gossip or other downtime.
Sync can be thought of as a three-phase process:
Problems
Gossip sync consumes ~700 KiB/s on hubs while state is growing by ~ 20 KiB/s. That’s a 35x overhead, and should be closer to 10x.
Hubs are exhibiting odd sync behavior on the margins - some hubs have more messages than expected while others are falling behind perpetually. The root cause is still unclear.
Diff sync can only handle 70-100 msg/s at peak and if a hub goes offline for a short while, it will never catch up the current state.
Hubs sometimes run into rate limits on blockchain nodes, likely due to using a free plan. This breaks sync silently and they diverge from the rest of the network until they are re-synced.
Metrics
The following data should be available on dashboards:
The following data should be periodically benchmarked:
Proposed Changes
Replication
When hubs receive messages they are stored locally in rocksdb. Apps will need to transfer this data to a higher level store like Postgres to be able to run queries on it efficiently. Farcaster has a library called shuttle which helps copy messages into a database table and keep it in sync.
Problems
Metrics
System Requirements
A hub should be capable of being run on commodity cloud hardware for < $1000 / month while successfully replicating to a postgres database. For today’s hubs we recommend provisioning at least the following:
This cost a little over $150/month on [latitude.sh](http://latitude.sh) today. Larger providers like AWS and GCP tend to be a bit more expensive.
We expect our requirements at 10x to be closer to:
Beta Was this translation helpful? Give feedback.
All reactions