-
-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add options IGNORE_ARCHIVED_POSTS and IGNORE_REPLY_POSTS #21
base: main
Are you sure you want to change the base?
Conversation
Users importing old posts from Twitter/X can cause a feed to get flooded wold posts that match a keyword. To avoid this, ignore posts with a `created_at` timestamp that is older than 4 minutes.
I am not sure that this is right approach. For example firehose could be down for 10 minutes. Then it will up again and start sending old posts. You need to process such posts and do not ignore. Another case is when your firehose consumer went down. And you want to backfill events. So you pass cursor back to 2 days ago. And start processing old events. These cases are common |
Maybe set it to four days? Unfortunately an old createdAt date is currently the only way to identify archived posts. It would be great if there was an archived flag in the taxonomy. Bluesky's recommendation is to ignore far future and far past timestamps. |
Looks like this is a specific feature. I do not think that it should be in the general template. Someone who needs it could implement by themself |
How about this: I removed the call to That why, they understand what is going on when something like this happens |
I am okay to keep this function if we will provide an option to enable it (which is disabled by default); right now, it is dead code |
- `IGNORE_REPLY_POSTS` - `IGNORE_OLD_POSTS`
@MarshalX Done. I've also added an option to ignore reply posts. |
An archived post is one whose creation date is at least 24 hours before the actual publish date shown in clients. Usually those kinds of posts are imported from another social media site such as 𝕏/Twitter and Mastodon. SkyFeed (used by many user created feeds including @\aither.bsky.social's LoveLive! feed) filters them out anyway, also their presence adds a bit more workload to remove false positives based on their content. To make sure false positives (usually those posted directly to Bluesky) aren't left out especially after being disconnected from the firehose for long enough, we compare the post creation date against the commit timestamp given by the firehose instead of the current time. Inspired by MarshalX/bluesky-feed-generator#21 Also move the porn label check to its own function and add comments explaining what each key-value pair in record create dict represents.
An archived post is one whose creation date is at least 24 hours before the actual publish date shown in clients. Usually those kinds of posts are imported from another social media site such as 𝕏/Twitter and Mastodon. SkyFeed (used by many user created feeds including @\aither.bsky.social's LoveLive! feed) filters them out anyway, also their presence adds a bit more workload to remove false positives based on their content. To make sure false positives (usually those posted directly to Bluesky) aren't left out especially after being disconnected from the firehose for long enough, we compare the post creation date against the commit timestamp given by the firehose instead of the current time. Inspired by MarshalX/bluesky-feed-generator#21 Also move the porn label check to its own function and add comments explaining what each key-value pair in record create dict represents.
Users importing old posts from Twitter/X can cause a feed to get flooded wold posts that match a keyword. To avoid this, ignore posts with a
created_at
timestamp that is older than 4 minutes.