Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add options IGNORE_ARCHIVED_POSTS and IGNORE_REPLY_POSTS #21

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

seanthegeek
Copy link
Contributor

Users importing old posts from Twitter/X can cause a feed to get flooded wold posts that match a keyword. To avoid this, ignore posts with a created_at timestamp that is older than 4 minutes.

Users importing old posts from Twitter/X can cause a feed to get flooded wold posts that match a keyword. To avoid this, ignore posts with a `created_at` timestamp that is older than 4 minutes.
@MarshalX
Copy link
Owner

MarshalX commented Dec 7, 2024

I am not sure that this is right approach. For example firehose could be down for 10 minutes. Then it will up again and start sending old posts. You need to process such posts and do not ignore. Another case is when your firehose consumer went down. And you want to backfill events. So you pass cursor back to 2 days ago. And start processing old events. These cases are common

@seanthegeek
Copy link
Contributor Author

seanthegeek commented Dec 8, 2024

Maybe set it to four days? Unfortunately an old createdAt date is currently the only way to identify archived posts. It would be great if there was an archived flag in the taxonomy.

Bluesky's recommendation is to ignore far future and far past timestamps.

@MarshalX
Copy link
Owner

MarshalX commented Dec 8, 2024

Looks like this is a specific feature. I do not think that it should be in the general template. Someone who needs it could implement by themself

@seanthegeek
Copy link
Contributor Author

How about this:

I removed the call to is_archive_record from the example filter, but kept the function in and added more documentation to it explaining the pros and cons. That should help new feed creators understand the issues.

That why, they understand what is going on when something like this happens

@MarshalX
Copy link
Owner

I am okay to keep this function if we will provide an option to enable it (which is disabled by default); right now, it is dead code

@seanthegeek seanthegeek changed the title Ignore archive posts Add options IGNORE_ARCHIVED_POSTS and IGNORE_REPLY_POSTS Dec 11, 2024
@seanthegeek
Copy link
Contributor Author

seanthegeek commented Dec 11, 2024

@MarshalX Done. I've also added an option to ignore reply posts.

p1timmy added a commit to p1timmy/bsky-feeds that referenced this pull request Dec 30, 2024
An archived post is one whose creation date is at least 24 hours
before the actual publish date shown in clients. Usually those kinds
of posts are imported from another social media site such as 𝕏/Twitter
and Mastodon.

SkyFeed (used by many user created feeds including @\aither.bsky.social's
LoveLive! feed) filters them out anyway, also their presence adds a
bit more workload to remove false positives based on their content.

To make sure false positives (usually those posted directly to Bluesky)
aren't left out especially after being disconnected from the firehose
for long enough, we compare the post creation date against the commit
timestamp given by the firehose instead of the current time.

Inspired by MarshalX/bluesky-feed-generator#21

Also move the porn label check to its own function and add comments
explaining what each key-value pair in record create dict represents.
p1timmy added a commit to p1timmy/bsky-feeds that referenced this pull request Dec 30, 2024
An archived post is one whose creation date is at least 24 hours
before the actual publish date shown in clients. Usually those kinds
of posts are imported from another social media site such as 𝕏/Twitter
and Mastodon.

SkyFeed (used by many user created feeds including @\aither.bsky.social's
LoveLive! feed) filters them out anyway, also their presence adds a
bit more workload to remove false positives based on their content.

To make sure false positives (usually those posted directly to Bluesky)
aren't left out especially after being disconnected from the firehose
for long enough, we compare the post creation date against the commit
timestamp given by the firehose instead of the current time.

Inspired by MarshalX/bluesky-feed-generator#21

Also move the porn label check to its own function and add comments
explaining what each key-value pair in record create dict represents.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants