Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Social media crawling issues #160

Closed
makew0rld opened this issue Aug 29, 2022 · 11 comments
Closed

Social media crawling issues #160

makew0rld opened this issue Aug 29, 2022 · 11 comments

Comments

@makew0rld
Copy link

Tested with ReplayWeb.page Linux AppImage, version 1.6.0. Issues were coroborated with https://replayweb.page/ in Chromium.

Twitter

Facebook

@ikreymer
Copy link
Member

Thanks, these are some of the more complicated issues, as social media platforms change how they serve video all the time, and there's likely multiple issues. Some questions:

  • can you check the URL tab and filter by video to see if the videos were in fact captured, but don't replay?
  • can you share the WARC/WACZ for each?
  • what version of browsertrix crawler and/or browsertrix cloud was used?
  • were profiles used?

We can look at these one at a time at some point, but these questions will help identify general issues.

@YurkoWasHere
Copy link

I will send you the WACZ files out of band

The version may be difficult to asses as its "latest" at the time.
Just realized we never capture the docker image hash which is an oversight.

Git commit of browsertrix: cdefb8d06e98cf0e73083d120bab0b07e0def125
Capture happened around 7-24-2022

Yes we had a profile (global profile for all captures)

@ikreymer
Copy link
Member

I will send you the WACZ files out of band

Thanks!

The version may be difficult to asses as its "latest" at the time. Just realized we never capture the docker image hash which is an oversight.

Git commit of browsertrix: cdefb8d06e98cf0e73083d120bab0b07e0def125 Capture happened around 7-24-2022

We've added an issue to plan to add this to the WACZ automatically, possibly as git hash and/or the container image (or both)
webrecorder/specs#127

@edsu
Copy link
Contributor

edsu commented Aug 30, 2022

For https://twitter.com/Angelo4justice3/status/1515753929996337159 did you see something like this:

Screen Shot 2022-08-29 at 8 25 20 PM

I noticed the warning when I viewed the tweet while not logged in. If you use a Twitter profile to do the crawl and update the user account settings to allow "sensitive content" it should get collected?

Screen Shot 2022-08-29 at 8 29 24 PM

@edsu
Copy link
Contributor

edsu commented Aug 30, 2022

@ikreymer I think I can reproduce the first problem with browsertrix-crawler main. I created a snapshot of https://twitter.com/Angelo4justice3/status/1515753929996337159 with the following settings, which used a profile for a user who was logged into Twitter:

collection: Angelo4justice3
profile: /crawls/profiles/twitter-baltimoreup.tar.gz
generateWACZ: true
text: true
scopeType: page
behaviors:
  - autoplay
screencastPort: 9037
seeds:
  - url: https://twitter.com/Angelo4justice3/status/1515753929996337159

The resulting WACZ can be found at https://inkdroid.org/tmp/Angelo4justice3.wacz

I noticed several problems in the devtools console when loading, but an early one that looks particularly bad is when loading the Twitter client https://abs.twimg.com/responsive-web/client-web/main.f906e9b8.js there is a JavaScript error being thrown on line 15? At first I thought the JavaScript was being truncated, but after looking at the WARC record and comparing it to what is on the liveweb it appears to be complete.

Screen Shot 2022-08-30 at 3 46 49 AM

I'm seeing the same issue with https://twitter.com/avalaina/status/1548681345248989186 which is available here: https://inkdroid.org/tmp/avalaina.wacz

@makew0rld
Copy link
Author

I can confirm all the Twitter issues still occur with the latest Browsertrix. Haven't been able to test Facebook yet.

@edsu
Copy link
Contributor

edsu commented Aug 30, 2022

I'm seeing the same with v0.6.0 so this is definitely a change on the Twitter side and not a regression.

@ikreymer
Copy link
Member

ikreymer commented Aug 30, 2022

@edsu thanks for taking a look. fortunately, it appears that this particular JS error does not affect the replay issue there is behavior doesn't click on the sensitive content link. To track things better, going to leave some comments, first on Twitter.

Twitter

  • Some JS errors occur on replay, though do not appear to affect the replay

  • Sensitive content button needs to be clicked, (tracked via: Twitter Behavior: Click on sensitive content button automatically. browsertrix-behaviors#20)

  • Videos should play on regular tweet (fixed in replayweb.page 1.6.5)

  • Videos should play on embedded tweet (possible capture and replay issue, looking at this)

  • Long threads: There is a timeout on how long the behavior runs, so won't be able to archive the entire thread most likely, but this can be increased. with --behaviorTimeout and --timeout settings. Not sure yet if there's an issue there or just that the behavior timeout has been reached. Can you clarify what the timeout was and what expected behavior you were looking for? @makeworld-the-better-one @YurkoWasHere ?

@makew0rld
Copy link
Author

makew0rld commented Aug 30, 2022

I assume you are referring to this bit from my original issue comment:

To be honest, I can't really reproduce this. I would leave it for now, and we'll come back to you if there's an issue in the future.

The issue with broken images and missing replies was still observed however.

@ikreymer
Copy link
Member

ikreymer commented Sep 3, 2022

Browsertrix Crawler 0.7.0-Beta.3 includes a number of fixes:

  • native twitter videos on regular should archive and replay, when tested with ReplayWeb.page 1.6.5
  • native videos on oembed.link tweets should capture and replay in full.
  • tweets with 'sensitive content' warning should now load.

(Note: this doesn't yet apply to youtube or other custom video players in tweets)

@ikreymer
Copy link
Member

ikreymer commented Sep 9, 2022

Twitter issues should now be fixed, created a separate issue for Facebook (#163), so closing this.

@ikreymer ikreymer closed this as completed Sep 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants