-
-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Social media crawling issues #160
Comments
Thanks, these are some of the more complicated issues, as social media platforms change how they serve video all the time, and there's likely multiple issues. Some questions:
We can look at these one at a time at some point, but these questions will help identify general issues. |
I will send you the WACZ files out of band The version may be difficult to asses as its "latest" at the time. Git commit of browsertrix: cdefb8d06e98cf0e73083d120bab0b07e0def125 Yes we had a profile (global profile for all captures) |
Thanks!
We've added an issue to plan to add this to the WACZ automatically, possibly as git hash and/or the container image (or both) |
For https://twitter.com/Angelo4justice3/status/1515753929996337159 did you see something like this: I noticed the warning when I viewed the tweet while not logged in. If you use a Twitter profile to do the crawl and update the user account settings to allow "sensitive content" it should get collected? |
@ikreymer I think I can reproduce the first problem with browsertrix-crawler main. I created a snapshot of https://twitter.com/Angelo4justice3/status/1515753929996337159 with the following settings, which used a profile for a user who was logged into Twitter: collection: Angelo4justice3
profile: /crawls/profiles/twitter-baltimoreup.tar.gz
generateWACZ: true
text: true
scopeType: page
behaviors:
- autoplay
screencastPort: 9037
seeds:
- url: https://twitter.com/Angelo4justice3/status/1515753929996337159 The resulting WACZ can be found at https://inkdroid.org/tmp/Angelo4justice3.wacz I noticed several problems in the devtools console when loading, but an early one that looks particularly bad is when loading the Twitter client https://abs.twimg.com/responsive-web/client-web/main.f906e9b8.js there is a JavaScript error being thrown on line 15? At first I thought the JavaScript was being truncated, but after looking at the WARC record and comparing it to what is on the liveweb it appears to be complete. I'm seeing the same issue with https://twitter.com/avalaina/status/1548681345248989186 which is available here: https://inkdroid.org/tmp/avalaina.wacz |
I can confirm all the Twitter issues still occur with the latest Browsertrix. Haven't been able to test Facebook yet. |
I'm seeing the same with v0.6.0 so this is definitely a change on the Twitter side and not a regression. |
@edsu thanks for taking a look. fortunately, it appears that this particular JS error does not affect the replay issue there is behavior doesn't click on the sensitive content link. To track things better, going to leave some comments, first on Twitter.
|
I assume you are referring to this bit from my original issue comment:
To be honest, I can't really reproduce this. I would leave it for now, and we'll come back to you if there's an issue in the future. The issue with broken images and missing replies was still observed however. |
Browsertrix Crawler 0.7.0-Beta.3 includes a number of fixes:
(Note: this doesn't yet apply to youtube or other custom video players in tweets) |
Twitter issues should now be fixed, created a separate issue for Facebook (#163), so closing this. |
Tested with ReplayWeb.page Linux AppImage, version 1.6.0. Issues were coroborated with https://replayweb.page/ in Chromium.
Twitter
Facebook
The text was updated successfully, but these errors were encountered: