-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract information about shared links (and media) #9
Comments
Python goose https://github.com/grangier/python-goose/ |
Cool, good to know. I don't have any of my Go code checked in yet (it's quite messy), but I will soon. I am currently grabbing and storing data from Facebook with great success. So I have a bunch of shared links coming through that I'll want to get a little more insight on. That said, given Social Harvest is being ported to Go (from PHP and Node.js) I'm looking to keep tools like this in Go. That is not to say someone else couldn't use Python and filter the data that ultimately ends up being stored. One of the goals of Social Harvest is to allow users to use whatever programming language they want. Any database for that matter too. |
@tmaiaroto interesting. I will ask you to make things as service. I am not sure whether it is right or wrong approach. But I feel SOA is nice. Application -> Queues -> Service ( this way the service can be the best fit in any language ) . We can get the o/p in json or some other way. Looking forward to see your code when pushed. Watching closely. |
For the most part I do plan to work like that. Though I was thinking about streams and piping data through a set of filters. Basically, with fluentd I'm tailing a bunch of log files for all the different types of data. This data, before it hits those log files, should be able to be streamed and filtered. I will likely have to think about a queue though. Any filter on the data done outside of the core server app will indeed get the data in JSON. Line by line. I need to pay extra careful attention to this process...so I expect it to not be complete as fast as some other things...But one thing I think that can be done immediately (and always) is filtering via Fluentd. |
Virality Score has a whole Node.js crawler to do this and it even extracts entities. Social Harvest needs a tool like this too...But in Go of course.
I stumbled upon this: https://github.com/advancedlogic/GoOse
Which appears to do some of what I had done in Node.js, but a bit more basic. I think it sounds like a good starting point.
The information extracted from shared links can really tell a lot about what exactly is being shared. It will contain more meaty information to run through various filters to get a sense for topics, etc. Shared links that do end up being crawled for additional information can at least only exist once in the data set. While the link may have been shared a thousand times, we only need the summary/extracted/semantic/meta data from it stored once.
Think about how to reduce duplicate lookups though. Once data is extracted, we don't want to make another 1,000 HTTP requests to get data that's already in the database. The challenge here is we don't know what database will be used. -- Perhaps an ok solution for now is to create a log file with a list of what was already discovered by the particular harvester. This would mean that a different harvester could request a duplicate URL, but at least that's better than making thousands of needless requests.
This is quite easy to get and store...But being efficient about it is going to require some thinking.
The text was updated successfully, but these errors were encountered: