You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Apr 27, 2018. It is now read-only.
This is not a use case we've considered thus far. Wouldn't be too hard to implement - loadArchive ultimately calls a Hadoop InputFormat to read ARCs and WARCs. We would need a corresponding Hadoop OutputFormat to implement the converse functionality. saveAsWarcArchive would then call this OutputFormat.
@lintool, with loadArchive returns a RDD[ArchiveRecord], so at this point we have lost the information on the request and response headers (except from url, date and mime type), right?
Of course, if we don't care about that headers, we can create a new archive with a dummy request and response headers. That would be ok for my current use case.
I did not created a sophisticated Spark writer/serializer, but it does the job.
If you are interested I can integrate this code to your warcbase project.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
We need to process a WARC archive, filter it based on keywords, and create a WARC archive. Something like this:
(saving request and response)
Is this possible with warc-base? if not, any idea on how to achieve it?
The text was updated successfully, but these errors were encountered: