load a warc archive, filter it, and produce another warc archive #257

dportabella · 2016-10-17T17:18:05Z

We need to process a WARC archive, filter it based on keywords, and create a WARC archive. Something like this:

RecordLoader.loadArchives(in, sc)
  .keepValidPages()
  .filter(r => r.getContentString.contains("my keyword"))
  .saveAsWarcArchive("/path/out.warc.gz")

(saving request and response)

Is this possible with warc-base? if not, any idea on how to achieve it?

The text was updated successfully, but these errors were encountered:

lintool · 2016-10-17T17:28:24Z

This is not a use case we've considered thus far. Wouldn't be too hard to implement - loadArchive ultimately calls a Hadoop InputFormat to read ARCs and WARCs. We would need a corresponding Hadoop OutputFormat to implement the converse functionality. saveAsWarcArchive would then call this OutputFormat.

ianmilligan1 · 2016-11-18T19:54:54Z

Just re-pinging this to keep it alive. I think I have a good use case for this too. Now to find time..

dportabella · 2017-01-29T11:54:30Z

@lintool, with loadArchive returns a RDD[ArchiveRecord], so at this point we have lost the information on the request and response headers (except from url, date and mime type), right?

Of course, if we don't care about that headers, we can create a new archive with a dummy request and response headers. That would be ok for my current use case.

dportabella · 2017-02-08T09:43:07Z

Hi,
I've created a gist to filter a WARC archive using Spark and storing the result back to a WARC archive:
https://gist.github.com/dportabella/3caf261c218a4448a03a14dbc06fe730

I did not created a sophisticated Spark writer/serializer, but it does the job.
If you are interested I can integrate this code to your warcbase project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load a warc archive, filter it, and produce another warc archive #257

load a warc archive, filter it, and produce another warc archive #257

dportabella commented Oct 17, 2016

lintool commented Oct 17, 2016

ianmilligan1 commented Nov 18, 2016

dportabella commented Jan 29, 2017

dportabella commented Feb 8, 2017

load a warc archive, filter it, and produce another warc archive #257

load a warc archive, filter it, and produce another warc archive #257

Comments

dportabella commented Oct 17, 2016

lintool commented Oct 17, 2016

ianmilligan1 commented Nov 18, 2016

dportabella commented Jan 29, 2017

dportabella commented Feb 8, 2017