Rewrite database importer to work in memory #14

jlnr · 2016-07-19T07:45:08Z

Instead of downloading the full Wikipedia dump, extracting it, then running a ragel script over the XML file, can we just do it all in memory? Pseudocode: curl -s http://dumps.wikimedia.org/.../enwiki-20170220-pages-articles-multistream.xml.bz2 | bzcat | ./extract-movies enwiki?

Rationale: Having 100 GB of free space is a rare occurence for me.

The text was updated successfully, but these errors were encountered:

jlnr · 2017-02-28T08:36:40Z

Update: This is now possible because I've replaced the ragel script with a little C++ tool that is capable of streaming.

It is also slower by a factor of 10, taking 33 instead of 3 minutes to process the zhwiki dump. If the enwiki script can finish over night (<8h), that's still good enough.

jlnr · 2017-05-22T08:39:18Z

It all works in memory now, you just need to set EN_DATE/ZH_DATE. The final step would be to automatically determine the latest dump date via the JSON status files (https://dumps.wikimedia.org/enwiki/20170501/dumpstatus.json etc)

jlnr added enhancement help wanted labels Jul 19, 2016

jlnr added a commit that referenced this issue May 22, 2017

Use streaming instead of downloads (#14)

9b31098

jlnr added a commit that referenced this issue May 22, 2017

Use streaming instead of downloads (#14)

56f8725

jlnr added a commit that referenced this issue May 22, 2017

Use streaming instead of downloads (#14)

edcdeec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite database importer to work in memory #14

Rewrite database importer to work in memory #14

jlnr commented Jul 19, 2016 •

edited

Loading

jlnr commented Feb 28, 2017

jlnr commented May 22, 2017

Rewrite database importer to work in memory #14

Rewrite database importer to work in memory #14

Comments

jlnr commented Jul 19, 2016 • edited Loading

jlnr commented Feb 28, 2017

jlnr commented May 22, 2017

jlnr commented Jul 19, 2016 •

edited

Loading