You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Instead of downloading the full Wikipedia dump, extracting it, then running a ragel script over the XML file, can we just do it all in memory? Pseudocode: curl -s http://dumps.wikimedia.org/.../enwiki-20170220-pages-articles-multistream.xml.bz2 | bzcat | ./extract-movies enwiki?
Rationale: Having 100 GB of free space is a rare occurence for me.
The text was updated successfully, but these errors were encountered:
Update: This is now possible because I've replaced the ragel script with a little C++ tool that is capable of streaming.
It is also slower by a factor of 10, taking 33 instead of 3 minutes to process the zhwiki dump. If the enwiki script can finish over night (<8h), that's still good enough.
Instead of downloading the full Wikipedia dump, extracting it, then running a ragel script over the XML file, can we just do it all in memory? Pseudocode:
curl -s http://dumps.wikimedia.org/.../enwiki-20170220-pages-articles-multistream.xml.bz2 | bzcat | ./extract-movies enwiki
?Rationale: Having 100 GB of free space is a rare occurence for me.
The text was updated successfully, but these errors were encountered: