Skip to content

GSoC Road Map & Weekly Status

Naveen Madhire edited this page Aug 3, 2015 · 16 revisions

GSoC 2015

This project is being done as part of GSoC 2015 by Naveen Madhire

This is a sketch of the Road Map.

Milestones:

  • Identify functions that are priority for re-generating spotlight models - [Warm up/Bonding]

  • Find/Fix a wikipedia dump parser (...Jsonpedia, Cloud9, Bliki..) - [Weeks: 1, 2, 3]

Week 1 - 25th May to 29th May

Modified and analyzed the JSON-wikipedia code to output paragraphs and links in the output file. This would be useful downstream for calculating token counts.

Week 2 - 1st June to 5th June

This week was about finding the right parser instead of relying on JSON-wikipedia for parsing the wiki dumps. I've checked different Wiki Parsers like sweble, jsonpedia, json-wikipedia, cloud9

Below is the comparison table which I've created after analyzing different wikipedia parsers.

Parser License Format Parsing Logic Working with Spark Clean Text Other Cons MultiLanguage
JsonPedia Y - Apache 2.0 JSON JSON output can be readable in spark Y Uses Jackson to convert to JSON No start and end index of the links. Just the link information It claims to handle all the languages.
JsonWikipedia N JSON Good parsing with templates JSON output can be readable in spark Need to add 1 more method to clean text Uses GSON to convert to JSON Few language already it supports. One has to create property files for using languages.
Sweble Y - Apache 2.0 AST, Plain Text Yes Y May have to write some custom code to convert the AST to Plain text Yes
Cloud9 Y - Apache 2.0 Plain Text This one has a good parsing logic for getting the plain text from the wikitext May work with Spark Y We have to add few functions to get the real paragrapgh from the wikitext and the links associated. Looks like only forArabic,Chinease,Czech,German,Spanish,Swedish,Turkish
wtf_wikipedia N JSON At a high level. Looks like it doesn't integrate well with others. N

Week 3 - 8th June to 12th June

To Work on implementing the JSON-wikipedia logic by parsing the XML Dump as elements of Spark RDD.

Plan & Progress

Modified the JSON - wikipedia to remove the boiler plate templates from the wiki - text. [Json Wikipedia Code] (https://github.com/naveenmadhire/json-wikipedia-dbspotlight).

Created a wikiparser.scala program to parse the XML dump and create JSON format individual articles as elements of Spark RDD

[Wikipedia Extractor Code] (https://github.com/naveenmadhire/wikipedia-stats-extractor)

Json output of Json-wikipedia has few Unicode characters. For example "\u0027". Need to convert the unicode characters to regular text after the parsing in Scala.

Made changes to Json-wikipedia to fetch the redirect information from class instead of individual language independent property files.

Currently working on fetching the category links information from class instead of property files.

Task Moved to next week Fix parsing issues with Json-wikipedia.

Week 4 - 15th June to 19th June

Learning from previous weeks During the course of last 3 weeks, the main learning was to understand the wiki parsing and make changes to suit the needs of DBPedia spotlight models.

Parsing issues with Json-wikipedia. Writing test cases for testing the parsing logic. I will use the small wiki dataset for testing and verifying the whole parsing logic.

Week 5 - June 22nd to 26th June

Fixed most of the parsing issues with the Json-wikipedia code. Implement using dataframes for reading the parsed RDDs to calculate the various counts later on.

Progress

This week I've worked on fixing the json-wikipedia parsing issues

  1. Removed the reference tags from the article text
  2. Fixed language identifiers

URI Counts logic has been implemented in Scala and Spark Here

Challenges and Learning

  1. Implementation using Scala and Spark
  2. Few errors faced during testing and was able to overcome most of the issues.

Week 6 - June 29th to 3rd July

I worked on testing FSA Spotter in Spotlight. I started working on Surface form counts this week. Most of the coding is complete for surface counts. Faced few issues with akka/scala version across Spark and spotlight, which were resolved by changing the version and building the jar.

Week 7 - 5th July to 10th July

This week my task is to complete the coding and testing of Entity , surface form and pair counts and start working on Token Counts to be in line with the plan to complete the project with in the GSoC timeframe.

Week 8 - 13th July to 17th July

Completed Token Counts and redirects implementation. Tested with the small set of wikidump.

Week 9 - 20th July to 24th July

This week task is to test the Various counts generated from the big dataset. I am planning to test 500 MB wikidump dataset and generate counts. Encountered few OOM issues during testing. Trying to adjust few configuration parameters for generating the counts on the large dataset.

Week 10 - 27th July to 31st July

Completed the coding and testing of all the wikipedia counts. Tested on a 500 MB wikipedia dump.

Week 11 - 3rd Aug to 8th Aug

  1. Tentative plan for this week is to complete the Raw Wiki text generation with entities being replaced by DBPedia URLs.

  2. Write test cases for different scenarios

  3. Run the whole wikipedia dump with the latest code.

  • Re-rewrite additional non-priority functions - [Weeks: 8]
  • Update Scripts to generate Spotlight Models (Quickstarter, Dbpedia-Spotlight/bin) - [Weeks: 9]
  • SF discount(Better Automatisation [Optional]) [Weeks: 10, 11]
  • Generate new models for supported Languages - [Weeks: 12]