-
Notifications
You must be signed in to change notification settings - Fork 5
GSoC Road Map & Weekly Status
This project is being done as part of GSoC 2015 by Naveen Madhire
This is a sketch of the Road Map.
Milestones:
-
Identify functions that are priority for re-generating spotlight models - [Warm up/Bonding] -
Find/Fix a wikipedia dump parser (...Jsonpedia, Cloud9, Bliki..) - [Weeks: 1, 2, 3]
Modified and analyzed the JSON-wikipedia code to output paragraphs and links in the output file. This would be useful downstream for calculating token counts.
This week was about finding the right parser instead of relying on JSON-wikipedia for parsing the wiki dumps. I've checked different Wiki Parsers like sweble, jsonpedia, json-wikipedia, cloud9
Below is the comparison table which I've created after analyzing different wikipedia parsers.
Parser | License | Format | Parsing Logic | Working with Spark | Clean Text | Other | Cons | MultiLanguage |
---|---|---|---|---|---|---|---|---|
JsonPedia | Y - Apache 2.0 | JSON | JSON output can be readable in spark | Y | Uses | Jackson to convert to JSON | No start and end index of the links. Just the link information | It claims to handle all the languages. |
JsonWikipedia | N | JSON | Good parsing with templates | JSON output can be readable in spark | Need to add 1 more method to clean text | Uses GSON to convert to JSON | Few language already it supports. One has to create property files for using languages. | |
Sweble | Y - Apache 2.0 | AST, Plain Text | Yes | Y | May have to write some custom code to convert the AST to Plain text | Yes | ||
Cloud9 | Y - Apache 2.0 | Plain Text | This one has a good parsing logic for getting the plain text from the wikitext | May work with Spark | Y | We have to add few functions to get the real paragrapgh from the wikitext and the links associated. | Looks like only forArabic,Chinease,Czech,German,Spanish,Swedish,Turkish | |
wtf_wikipedia | N | JSON | At a high level. Looks like it doesn't integrate well with others. | N |
To Work on implementing the JSON-wikipedia logic by parsing the XML Dump as elements of Spark RDD.
Plan & Progress
Modified the JSON - wikipedia to remove the boiler plate templates from the wiki - text. [Json Wikipedia Code] (https://github.com/naveenmadhire/json-wikipedia-dbspotlight).
Created a wikiparser.scala program to parse the XML dump and create JSON format individual articles as elements of Spark RDD
[Wikipedia Extractor Code] (https://github.com/naveenmadhire/wikipedia-stats-extractor)
Json output of Json-wikipedia has few Unicode characters. For example "\u0027". Need to convert the unicode characters to regular text after the parsing in Scala.
Made changes to Json-wikipedia to fetch the redirect information from class instead of individual language independent property files.
Currently working on fetching the category links information from class instead of property files.
Task Moved to next week Fix parsing issues with Json-wikipedia.
Learning from previous weeks During the course of last 3 weeks, the main learning was to understand the wiki parsing and make changes to suit the needs of DBPedia spotlight models.
Parsing issues with Json-wikipedia. Writing test cases for testing the parsing logic. I will use the small wiki dataset for testing and verifying the whole parsing logic.
Fixed most of the parsing issues with the Json-wikipedia code. Implement using dataframes for reading the parsed RDDs to calculate the various counts later on.
Progress
This week I've worked on fixing the json-wikipedia parsing issues
- Removed the reference tags from the article text
- Fixed language identifiers
URI Counts logic has been implemented in Scala and Spark Here
Challenges and Learning
- Implementation using Scala and Spark
- Few errors faced during testing and was able to overcome most of the issues.
I worked on testing FSA Spotter in Spotlight. I started working on Surface form counts this week. Most of the coding is complete for surface counts. Faced few issues with akka/scala version across Spark and spotlight, which were resolved by changing the version and building the jar.
This week my task is to complete the coding and testing of Entity , surface form and pair counts and start working on Token Counts to be in line with the plan to complete the project with in the GSoC timeframe.
Completed Token Counts and redirects implementation. Tested with the small set of wikidump.
This week task is to test the Various counts generated from the big dataset. I am planning to test 500 MB wikidump dataset and generate counts. Encountered few OOM issues during testing. Trying to adjust few configuration parameters for generating the counts on the large dataset.
Completed the coding and testing of all the wikipedia counts. Tested on a 500 MB wikipedia dump.
-
Tentative plan for this week is to complete the Raw Wiki text generation with entities being replaced by DBPedia URLs.
-
Write test cases for different scenarios
-
Run the whole wikipedia dump with the latest code.
- Re-rewrite additional non-priority functions - [Weeks: 8]
- Update Scripts to generate Spotlight Models (Quickstarter, Dbpedia-Spotlight/bin) - [Weeks: 9]
- SF discount(Better Automatisation [Optional]) [Weeks: 10, 11]
- Generate new models for supported Languages - [Weeks: 12]