-
Notifications
You must be signed in to change notification settings - Fork 5
Mid Term Report
This project is the whole rewrite of PigNlproc. PigNlproc parses the wikipedia dump and creates uri counts, surface form counts, token counts which would then be used in DBPedia -spotlight for creating models. Hence this project is useful in terms of different language models created.
GSOC Plan
This project has two main important moduels
- Parsing the raw WikiPedia dump.
- Creating the respective counts out of WikiPedia text.
After thorough analysis of different wiki parsers like JWPL,Jsonpedia,JsonWikipedia,Cloud9,wtf_wikipedia. Jsonwikipedia seemed to be a good fit for parsing the each wikiarticle from raw wikipedia dump into JSON. The obtained JSON with the help of Apache Spark framework can be used for calculating the counts.
JSON-WikiPedia Changes
I made below changes to json-wikipedia to be suitable for use in extraction of links, paragraph text and other details to be useful for calculating the counts downstream
- Added the Start and End span of the links identified in the wiki text during parsing
- Removed the logic for parsing External links
- Added a new ParagrapghLinks element in the Json output which will have the Paragraph text and all the links(with start and end spans) associated with the paragraph. Example
- Added logic to output clean wiki article text. Example
- Added logic for parsing the Redirects
- Removed unwanted and unused fields from the JSON Output.
- Added the logic to remove the Templates [TEMPLATE] elements from the paragraph text and article text.
- Added the logic to remove reference tags from the article text like "". The logic is here
- Added additional language modifiers which were identified during testing.Logic
- Test case for Link extraction.Here
- Schema of the Output JSON
|-- integerNamespace: long (nullable = true)
|-- lang: string (nullable = true)
|-- links: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- description: string (nullable = true)
| | |-- end: long (nullable = true)
| | |-- id: string (nullable = true)
| | |-- start: long (nullable = true)
|-- namespace: string (nullable = true)
|-- paragraphsLink: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- links: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- description: string (nullable = true)
| | | | |-- end: long (nullable = true)
| | | | |-- id: string (nullable = true)
| | | | |-- start: long (nullable = true)
| | |-- paraText: string (nullable = true)
|-- redirect: string (nullable = true)
|-- title: string (nullable = true)
|-- type: string (nullable = true)
|-- wid: long (nullable = true)
|-- wikiText: string (nullable = true)
|-- wikiTitle: string (nullable = true)
- Sample JSON Output
Various Counts Changes
I made the code changes for implementing the logic for URI counts using Apache Spark and Scala.
-
Apache Spark Version 1.3.1 is being used for calculating various counts. Each individual article parsed into JSON format will be stored as an element of RDD. I used dataframes to parse the JSON element and read only the relevant items. Example below
val pageRDDs = readFile(inputWikiDump,sc)
val sqlContext = new SQLContext(sc)
val dfWikiRDD = sqlContext.jsonRDD(pageRDDs)
-
Detailed Parsing logic and URI counts logic is present here.
Source Code to be Delievered
Where I am at Mid-Term
I am done with WikiPedia dump parsing into JSON format and URI counts implementation in Spark.
What to do in rest of GSOC
Below is the high level plan to follow for rest of the Summer.
-
Adding test cases for Counts in Spark with Sample wikipedia xml.
-
Working on various counts which were used in PigNlproc.
Surface Form Counts Pair Counts Token Counts
-
Testing parsing and counts for languages other than English
-
Resolve DBPedia Identifiers (i.e: Resolve Redirects)
-
Re-rewrite additional non-priority functions
-
Update Scripts to generate Spotlight Models (Quickstarter, Dbpedia-Spotlight/bin)
-
SF discount(Better Automatisation [Optional])
-
Generate new models for supported Languages
Challenges so far
- Understanding the various terminologies used in the Wikipedia format.
- Transitioning from Java to Scala was tedious at the beginning. However, with little practice and concepts present in Twitter scala Link helped me in the transition.