Mid Term Report

Re-modelling PigNlproc Project

This project is the whole rewrite of PigNlproc. PigNlproc parses the wikipedia dump and creates uri counts, surface form counts, token counts which would then be used in DBPedia -spotlight for creating models. Hence this project is useful in terms of different language models created.

GSOC Plan

This project has two main important moduels

Parsing the raw WikiPedia dump.
Creating the respective counts out of WikiPedia text.

After thorough analysis of different wiki parsers like JWPL,Jsonpedia,JsonWikipedia,Cloud9,wtf_wikipedia. Jsonwikipedia seemed to be a good fit for parsing the each wikiarticle from raw wikipedia dump into JSON. The obtained JSON with the help of Apache Spark framework can be used for calculating the counts.

JSON-WikiPedia Changes

I made below changes to json-wikipedia to be suitable for use in extraction of links, paragraph text and other details to be useful for calculating the counts downstream

Added the Start and End span of the links identified in the wiki text during parsing
Removed the logic for parsing External links
Added a new ParagrapghLinks element in the Json output which will have the Paragraph text and all the links(with start and end spans) associated with the paragraph. Example
Added logic to output clean wiki article text. Example
Added logic for parsing the Redirects
Removed unwanted and unused fields from the JSON Output.
Added the logic to remove the Templates [TEMPLATE] elements from the paragraph text and article text.
Added the logic to remove reference tags from the article text like "". The logic is here
Added additional language modifiers which were identified during testing.Logic
Test case for Link extraction.Here
Schema of the Output JSON

 |-- integerNamespace: long (nullable = true)
 |-- lang: string (nullable = true)
 |-- links: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- end: long (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- start: long (nullable = true)
 |-- namespace: string (nullable = true)
 |-- paragraphsLink: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- links: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- end: long (nullable = true)
 |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |-- start: long (nullable = true)
 |    |    |-- paraText: string (nullable = true)
 |-- redirect: string (nullable = true)
 |-- title: string (nullable = true)
 |-- type: string (nullable = true)
 |-- wid: long (nullable = true)
 |-- wikiText: string (nullable = true)
 |-- wikiTitle: string (nullable = true)

Sample JSON Output

Various Counts Changes

I made the code changes for implementing the logic for URI counts using Apache Spark and Scala.

Apache Spark Version 1.3.1 is being used for calculating various counts. Each individual article parsed into JSON format will be stored as an element of RDD. I used dataframes to parse the JSON element and read only the relevant items. Example below

val pageRDDs = readFile(inputWikiDump,sc)

val sqlContext = new SQLContext(sc)

val dfWikiRDD = sqlContext.jsonRDD(pageRDDs)
Detailed Parsing logic and URI counts logic is present here.

Source Code to be Delievered

Json-WikiPedia Repo
Wikipedia-stats-extractor Repo

Where I am at Mid-Term

I am done with WikiPedia dump parsing into JSON format and URI counts implementation in Spark.

What to do in rest of GSOC

Below is the high level plan to follow for rest of the Summer.

Adding test cases for Counts in Spark with Sample wikipedia xml.

Working on various counts which were used in PigNlproc.

  Surface Form Counts

  Pair Counts

  Token Counts

Testing parsing and counts for languages other than English
Resolve DBPedia Identifiers (i.e: Resolve Redirects)
Re-rewrite additional non-priority functions
Update Scripts to generate Spotlight Models (Quickstarter, Dbpedia-Spotlight/bin)
SF discount(Better Automatisation [Optional])
Generate new models for supported Languages

Challenges so far

Understanding the various terminologies used in the Wikipedia format.
Transitioning from Java to Scala was tedious at the beginning. However, with little practice and concepts present in Twitter scala Link helped me in the transition.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mid Term Report

Re-modelling PigNlproc Project

Clone this wiki locally