Skip to content
Naveen Madhire edited this page Aug 24, 2015 · 4 revisions

Re-modelling PigNlproc Project - Final Report for GSoC 2015

This project is the whole rewrite of PigNlproc. PigNlproc parses the wikipedia dump and creates uri counts, surface form counts, token counts which would then be used in DBPedia -spotlight for creating models. Hence this project is useful in terms of different language models created.

Below are the high level activities covered as part of this project during GSoC 2015,

  1. Various wikiStats counts - UriCounts, SfTotalCounts, PairCounts and TokenCounts.
  2. Links replaced by DBPedia URIs in the whole raw wiki text.
  3. Convert wikipedia xml dump into JSON dump which can be extended in future by other projects.

Below are the repos used as part of this project

https://github.com/naveenmadhire/dbpedia-spotlight/tree/feature/scala-2.10

https://github.com/naveenmadhire/wikipedia-stats-extractor

https://github.com/naveenmadhire/json-wikipedia-dbspotlight

####Known Issues or Logic to be modified There are few existing issues which can be addressed,

Issues

Clone this wiki locally