-
Notifications
You must be signed in to change notification settings - Fork 5
Final Report
Naveen Madhire edited this page Aug 24, 2015
·
4 revisions
This project is the whole rewrite of PigNlproc. PigNlproc parses the wikipedia dump and creates uri counts, surface form counts, token counts which would then be used in DBPedia -spotlight for creating models. Hence this project is useful in terms of different language models created.
Below are the high level activities covered as part of this project during GSoC 2015,
- Various wikiStats counts - UriCounts, SfTotalCounts, PairCounts and TokenCounts.
- Links replaced by DBPedia URIs in the whole raw wiki text.
- Convert wikipedia xml dump into JSON dump which can be extended in future by other projects.
Below are the repos used as part of this project
https://github.com/naveenmadhire/dbpedia-spotlight/tree/feature/scala-2.10
https://github.com/naveenmadhire/wikipedia-stats-extractor
https://github.com/naveenmadhire/json-wikipedia-dbspotlight
####Known Issues or Logic to be modified There are few existing issues which can be addressed,