Named Entity Recognition and Named Entity Linking with Stanza to create a Knowledge Graph out of RSS News and Subreddits

"data" folder: Contains all the data of the project.

"data/raw" folder: Contains all crawled news from reddit and rss from 22.03.2021 till 05.04.2021. Data crawled with the 0_crawlReddit.py and 0_crawlRSS.py scripts.

"data/prepprocessed" folder: Contains all the preprocessed news created by the 1_preprocess.py script. Files created by 1_preprocess.py.

"data/news_stanza.json" Contains all news processed by stanza with ids of connected entities. File created by 2_NER_stanza.py

"data/entities.json" Contains all entities detected by stanza including connected news_ids. File created by 2_NER_stanza.py

"data/entities_wikified.json" Contains all entities including dbpedia resources if available. File created by 3_SPARQL_Wikify.py

"data/entities_wikified_checked.json" Contains all entities including dbpedia resources and boolean if stanza type matches the dbpedia ontology of resource. File created by 4_SPARQL_Check.py

0_crawlReddit.py: Crawls the subreddit "news" using the python library praw and stores new posts into the "data/raw/reddit" folder.

0_crawlRSS.py: Crawls the declared rss feeds using the python library feedparser and stores new feeds into the "data/raw/FEED_NAME" folder

1_preprocess.py: Preprocesses all reddit and rss data to align field names and datetimes, and creates an index for all news.

2_NER_stanza.py: Runs stanza named entity recognition pipeline on the preprocessed news and stores detected entities into "data/entities.json" and news into "data/news_stanza.json".

3_SPARQL_Wikify.py: Runs a SPARQL query against the dbpedia SPARQL endpoint to check if an resource exists for each entity.

4_SPARQL_Check.py: Runs a SPARQL query against the dbpedia SPARQL endpoint to check if the detected stanza class matches the resource ontology.

5_News.yarrr.yml / rules.rml.ttl / graph.ttl: Use following command to generate rml mapping out of mapping file: yarrrml-parser -i .\5_News.yarrr.yml -o rules.rml.ttl In a next step the following command is used to generate turtle tripples: java -jar .\rmlmapper.jar -m .\rules.rml.ttl -o graph.ttl

6_CountEntities.py: Python script to count the named entities and dbpedia resources without a type.

requirements.txt: All used python libraries.

SPARQL_Queries.txt: Used queries and results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Named Entity Recognition and Named Entity Linking with Stanza to create a Knowledge Graph out of RSS News and Subreddits

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
0_crawlRSS.py		0_crawlRSS.py
0_crawlReddit.py		0_crawlReddit.py
1_preprocess.py		1_preprocess.py
2_NER_stanza.py		2_NER_stanza.py
3_SPARQL_Wikify.py		3_SPARQL_Wikify.py
4_SPARQL_Check.py		4_SPARQL_Check.py
5_News.yarrr.yml		5_News.yarrr.yml
6_CountEntities.py		6_CountEntities.py
Readme.md		Readme.md

obensch/NER_NEL_KG

Folders and files

Latest commit

History

Repository files navigation

Named Entity Recognition and Named Entity Linking with Stanza to create a Knowledge Graph out of RSS News and Subreddits

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages