Skip to content

A data processing pipeline that schedules and runs content harvesters, normalizes their data, and outputs that normalized data to a variety of output streams. Data collected can be explored at https://osf.io/share/, and viewed at https://osf.io/api/v1/share/. Developer docs can be viewed at https://osf.io/wur56/wiki

License

Notifications You must be signed in to change notification settings

stitchinthyme/scrapi

 
 

Repository files navigation

scrapi

Build Status Coverage Status

Getting started

  • You will need to:
    • Install requirements.
    • Install Elasticsearch
    • Install Cassandra
    • Install harvesters
    • Install rabbitmq

Requirements

  • Create and enter virtual environment for scrapi, and go to the top level project directory. From there, run
$ pip install -r requirements.txt

and the python requirements for the project will download and install.

Installing Cassandra and Elasticsearch

note: JDK 7 must be installed for Cassandra and Elasticsearch to run

Mac OSX

$ brew install cassandra
$ brew install elasticsearch

Now, just run

$ cassandra
$ elasticsearch

Or, if you'd like your cassandra session to be bound to your current session, run:

$ cassandra -f

and you should be good to go.

Running the server

  • Just run
$ python server.py

from the scrapi/website/ directory, and the server should be up and running!

Harvesters

  • To set up harvesters for the first time, Just run
invoke init_harvesters

and the harvesters specified in the manifest files of the worker_manager, and their requirements, will be installed.

Rabbitmq

Mac OSX

$ brew install rabbitmq

Ubuntu

$ sudo apt-get install rabbitmq-server

Running the scheduler

  • from the top-level project directory run:
$ invoke celery_beat

to start the scheduler, and

$ invoke celery_worker

to start the worker.

Testing

  • To run the tests for the project, just type
$ invoke test

and all of the tests in the 'tests/' directory will be run.

About

A data processing pipeline that schedules and runs content harvesters, normalizes their data, and outputs that normalized data to a variety of output streams. Data collected can be explored at https://osf.io/share/, and viewed at https://osf.io/api/v1/share/. Developer docs can be viewed at https://osf.io/wur56/wiki

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%