PDF Transcribe is a demo project for PyBossa that shows how you can crowdsource a PDF transcription problem.
This project uses the Mozilla PDF.JS library to load an external PDF file and render it directly in the web browser without using any third party plugin.
By using PDF.JS, we have the possibility of rendering almost any PDF that is hosted under an HTTP server and then use a customized form to get the data that we want to extract from it.
In this simple demo project, we load a PDF file in one side of the page, and in the other one a form where the volunteer will be able to transcribe the PDF page by typing the text in the input form. While this example is really simple, adapting the template to extract specific bits of information from the PDF will be really easy (you will only need to add more HTML input fields with instructions about what you want to extract from the PDF file). The idea is that you could be able for example to extract specific items from the documents, like captions, tabular data, authorship, institutions, etc.
The provided script for creating the tasks is very simple: you only need to tell the script where is the PDF file hosted, the URL, and which pages you want to convert as tasks. By default, this demo explores the 14 pages of the example PDF file.
You need to install the pybossa-pbs library first. Use of a virtual environment is recommended:
$ virtualenv env
$ source env/bin/activate
$ pip install -r requirements.txt
Now that you've all the requirements installed in your system, you need a PyBossa account:
- Create an account in your PyBossa server (use Crowdcrafting if you want).
- Copy your API-KEY (you can find it in your profile page).
PyBossa-pbs command line tool can be configured with a config file in order to avoid typing the API-KEY and the server every time you want to take an action on your project. For this reason, we recommend you to actually create the config file. For creating the file, follow the next steps:
$ cd ~
$ editorofyourchoice .pybossa.cfg
That will create a file. Now paste the following:
[default]
server: http://yourpybossaserver.com
apikey: yourapikey
Save the file, and you are done! From now on, pybossa-pbs will always use the default section to run your commands.
Now that we've everything in place, creating the project is as simple as running this command:
$ pbs create_project
This is very simple too. There's a sample tasks CSV file included here named 'pdf_tasks.csv'. You can adapt it to your own PDF files URLs, and then just let pbs do the job:
$ pbs add_tasks --tasks-file pdf_tasks.csv
But notice that it has 3 columns (or keys if you'd work with an equivalent JSON file) which are required by this template:
- pdf_url: the url where the PDF file will be loaded from.
- question: some text you want to display giving instructions on what the user needs to do.
- page: an optional field that will make that only the specified page is displayed if the PDF document has multiple pages. If omitted, the whole document will be shown (with a pagination).
You can also use the built-in Dropbox importer that comes with PyBossa servers (if configured by the admin). For more details, please visit the PyBossa documentation.
Now that we've some data to process, let's add to our project the required templates to show a better description of our project, to present the tasks to our users, and a small tutorial for the volunteers:
$ pbs update_project
NOTE: we provide templates also for Bootstrap v2 in case your PyBossa server is using Bootstrap 2 instead of Bootstrap 3. See the rest of the files.
Please, check the full documentation here about how to create a project in the command line with pbs:
http://docs.pybossa.com/en/latest/user/pbs.html
Usually you will have a set of PDF files that you are currently serving via a web server.
If you use the project as it is, you will see that it does not work loading the PDFs, even though the URL links are fine and the PDF pages are correct in the Google Spreadsheet that you have created. The problem, is that you need to enable CORS in order to get access to your PDF files.
In Enable Cors webpage you can check how you can configure most of the web servers properly, so this project can load the PDF files from other domains without problems. For example, for an Apache web server all you have to do is to enable the module mod_headers:
$ sudo a2enmod headers
Then, open the site config file, i.e. /etc/apache2/sites-enabled/000-default and add the following to the **VirtualHost section:
Header set Access-Control-Allow-Origin "*"
Finally restart the web server and you will be done! The PDFs now should be loaded without problems. Note: you can use .htaccess files too in order to not enable CORS to all your site, or if you prefer place the previous sentence in a Directory or Location, instead of at the level of the VirtualHost section.
Alternatively, if you are using a PyBossa server configured to be integrated with Dropbox (like Crowdcrafting) you can use the built-in Dropbox importer to serve the PDF files directly from a Dropbox account. Check the PyBossa docs for more details.
We recommend that you read the section: Build with PyBossa and follow the step by step tutorial.
NOTE: This project uses the pybossa-pbs library in order to simplify the development of the project and its usage. Check the documentation.
Please, see the COPYING file.
The thumbnail has been created using a photo from TempusVolat (license CC BY-NC-SA 2.0).
Special thanks to Miquel Herrera for his JS libraries for the canvas scrolling, and Mozilla Foundation for their PDF.JS library.