-
Notifications
You must be signed in to change notification settings - Fork 42
Tutorial 1
This is a wiki page started by Ross Mounce (contributions/edits/forks welcome!) on 22/07/2014 at Mozilla (London) as part of #mozsprint. I will be attempting to document Peter Murray-Rust's attempt to write a journal scraper for the quickscrape project for the journal Acta Crystallographica E.
Eventually this will be turned into a serviceable tutorial for others to follow...
Screencrop shows the navigation bar which has links to seven different objects:
- html : full text in HTML SCRAPE-IT!
- pdf : full text in PDF SCRAPE-IT!
- cif : the crystal data in CIF format SCRAPE-IT!
- 3d view : a wonderful 3d interactive viewer (Jmol). Probably not worth scraping
- structure factors : the raw data. Very valuable for reproducibility. CrystalGeekScrape
- supplementary materials : crystal data in HTML. SCRAPE-IT! (actually dupliactes what is in the CIF, but avoids transformation) SCRAPE-IT!
- checkCIF : machine review of data quality. Really valuable for reproducibility. SCRAPE-IT!
- PMR Googles a known DOI string of an Acta Crystallographica E article: 10.1107/S1600536814009556
- This is one of the links returned: http://scripts.iucr.org/cgi-bin/paper?hb0002
- Apparently hb0002 is the 'journal identifier' PMR may need to explain this... !!!
- To view the source HTML of a landing page in a modern browser (e.g. Firefox/Chrome/Chromium) press Ctr+U (on Win/Linux) or Cmd+U (Mac)
- It should then show you the plain HTML of the page you were looking at in a new tab/window:
Finding the links is trivial (this isn't true of most journals) . We'll look at the source of the landing page (later we'll have clickable tools to make this easier). The purple/pink bar containing all the links to the academic content is represented by a single very long line in the HTML (which is difficult to read) Line 75
<!-- End CrossMark Snippet --> <div class="buttonlinks">
<div class="buttonlinks"><a href="http://journals.iucr.org/e/issues/2014/08/00/hb0002/index.html" target="_parent"><img src="http://journals.iucr.org/e//graphics/htmlborder.gif" alt="HTML version" align="top" border="0" /></a><a href="http://journals.iucr.org/e/issues/2014/08/00/hb0002/hb0002.pdf" ><img src="http://journals.iucr.org/e//graphics/pdfborder.gif" alt="pdf version" align="top" border="0" /></a><a href="http://scripts.iucr.org/cgi-bin/sendcif?hb0002sup1" ><img src="http://journals.iucr.org/e//graphics/cifborder.gif" alt="cif file" align="top" border="0" /></a><a href="http://scripts.iucr.org/cgi-bin/sendcif?hb0002sup1&Qmime=cif" ><img src="http://journals.iucr.org/e//graphics/3dviewborder.gif" alt="3d view" align="top" border="0" /></a><a href="http://journals.iucr.org/e/issues/2014/08/00/hb0002/hb0002Isup2.hkl" ><img src="http://journals.iucr.org/e//graphics/structurefactorsborder.gif" alt="structure factors" align="top" border="0" /></a><a href="http://journals.iucr.org/e/issues/2014/08/00/hb0002/hb0002sup0.html" ><img src="http://journals.iucr.org/e//graphics/supplementarymaterialsborder.gif" alt="supplementary materials" align="top" border="0" /></a><a href="http://scripts.iucr.org/cgi-bin/paper?cnor=hb0002&checkcif=yes" ><img src="http://journals.iucr.org/e//graphics/checkcifborder.gif" alt="CIF check report" align="top" border="0" /></a> <a href="http://journals.iucr.org/services/openaccess.html"><img src="http://journals.iucr.org//../logos/free.gif" alt="Open access" align="top" border="0" /></a></div><p class="scheme"><img src="http://journals.iucr.org/e/issues/2014/08/00/hb0002/hb0002contents.gif" alt="hb0002 scheme" /></p><div class="bibline"><p><i>Acta Cryst.</i> (2014). E<b>70</b>, 44-47 [ <font size="2"><a title="Open URL link" href="http://dx.doi.org/10.1107/S1600536814009556">doi:10.1107/S1600536814009556</a></font> ]</p></div>
The CrossMark is not relevant. formatting the rest (and adding blank lines and annotation for humans) we get:
<div class="buttonlinks">
<a href="http://journals.iucr.org/e/issues/2014/08/00/hb0002/index.html" target="_parent"><img src="http://journals.iucr.org/e//graphics/htmlborder.gif" alt="HTML version" align="top" border="0" /></a>
<a href="http://journals.iucr.org/e/issues/2014/08/00/hb0002/hb0002.pdf" ><img src="http://journals.iucr.org/e//graphics/pdfborder.gif" alt="pdf version" align="top" border="0" /></a>
<a href="http://scripts.iucr.org/cgi-bin/sendcif?hb0002sup1" ><img src="http://journals.iucr.org/e//graphics/cifborder.gif" alt="cif file" align="top" border="0" /></a>
<a href="http://scripts.iucr.org/cgi-bin/sendcif?hb0002sup1&Qmime=cif" ><img src="http://journals.iucr.org/e//graphics/3dviewborder.gif" alt="3d view" align="top" border="0" /></a>
<a href="http://journals.iucr.org/e/issues/2014/08/00/hb0002/hb0002Isup2.hkl" ><img src="http://journals.iucr.org/e//graphics/structurefactorsborder.gif" alt="structure factors" align="top" border="0" /></a>
<a href="http://journals.iucr.org/e/issues/2014/08/00/hb0002/hb0002sup0.html" ><img src="http://journals.iucr.org/e//graphics/supplementarymaterialsborder.gif" alt="supplementary materials" align="top" border="0" /></a>
<a href="http://scripts.iucr.org/cgi-bin/paper?cnor=hb0002&checkcif=yes" ><img src="http://journals.iucr.org/e//graphics/checkcifborder.gif" alt="CIF check report" align="top" border="0" /></a>
<a href="http://journals.iucr.org/services/openaccess.html"><img src="http://journals.iucr.org//../logos/free.gif" alt="Open access" align="top" border="0" /></a>
</div>
The rest of the line is not relevant:
`
Acta Cryst. (2014). E70, 44-47 [ doi:10.1107/S1600536814009556 ]
DON'T PANIC!
It's actually easy as you will probably cut and paste this and edit the bits you want. It uses JSON as a (declarative) programming language, BUT you DON'T need to learn it.
The structure is
{ "url": "mdpi", "elements": {
... a list of scrapables ...
} }
The "uri" (we should rename this?) is actually a regular expression ("regex") to locate the URL of the landing page. It reads find any URL with the string "mdpi" in it. Our landing page is http://journals.iucr.org/e/issues/2014/08/00/hb0002/index.html so we'll write
"url": "iucr",
as IUCr (the abbreviation for the International Union of Crystallography) is probably unique. (If there are ambiguities we'll show you how to tighten the regex. Make sure the brackets are balanced (i.e. for each { you have a corresponding } ) and that the quotes are closed (i.e. " is ended by "). Also the colons (:), semicolons (;) and commas (,) matter. If you get things wrong it's almost certain that you've got the punctuation wrong. Note that lists (e.g. the scrapables or the contents of the scrapable) have elements separated by commas EXCEPT for the last. A very common error will be having an unwanted comma at the end of the list, but with care you'll get it right. If you do foul up (and we all do!) DON'T PANIC but contact the ContentMine community (contentmine.org). We'll write the scrapable for HTML:
All we want from here is the string http://journals.iucr.org/e/issues/2014/08/00/hb0002/index.html and to label it as fulltext_html (The labels are reserved name - you must get them exactly right including case) so our html scrapable has the structure: "fulltext_html": { "selector": [to be written now], "attribute": [to be written now], "download": true },
This will mean that the scraped result has a chunk labelled "fulltext_html" and that we want to download it now (true). Now we just have to write the selector and the attributes... HTML (and XML) consist of elements (started with a "<") and attributes of the form foo="bar" foo is the attribute name and bar is the attribute value [that's all you need to know] Because no publishers document their HTML externally we have to guess how to identify each element. There's lots of "a" elements so we need to find specific attributes. We'll guess that:
is unique for the link which points to the HTML. Further XPath tips/syntax: http://www.w3schools.com/xpath/xpath_syntax.asp In XPath this reads:
//a[@target='_parent']
(the "//" means anywhere in the document. It's a sledgehammer which works for most landing pages). Read this as "find all elements with a "target" attribute whose value is "_parent" This finds what we want and we write:
"selector": "//a[@target='_parent']",
Make sure you get the quotes (") and apostrophes (') correct and balanced. and to get the value:
"attribute": "href",
and the whole scrapable is
"fulltext_html": { "selector": "//a[@target='_parent']", "attribute": "href", "download": true }
That's it! That's all you have to know. Make sure you have commas after each scrapable directive, except the last and that you have one after the last "}" Take a rest. Then we'll test it... ... one very large coffee later. We assemble the growing scraper:
{ "url": "iucr", "elements": { "fulltext_html": { "selector": "//a[@target='_parent']", "attribute": "href", "download": true } }
(as there is only and see if it runs... (Don't be too ambitious - write a scraper for one file type and then add the others later). To run it... create file iucr.json in journal/scrapers directory with this content. then run:
quickscrape --url http://scripts.iucr.org/cgi-bin/paper?S1600536814009556 --scraper journal-scrapers/iucr.json --output iucr-test
We made three mistakes worth detailing in the course of all this:
- missed out a "}" which gave:
undefined:0
^
SyntaxError: Unexpected end of input
- mistyped the condition which gave:
info: Single matching scraper found
warn: no elements were extracted from url: http://scripts.iucr.org/cgi-bin/paper?S1600536814009556
- leaving the last element in the journal scraper with a comma e.g. }, which gives:
undefined:22
}
^
SyntaxError: Unexpected token }
but when we corrected them we got a directory full of files and a results.json which contained
[
{
"fulltext_html": "http://journals.iucr.org/e/issues/2014/08/00/hb0002/index.html"
}
]
i.e. exactly one file...
Final version with ALL the scrapables for this journal:
{
"url": "iucr",
"elements": {
"fulltext_html": {
"selector": "//a[@target='_parent']",
"attribute": "href",
"download": true
},
"fulltext_pdf": {
"selector": "//a[img[@alt='pdf version']]",
"attribute": "href",
"download": true
},
"iucr:cif": {
"selector": "//a[img[@alt='cif file']]",
"attribute": "href",
"download": true
},
"iucr:supplementary": {
"selector": "//a[img[@alt='
supplementary materials']]",
"attribute": "href",
"download": true
},
"iucr:cif_check": {
"selector": "//a[img[@alt='CIF check report']]",
"attribute": "href",
"download": true
}
}
}
NOTE : no comma after the last scrapable; and exact number of spaces in the attribute values and we get all 5 files, all listed in the results.json ... When you are proficient (and know the target journal) this scraper would probably take:
- 10 minutes to find the HTML links in the landing page
- 5 minutes to write the scrapables (here they are very nicely set out) - some journals might take 30 minutes
- 5 minutes to correct your errors :-)
- 5 minutes to test a few more URLs
- 5 mins to document what you have done and commit
so about 30 mins in all. However some journals are poorly structured and you may have to use positional information, regexes on attribute values, etc. We'll write more tutorials for that...