This repository contains the complete scripts and data files used to construct the Computational History of Philosophy of Science (Comp HOPOS) dataset. This readme file contains information on reproducing the dataset (and modifying the scripts for other purposes); a license statement; a brief overview of the method by which the dataset is constructed; and an overview of the files included in the repository.
The dataset can be downloaded at https://doi.org/10.5281/zenodo.1400633. The downloads include a data dictionary. A paper describing the motivations and construction method in more detail is available at .
These scripts were developed by Daniel J. Hicks, Rick Morris, and Evelyn Brister. The repository is maintained by Daniel J. Hicks. To report errors or other issues, please use the issue tracker for this repository (preferred) or email [email protected].
In principle, the dataset can be reproduced by running the scripts (the R
files) in numerical order. There are, of course, a number of further details and complications.
-
With the exception of the gender attribution websites accessed in
06_gender.R
, all of the software tools and data are free (or free for non-commercial use). -
The
R
files are run using the R programming language. -
04_dedupe_names.R
is essentially a wrapper around a separate (included) Python script. This script requires Python 3. The R script needs to be configured to call the correct Python installation. -
The following input files are required:
00_Boston.csv
and00_Western_Ontario.csv
, which record volume-level information for the Boston and Western Ontario book series, respectively. These files can be generated by conducting a search on SpringerLink and then clicking on the downward-arrow button in the upper-right.00_Minnesota.xlsx
, which records chapter-level information for the Minnesota book series. This file was created by Evelyn Brister with assistance from Aaron Crespo.01_canonical_titles.csv
, which includes canonical names for journals. This file can also be constructed manually from the output file01_journal_data.csv
.
-
Every script has a number of dependencies, and the dependencies vary from script to script. In particular, note the following:
- Unless otherwise noted, all dependencies should be available using standard package managers (e.g., R's built-in package manager,
pip
or Anaconda for Python). - The Python dependencies used in
04_dupe_remove.py
are unique to that script. 06_gender.R
accesses files in another repository.06_gender.R
also accesses APIs for two subscription services, NamSor and genderize.io. After subscribing to those services, create a fileapi_keys.R
in the working directory, and add API keys with the following format:
namsor_key = 'stringofcharacters' namsor_user = 'namsor.com/[email protected]/12345' genderize.io_key = 'otherstringofcharacters'
- Unless otherwise noted, all dependencies should be available using standard package managers (e.g., R's built-in package manager,
-
The two scripts that retrieve larges amounts of data,
01_journals.R
and02_springer_books.R
, have relatively long runtimes, approximately 20 and 45 minutes respectively. -
Because of these two last points, it is recommended that each script be run individually. While the entire process could be automated once all of the dependencies are satisfied, this will not be necessary in many cases (because the dataset has already been constructed).
-
Note on the CrossRef API from the RCRossRef documentation:
Be nice and share your email with Crossref
The Crossref team encourage requests with appropriate contact information and will forward you to a dedicated API cluster for improved performance when you share your email address with them. https://github.com/CrossRef/rest-api-doc#good-manners--more-reliable-service
To pass your email address to Crossref via this client, simply store it as environment variable in .Renviron like this:
- Open file: file.edit("~/.Renviron")
- Add email address to be shared with Crossref crossref_email = [email protected]
- Save the file and restart your R session
Don’t wanna share your email any longer? Simply delete it from ~/.Renviron
Copyright (c) 2018 Daniel J. Hicks, Rick Morris, and Evelyn Brister
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
The primary source of data for the Comp HOPOS dataset is CrossRef, which maintains registration records for the (vast) majority of digital object identifiers (DOIs). Recently, many scholarly publishers have been "minting" DOIs for their archives; combined with an elegant R API, this makes it possible to easily and rapidly retrieve a complete set of metadata records for many scholarly journals.
Two other sources of data are incorporated into the Comp HOPOS dataset. Chapter-level DOIs for the Boston and Western Ontario book series are scraped from search results in Springer's public search engine. CrossRef is then used to retrieve the metadata for a chapter in a standard format. For the Minnesota book series, Evelyn Brister retrieved the data from the University of Minnesota website manually, with assistance from Aaron Crespo.
After combining these data sources, author names are disambiguated. Using canonical names, philosophers of science are identified and genders are attributed to philosophers of science based on author names.
Currently, philosophers of science are identified as authors who (after name disambiguation) have 2 or more articles in an identified set of "primary philosophy of science" journals (including the three book series).
00_Boston.csv
: Volume-level information for Boston book series00_Minnesota.xlsx
: Chapter-level information for the Minnesota book series00_Western_Ontario.csv
: Volume-level information for the Western Ontario book series01_canonical_titles.csv
: Canonical titles for journals01_journals.R
: Retrieve metadata for journal articles using CrossRef API02_springer_books.R
: Retrieve metadata for Springer-published book chapters using Springer website + CrossRef API03_combine.R
: Combine data retrieved in previous two scripts04_dedupe_names.R
: R wrapper around04_dupe_remove.py
04_dupe_remove.py
: Primary Python script used for author name disambiguation04_qc_file_diffs.py
: Auxiliary Python script used for author name disambiguation04_readme.txt
: Readme for Python scripts05_verify_samples.R
: Generates subsets of author disambiguation output files for manual review06_gender.R
: Identifies philosophers of science in the dataset and attributes gender based on author name07_dataset.R
: Combines metadataset with gender attribution into dataset suitable for public distributionreadme.md
: This readme