Skip to content

Latest commit

 

History

History
208 lines (163 loc) · 34 KB

README.md

File metadata and controls

208 lines (163 loc) · 34 KB

Literary Language Toolkit (LLTK)

Corpora, models, and tools for the study of complex language.

Quickstart

See this notebook for a more interactive quickstart (run the code here on Binder).

Install

Open a terminal, Jupyter, or Colab notebook and type:

pip install -qU lltk-dh

# or for very latest version:
#pip install -qU git+https://github.com/quadrismegistus/lltk

Show available corpora:

lltk show

Or, within python, show in markdown:

import lltk
lltk.show()

Play with corpora

See below for available corpora.

# Load/install a corpus
import lltk
corpus = lltk.load('ECCO_TCP')           # load the corpus by name or ID

# Metadata
meta = corpus.meta                       # metadata as data frame
smpl = meta.query('1770<year<1830')      # easy query access         

# Data
mfw = corpus.mfw()                       # get the 10K most frequent words as a list
dtm = corpus.dtm()                       # get a document-term matrix as a pandas dataframe
dtm = corpus.dtm(tfidf=True)             # get DTM as tf-idf
mdw = corpus.mdw('gender')               # get most distinctive words for a metadata group

Play with texts

# accessing text objs
texts = corpus.texts()                   # get a list of corpus's text objects
texts_smpl = corpus.texts(smpl)          # text objects from df/list of ids 
texts_rad = corpus.au.Radcliffe          # hit "tab" after typing e.g. "Rad" to autocomplete 
text = corpus.t                          # get a random text object from corpus

# metadata access
text_meta = text.meta                    # get text metadata as dictionary
author = text.author                     # get common metadata as attributes    
title = text.title
year = text.year
dec = text.decade                        # few inferred attributes too
dec_str = text.decade_str                # '1890-1900' rather than 1890

# data access
txt = text.txt                           # get plain text as string
xml = text.xml                           # get xml as string

# simple nlp
words  = text.words                      # get list of words (excl punct)
sents = text.sents                       # get list of sentences
counts = text.counts                     # get word counts as dictionary (from JSON if saved)

# other nlp
tnltk = text.nltk                        # get nltk Text object
tblob = text.blob                        # get TextBlob object
tstanza = text.stanza                    # get list of stanza objects (one per para)
tspacy = text.spacy                      # get list of spacy objects (one per para)

Available corpora

LLTK has built in functionality for the following corpora. Some (🌞) are freely downloadable from the links below or the LLTK interface. Some of them (☂) require first accessing the raw data through your institutional or other subscription. Some corpora have a mixture, with some data open through fair research use (e.g. metadata, freqs) and some closed (e.g. txt, xml, raw).

name desc license metadata freqs txt xml raw
ARTFL American and French Research on the Treasury of the French Language Academic ☂️ ☂️
BPO British Periodicals Online Commercial ☂️ ☂️
CLMET Corpus of Late Modern English Texts Academic 🌞 🌞 ☂️ ☂️
COCA Corpus of Contemporary American English Commercial ☂️ ☂️ ☂️ ☂️
COHA Corpus of Historical American English Commercial ☂️ ☂️ ☂️ ☂️
Chadwyck Chadwyck-Healey Fiction Collections Mixed 🌞 🌞 ☂️ ☂️ ☂️
ChadwyckDrama Chadwyck-Healey Drama Collections Mixed ☂️ ☂️ ☂️ ☂️ ☂️
ChadwyckPoetry Chadwyck-Healey Poetry Collections Mixed ☂️ ☂️ ☂️ ☂️ ☂️
Chicago U of Chicago Corpus of C20 Novels Academic 🌞 🌞 ☂️
DTA Deutsches Text Archiv Free 🌞 🌞 🌞 🌞 🌞
DialNarr Dialogue and Narration separated in Chadwyck-Healey Novels Academic 🌞 🌞 ☂️
ECCO Eighteenth Century Collections Online Commercial ☂️ ☂️ ☂️ ☂️ ☂️
ECCO_TCP ECCO (Text Creation Partnership) Free 🌞 🌞 🌞 🌞 🌞
EEBO_TCP Early English Books Online (curated by the Text Creation Partnership) Free 🌞 🌞 🌞 🌞
ESTC English Short Title Catalogue Academic ☂️
EnglishDialogues A Corpus of English Dialogues, 1560-1760 Academic 🌞 🌞 🌞
EvansTCP Early American Fiction Free 🌞 🌞 🌞 🌞 🌞
GaleAmericanFiction Gale American Fiction, 1774-1920 Academic 🌞 🌞 ☂️ ☂️
GildedAge U.S. Fiction of the Gilded Age Academic 🌞 🌞 🌞
HathiBio Biographies from Hathi Trust Academic 🌞 🌞
HathiEngLit Fiction, drama, verse word frequencies from Hathi Trust Academic 🌞 🌞
HathiEssays Hathi Trust volumes with "essay(s)" in title Academic 🌞 🌞
HathiLetters Hathi Trust volumes with "letter(s)" in title Academic 🌞 🌞
HathiNovels Hathi Trust volumes with "novel(s)" in title Academic 🌞 🌞
HathiProclamations Hathi Trust volumes with "proclamation(s)" in title Academic 🌞 🌞
HathiSermons Hathi Trust volumes with "sermon(s)" in title Academic 🌞 🌞
HathiStories Hathi Trust volumes with "story/stories" in title Academic 🌞 🌞
HathiTales Hathi Trust volumes with "tale(s)" in title Academic 🌞 🌞
HathiTreatises Hathi Trust volumes with "treatise(s)" in title Academic 🌞 🌞
InternetArchive 19th Century Novels, curated by the U of Illinois and hosted on the Internet Archive Free 🌞 🌞 🌞
LitLab Literary Lab Corpus of 18th and 19th Century Novels Academic 🌞 🌞 ☂️
MarkMark Mark Algee-Hewitt's and Mark McGurl's 20th Century Corpus Academic 🌞 🌞 ☂️
OldBailey Old Bailey Online Free 🌞 🌞 🌞 🌞
RavenGarside Raven & Garside's Bibliography of English Novels, 1770-1830 Academic ☂️
SOTU State of the Union Addresses Free 🌞 🌞 🌞
Sellers 19th Century Texts compiled by Jordan Sellers Free 🌞 🌞 🌞
SemanticCohort Corpus used in "Semantic Cohort Method" (2012) Free 🌞
Spectator The Spectator (1711-1714) Free 🌞 🌞 🌞
TedJDH Corpus used in "Emergence of Literary Diction" (2012) Free 🌞 🌞 🌞
TxtLab A multilingual dataset of 450 novels Free 🌞 🌞 🌞 🌞

Documentation

Incomplete for now. See this sample notebook for some examples.

New corpus

Import a corpus into LLTK:

lltk import                           # use the "import" command \
  -path_txt mycorpus/txts             # a folder of txt files  (use -path_xml for xml) \
  -path_metadata mycorpus/meta.xls    # a metadata csv/tsv/xls about those txt files \
  -col_fn filename                    # .txt/.xml filename col in metadata (use -col_id if no ext)

Or create a new one:

lltk create

Most frequent words

corpus.mfw_df(
    n=None,                            # Number of top words overall
    by_ntext=False,                    # Count number of documents not number of words
    by_fpm=False,                      # Count by within-text relative sums
    min_count=None,                    # Minimum count of word

    yearbin=None,                      # Average relative counts across `yearbin` periods
    col_group='period',                # Which column to periodize on
    n_by_period=None,                  # Number of top words per period
    keep_periods=True,                 # Keep periods in output dataframe
    n_agg='median',                    # How to aggregate across periods
    min_periods=None,                  # minimum number of periods a word must appear in

    excl_stopwords=False,              # Exclude stopwords (at `PATH_TO_ENGLISH_STOPWORDS`)
    excl_top=0,                        # Exclude words ranked 1:`not_top`
    valtype='fpm',                     # valtype to compute top words by
    **attrs
)

Document term matrix

corpus.dtm(
    words=[],                          # words to use in DTM
    n=25000,                           # if not `words`, how many mfw?
    texts=None,                        # set texts to use explicitly, otherwise use all
    tf=False,                          # return term frequencies, not counts
    tfidf=False,                       # return tfidf, not counts
    meta=False,                        # include metadata (e.g. ["gender","nation"])
    **mfw_attrs,                       # all other attributes passed to self.mfw()
)

Most distinctive words

corpus.mdw(                                 
    groupby,                           # metadata categorical variable to group by
    words=[],                          # explicitly set words to use
    texts=None,                        # explicitly set texts to use
    tfidf=True,                        # use tfidf as mdw calculation
    keep_null_cols=False,              # remove texts which do not have `groupby` set
    remove_zeros=True,                 # remove rows summing to zero
    agg='median',                      # aggregate by `agg`
)