Corpora, models, and tools for the study of complex language.
See this notebook for a more interactive quickstart (run the code here on Binder).
Open a terminal, Jupyter, or Colab notebook and type:
pip install -qU lltk-dh
# or for very latest version:
#pip install -qU git+https://github.com/quadrismegistus/lltk
Show available corpora:
lltk show
Or, within python, show in markdown:
import lltk
lltk.show()
See below for available corpora.
# Load/install a corpus
import lltk
corpus = lltk.load('ECCO_TCP') # load the corpus by name or ID
# Metadata
meta = corpus.meta # metadata as data frame
smpl = meta.query('1770<year<1830') # easy query access
# Data
mfw = corpus.mfw() # get the 10K most frequent words as a list
dtm = corpus.dtm() # get a document-term matrix as a pandas dataframe
dtm = corpus.dtm(tfidf=True) # get DTM as tf-idf
mdw = corpus.mdw('gender') # get most distinctive words for a metadata group
# accessing text objs
texts = corpus.texts() # get a list of corpus's text objects
texts_smpl = corpus.texts(smpl) # text objects from df/list of ids
texts_rad = corpus.au.Radcliffe # hit "tab" after typing e.g. "Rad" to autocomplete
text = corpus.t # get a random text object from corpus
# metadata access
text_meta = text.meta # get text metadata as dictionary
author = text.author # get common metadata as attributes
title = text.title
year = text.year
dec = text.decade # few inferred attributes too
dec_str = text.decade_str # '1890-1900' rather than 1890
# data access
txt = text.txt # get plain text as string
xml = text.xml # get xml as string
# simple nlp
words = text.words # get list of words (excl punct)
sents = text.sents # get list of sentences
counts = text.counts # get word counts as dictionary (from JSON if saved)
# other nlp
tnltk = text.nltk # get nltk Text object
tblob = text.blob # get TextBlob object
tstanza = text.stanza # get list of stanza objects (one per para)
tspacy = text.spacy # get list of spacy objects (one per para)
LLTK has built in functionality for the following corpora. Some (🌞) are freely downloadable from the links below or the LLTK interface. Some of them (☂) require first accessing the raw data through your institutional or other subscription. Some corpora have a mixture, with some data open through fair research use (e.g. metadata, freqs) and some closed (e.g. txt, xml, raw).
Incomplete for now. See this sample notebook for some examples.
Import a corpus into LLTK:
lltk import # use the "import" command \
-path_txt mycorpus/txts # a folder of txt files (use -path_xml for xml) \
-path_metadata mycorpus/meta.xls # a metadata csv/tsv/xls about those txt files \
-col_fn filename # .txt/.xml filename col in metadata (use -col_id if no ext)
Or create a new one:
lltk create
corpus.mfw_df(
n=None, # Number of top words overall
by_ntext=False, # Count number of documents not number of words
by_fpm=False, # Count by within-text relative sums
min_count=None, # Minimum count of word
yearbin=None, # Average relative counts across `yearbin` periods
col_group='period', # Which column to periodize on
n_by_period=None, # Number of top words per period
keep_periods=True, # Keep periods in output dataframe
n_agg='median', # How to aggregate across periods
min_periods=None, # minimum number of periods a word must appear in
excl_stopwords=False, # Exclude stopwords (at `PATH_TO_ENGLISH_STOPWORDS`)
excl_top=0, # Exclude words ranked 1:`not_top`
valtype='fpm', # valtype to compute top words by
**attrs
)
corpus.dtm(
words=[], # words to use in DTM
n=25000, # if not `words`, how many mfw?
texts=None, # set texts to use explicitly, otherwise use all
tf=False, # return term frequencies, not counts
tfidf=False, # return tfidf, not counts
meta=False, # include metadata (e.g. ["gender","nation"])
**mfw_attrs, # all other attributes passed to self.mfw()
)
corpus.mdw(
groupby, # metadata categorical variable to group by
words=[], # explicitly set words to use
texts=None, # explicitly set texts to use
tfidf=True, # use tfidf as mdw calculation
keep_null_cols=False, # remove texts which do not have `groupby` set
remove_zeros=True, # remove rows summing to zero
agg='median', # aggregate by `agg`
)