Russian Texts Statistics (ruTS)

Library for statistics extraction from texts in Russian.

Installation

Run the following command:

$ pip install ruts

Dependencies:

python 3.8-3.10
nltk
pymorphy2
razdel
scipy
spaCy
numpy
pandas
matplotlib
graphviz

Usage

The main functions are based on the textacy statistics adapted to Russian language. The library allows working both with raw texts and Doc-objects of the spaCy library.

API to explore the available functions.

Object extraction

The library allows creating your own tools for sentence and word extraction from a text, which can be further employed for counting statistics.

Example:

import re
from nltk.corpus import stopwords
from ruts import SentsExtractor, WordsExtractor
text = "Не имей 100 рублей, а имей 100 друзей"
se = SentsExtractor(tokenizer=re.compile(r', '))
se.extract(text)

    ('Не имей 100 рублей', 'а имей 100 друзей')

we = WordsExtractor(use_lexemes=True, stopwords=stopwords.words('russian'), filter_nums=True, ngram_range=(1, 2))
we.extract(text)

    ('иметь', 'рубль', 'иметь', 'друг', 'иметь_рубль', 'рубль_иметь', 'иметь_друг')
   
we.get_most_common(3)

    [('иметь', 2), ('рубль', 1), ('друг', 1)]

Basic statistics

The library allows extracting the following statistics from a text:

the number of sentences
the number of words
the number of unique words
the number of long words
the number of complex words
the number of simple words
the number of monosyllabic words
the number of polysyllabic words
the number of symbols
the number of letters
the number of spaces
the number of syllables
the number of punctuation marks
word distribution by the number of letters
word distribution by the number of syllables

Example:

from ruts import BasicStats
text = "Существуют три вида лжи: ложь, наглая ложь и статистика"
bs = BasicStats(text)
bs.get_stats()

    {'c_letters': {1: 1, 3: 2, 4: 3, 6: 1, 10: 2},
    'c_syllables': {1: 5, 2: 1, 3: 1, 4: 2},
    'n_chars': 55,
    'n_complex_words': 2,
    'n_letters': 45,
    'n_long_words': 3,
    'n_monosyllable_words': 5,
    'n_polysyllable_words': 4,
    'n_punctuations': 2,
    'n_sents': 1,
    'n_simple_words': 7,
    'n_spaces': 8,
    'n_syllables': 18,
    'n_unique_words': 8,
    'n_words': 9}

bs.print_stats()

        Статистика     | Значение 
    ------------------------------
    Предложения         |    1     
    Слова               |    9     
    Уникальные слова    |    8     
    Длинные слова       |    3     
    Сложные слова       |    2     
    Простые слова       |    7     
    Односложные слова   |    5     
    Многосложные слова  |    4     
    Символы             |    55    
    Буквы               |    45    
    Пробелы             |    8     
    Слоги               |    18
    Знаки препинания    |    2

Readability metrics

The library allows counting the following readability metrics:

Flesch Reading Ease
Flesch-Kincaid Grade Level
Coleman-Liau Index
SMOG Index
Automated Readability Index
LIX readability measure

Coefficients for Russian language were borrowed from the Plain Russian Language project dedicated to counting readability coefficients based on a special corpus of texts with age labels.

Example:

from ruts import ReadabilityStats
text = "Ног нет, а хожу, рта нет, а скажу: когда спать, когда вставать, когда работу начинать"
rs = ReadabilityStats(text)
rs.get_stats()

    {'automated_readability_index': 0.2941666666666656,
    'coleman_liau_index': 0.2941666666666656,
    'flesch_kincaid_grade': 3.4133333333333304,
    'flesch_reading_easy': 83.16166666666666,
    'lix': 48.333333333333336,
    'smog_index': 0.05}

rs.print_stats()

                    Метрика                 | Значение 
    --------------------------------------------------
    Тест Флеша-Кинкайда                     |   3.41   
    Индекс удобочитаемости Флеша            |  83.16   
    Индекс Колман-Лиау                      |   0.29   
    Индекс SMOG                             |   0.05   
    Автоматический индекс удобочитаемости   |   0.29   
    Индекс удобочитаемости LIX              |  48.33

Lexical diversity metrics

The library allows counting the following lexical diversity metrics for a text:

Type-Token Ratio (TTR)
Root Type-Token Ratio (RTTR)
Corrected Type-Token Ratio (CTTR)
Herdan Type-Token Ratio (HTTR)
Summer Type-Token Ratio (STTR)
Mass Type-Token Ratio (MTTR)
Dugast Type-Token Ratio (DTTR)
Moving Average Type-Token Ratio (MATTR)
Mean Segmental Type-Token Ratio (MSTTR)
Measure of Textual Lexical Diversity (MTLD)
Moving Average Measure of Textual Lexical Diversity (MAMTLD)
Hypergeometric Distribution D (HD-D)
Simpson's Diversity Index
Hapax Legomena Index

Some of the implementations were borrowed from the lexical_diversity project.

Example:

from ruts import DiversityStats
text = "Ног нет, а хожу, рта нет, а скажу: когда спать, когда вставать, когда работу начинать"
ds = DiversityStats(text)
ds.get_stats()

    {'ttr': 0.7333333333333333,
    'rttr': 2.840187787218772,
    'cttr': 2.008316044185609,
    'httr': 0.8854692840710253,
    'sttr': 0.2500605793160845,
    'mttr': 0.0973825075623254,
    'dttr': 10.268784661968104,
    'mattr': 0.7333333333333333,
    'msttr': 0.7333333333333333,
    'mtld': 15.0,
    'mamtld': 11.875,
    'hdd': -1,
    'simpson_index': 21.0,
    'hapax_index': 431.2334616537499}

ds.print_stats()

                              Метрика                           | Значение 
    ----------------------------------------------------------------------
    Type-Token Ratio (TTR)                                      |   0.92   
    Root Type-Token Ratio (RTTR)                                |   7.17   
    Corrected Type-Token Ratio (CTTR)                           |   5.07   
    Herdan Type-Token Ratio (HTTR)                              |   0.98   
    Summer Type-Token Ratio (STTR)                              |   0.96   
    Mass Type-Token Ratio (MTTR)                                |   0.01   
    Dugast Type-Token Ratio (DTTR)                              |  85.82   
    Moving Average Type-Token Ratio (MATTR)                     |   0.91   
    Mean Segmental Type-Token Ratio (MSTTR)                     |   0.94   
    Measure of Textual Lexical Diversity (MTLD)                 |  208.38  
    Moving Average Measure of Textual Lexical Diversity (MTLD)  |   1.00   
    Hypergeometric Distribution D (HD-D)                        |   0.94   
    Индекс Симпсона                                             |  305.00  
    Гапакс-индекс                                               | 2499.46

Morphological statistics

The library allows extracting the following morphological features:

part of speech
animacy
aspect
case
gender
involvement
mood
number
person
tense
transitivity
voice

Morphological analysis is made using pymorphy2. Descriptions of morphological features were borrowed from OpenCorpora.

Example:

from ruts import MorphStats
text = "Постарайтесь получить то, что любите, иначе придется полюбить то, что получили"
ms = MorphStats(text)
ms.pos

    ('VERB', 'INFN', 'CONJ', 'CONJ', 'VERB', 'ADVB', 'VERB', 'INFN', 'CONJ', 'CONJ', 'VERB')

ms.get_stats()

    {'animacy': {None: 11},
    'aspect': {None: 5, 'impf': 1, 'perf': 5},
    'case': {None: 11},
    'gender': {None: 11},
    'involvement': {None: 10, 'excl': 1},
    'mood': {None: 7, 'impr': 1, 'indc': 3},
    'number': {None: 7, 'plur': 3, 'sing': 1},
    'person': {None: 9, '2per': 1, '3per': 1},
    'pos': {'ADVB': 1, 'CONJ': 4, 'INFN': 2, 'VERB': 4},
    'tense': {None: 8, 'futr': 1, 'past': 1, 'pres': 1},
    'transitivity': {None: 5, 'intr': 2, 'tran': 4},
    'voice': {None: 11}}

ms.explain_text(filter_none=True)

    (('Постарайтесь',
        {'aspect': 'perf',
        'involvement': 'excl',
        'mood': 'impr',
        'number': 'plur',
        'pos': 'VERB',
        'transitivity': 'intr'}),
    ('получить', {'aspect': 'perf', 'pos': 'INFN', 'transitivity': 'tran'}),
    ('то', {'pos': 'CONJ'}),
    ('что', {'pos': 'CONJ'}),
    ('любите',
        {'aspect': 'impf',
        'mood': 'indc',
        'number': 'plur',
        'person': '2per',
        'pos': 'VERB',
        'tense': 'pres',
        'transitivity': 'tran'}),
    ('иначе', {'pos': 'ADVB'}),
    ('придется',
        {'aspect': 'perf',
        'mood': 'indc',
        'number': 'sing',
        'person': '3per',
        'pos': 'VERB',
        'tense': 'futr',
        'transitivity': 'intr'}),
    ('полюбить', {'aspect': 'perf', 'pos': 'INFN', 'transitivity': 'tran'}),
    ('то', {'pos': 'CONJ'}),
    ('что', {'pos': 'CONJ'}),
    ('получили',
        {'aspect': 'perf',
        'mood': 'indc',
        'number': 'plur',
        'pos': 'VERB',
        'tense': 'past',
        'transitivity': 'tran'}))

ms.print_stats('pos', 'tense')

    ---------------Часть речи---------------
    Глагол (личная форма)         |    4     
    Союз                          |    4     
    Глагол (инфинитив)            |    2     
    Наречие                       |    1     

    -----------------Время------------------
    Неизвестно                    |    8     
    Настоящее                     |    1     
    Будущее                       |    1     
    Прошедшее                     |    1

Datasets

Library allows working with a number of preprocessed datasets:

sov_chrest_lit - soviet reading-books for literature classes
stalin_works - the collected works of Stalin

One can work solely with texts (without title info) or texts with metadata. There is also an opportunity to filter texts on different criteria.

Example:

from ruts.datasets import SovChLit
sc = SovChLit()
sc.info

    {'description': 'Корпус советских хрестоматий по литературе',
    'url': 'https://dataverse.harvard.edu/file.xhtml?fileId=3670902&version=DRAFT',
    'Наименование': 'sov_chrest_lit'}

for i in sc.get_records(max_len=100, category='Весна', limit=1):
    pprint(i)

    {'author': 'Е. Трутнева',
    'book': 'Родная речь. Книга для чтения в I классе начальной школы',
    'category': 'Весна',
    'file': PosixPath('../ruTS/ruts_data/texts/sov_chrest_lit/grade_1/155'),
    'grade': 1,
    'subject': 'Дождик',
    'text': 'Дождик, дождик, поливай, будет хлеба каравай!\n'
            'Дождик, дождик, припусти, дай гороху подрасти!',
    'type': 'Стихотворение',
    'year': 1963}

for i in sc.get_texts(text_type='Басня', limit=1):
    pprint(i)

    ('— Соседка, слышала ль ты добрую молву? — вбежавши, крысе мышь сказала:\n'
    '— Ведь кошка, говорят, попалась в когти льву. Вот отдохнуть и нам пора '
    'настала!\n'
    '— Не радуйся, мой свет,— ей крыса говорит в ответ,— и не надейся '
    'по-пустому.\n'
    'Коль до когтей у них дойдёт, то, верно, льву не быть живому: сильнее кошки '
    'зверя нет.')

Visualization

Library allows visualizing text with the help of the following graphs:

Zipf's law
Literature Fingerprinting
Word Tree

Example:

from collections import Counter
from nltk.corpus import stopwords
from ruts import WordsExtractor
from ruts.datasets import SovChLit
from ruts.visualizers import zipf

sc = SovChLit()
text = '\n'.join([text for text in sc.get_texts(limit=100)])
we = WordsExtractor(use_lexemes=True, stopwords=stopwords.words('russian'), filter_nums=True)
tokens_with_count = Counter(we.extract(text))
zipf(tokens_with_count, num_words=100, num_labels=10, log=False, show_theory=True, alpha=1.1)

Components

Library allows creating the following classes of spaCy components:

BasicStats
DiversityStats
MorphStats
ReadabilityStats

Russian-language spaCy model can be downloaded by running the command:

$ python -m spacy download ru_core_news_sm

Example:

import ruts
import spacy
nlp = spacy.load('ru_core_news_sm')
nlp.add_pipe('basic', last=True)
doc = nlp("Существуют три вида лжи: ложь, наглая ложь и статистика")
doc._.basic.c_letters

    {1: 1, 3: 2, 4: 3, 6: 1, 10: 2}

doc._.basic.get_stats()

    {'c_letters': {1: 1, 3: 2, 4: 3, 6: 1, 10: 2},
    'c_syllables': {1: 5, 2: 1, 3: 1, 4: 2},
    'n_chars': 55,
    'n_complex_words': 2,
    'n_letters': 45,
    'n_long_words': 3,
    'n_monosyllable_words': 5,
    'n_polysyllable_words': 4,
    'n_punctuations': 2,
    'n_sents': 1,
    'n_simple_words': 7,
    'n_spaces': 8,
    'n_syllables': 18,
    'n_unique_words': 8,
    'n_words': 9}

Project structure

docs - project documentation
ruts:
- basic_stats.py - basic text statistics
- components.py - spaCy components
- constants.py - main constants
- diversity_stats.py - lexical diversity metrics
- extractors.py - tools for object extraction from a text
- morph_stats.py - morphological statistics
- readability_stats.py - readability metrics
- utils.py - subsidiary tools
- datasets:
  - dataset.py - basic class for working with datasets
  - sov_chrest_lit.py - soviet reading-books for literature classes
  - stalin_works.py - the collected works of Stalin
- visualizers - tools for text visualization:
  - fingerprinting.py - Literature Fingerprinting
  - word_tree.py - Word Tree
  - zipf.py - Zipf's law
tests:
- test_basic_stats.py - tests for basic text statistics
- test_components.py - tests for spaCy components
- test_diversity_stats.py - tests for lexical diversity metrics
- test_extractors.py - tests for object extraction tools
- test_morph_stats - tests for morphological statistics
- test_readability_stats.py - tests for readability metrics
- datasets - tests for datasets:
  - test_dataset.py - tests for basic class for working with datasets
  - test_sov_chrest_lit.py - tests for dataset soviet reading-books for literature classes
  - test_stalin_works.py - tests for dataset the collected works of Stalin
- visualizers - tests for tools for text visualization:
  - test_fingerprinting.py - tests for visualization Literature Fingerprinting
  - test_word_tree.py - tests for visualization Word Tree
  - test_zipf.py - tests for visualization Zipf's law

Authors

Sergey Shkarin (kouki.sergey@gmail.com)
Ekaterina Smirnova (ekanerina@yandex.ru)

Attribution

Please use the following BibTeX entry for citing ruTS if you use it in your research or software. Citations are helpful for the continued development and maintenance of this library.

@software{ruTS,
  author = {Sergey Shkarin},
  title = {{ruTS, a library for statistics extraction from texts in Russian}},
  year = 2023,
  publisher = {Moscow},
  url = {https://github.com/SergeyShk/ruTS}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.en.md

README.en.md

Russian Texts Statistics (ruTS)

Installation

Usage

Object extraction

Basic statistics

Readability metrics

Lexical diversity metrics

Morphological statistics

Datasets

Visualization

Components

Project structure

Authors

Attribution

Files

README.en.md

Latest commit

History

README.en.md

File metadata and controls

Russian Texts Statistics (ruTS)

Installation

Usage

Object extraction

Basic statistics

Readability metrics

Lexical diversity metrics

Morphological statistics

Datasets

Visualization

Components

Project structure

Authors

Attribution