Skip to content

Creating an X (Twitter) dataset for clustering and evaluating different models on it as a part of CTU thesis "Clustering social network texts"

Notifications You must be signed in to change notification settings

anabolicobsession/tweet-clustering

Repository files navigation

Introduction

This is code for the thesis "Clustering Social Network Texts". It allows the generation of a dataset of custom size and evaluation of any models (text embeddings, dimensionality reduction techniques, and clustering algorithms) with further analysis.

Setup

The required libraries can be found in misc/requirements.txt and installed via pip as:

pip install -r misc/requirements.txt

Running with GPU is highly recommended.

Dataset Generation

You can easily generate a dataset with topics of any size in generate_dataset.ipynb. Just use topic constants at the beginning (see comments for more). Then, run all cells and use the last cell to save the dataset with your name.

Moreover, you can add your own topics with custom noise removal (the notebook is modular). Just repeat the code in any other cell relating to a topic dataset (with required modifications).

Evaluation Pipeline

In clustering.ipynb, you can use existing parts of the pipeline (text embeddings, dimensionality reduction techniques, and clustering algorithms) with your own parameters or easily add new ones (the notebook is modular). The code at the end (EvaluationTable) allows evaluation results to be saved for further analysis (see analysis.ipynb). Evaluation metrics are defined in evaluation.ipynb, and the default directory for saving results is evaluation.

About

Creating an X (Twitter) dataset for clustering and evaluating different models on it as a part of CTU thesis "Clustering social network texts"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published