GitHub - anabolicobsession/tweet-clustering: Creating an X (Twitter) dataset for clustering and evaluating different models on it as a part of CTU thesis "Clustering social network texts"

Introduction

This is code for the thesis "Clustering Social Network Texts". It allows the generation of a dataset of custom size and evaluation of any models (text embeddings, dimensionality reduction techniques, and clustering algorithms) with further analysis.

Setup

The required libraries can be found in misc/requirements.txt and installed via pip as:

pip install -r misc/requirements.txt

Running with GPU is highly recommended.

Dataset Generation

You can easily generate a dataset with topics of any size in generate_dataset.ipynb. Just use topic constants at the beginning (see comments for more). Then, run all cells and use the last cell to save the dataset with your name.

Moreover, you can add your own topics with custom noise removal (the notebook is modular). Just repeat the code in any other cell relating to a topic dataset (with required modifications).

Evaluation Pipeline

In clustering.ipynb, you can use existing parts of the pipeline (text embeddings, dimensionality reduction techniques, and clustering algorithms) with your own parameters or easily add new ones (the notebook is modular). The code at the end (EvaluationTable) allows evaluation results to be saved for further analysis (see analysis.ipynb). Evaluation metrics are defined in evaluation.ipynb, and the default directory for saving results is evaluation.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
evaluation		evaluation
misc		misc
plots		plots
README.md		README.md
analysis.ipynb		analysis.ipynb
clustering.ipynb		clustering.ipynb
constants.py		constants.py
deep_embedded_clustering.py		deep_embedded_clustering.py
evaluation.py		evaluation.py
generate_dataset.ipynb		generate_dataset.ipynb
thesis.pdf		thesis.pdf
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Setup

Dataset Generation

Evaluation Pipeline

About

Releases

Packages

Languages

anabolicobsession/tweet-clustering

Folders and files

Latest commit

History

Repository files navigation

Introduction

Setup

Dataset Generation

Evaluation Pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages