2.3. Implement clustering algorithm

Motivation

Previously, we have evaluated our fruit-fly by classifying documents from three different datasets - Web of Science, Wikipedia, and 20newsgroups. Further tests have however indicated that the performance of the Fruit Fly was not as satisfactory in clustering tasks. This package

We have found a way to significantly improve the quality of our Web document representations returned by our flies. Technically speaking, this means adding a dimensionality reduction step which makes use of topological data analysis in the fly's input layer.

Dimensionality reduction and clustering

Code at https://github.com/PeARSearch/PeARS-fruit-fly/tree/main/dense_fruit_fly

Here are two samples from a philosophy and a physics clusters.

Integration with Web Map

Code at https://github.com/PeARSearch/PeARS-fruit-fly/tree/main/web_map/umap

We have written code which downloads the whole of Wikipedia in some language, produces dimensionality-reduced representations of each article, passes those to the fruit fly, and finally returns binary representations that seem to cluster nicely.

Further work section

Analyse the quality of Wikipedia's meta-categories (our labels) as compared to the text extracted from external links.
Compare classification with common-crawl data.
Classify texts with common-crawl labels.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.3. Implement clustering algorithm

Motivation

Dimensionality reduction and clustering

Integration with Web Map

Further work section

Clone this wiki locally