Skip to content

2.3. Implement clustering algorithm

Aurelie Herbelot edited this page May 1, 2022 · 10 revisions

Motivation

Previously, we have evaluated our fruit-fly by classifying documents from three different datasets - Web of Science, Wikipedia, and 20newsgroups. Further tests have however indicated that the performance of the Fruit Fly was not as satisfactory in clustering tasks. This package

We have found a way to significantly improve the quality of our Web document representations returned by our flies. Technically speaking, this means adding a dimensionality reduction step which makes use of topological data analysis in the fly's input layer.

Dimensionality reduction and clustering

Code at https://github.com/PeARSearch/PeARS-fruit-fly/tree/main/dense_fruit_fly

Here are two samples from a philosophy and a physics clusters.

Integration with Web Map

Code at https://github.com/PeARSearch/PeARS-fruit-fly/tree/main/web_map/umap

We have written code which downloads the whole of Wikipedia in some language, produces dimensionality-reduced representations of each article, passes those to the fruit fly, and finally returns binary representations that seem to cluster nicely.

Further work section

  • Analyse the quality of Wikipedia's meta-categories (our labels) as compared to the text extracted from external links.
  • Compare classification with common-crawl data.
  • Classify texts with common-crawl labels.