-
Notifications
You must be signed in to change notification settings - Fork 3
2.3. Implement clustering algorithm
Previously, we have evaluated our fruit-fly by classifying documents from three different datasets - Web of Science, Wikipedia, and 20newsgroups. Further tests have however indicated that the performance of the Fruit Fly was not as satisfactory in clustering tasks. This package
We have found a way to significantly improve the quality of our Web document representations returned by our flies. Technically speaking, this means adding a dimensionality reduction step which makes use of topological data analysis in the fly's input layer.
Code at https://github.com/PeARSearch/PeARS-fruit-fly/tree/main/dense_fruit_fly
Here are two samples from a philosophy and a physics clusters.
Code at https://github.com/PeARSearch/PeARS-fruit-fly/tree/main/web_map/umap
We have written code which downloads the whole of Wikipedia in some language, produces dimensionality-reduced representations of each article, passes those to the fruit fly, and finally returns binary representations that seem to cluster nicely.
- Analyse the quality of Wikipedia's meta-categories (our labels) as compared to the text extracted from external links.
- Compare classification with common-crawl data.
- Classify texts with common-crawl labels.