Neural search engine for discovering semantically similar Python repositories on GitHub.
TODO --- Update the gif file!!!
Searching an indexed repository:
RepoSnipy is a neural search engine built with streamlit and docarray. You can query a public Python repository hosted on GitHub and find popular repositories that are semantically similar to it.
Compared to the previous generation of RepoSnipy, the latest version has such new features below:
- It uses the RepoSim4Py, which is based on RepoSim4Py pipeline, to create multi-level embeddings for Python repositories.
- Multi-level embeddings --- code, doc, readme, requirement, and repository.
- It uses the SciBERT model to analyse repository topics and to generate embeddings for topics.
- Transfer multiple topics into one cluster --- it uses a KMeans model (kmeans_model_topic_scibert) to analyse topic embeddings and to cluster repositories based on topics.
- Clustering by code snippets --- it uses a KMeans model (kmeans_model_code_unixcoder) to analyse code embeddings and to cluster repositories based on code snippets.
- It uses the SimilarityCal model, which is a binary classifier to calculate cluster similarity based on repository-level embeddings and cluster (topic or code cluster number). More generally, SimilarityCal model seem repositories with same cluster as label 1, otherwise as label 0. The input features of SimilarityCal model are two repositories' embeddings concatenation, and the binary labels are mentioned above. The output of SimilarityCal model are scores of how similar or dissimilar two repositories are.
We have created a vector dataset (stored as docarray index) of approximate 9700 GitHub Python repositories that has license and over 300 stars by the time of March 2024. The accordingly generated clusters were putted in two json datasets (repo_topic_clusters and repo_code_clusters) (stored repo-cluster as key-values accordingly).
- Python 3.11
- pip
We recommend to install first a conda environment with python 3.11
. Then, you can download the repository. See below:
conda create --name py311 python=3.11
conda activate py311
git clone https://github.com/RepoMining/RepoSnipy
After downloading the repository, you need install the required package. Make sure the python and pip you used are both from conda environment! For the following:
cd RepoSnipy
pip install -r requirements.txt
Then run the app on your local machine using:
streamlit run app.py
or
python -m streamlit run app.py
Importantly, to avoid unnecessary conflict (like version conflict, or package location conflict), you should ensure that streamlit you used is from conda environment!
We deployed RepoSnipy in HuggingFace Space. You can directly taste it in RepoSnipy Space.
As mentioned above, RepoSnipy needs vector, clusters json dataset (repo_topic_clusters and repo_code_clusters), KMeans models (kmeans_model_topic_scibert and kmeans_model_code_unixcoder) and SimilarityCal model when you start up it. For your convenience, we have uploaded them in the folder data of this repository.
To provide research-oriented meaning, we have provided the following scripts for you to recreate them:
create_index.py # For creating vector dataset (binary files)
generate_clusters.py # For creating useful cluster models and information (KMeans models and json files representing repo-clusters, including repo-topic_cluster and repo-code_cluster)
More details can refer to these two scripts above. When you run scripts above, you will get the following files:
- Generated by create_index.py:
repositories.txt # the original repositories file
invalid_repositories.txt # the invalid repositories file, including invalid repositories
filtered_repositories.txt # the final repositories file, removing duplicated and invalid repositories
index{i}_{i * target_sub_length}.bin # the sub-index files, where i means number of sub-repositories and target_sub_length means sub-repositories length
index.bin # the index file merged by sub-index files and removed numpy zero arrays
- Generated by generate_clusters.py:
repo_topic_clusters.json # a json file representing repo-topic_cluster dictionary
kmeans_model_topic_scibert.pkl # a pickle file for storing kmeans model based on topic embeddings generated by SciBERT model
repo_code_clusters.json # a json file representing repo-code_cluster dictionary
kmeans_model_code_unixcoder.pkl # a pickle file for storing kmeans model based on code embeddings generated by UniXCoder model
- Generated by SimilarityCal project:
SimilarityCal_model_NO1.pt # the SimilarityCal NO.1 model based pytorch
The evaluation script finds all combinations of repository pairs in the dataset and calculates the cosine similarity between their embeddings.
It also checks if they share at least one topic (except for python
and python3
).
Then we compare them and use ROC AUC score to evaluate the embeddings performance.
The resultant dataframe containing all pairs of cosine similarity can be downloaded
from here,
including embeddings with 5-levels (code, doc, readme, requirement, repository) evaluations.
The resultant ROC AUC score of each level embeddings are as following:
- code embeddings with 0.839.
- doc embeddings with 0.808.
- readme embeddings with 0.781.
- requirement embeddings with 0.638.
- repository embeddings with 0.827.
Distributed under the MIT License. See LICENSE for more information.
The model and the fine-tuning dataset used: