Skip to content

Latest commit

 

History

History
58 lines (45 loc) · 2.82 KB

README.md

File metadata and controls

58 lines (45 loc) · 2.82 KB

RepoSim

An approach to detect semantically similar python repositories using pre-trained language models.

About

This repository contains the notebooks and scripts conducted for our approach to detect semantically similar python repositories using pre-trained language models.

Currently our best performing model is UniXCoder fine-tuned on code search task with AdvTest dataset. For evaluations of different language models on repository similarity comparison, please refer to this Jupyter notebook: notebooks/BiEncoder/Embeddings_evaluation.ipynb

More details on our approach's implementations and applications can be found under the scripts folder.

Applications

RepoSnipy is a neural search engine for discoving similar Python repositories on GitHub, powered by RepoSim. Please feel free to give it a try!

Directory Structure

RepoSim
├── LICENSE
├── README.md
├── data
│   ├── df2txt.py  # Convert PoolC dataset for clone detection fine-tuning script
│   ├── repo_topic.json # Topic-Repos mapping
│   └── repo_topic.py  # Script to select repos from topics
├── notebooks
│   ├── BiEncoder
│   │   ├── Embeddings_evaluation.ipynb  # Evaluations for comparing different language models
│   │   ├── RepoSim.ipynb  # Our approach's implementation
│   │   └── UnixCoder_C4_Evaluation.ipynb
│   └── CrossEncoder
│       ├── Clone_Detection_C4_Evaluation.ipynb
│       ├── HungarianAlgorithm.ipynb  # Cross-encoder approaches for repo similarity comparison
│       └── keonalgorithms-TheAlgorithmsPython.csv  # Evaluation results by ungarianAlgorithm.ipynb
└── scripts
    ├── LICENSE
    ├── PlayGround.ipynb  # For experimenting with repo embeddings
    ├── README.md
    ├── pipeline.py  # Our approach's implementation as a HuggingFace pipeline
    ├── repo_sim.py
    └── requirements.txt

License

Distributed under the MIT License. See LICENSE for more information.

Acknowledgments