RepoSim

An approach to detect semantically similar python repositories using pre-trained language models.

About

This repository contains the notebooks and scripts conducted for our approach to detect semantically similar python repositories using pre-trained language models.

Currently our best performing model is UniXCoder fine-tuned on code search task with AdvTest dataset. For evaluations of different language models on repository similarity comparison, please refer to this Jupyter notebook: notebooks/BiEncoder/Embeddings_evaluation.ipynb

More details on our approach's implementations and applications can be found under the scripts folder.

Applications

RepoSnipy is a neural search engine for discoving similar Python repositories on GitHub, powered by RepoSim. Please feel free to give it a try!

Directory Structure

RepoSim
├── LICENSE
├── README.md
├── data
│   ├── df2txt.py  # Convert PoolC dataset for clone detection fine-tuning script
│   ├── repo_topic.json # Topic-Repos mapping
│   └── repo_topic.py  # Script to select repos from topics
├── notebooks
│   ├── BiEncoder
│   │   ├── Embeddings_evaluation.ipynb  # Evaluations for comparing different language models
│   │   ├── RepoSim.ipynb  # Our approach's implementation
│   │   └── UnixCoder_C4_Evaluation.ipynb
│   └── CrossEncoder
│       ├── Clone_Detection_C4_Evaluation.ipynb
│       ├── HungarianAlgorithm.ipynb  # Cross-encoder approaches for repo similarity comparison
│       └── keonalgorithms-TheAlgorithmsPython.csv  # Evaluation results by ungarianAlgorithm.ipynb
└── scripts
    ├── LICENSE
    ├── PlayGround.ipynb  # For experimenting with repo embeddings
    ├── README.md
    ├── pipeline.py  # Our approach's implementation as a HuggingFace pipeline
    ├── repo_sim.py
    └── requirements.txt

License

Distributed under the MIT License. See LICENSE for more information.

Acknowledgments

GraphCodeBERT
UniXCoder
AdvTest
Sentence Transformers
awesome-python
Original work of the customized GraphCodeBERT model by @snoop2head
Python clone dataset from dacon
Python clone dataset shared by PoolC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

RepoSim

About

Applications

Directory Structure

License

Acknowledgments

Files

README.md

Latest commit

History

README.md

File metadata and controls

RepoSim

About

Applications

Directory Structure

License

Acknowledgments