Recommender of daily papers from arXiv, customized with your Prompt. Minimal, hackable, no-boilerplate.
"I like innovative papers in large foundation models, multimodal methods, symbolic reasoning and automation."
Now we've been overwhelmed by papers on arXiv. With ~300 new additions daily in cs.AI section alone, sifting through them can be daunting. This project scrapes daily feed from https://arxiv.org/list/{namespace}/new, collecting author data and performing two-stage ranking:
- Coarse Ranking: Use the authors' impact index and a CPU-friendly embedding model (per MTEB leaderboard 🤗) to reduce candidate pools into ~20 by weighted Copeland ranking.
- Reranking: Optionally use gpt-4 to choose top k and write a summary (which is cheap for just one call per day).
Prepare environment
conda create -n "arxplorer" python==3.11
conda activate arxplorer
pip install -r requirements.txt
(Recommended) Use an OpenAI key for summarization and better ranking.
echo 'OPENAI_API_KEY=your_api_key_here' >> .env
GO!
python run.py
You may customize your preferences or interests by
echo 'INSTRUCTION="I like ..."' >> .env
Use namespace
to specify the section in arXiv to scrape from (make sure https://arxiv.org/list/{namespace}/new can be visited). Use top_k
to specify the final number of feeds you want to see. coarse_k
is the intermediate number from coarse ranking and should always be larger than top_k
.
python run.py --namespace="cs.AI" --top_k=10 --coarse_k=20
fast_mode
is set to True by default, which ignores author-related features. Collecting author data stably (using scholarly and free-proxy can be painfully slow at the beginning (and going faster as authors_cache.db
builds up the cache). If you are deploying on server or have ~1hr to let it run,
python run.py --fast_mode=False
This ranker is soooo biased and I'm pretty sure some cool papers are overlooked. But I feel it helpful in capturing part of which I regret to miss.
I'll create a Tweeter Bot soon to serve this project into daily feed. Feel free to contact me @billxbf for suggestions or contribute to more features, faster pipelines etc :)