This app crawls through Wikipedia and stores the pages in JSON files. The JSONs can be used for RAG and LLM-finetuning. This is still a Work in Progress, so please feel expected to see some bugs.
This Wikipedia Crawler has APIs to crawl through Wikipedia and store the pages in JSON files. The JSONs can be used for RAG and LLM-finetuning. The app is built using Flask and MongoDB for data storage. The app is containerized using Docker and can be deployed using GitHub Actions.
- Integration with NGINX and GUNICORN
- Simplified structure for easy project initiation
- Use of best practices and recommended plugins
- Integration with Docker for easy deployment
- Use of MongoDB for data storage and Redis for caching
- Integrated with GitHub Actions
To get started with this template, follow these steps:
-
Clone the repository.
git clone https://github.com/adhishthite/wikipedia-RAG-app.git
-
Navigate to the repository
cd wikipedia-RAG-app
-
Rename the
.env-t
file to.env
and add/update the required environment variables.mv .env-t .env
-
Build the Docker image using docker-compose.
docker-compose up --build
[WIP]
I welcome feedback and suggestions. Please feel free to open an issue or submit a pull request.