Wikipedia Crawler

This app crawls through Wikipedia and stores the pages in JSON files. The JSONs can be used for RAG and LLM-finetuning. This is still a Work in Progress, so please feel expected to see some bugs.

Overview

This Wikipedia Crawler has APIs to crawl through Wikipedia and store the pages in JSON files. The JSONs can be used for RAG and LLM-finetuning. The app is built using Flask and MongoDB for data storage. The app is containerized using Docker and can be deployed using GitHub Actions.

Features

Integration with NGINX and GUNICORN
Simplified structure for easy project initiation
Use of best practices and recommended plugins
Integration with Docker for easy deployment
Use of MongoDB for data storage and Redis for caching
Integrated with GitHub Actions

Getting Started

To get started with this template, follow these steps:

Clone the repository.

git clone https://github.com/adhishthite/wikipedia-RAG-app.git

Navigate to the repository
```
cd wikipedia-RAG-app
```
Rename the .env-t file to .env and add/update the required environment variables.
```
mv .env-t .env
```
Build the Docker image using docker-compose.
```
docker-compose up --build
```

[WIP]

License

Feedback

I welcome feedback and suggestions. Please feel free to open an issue or submit a pull request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Wikipedia Crawler

Overview

Features

Getting Started

License

Feedback

Files

README.md

Latest commit

History

README.md

File metadata and controls

Wikipedia Crawler

Overview

Features

Getting Started

License

Feedback