Wikipedia Crawler

This app crawls through Wikipedia and stores the pages in JSON files. The JSONs can be used for RAG and LLM-finetuning. This is still a Work in Progress, so please feel expected to see some bugs.

Overview

This Wikipedia Crawler has APIs to crawl through Wikipedia and store the pages in JSON files. The JSONs can be used for RAG and LLM-finetuning. The app is built using Flask and MongoDB for data storage. The app is containerized using Docker and can be deployed using GitHub Actions.

Features

Integration with NGINX and GUNICORN
Simplified structure for easy project initiation
Use of best practices and recommended plugins
Integration with Docker for easy deployment
Use of MongoDB for data storage and Redis for caching
Integrated with GitHub Actions

Getting Started

To get started with this template, follow these steps:

Clone the repository.

git clone https://github.com/adhishthite/wikipedia-RAG-app.git

Navigate to the repository
```
cd wikipedia-RAG-app
```
Rename the .env-t file to .env and add/update the required environment variables.
```
mv .env-t .env
```
Build the Docker image using docker-compose.
```
docker-compose up --build
```

[WIP]

License

Feedback

I welcome feedback and suggestions. Please feel free to open an issue or submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
app		app
tests		tests
.env-t		.env-t
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
nginx.conf		nginx.conf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia Crawler

Overview

Features

Getting Started

License

Feedback

About

Releases

Packages

Languages

License

adhishthite/wikipedia-RAG-app

Folders and files

Latest commit

History

Repository files navigation

Wikipedia Crawler

Overview

Features

Getting Started

License

Feedback

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages