A Docker-based RAG (Retrieval Augmented Generation) system that provides intelligent querying of DSPy documentation using Ollama, Qdrant, and LlamaIndex.
- **Optimized for Apple MacBook Air M3 - Apple Silicon - 8Gb Internal Memory
- RAG Implementation: Uses LlamaIndex for document processing and retrieval
- Vector Storage: Qdrant for efficient vector storage and similarity search
- Local LLM: Ollama integration with orca-mini model
- Optimizations:
- Leverages Mac Performance Shaders (MPS)
- Docker Abstraction
- Robust Error Handling
- Lighweight Inference Models
- Streamlight Query Implementation
- Sentence-level chunking with overlap
- Cross-encoder reranking
- Custom prompt templates
- Efficient vector similarity search
- Asynchronous query processing
- macOS with Apple Silicon (M1/M2/M3)
- Docker Desktop
- Ollama Desktop
- Streamlight
┌─────────────────┐ ┌──────────────┐ ┌──────────────┐ │ RAG App │ │ Qdrant │ │ Ollama │ │ - LlamaIndex │────▶│ Vector DB │ │ Local LLM │ │ - HF Embeddings│◀────│ │ │ └─────────────────┘ └──────────────┘ └──────────────┘
- Clone the repository:
git clone <repository-url>
cd <repository-name>
-
Start Ollama Desktop application
-
Pull the required model:
- This should be automatic in the Dockerfile
- Build and run with Docker Compose:
docker compose up --build
- Persistent storage for document embeddings
- Efficient similarity search
- Scalable vector database
- Sentence-level chunking (1024 tokens with 200 token overlap)
- HuggingFace embeddings (all-MiniLM-L6-v2)
- Filename-based document tracking
- Cross-encoder reranking (ms-marco-MiniLM-L-2-v2)
- Custom prompt templates for consistent responses
- Top-k retrieval with reranking
- Async query processing
- Local inference with orca-mini
- Optimized for Apple Silicon
- Configurable timeout and retry logic
- Place PDF documentation in the
docs
directory - Start the system:
docker compose up
- Interact with Streamlit:
- Built with Python 3.9
- Uses LlamaIndex for document processing
- Async support for concurrent operations
- Comprehensive logging and error handling
- Qdrant for vector similarity search
- Persistent storage across sessions
- Efficient index management
- Ollama for local inference
- Optimized for Apple Silicon
- Configurable model parameters
-
Ollama Connection Issues
- Ensure Ollama Desktop is running
- Check the model is pulled:
ollama list
- Verify host.docker.internal is accessible
-
Memory Issues
- Adjust Docker resource limits
- Modify chunk size and overlap
- Consider using a smaller model
-
Performance Issues
-
Enable MPS acceleration
-
Adjust batch sizes
-
Monitor resource usage
Akshay Pachaar, Avi Chawla - A Crash Course on Building RAG Systems – Part 1 (With Implementation)
-
=======