This project is an implementation of a research paper in which I work with Dr. Randy Lin (Algoma University) to leverage LLMs to generate accurate and relevant image captions. Besides, this research paper has been accepted by the IEEE/ICCA'24 (Sixth Edition, BUE) conference.
Blind individuals and those with severe vision impairments face significant challenges in navigating web content, especially with understanding images. Since images frequently provide essential information or context, they become inaccessible without accompanying text alternatives.
Smart Caption AI is a Chrome extension that generates image captions using Large Language Models (LLMs). Its unique approach involves first summarizing the webpage content or article and then using it as context for generating image captions. This method allows the LLM to "understand" more details and relevant context for the images, resulting in more accurate and relevant image descriptions. The tool employs a multi-agent system:
- Proxy Agent: Controls the conversation among other agents
- WebSurfer Agent: Surfs and summarizes webpage content
- Image Agent: Converts images to text This innovative approach enables Smart Caption AI to provide more contextually appropriate and accurate image captions by leveraging the surrounding content of the webpage. Besides generating image captioning, the tool also simplify the webpage (removing ads, unnecessary information, etc) and providing Text-to-speech feature.
Below is the workflow of the tool:
- Python 3.11
- OpenAI Key or your open source LLM host
- Framework and libraries: Flask-RESTful, Pyautogen, Readability.js, etc.
-
Clone the repository
-
Go to smart-caption-ai folder
-
Install Python: https://www.python.org/downloads/
-
Setup Python virtual environment:
python -m venv .venv
-
Activate virtual environment:
On Windows:
.venv\Scripts\activate
On Unix or MacOS:
source .venv/bin/activate
-
Install library dependencies:
pip install -r requirements.txt
- Open .env file in server folder, and declare your
OPENAI_API_KEY
- Run the app by:
python -m server.main
- Open .env file in server folder, and declare your server url which host the LLM:
BASE_URL
- Run the app by:
python -m server.main
To verify the backend running, you can query by:
curl --location 'http://127.0.0.1:5000/ai/convert' \ --header 'Content-Type: application/json' \ --data '{ "article_url": "https://www.aljazeera.com/gallery/2021/3/18/families-forced-into-a-deadly-spiral-in-central-african-republic", "images_url": [ { "url": "https://www.aljazeera.com/wp-content/uploads/2021/03/3-3.jpg?fit=1170%2C746&quality=80 " } ] }'
article_url: the webpage content url
images_url: list of images url in the article
- Open Chrome browser / Settings / Extensions / Manage Extensions
- Enable 'Developer Mode' on the top-right corner
- Click on 'Load unpacked' on the top-left corner
- Select the 'chrome-extension' folder that you cloned from previous step. Then the extension will be listed as below:
- Open Extension and 'pin' it for easy to use.
- Now, you can open an article (eg. GitHub Pages), then click on the icon of this tool. Result as below:
-
Fix issues with displaying images on some websites
- This tool is not compatible with all websites. Some websites are built with different iframe structures that lead to the tool being unable to display all of the images.
-
Optimize processing time by sending the article content directly instead of the URL to reduce 'surfing' time of the agent.