Smart Caption AI

This project is an implementation of a research paper in which I work with Dr. Randy Lin (Algoma University) to leverage LLMs to generate accurate and relevant image captions. Besides, this research paper has been accepted by the IEEE/ICCA'24 (Sixth Edition, BUE) conference.

Introduction

Blind individuals and those with severe vision impairments face significant challenges in navigating web content, especially with understanding images. Since images frequently provide essential information or context, they become inaccessible without accompanying text alternatives.

Smart Caption AI is a Chrome extension that generates image captions using Large Language Models (LLMs). Its unique approach involves first summarizing the webpage content or article and then using it as context for generating image captions. This method allows the LLM to "understand" more details and relevant context for the images, resulting in more accurate and relevant image descriptions. The tool employs a multi-agent system:

Proxy Agent: Controls the conversation among other agents
WebSurfer Agent: Surfs and summarizes webpage content
Image Agent: Converts images to text This innovative approach enables Smart Caption AI to provide more contextually appropriate and accurate image captions by leveraging the surrounding content of the webpage. Besides generating image captioning, the tool also simplify the webpage (removing ads, unnecessary information, etc) and providing Text-to-speech feature.

Below is the workflow of the tool:

Requirements

Python 3.11
OpenAI Key or your open source LLM host
Framework and libraries: Flask-RESTful, Pyautogen, Readability.js, etc.

Installation

Clone the repository
Go to smart-caption-ai folder
Install Python: https://www.python.org/downloads/
Setup Python virtual environment: python -m venv .venv
Activate virtual environment:

On Windows: .venv\Scripts\activate

On Unix or MacOS: source .venv/bin/activate
Install library dependencies: pip install -r requirements.txt

Backend Setup

If you plan to use OpenAI

Open .env file in server folder, and declare your OPENAI_API_KEY
Run the app by: python -m server.main

If you plan to use Open source LLM: llama, phi3, llava, etc

Open .env file in server folder, and declare your server url which host the LLM: BASE_URL
Run the app by: python -m server.main

To verify the backend running, you can query by:

curl --location 'http://127.0.0.1:5000/ai/convert' \ --header 'Content-Type: application/json' \ --data '{ "article_url": "https://www.aljazeera.com/gallery/2021/3/18/families-forced-into-a-deadly-spiral-in-central-african-republic", "images_url": [ { "url": "https://www.aljazeera.com/wp-content/uploads/2021/03/3-3.jpg?fit=1170%2C746&quality=80 " } ] }' article_url: the webpage content url images_url: list of images url in the article

Frontend Setup (chrome extension)

Open Chrome browser / Settings / Extensions / Manage Extensions
Enable 'Developer Mode' on the top-right corner
Click on 'Load unpacked' on the top-left corner
Select the 'chrome-extension' folder that you cloned from previous step. Then the extension will be listed as below:

Open Extension and 'pin' it for easy to use.
Now, you can open an article (eg. GitHub Pages), then click on the icon of this tool. Result as below:

Future Works

Fix issues with displaying images on some websites
- This tool is not compatible with all websites. Some websites are built with different iframe structures that lead to the tool being unable to display all of the images.
Optimize processing time by sending the article content directly instead of the URL to reduce 'surfing' time of the agent.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
chrome-extension		chrome-extension
server		server
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Smart Caption AI

Introduction

Requirements

Installation

Backend Setup

If you plan to use OpenAI

If you plan to use Open source LLM: llama, phi3, llava, etc

Frontend Setup (chrome extension)

Future Works

About

Releases

Packages

Languages

License

hgky95/smart-caption-ai

Folders and files

Latest commit

History

Repository files navigation

Smart Caption AI

Introduction

Requirements

Installation

Backend Setup

If you plan to use OpenAI

If you plan to use Open source LLM: llama, phi3, llava, etc

Frontend Setup (chrome extension)

Future Works

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages