Building Human-in-the-Loop AI Tools for Accessible Image Descriptions
Overview
In recent discussions about artificial intelligence and accessibility, skepticism rightly abounds. As an accessibility innovation strategist who helps run an AI-for-accessibility grant program, I share many of the concerns raised about current AI capabilities—especially around generating alternative text for images. However, I also believe that with thoughtful design, AI can become a powerful ally rather than a threat. This tutorial presents a practical approach to building AI-assisted alt text tools that integrate human oversight, focusing on context-aware image description, distinguishing decorative from informative images, and leveraging large language models to provide meaningful starting points. By the end, you’ll understand how to create a system that amplifies human effort rather than replacing it.
Prerequisites
Before diving into the step-by-step instructions, ensure you have the following:
- Basic programming knowledge – familiarity with Python (3.8+) and REST APIs will be helpful.
- Access to AI services – a computer vision API (e.g., Azure Computer Vision, Google Cloud Vision) and a large language model API (e.g., OpenAI GPT‑4, Anthropic Claude).
- Sample images – a few diverse images: photographs, graphs, charts, and purely decorative elements (like a spacer image).
- Understanding of accessibility guidelines – WCAG 2.1 success criterion 1.1.1 (Non-text Content) is essential.
Step-by-Step Instructions
1. Set Up Your Environment
Create a new Python project and install required libraries:
pip install requests pillow
Store your API keys in environment variables (do not hard‑code them). Example .env file:
VISION_KEY=your_key_here
VISION_ENDPOINT=https://your-region.api.cognitive.microsoft.com/
LLM_KEY=your_openai_key
Load them in your script using os.getenv().
2. Build a Function to Get Initial Image Description
Use a computer vision model to obtain raw caption. Many modern models output a short description and a list of objects. Example:
import requests, os
def get_vision_description(image_url):
headers = {
'Ocp-Apim-Subscription-Key': os.getenv('VISION_KEY'),
'Content-Type': 'application/json'
}
data = {'url': image_url}
response = requests.post(
os.getenv('VISION_ENDPOINT') + 'vision/v3.2/describe?maxCandidates=1',
headers=headers, json=data
)
return response.json().get('description', {}).get('captions', [{}])[0].get('text', '')
This gives a raw description. However, as noted in the original criticism, such descriptions lack context. For example, a photo of a person at a desk might be described as “a man sitting in front of a computer,” but we don’t know if the image is purely illustrative or contains key information.
3. Add Context Analysis – Is the Image Decorative?
To determine whether an image is decorative or informative, we need to analyze its role on the page. One approach is to use the image’s surrounding HTML or metadata. For a simplistic version, we can ask a human editor to tag the image’s purpose (or we can infer from the image’s file name). A more advanced method: train a small classifier on features like image size, position, and surrounding text. For this tutorial, we’ll simulate the human‑in‑loop step by prompting the editor:
def ask_human_decorative(image_url):
# In production, present a UI; here we simulate with user input.
answer = input(f"Is this image decorative? (y/n): ")
return answer.lower().startswith('y')
If the image is decorative, the alt text should be empty (alt=""). Otherwise, we proceed to generate alt text with AI assistance.
4. Generate a Starting Point with a Large Language Model
Even poor initial descriptions can be improved by feeding them into an LLM along with context. For example, if the image comes from a web page about “spring flowers,” we can create a prompt:
def generate_alt_start(image_caption, page_context):
prompt = f"""
The image shows: {image_caption}
Page context: {page_context}
Write a concise, meaningful alt text (1-2 sentences) that describes the image's essential information for a blind user. If the caption seems wrong, say 'Starting point not reliable' and then give your best guess.
"""
# Call GPT-4 or similar
headers = {
'Authorization': f'Bearer {os.getenv("LLM_KEY")}',
'Content-Type': 'application/json'
}
data = {
'model': 'gpt-4',
'messages': [{'role': 'user', 'content': prompt}],
'max_tokens': 100
}
response = requests.post('https://api.openai.com/v1/chat/completions', headers=headers, json=data)
return response.json()['choices'][0]['message']['content']
This approach leverages the LLM’s ability to reason about the image in context, addressing the weakness of isolated vision models. Even if the raw caption is “a man and a dog,” the LLM, given the page topic “adoptable pets at shelter,” may produce a more accurate description.
5. Implement the Human‑in‑the‑Loop Review
Present the generated alt text to a human editor alongside the image. The editor can accept, modify, or reject it. This is critical because, as Joe Dolson rightly points out, current AI is far from perfect. The tool should record feedback to improve future generations. A simple flow:
def review_alt_text(initial, image_url):
print(f"Image: {image_url}")
print(f"Suggested alt text: {initial}")
choice = input("Accept (a), Edit (e), Reject (r): ")
if choice == 'a':
return initial
elif choice == 'e':
return input("Enter corrected alt text: ")
else:
return None # Will later fallback to manual
6. Handle Complex Images (Graphs, Charts)
Data visualizations require longer, structured alt text. For these, the LLM can be prompted differently—ask it to extract data points from the image caption or from a data table if available. An advanced technique is to combine OCR (Optical Character Recognition) with an LLM. Example (pseudo‑code):
def describe_chart(image_url):
# Step A: use OCR to get axis labels and numbers
ocr_text = do_ocr(image_url)
# Step B: ask LLM to describe the chart from OCR output
prompt = f"From these OCR results: {ocr_text}, write a concise text description of the chart suitable for screen readers."
return query_llm(prompt)
Again, a human review is mandatory because OCR and vision models can misinterpret chart types.
Common Mistakes
- Trusting the AI’s first guess without review – Vision models frequently miss context (“What is this BS?” reaction). Always include a human editor.
- Using the same prompt for all image types – Decorative images should have empty alt text; informative images need short descriptions; complex images need long descriptions. Prompt engineering must be tailored.
- Ignoring the “missing alt” problem – Many websites have no alt text at all. A human‑in‑loop tool is only useful if it’s actually used. Integrate with authoring workflows (CMS plugins, etc.).
- Over‑promising to stakeholders – As skepticism in the original article shows, AI alt text is not production‑ready without human vetting. Set expectations transparently.
Summary
Building an AI‑powered tool for accessible image descriptions is possible, but it must be anchored in human oversight. By combining computer vision for initial captions, large language models for contextual reasoning, and a feedback loop with human editors, you can create a system that amplifies accessibility efforts rather than undermining them. The key lessons: always verify AI output, treat complex images differently, and measure success by whether the final alt text truly conveys the image’s meaning to a person using assistive technology.
Related Discussions