AI Image Description: A Report

Report Table of Contents

What can AI (Artificial Intelligence) do for you? Apparently do a decent description of charts, graphs and infographics. I’m generally fairly skeptical about AI applications, but some tools such as the ASU Image Accessibility Creator tool and others can truly save time for someone needing a description.

But like all technology, you need to understand the limitations. Here are some of my observations about AI image description would could be useful.

OCR (Optical Character Recognition) Success!!

Believe it or not one of the most tedious chores of image description is transcribing text. Sighted people can read text embedded in a photograph or graphic, but screen readers cannot.

Optical Character Recognition (OCR) tools have been developed to extract text, but traditional tools still have drawbacks including interpreting reading order of complex layouts such as in infographics, org charts or bar charts.

AI Results

Many of the new AI tools to a fantastic job just with transcribing test. In comparison with more traditional OCR tools, they seem to do a better job of interpreting text, even in very small fonts, and understanding where the pieces go.

For English text, I haven’t found any major issues, even with less than stellar image quality. But I haven’t tested any phonetics or non-English yet.

See an example below of an infographic with a tricky layout about grand juries vs. trial juries.

Note: I asked the AI to not only transcribe the image, but to provide some of the output in an accessible HTML file. Although I had to do some tweaks, the process took about five minutes to complete. The original AI description included icon descriptions, but editors can determine if they are decorative (and can be omitted) or not.

 

Infographic: Grand Jury v. Trial Jury in the Federal Court System.

View Image
Infographic of a comparison between Grand Jury and Trial Jury in the Federal Court System.  See details below.
View Description with Table

Description: A comparison chart of Grand Jury vs Trial Jury in the Federal Court System, including jury size, question addressed, what happens, who presents evidence, and duration of service, with icons and descriptions for each aspect. See details below.

Caption: This is a teaching tool with information available at uscourts.gov Educational Resources. Source: <uscourts.gov> Educational Resources.

The image is an infographic titled “Grand Jury v. Trial Jury in the Federal Court System Simplified.” It is divided into two vertical sections: “Grand Jury” on the left and “Trial Jury” on the right. A series of horizontal comparisons are made between the two types of juries, highlighting differences in jury size, the questions they address, the process of presenting evidence, the parties involved, and the duration of service.

  • The top section includes the title, with an illustration of people and the word “Simplified.”
  • Below the title, a brief text indicates that the infographic is a teaching tool with more information available at “uscourts.gov” in the Educational Resources section.
  • The headings “Grand Jury” and “Trial Jury” have arrow-like borders pointing inward to the vertical line dividing the sections.
  • “Grand Jury” points include 16-23 jurors, deciding if there’s enough evidence to bring charges, government presenting evidence in closed hearings, U.S. attorneys (prosecutors) presenting evidence, and the duration of service being one day a month for several months.
  • “Trial Jury” points include 6-12 jurors, finding guilt or liability in criminal or civil cases, evidence presented in open court by both sides (prosecution and defense), attorneys from both sides presenting evidence, and the duration of service being one trial.
  • The visual line connecting these points includes icons like a question mark, stack of papers, and microphone, and ends with a calendar icon.
  • At the bottom, the source is mentioned as “<uscourts.gov> Educational Resources.”
Alt-Text:

Infographic comparing Grand Jury and Trial Jury in the federal court system, covering jury size, roles, evidence presentation, involved parties, and service duration.

Editor’s Note: A side by side comparison chart has been converted to the table below.

Comparison of Grand Jury vs. Trial Jury in the Federal Court System
Aspect Grand Jury Trial Jury
Jury Size 16-23 Jurors 6-12 Jurors
Question Decide if the Prosecutor Has Enough Evidence to Bring Charges Find "Guilty/Not Guilty" (Criminal Cases) or Find "Liable/Not Liable" (Civil Cases)
What Happens The Government Presents Evidence in Closed Hearing Prosecution and Defense Present Evidence in Open Court
Who Presents Evidence U.S. Attorneys (Prosecutors) Attorneys for Both Sides: Prosecution and Defense
Duration of Service One Day a Month for Several Months One Trial

Technical Diagrams – Pretty Darn Good

Another common task in academic image descriptions are technical diagrams such as bar charts, line charts, maps and all sorts of technical diagrams. So long as your charts aren’t too exotic, AI does a good job providing a description, including …transcribing the text of title, keys, legends, axis labels and other snippets. AI will also describe graphic elements, icons and colors, all of which are helpful when interpreting data.

Missing Data Points

One gap I have noticed is that AI doesn’t always estimate data values. One description of a Harrisburg climate chart mentioned a temperature peak in July, but not an approximate value. Data estimation based on visuals alone can be tricky though. It’s always good to have the data sourced as a table or some sort of spreadsheet/data file.

Harrisburg Climate Chart

Combinatation of a annual precipitation bar chart and temperature line graph for Harrisburg. The bar chart shows average monthly preciptation, and three line charts shows changes in average high temperature, low temperature and overall temperature
Chart generated using data from the National Weather Service. Public Domain, but available on Wikimedia Commons.

AI Description

Open Image Description

Image Description: The image is a combined bar and line graph representing the Monthly Climate Normals for the Middletown-Harrisburg Area, PA, from 1991 to 2020. The vertical axis on the left is labeled "Precipitation (inches)" and scales from 0 to 6 in increments of 0.6. The vertical axis on the right is labeled "Temperature (°F)" and scales from 15 to 90 in increments of 15. The horizontal axis lists months from January to December.

Green vertical bars illustrate total precipitation normal for each month, showing that the precipitation peaks around July [at about 4.5 inches]. There are three lines indicating different temperature measures: the blue line represents the mean minimum temperature normal, the brown line the mean average temperature normal, and the red line the mean maximum temperature normal. All temperature lines show a bell-shaped curve, with the higher temperatures in the middle of the year and lower temperatures at the beginning and end of the year. The min and max temps are listed below:

The lines display a clear seasonal pattern, with the mean max temperature (red) peaking in the middle of the year during the summer months, the mean avg temperature (brown) following closely, and the mean min temperature (blue) being consistently below the other two lines. The graph also contains a note "Click and drag to zoom to a shorter time interval," suggesting interactive functionality.
Alt-Text: "Graph of Monthly Climate Normals for Middletown-Harrisburg Area, PA, displaying precipitation as green bars and temperature with a blue, brown, and red line for mean min, avg, and max temperature normals respectively, showing seasonal variations."

Text in the image:
  • Monthly Climate Normals (1991-2020) – Middletown-Harrisburg Area, PA (ThreadEx)
  • Click and drag to zoom to a shorter time interval
  • Total Precipitation Normal
  • Mean Max Temperature Normal
  • Mean Min Temperature Normal
  • Mean Avg Temperature Normal
  • Powered by ACIS

With Added Data Points

View Data Table
Approximate Data by Quarter
T/Precip. Jan April July Oct
Total Precipitation (Inches) 3.0 < 3.6 < 4.8 > 3.8
Mean Max Temperature (°F) 43 66 86 66
Mean Avg Temperature 36 56 78 54
Mean Min Temperature 18 42 66 45

SME Matters

I would recommend having an SME (subject matter expert) review anything very unusual. I fed AI a syntax chart from a paper of mine, and the results were good, but what took AI about 3-4 paragraphs to describe could be condensed by the SME to 1-2 sentences depending on the context. Below is another syntax tree which AI described well, but I did tweak the descriptions.

Syntax tree for the sentence John hit the ball.  See details below.
File from Wikimedia Commons, released into public domain.

AI and SME Description

Open AI Description

Image Description: The image depicts a syntactic tree diagram used in linguistics to illustrate the structure of a sentence. The tree is composed of green text and lines, representing different components of a sentence. At the top of the tree is the letter “S,” indicating the sentence level. Below “S,” the tree branches into “NP” (Noun Phrase) and “VP” (Verb Phrase). Further, the “VP” splits into “V” (Verb) and another “NP,” which is marked as containing a “Det” (Determiner) and “N” (Noun).

Alt-text: Syntactic tree diagram of a sentence [John hit the ball].

Text in Image:

  • S
  • NP
  • VP
  • V
  • NP
  • Det
  • N
Open SME Description

A syntax tree for the sentence (S) “John hit the ball.” The top S node branches into an noun phrase (NP) ‘John’ and verb phrase (VP) ‘hit the ball’. The VP (corresponding to a traditional predicate) branches further into V ‘hit’ and NP ‘the ball.’ The last NP ‘the ball’ branches into a determiner (D) ‘the’ and noun (N) ‘ball.’

Non-Visual Representations

In an ideal world, there could b, but even in mainstream academia, there are technology limits to universal use of graphics. Fortunately, there are are often text-based alternatives that can be used. Here’s a linearized representation of the the tree above that could be used in some cases.

Linearized Tree

Sentence: John hit the ball.

[S [NP John] [VP hit [NP [D the ] [N ball.]]]]

Repeated Graphics

Editing descriptions are also very important for repeated graphics in which maybe only one element is really important in a particular graphic to represent a change or focus element. Once a person has become familiar with the format, it’s more efficient to focus on the these first and skip what is already known.

Many people not directly involved in accessibility wonder why SME input is important at the beginning. Reviewing image descriptions in an academic context from both perspectives has made me realize that bridging the gap early makes accessibility more efficient for everyone.

 

Weaknesses

Facial Recognition

Do you recognize at least one woman in the photo below? It’s a photo of First Lady Barbara Bush and Princess Diana taken in the White House. Many people are able to recognize celebrities, but the AI tools I have used are not able to do so. Instead they describe women sitting on a sofa and their outfits.

Barbara Bush and Princess Diana sitting in the White House reading a book.  See details below.
Wikimedia Commons: First Lady Barbara Bush and Princess Diana have tea in the Yellow Oval Room and Mrs. Bush gives the Princess a copy of "Millie’s Book as Dictated to Barbara Bush." October 5, 1990 Photo Credit: George Bush Presidential Library and Museum. Public Domain.
AI Caption: Two women seated on a yellow sofa, one with short white hair in a blue outfit, and the other with short blonde hair in a pink suit, looking at a book.

Google Reverse Image Search

If you are in a situation where you cannot identify people or objects, I do recommend Google Reverse Image search or something similar to find other "matching" photos.

Math Equations

Although the ASU tool can theoretically generate MathML, it’s OCR engine has not produced optimal results. I would recommend I would still recommend Equatio or MathPix for equation OCR.

Note: I want to add that this functionality is slightly beyond what is intented. Image description can mean many things depending on the context.

The equation is the number 17 to the 240th power under a series of nth roots - specifically the 4th root of the 5th root of the 3rd root of the square root of 17^240.

Original Equation (MathML)

17 240 3 5 4

According to

ASU LaTeX Raw (Spaces Removed)

\sqrt[4]{\sqrt[5]{\sqrt[3]{\sqrt[17]{240}}}}=

ASU MathML

240 17 3 5 4 =

 

Note: In my experience, more advanced math students, even those using screen readers, usually avoid "spoken" math. It’s easier to process math symbols, LaTeX or MathML/equation editor outputs.

Art Objects

This was an interesting area. For some fine arts, AI was able to make useful descriptions of color and composition, but often missed details that an art historian would probably comment on.

For example, in a piece of embroidery, AI might be able to identify that there was embroidery, but miss pretty key points like the presence of cross stitch elements or which animals and flowers were being depicted. Both of these are important for identifying artistic genres, periods or locations of art.

Closeups

Adding a closeup of an embroidered piece provides a different perspective for AI. In the closeup, AI did pickup the cross stitch, but had problems with identifying the Runes as Runes – it did much better with the full picture of all the runes. Context does matter a lot.

Closeup of cross stitch band with Runic letters in blue cross stitch, red Latin letter below done in backstitch and traditional motifs above done in a mix of red and blue threads.
Closeup of one band of a modern Runic band sampler.

AI: Image Description: The image shows a piece of white fabric with a cross-stitch design. The design features two rows of thick, intertwined red and blue lines at the top and bottom. In between these rows are two sets of letters. The upper set consists of large, blue letters spelled out in an unusual, varied typeface. Below these, there is a series of smaller, red letters in a simpler typeface. The fabric has a visible weave, typical of cross-stitch materials.

Text: Upper set: “MMROAN” (actually Runic Letters) Lower set: “EMLNGOD”

 

Final Warning: Proof Your Output

As wonderful as the AI tools are, you need to proof the output.

Hallucinations

AI can "hallucinate," meaning it may make an incorrect assumption, and using that assumption, make further wildly incorrect statements.

I asked it to describe a modern Runic embroidered sampler (below) and it identified one motif as a "Viking ship" and mentioned it had four rows of oars. It also said the piece was a bookmark (it’s really six inches wide).

Closeup of embroidered object
Embroidered abstract animal design misidentified as a ship. Design and phot by Elizabeth J. Pyatt

In reality, the Viking ship was a modified version of an abstract animal with a head similar to a ship’s head motif, but with actual legs and hind quarters. Not a bad miss in terms of shape, but very inaccurate conceptually.

As IBM notes in the link, hallucinations may be the result of incomplete exposure to training data in a given subject domain. Because academic research may be filling in current knowledge gaps, there may always be a slight lag in what AI can do with academic content.

Missed Details

A description could be very close to correct, but miss a crucial detail. In one case of an PowerPoint screenshot with a utility window open, the main screen was described, but not the window on the right. As with other OCR tools, you could upload just a cropped section and redo the scan to receive a transcript. Or you could tweak the prompt.

Slide deck showing three proposed redesign of the Pennsylvania flag and the slide ordering window with the title at the bottom and the left most image on the right.

The AI tool described the slide content, but not the screen order elements on the right. Reloading just the mixing element will usually result in a usable description.

AI Description

Image Description:

The image presents a question on whether the flag of Pennsylvania should be redesigned, showing three different flag designs.

Open Image Description
  • To the left is the "Official (1799)" flag, featuring a dark blue background with a detailed crest in the center that includes a bald eagle, horses, and a shield with a sailing ship, plow, and three sheaves of wheat.
  • In the middle is a modern alternative design titled "Keystone + State," which displays a simplified solid blue field with a central orange keystone shape containing three white stars above and four below.
  • To the right is another alternative titled "Keystone Stripes," depicting a horizontal tri-color flag with blue, white, and gold stripes, and a centered keystone emblem.
  • The image source credits are at the bottom with URLs and the names "Jeffcook2756" and "Marmoceet". Images from vexillology.fandom.com (Pennsylvania) Licensed by Creative Commons.

Summary

Image description is one case where AI is coming into its own. But as with automatic speech recognition and other applications, AI output needs some auditing to be 100% accurate.

This entry was posted in Accessibility, Graphics. Bookmark the permalink.

Leave a Reply