Publishers of technology books, eBooks, and videos for creative people

Home > Articles > Design > Voices That Matter

Web Design Reference Guide

Hosted by

Toggle Open Guide Table of ContentsGuide Contents

Close Table of ContentsGuide Contents

Close Table of Contents

Usability Tips You Can Use: Designing Accessible Audio

Last updated Oct 17, 2003.

Text is hands-down the most accessible format for conveying information to the broadest possible Web audience. Unlike images, video, and audio, text can be both seen and heard. Users who can’t hear can read text, and users who can’t see can have text read aloud by software. This doesn’t mean that we should forgo the richness of presentation that comes from images and audio—pages thick with text present their own set of challenges—but that information conveyed through these channels must also be available as text. Fortunately, Web technology offers a variety of methods for presenting media content with equivalent text.

Accessible audio is a perfect illustration of the broad benefits of universal design because access to equivalent text is helpful for everyone. A text transcript can be read and indexed by software, making audio- and video-based content easier to find. Users who have technical issues with Web-based media (slow Internet access, software incompatibilities, and so on) can access the information via the text transcript. Also, a text transcript is an effective way to convey information, and may be more efficient than audio alone, since reading is faster than listening—meaning that people reading a transcript may understand the information better and faster.

So how do you go about putting into text the information contained in an audio recording? Keep reading to find out, and to learn what to do with a transcript once you have one.

Audio Only

Say you’ve recorded a commentary on the ethics of breeding fluorescent pigs. You plan to publish the commentary as a podcast and also post it on your blog. For your commentary to be accessible to users who can’t hear, you need a text transcript.

Transcribing the Audio

Preparing a transcript is a snap if you started with a script. In this case, just review the recording and revise the script for accuracy, noting any sounds that add to the narrative (for example, giggles, grunts, snorts). Preparing a transcript without a script is another story. In this case, you’ll need to play and pause, play and pause, while typing the text. For those of us who lack training, the process of transcribing what’s spoken may be slow and tedious, but is certainly not impossible.

"Whoa there, Missy," you say. "With printed documents, we use software to scan pages and convert the dots on the page into characters and words. Why can’t we use a similar tool to convert spoken words to text?" Well, you can. Speech-to-text technology is available in tools such as Dragon NaturallySpeaking. If you’re doing regular commentaries, it might make sense to spend the time configuring the software to recognize your voice and create accurate transcripts. On the other hand, the technology is not as robust or accurate as the OCR technology used for scanning a printed document, and you may spend more time correcting errors than you would have had you started from scratch.

For those with money to spend, contracting with a transcription service may be the best option. Some Web-based companies offer low prices and quick turnaround for audio transcripts of online media. I can’t recommend a specific service, but googling "podcast transcription service" should provide some options.

Publishing the Transcript

With a complete and accurate transcript in hand, post the text using standards-based, semantic markup and then sit back and reap the benefits of improved findability and broader access. For an example of a site that provides audio transcripts, see NASA Podcasting.

Audio and Video

Now suppose that your commentary contains a video track—photos or video footage of fluorescent pigs, perhaps. For truly accessible audio, you’ll need to go one step further and synchronize the text and video to create captions.

Adding Timecode

For use as captions, the transcript needs to contain timecodes that indicate when to display each text segment along with the video. Some transcription services provide timecodes as part of their service. For the do-it-yourself captioner, try Media Access Generator (known as MAGpie) from the National Center for Accessible Media (NCAM). MAGpie doesn’t automagically add timecodes to an audio transcript. Instead, MAGpie is facilitating software, offering a comfortable and efficient working environment in which to create captions (see Figure 1).

To create captions using MAGpie, you import the audio transcript, play the associated video, and press a function key to mark each break. MAGpie inserts timecodes and creates a text file that can be used for captions:

The realist in the area of race relations seeks to combine the truths of two opposites

while avoiding the extremes of both.

So the realist would agree with the optimist that we have come a long, long way

but he would balance this by agreeing with the pessimist

that we have a long, long way to go before this problem is solved.

And it is this realistic position that I would like to use as a basis for our thinking together

as we think of the future of race relations in the United States.

We have made significant strides.

We have come a long, long way.

But we have a long, long way to go.
Synchronizing Captions with Video

A variety of methods are available for synchronizing captions and video, and the method you choose depends in large part on the format of your media. For the purpose of simplicity, let’s look at publishing captioned QuickTime video using Synchronized Multimedia Integration Language (SMIL).

SMIL, pronounced "smile," is a happy little markup language for multimedia presentations. Using SMIL is a bit like working with page layout software to pull together and position different elements into one presentation. In the following SMIL example, the location of the elements on the page is defined in the <layout> section and the file source and duration are defined in the <body> section. Enclosing the video and text in the <par> code causes them to display in parallel (simultaneously).

     <root-layout width="400" height="380" />
     <region id="video" width="400" height="300" />

     <region id="caption" width="360" height="60" left="20" top="320" />

   <par dur="1:00:00.00">
     <video dur="1:00:00.00" region="video" src="" alt="Video" />

     <textstream dur="1:00:00.00" region="caption" src="captions.txt" alt="Captions" />


For those familiar with markup languages, SMIL is just another dialect. For the faint of heart, don’t panic. Authoring tools such as MAGpie and GoLive generate SMIL markup.


To learn more about audio transcription and captioning, check these resources: