Publishers of technology books, eBooks, and videos for creative people

Home > Articles > Web Design & Development > HTML/XHTML

  • Print
  • + Share This
This chapter is from the book

This chapter is from the book

Multimedia accessibility

We’ve talked about the keyboard accessibility of the video element, but what about transcripts and captions for multimedia? After all, there is no alt attribute for video or audio as there is for <img>. The fallback content between the tags is meant only for browsers that can’t cope with native video, not for people whose browsers can display the media but can’t see or hear it due to disability or situation (for example, being in a noisy environment or needing to conserve bandwidth).

There are two methods of attaching synchronized text alternatives (captions, subtitles, and so on) to multimedia, called in-band and out-of-band. In-band means that the text file is included in the multimedia container; an MP4 file, for example, is actually a container for H.264 video and AAC audio, and can hold other metadata files too, such as subtitles. WebM is a container (based on the open standard Matroska Media Container format) that holds VP8 video and Ogg Vorbis audio. Currently, WebM doesn’t support subtitles, as Google is waiting for the Working Groups to specify the HTML5 format: “WHATWG/W3C RFC will release guidance on subtitles and other overlays in HTML5 <video> in the near future. WebM intends to follow that guidance”. (Of course, even if the container can contain additional metadata, it’s still up to the media player or browser to expose that information to the user.)

Out-of-band text alternatives are those that aren’t inside the media container but are held in a separate file and associated with the media file with a child <track> element:

<video controls>
<source src=movie.webm>
<source src=movie.mp4>
<track src=english.vtt kind=captions srclang=en>
<track src=french.vtt kind=captions srclang=fr>
<p>Fallback content here with links to download video files</p>

This example associates two caption tracks with the video, one in English and one in French. Browsers will have some UI mechanism to allow the user to select the one she wants (listing any in-band tracks, too).

The <track> element doesn’t presuppose any particular format, but the browsers will probably begin by implementing the new WebVTT format (previously known as WebSRT, as it’s based on the SRT format) (

This format is still in development by WHATWG, with lots of feedback from people who really know, such as the BBC, Netflix, and Google (the organisation with probably the most experience of subtitling web-delivered video via YouTube). Because it’s still in flux, we won’t look in-depth at syntax here, as it will probably be slightly different by the time you read this.

WebVTT is just a UTF-8 encoded text file, which looks like this at its simplest:


00:00:11.000 --> 00:00:13.000
Luftputefartøyet mitt er fullt av ål

This puts the subtitle text “Luftputefartøyet mitt er fullt av ål” over the video starting at 11 seconds from the beginning, and removes it when the video reaches the 13 second mark (not 13 seconds later).

No browser currently supports WebVTT or <track> but there are a couple of polyfills available. Julien Villetorte (@delphiki) has written Playr (, a lightweight script that adds support for these features to all browsers that support HTML5 video (Figure 4.6).

Figure 4.6

Figure 4.6 Remy reading Shakespeare’s Sonnet 155, with Welsh subtitle displayed by Playr.

WebVTT also allows for bold, italic, and colour text, vertical text for Asian languages, right-to-left text for languages like Arabic and Hebrew, ruby annotations (see Chapter 2), and positioning text from the default positioning (so it doesn’t obscure key text on the screen, for example), but only if you need these features.

The format is deliberately made to be as simple as possible, and that’s vital for accessibility: If it’s hard to write, people won’t do it, and all the APIs in the world won’t help video be accessible if there are no subtitled videos.

Let’s also note that having plain text isn’t just important for people with disabilities. Textual transcripts can be spidered by search engines, pleasing the Search Engine Optimists. And, of course, text can be selected, copied, pasted, resized, and styled with CSS, translated by websites, mashed up, and all other kinds of wonders. As Shakespeare said in Sonnet 155, “If thy text be selectable/‘tis most delectable.”

  • + Share This
  • 🔖 Save To Your Account