CaptionCrafter - Customizable Multilingual Video Captions
Introduction
CaptionCrafter addresses the critical issue of digital accessibility in multimedia content, specifically for individuals with diverse and intersecting needs. Millions of people who are Deaf or hard of hearing, multilingual users, language learners, and those with cognitive disabilities face significant barriers when consuming video content online. For Deaf and hard of hearing individuals, captions are not merely a convenience but a necessity for understanding content. Multilingual users and language learners often struggle with content in non-native languages, while people with cognitive disabilities may find standard captioning overwhelming or difficult to process. These accessibility gaps create digital divides that exclude large populations from educational resources, entertainment, and critical information shared through video media.
CaptionCrafter is a web-based application that transforms video accessibility through highly customizable dual-language captioning with innovative keyword highlighting. The tool allows users to independently customize each language’s appearance (font, size, color), position captions strategically (overlay or below video), and visually track important keywords across languages. By integrating open-source translation systems with precise timestamp synchronization, CaptionCrafter creates an interactive captioning experience that meets diverse accessibility needs while supporting language acquisition. Rather than providing a one-size-fits-all solution, the application acknowledges the individuality of accessibility requirements, empowering users to tailor their viewing experience to match their specific needs—whether they’re learning a new language, require visual assistance with caption presentation, or need help identifying key concepts within content.
Client Input
Who Were Our Clients?
Our clients are individuals with diverse and intersecting accessibility needs. This includes people who are Deaf or hard of hearing, multilingual users, and individuals with cognitive disabilities. Many clients spoke several languages or were learning another language and expressed interest in turning their reliance on captions into a language learning experience. The tool was initially designed to improve accessibility for multilingual audiences, but through feedback from clients, it evolved to meet a broader range of needs. CaptionCrafter addresses the needs of users who require captions for understanding audio and video content, non-native speakers or language learners who benefit from multilingual support, and people with cognitive disabilities who find keyword highlighting and customizable features, like adjustable font size and color, helpful in making content more engaging and easier to process. These clients could use CaptionCrafter to have a more inclusive and accessible digital experience.
What Did We Learn About Their Access Needs?
Through our work with them, we’ve learned that customization and precise multilingual translation are crucial for enhancing the accessibility of our product. Each client had different needs when it came to how captions should be presented. For example, we found that clients with cognitive disabilities required captions that were simple, organized, and highlighted important keywords to help them focus on the key points of the content. Meanwhile, clients who spoke multiple languages emphasized the importance of accurate multilingual translations to ensure they fully understood the material. Through these interactions, we realized that customizability such as adjusting font size, text style, and language options, is essential for meeting the specific access needs of our clients and broadening the accessibility of our app in general.
We have realized that the intersection of language learning and accessibility needs has been unaddressed in many previous applications. There are still many harmful assumptions about who language learners are and what tools they do and don’t need. We have realized that we need to remove all assumptions about a user’s experience to truly develop a meaningful tool for anyone to use.
Client Feedback
The positive impact of our customization features was highlighted by client testimonials that reinforced the importance of flexibility in accessibility tools. As one multilingual client who regularly uses captions for both accessibility and language learning emphasized:
“Having the option to adjust captions and choose the translation language makes it so much easier to follow along.”
This feedback directly validates our approach to customizable, multilingual captioning and reinforces how critical these features are for users navigating content in multiple languages.
Lo-Fi Prototype Feedback
Figure 1: Lo-fi prototype showing customizable captions overlaid on a video with custom font styles, colors, and sizes.
During a lo-fi prototype interview session, one of the key features that emerged was the idea of making each language customizable separately. This feedback was invaluable, as our client had not considered this feature during the initial round of interviews, possibly because they assumed it was already part of the prototype. Presenting our version of the prototype helped both sides realize that we had different expectations going into the design process. As designers, we’re grateful we were able to catch this misalignment early and make the necessary adjustments. This change was crucial, as it allowed us to incorporate a feature that is highly important to our clients and better aligned the prototype with their needs for the app.
Additionally, we realized that granular changes in font size were not always necessary and it was far more accessible to simply provide 3-4 options per font customization type. Simplifying the panel of customizations streamlined the user experience and further improved the accessibility of the application.
Final Concept
Description of Final Concept
Figure 1: The image shows how dual lingual captioning can be placed below the video with a small bar below the captions to adjust the space between the captions. On the right, there is a panel that has computed keywords that are identified in blue for the user to find throughout the captions.
Figure 2: This image illustrates how captions can be overlaid on top of the video, showing multilingual captions with the keyword “voice” highlighted in blue.
Figure 3: All panels can be collapsed for easier viewing as shown in the image above. The video can be enlarged and captions can be directly overlaid, making the user viewing experience simple and elegant.
From a design perspective, our final concept is centered around providing a highly customizable and accessible captioning experience. The decision to build a web interface rather than a chrome extension means that the video viewing experience is significantly simpler and more accessible without additional content surrounding the page. The interface features an intuitive, user-friendly layout that allows users to adjust caption settings and orientation based on their individual preferences. Key design elements include adjustable font, size, dark mode, and options for highlighting keywords to enhance readability. The design also supports multilingual captions, with each language being customizable separately, allowing users to tailor the experience to their needs. The interface is responsive, allowing for the tool to accommodate various screen sizes while allowing for different features to be toggled out of the way to simplify the interface. To find keywords throughout the captioning experience, users can simply click on the blue highlighted text to make it green. We aimed for the design to create a captioning experience that was interactive for anyone to use that served diverse user needs, identities, and preferences.
The application integrates an open-source keyword computing and translation system using YAKE for multilingual keyword extraction and M2M100-small100 for efficient bidirectional translation, both optimized with caching and batch processing. A caption management system synchronizes transcripts via the YouTube API, ensuring precise timestamp alignment, real-time tracking, and customizable display options, including support for right-to-left languages. Built on a Flask-based REST API, the backend enables parallel processing, caching, and scalable data management, making the tool free to use while supporting multilingual video captioning with advanced customization and accessibility features. From a technical perspective, the tool represents significant advances in multilingual technology by leveraging fast, lightweight multilingual models but also reveals the limited language coverage that needs to be improved upon.
Addressing Intersectional Needs
This project serves a diverse range of identities, including multilingual individuals, those who are visually impaired, Deaf or hard of hearing, have cognitive disabilities, and anyone who requires captions for accessibility. Our app is designed to benefit users who identify with one or more of these needs.
We address multilingual individuals through a dual language captioning experience and are adding in language-specific fonts to improve readability like Shizuru for Japanese. We are careful to acknowledge that not all languages are supported at the moment and are working to add better multilingual support.
Multilingual users all have varying levels of comprehension in the languages they speak. We want to support their language learning journey by allowing for the transcripts to appear below the captions for a slower captioning viewing experience while also including keywords to search for in their primary language they are learning to support reading comprehension. Keyword searching ensures that the user pays greater attention to the correspondence between their native and secondary language such that captioning goes beyond the surface level language learning experience.
For those with reading and visual needs, the captions can be resized to larger fonts and styles with several fonts like Lexend being well-established fonts for those with dyslexia. We hope that the dark mode will also support users who struggle with eye strain and screen fatigue.
Our client population is composed of multilingual users who all wanted captions that were more fun or easier to read and could address their language learning needs. We believe that keyword searching paired with a multitude of customizable interfaces can make users intersectional identities feel seen and acknowledged.
Potential Risks and Harms
One potential harm is that the app might not fully meet the diverse needs of all users, particularly if features like multilingual support or captions aren’t as accurate or comprehensive as they should be. Building multilingual support for dialect and language variation is a continuous effort that we will acknowledge. We understand the risk of poor language support or misleading translated captions and provide a disclaimer at the bottom of the application to acknowledge that captioning may be imperfect or incomprehensive. We also note that we will work to improve language coverage.
Another risk is that the highlighted keywords may focus on negative words that should not be brought to the users attention (curse words or slurs). This is something we have carefully considered but are not able to fully address yet. We simply minimize this risk where possible by suggesting videos that are ted-talks and well scripted videos that reduce usage of harmful language. In the future, we will use word embeddings that follow more positive connotations to be more considerate of this.
Innovation and Value Proposition
Our multilingual captioning tool enhances video accessibility by providing real-time synchronized captions across multiple languages with automatic keyword detection. Unlike existing solutions, it uniquely integrates open-source machine translation and keyword extraction to let users seamlessly track key terms and concepts while watching videos. The interface is fully customizable, offering adjustable fonts, colors, and layouts to support diverse accessibility needs, including right-to-left language formatting. Designed for learners, educators, and researchers, this tool is the first to combine precise keyword mapping with synchronized multilingual captions, eliminating the need for separate translation tools or manual adjustments.
VPAT (Accessibility Audit)
Accessibility Features and Limitations
The application is largely accessible, with most key features fully supporting accessibility standards. The video content includes captions, keywords, and a transcript, ensuring that non-text content is accessible. Headings are correctly marked, and all buttons and toggles are properly labeled for compatibility with assistive technologies. The app also allows users to customize font sizes for captions and adjust settings at any time. Users can choose between light or dark mode and adjust the contrast of the background to meet their needs. The focus order is logically structured, and all interactive elements are labeled correctly to assist with navigation. Additionally, the app provides real-time status messages to inform users when captions are being processed.
However, there are a few areas with limitations. Resizing the captions too large can affect the overlay feature, causing the captions to cover a significant portion of the video, though the functionality remains intact. Another limitation is that the tool works best in a Landscape layout, otherwise the user will not be able to expand video content. Overall, the application is accessible with some minor exceptions.
VPAT Table
Criteria | Conformance Level | Explanation |
---|---|---|
Success criterion 1.1.1 - Non-text Content | Supports | The only image on the page is the video, which has captions, keywords, and a transcript available. Technically there aren’t any still images so this could be N/A |
Success Criterion 1.3.1 - Info and Relationships | Supports | All headings are marked as headings, and all buttons and toggles are labeled correctly to be compatible with assistive technology. |
Success criterion 1.3.4 - Orientation | Supports with Exceptions | Since there is video content, this tool is designed to be viewed horizontally. This creates the best viewing experience, and users should avoid vertical viewing so they have the ability to expand the video content. That said, the application is fully responsive so that a user could watch the video from their mobile device. |
Success criterion 1.4.3 - Contrast (Minimum) | Supports with Exceptions | All text, buttons, and background colors have a higher than minimum contrast ratio. |
Success criterion 1.4.4 - Resize Text | Supports | Users have the ability to customize font size for each line of captions separately. They also have the ability to adjust these settings at any time. |
Success criterion 1.4.10 - Reflow | Supports with Exceptions | Resizing the captions to be large will affect the overlay feature. All other functionalities will work, and technically overlay will too, but the captions will be extremely large and cover a lot of the video. |
Success criterion 1.4.11 - Non textual contrast | Supports | Users have the opportunity to choose light or dark mode, and based on their video, they can choose the contrast for the background that feels appropriate to them. |
Success criterion 2.4.3 - Focus Order | Supports | When using assistive technology, the order will go through the right side panel with all the caption settings first. We ensured it goes through each option in order and does not miss one. Then, it will go to the video content and captions on the left. When testing with users, this order made sense. |
Success Criterion 4.1.2 - Name, Role, Value | Supports | Since our app supports tons of customizability options, this is an important guideline to triple check. We have ensured all buttons, sliders, and toggles are labeled correctly to inform the user of available actions. |
Success criterion 4.1.3 - Status messages | Supports | The main time when this occurs is when originally loading the video by pasting the url from Youtube. It takes a while to extract captions in several languages, and our app will indicate that it’s still processing to the user. When done, the app will display a message indicating that captions in xx number of languages were extracted. |