Microsoft is committed to pushing the boundaries of technology to improve and positively influence all parts of society. Recent advances in deep learning and related AI techniques have resulted in significant strides in automated image captioning. However, current image captioning systems are not well-aligned with the needs of a community that can benefit greatly from them: people who are blind or with low vision.
We recently completed a competitive process to find an academic research team to work with on changing that. We’re excited to partner with The University of Texas at Austin for our new Microsoft Ability Initiative. This companywide initiative aims to create a public dataset that ultimately can be used to advance the state of the art in AI systems for automated image captioning. We recently spent two days with the research team in Austin to kick off this exciting new collaboration.
Microsoft researchers involved in this effort have specialized experience in accessible technologies, human-centric AI systems, and computer vision. These researchers’ efforts are complemented by colleagues in other divisions of the company, including the AI for Accessibility program (opens in new tab), which helps fund the initiative, and Microsoft 365 accessibility. The Microsoft Ability Initiative is one of an increasing number of initiatives at Microsoft in which researchers and product developers are coming together in a new, cross-company push to spur innovative and exciting new research and development in the area of accessible technologies.
“We are excited about this new initiative,” said Wendy Chisholm, Principal Program Manager with the AI for Accessibility program at Microsoft. “The goal of creating public data resources that can accelerate innovations with AI that empower people who are blind or with low vision is a fantastic example of the kind of impact Microsoft hopes to have through its AI for Accessibility program (opens in new tab).”
UT Austin stood out last year from a select number of universities with specialized experience invited to participate in the competitive process to identify an academic partner for the initiative. Principal investigator Professor Danna Gurari and Professor Kenneth R. Fleischmann are leading the team at UT Austin, which also includes several graduate students.
Professor Gurari has a previous record of success in creating public datasets to advance the state of the art in AI and accessibility, having co-founded the VizWiz Grand Challenge (opens in new tab). The UT Austin team, which we’ll collaborate with over a period of 18 months, plans to take a user-centered approach to the problem, including working with people who are blind or with low vision to better understand their expectations of AI captioning tools. The team also plans to launch community challenges to engage a broad swath of researchers and developers to build these next-generation tools.
“I hope to build a community that links the diversity of researchers and practitioners with a shared interest in developing accessible methods in order to accelerate the conversion of cutting-edge research into market products that assist people who are blind or with low vision in their daily lives,” said Gurari.
This collaboration with UT Austin builds upon prior Microsoft research that has identified a need for new approaches at the intersection of computer vision and accessibility. Such work includes studies on how end-users who are blind interpret the output of AI image labeling systems (opens in new tab) and the types of detail missing from automated image descriptions (opens in new tab). We’ve also built a prototype exploring new techniques for interacting with image captions (opens in new tab) that takes advantage of more detailed and structured caption content future AI systems may provide. Our prior research has identified many key challenges in this realm, and we’re looking forward to working with UT Austin to make strides toward actionable solutions. Our Cognitive Services (opens in new tab) and Azure cloud computing (opens in new tab) resources provide a technical foundation that will support the joint research effort.
Professor Gurari noted that the initiative will not only advance the state of the art of vision-to-language technology, continuing the progress Microsoft has made with such tools and resources as the Seeing AI mobile phone application (opens in new tab) and the Microsoft Common Objects in COntext (MS COCO) dataset (opens in new tab), but it will also be a teaching opportunity for students at UT Austin.
“I love to see the excitement in so many of my students when they realize that they can use their skills to make a difference in the world, especially for people who are blind or with low vision,” she said.
We came away from our meetings at The University of Texas at Austin even more energized about the potential for this initiative to have real impact in the lives of millions of people around the world, and we couldn’t be more excited. We expect at the end of this joint effort that the broader research community will leverage the new dataset to jump-start yet another wave of innovative research that will lead to new technologies for people who are blind or with low vision.