Facebook's Rosetta Can Understand Text in Images

Facebook has created and deployed a large-scale machine learning system named Rosetta. It can extract text from more than a billion public Facebook and Instagram images and video frames, in a wide variety of languages.

This is excellent news for people who use screenreaders, because it will make the text in images, videos, and memes understandable for people who are blind or visually impaired. I applaud Facebook’s initiative to make the internet more accessible for people who have vision issues.

In addition, Rosetta will be able to not only understand the text in images but also the context in which that text appears. Rosetta can extract text daily, and in real time, and input it to a text recognition model that has been trained on classifiers to understand the context of the text and the image together.

That means Rosetta is a tool that can help Facebook’s systems proactively identify inappropriate or harmful content. In other words, it will be able to identify the difference between an image that first appears harmless and one that includes text that violates Facebook and Instagram’s hate-speech policy.

Personally, I think Rosetta has wonderful potential not only for helping those who are visually impaired to understand what is in a particular image or video, but also to help clean up Facebook and Instagram.

My hope is that Rosetta will be able to identify that an otherwise uncontroversial image (such as a woman standing in a field, or a photo of a gorilla) includes text that is hate-speech. Ideally, this will lead to a quicker removal of hate speech, and make Facebook a kinder place for all of its users.