a tall building lit up at night

Microsoft Research Lab – Asia

Revolutionizing Document AI with Multimodal Document Foundation Models

Share this page

Have you ever been overwhelmed by invoices with different pieces of information like payables, dates, quantity of goods, unit prices and amounts? When dealing with essential business contracts, are you worried about getting a decimal point wrong, causing incalculable financial losses? Have you ever read numerous resumes while looking for top talent? Business people have to deal with all these tasks and a great variety of documents, including insurance policies, business reports, emails, and shipping orders.

In the digital age, companies often must digitize all these documents, along with various charts and images, to streamline their procedures. However, manually digitizing documents is an inefficient practice because many scanned documents are of varying quality, while web pages and electronic documents can have different layouts. How can we efficiently extract, organize and analyze the information in these different documents? The answer is Document AI technology, which frees employees and companies from this repetitive and tedious work.

graph
Scanned images of business documents with different layouts and formats

Document AI mainly refers to the use of AI technology for automated extraction, classification, and understanding of information with rich typesetting formats from webpages, digital-born documents, or scanned documents. It is an important research area at the intersection of natural language processing (NLP) and computer vision (CV). The surge of deep learning techniques has dramatically boosted the development of Document AI, with significant performance improvements in visual information extraction and document layout analysis, as well as in document visual question and answering and document image classification, etc. Document AI also plays an important role in helping enterprises save operational costs, improve employee efficiency, and reduce human errors.

From Text to Multimodal Models: Document AI Gradually Develops New Skills.

Microsoft Research Asia’s series of studies on Document AI began in 2019. In an in-depth study of deep learning, researchers wanted to extract useful information from publicly available documents to build a knowledge base that could support the pre-training task of deep learning models. However, since real-world documents do not contain structured data, extracting structured textual information from cluttered documents was the first problem researchers had to solve.

To tackle this problem, Microsoft Research Asia proposed UniLM, a unified pre-trained language model that reads documents and automatically generates content. The UniLM model has achieved strong results in natural language understanding and generation tasks. Furthermore, researchers provided the system with a function to expand English NLP tasks into several languages by developing a cross-lingual pre-trained model (InfoXLM). In the real world, documents contain not only text information but also layout and style information (e.g., fonts, colors, and underlines). As a result, models that only process textual information cannot be applied in real-world scenarios where multimodal programs are needed.

At the end of 2019, Microsoft Research Asia introduced LayoutLM, a general-purpose pre-trained Document Foundation Model that combined NLP and CV technologies. This is the first model that can learn both text and layout information in a single framework for document-level pre-training. LayoutLM is pre-trained on approximately 11 million scanned document images from the IIT-CDIP Test Collection 1.0 dataset. It can also be easily trained in a self-supervised way with the large-scale use of unlabeled scanned document images, outperforming other models in form and receipt understanding as well as image classification tasks.  In an updated model called LayoutLMv2, researchers subsequently incorporated visual information in the pre-training process to improve its image understanding capability. This new model managed to unify document text, layout, and visual information in a single end-to-end framework that can learn cross-modal interactions.

graphical user interface, application, Teams
Document AI Research Progress from Microsoft Research Asia

Moreover, researchers developed LayoutXLM, a multimodal pre-trained model that is based on LayoutLMv2 but can perform multilingual document understanding, to meet the needs of diverse users using various languages. The LayoutXLM model not only integrates textual and visual information from multilingual documents but also exploits their local invariance property. LayoutXLM can process documents in nearly 200 languages. To accurately evaluate the performance of pre-trained models for multilingual document understanding, researchers also created the multilingual form understanding benchmark dataset XFUND, which covers seven languages (i.e., Chinese, Japanese, Spanish, French, Italian, German, and Portuguese).

Unlike fixed-layout documents that include scanned document images and digital-born PDF files, many markup-language-based documents, such as HTML-based web pages and XML-based Office documents, are often rendered in real-time. For this reason, researchers developed the MarkupLM model to process the source code of markup-language-based documents and understand them without additional computational resources. The experimental results show that MarkupLM significantly outperforms previous fixed-layout-based methods and is highly practical.

Microsoft Research Asia continues to iterate on Document AI techniques to provide them with the ability to process different types of data, including text, layout and image information. This year, Microsoft Research Asia released LayoutLMv3, the latest multimodal pre-trained model that can achieve unified masked text and image modeling. LayoutLMv3 is the first model that mitigates the discrepancy between text and image multimodal representation learning by masking the prediction of both text and images. Additionally, LayoutLMv3 is pre-trained to achieve word-patch alignment, which means that it can learn cross-modal alignment by predicting whether the corresponding image patch of a word is masked. In terms of model architecture, LayoutLMv3 does not rely on a pre-trained CNN backbone to extract visual features. However, it directly utilizes document image patches, which significantly saves parameters, eliminates region annotations, and avoids complex document pre-processing. These simple unified architecture and training objectives make LayoutLMv3 a general-purpose pretrained model for both text-centric and image-centric Document AI tasks.

“The Layout(X)LM family models play a crucial role in our fundamental research on pushing ‘The Big Convergence’ of foundation models and large-scale self-supervised pretraining across tasks, languages, and modalities”, stated Furu Wei, Partner Research Manager at Microsoft Research Asia.

graph
The architecture and pre-training objectives of LayoutLMv3

“We see a research trend towards a big convergence of different modalities, where scientists from different fields are researching unified models, including NLP, CV, and others. The first two versions of LayoutLM focused on language processing, while the advantage of LayoutLMv3 is that it can handle tasks in both NLP and CV modalities, making a big breakthrough in the field of computer vision,” said Lei Cui, Principal Research Manager at Microsoft Research Asia.

Microsoft Research Asia Series Document AI Models
GitHub link: https://github.com/microsoft/unilm (opens in new tab)

Industry-leading Models

The Layout(X)LM series models are leading the way in leveraging large-scale unlabeled data and integrating text and images with multimodal, multi-page, and multilingual content. In particular, the generality and superiority of LayoutLMv3 have made it a benchmark model for Document AI industry research. For example, the Layout(X)LM series models have been adopted by many Document AI products from many leading companies, especially in the Robotic Process Automation (RPA) domain.

“Microsoft Research Asia has not only achieved significant results in modeling innovation and benchmark datasets but it has also developed many applications that allow users to perform multiple tasks with just one single model architecture. Many colleagues in academia and industry are using Layout(X)LM to conduct meaningful scientific explorations and advancing Document AI,” Lei Cui said.

Microsoft is leading the way in the field, with a range of Microsoft Research Asia’s Document AI models now being used in many Microsoft-related products, such as Azure Form Recognizer, AI Builder, and Microsoft Syntex. “We are excited to work with these top researchers at Microsoft Research Asia. The Document Foundation Models have greatly improved our development and application efficiency and have contributed to the popularity of Document AI. We look forward to more exciting advances in this area in the future,” said Cha Zhang, Partner Engineering Manager from Microsoft Azure AI.

Next Step in Document AI: Developing General-purpose and Unified Frameworks

Over the course of time, technological advances in Document AI have led to its application in various industries such as finance, healthcare, energy, government services, and logistics, saving people in these industries a lot of time as they can now avoid manual processing. For example, in the financial industry, Document AI enables financial report analysis, intelligent decision analysis, and automated information extraction for invoices and orders; in the healthcare industry, it facilitates case digitization, analyzes the relevance of medical literature and cases, and suggests potential treatment options.

However, Microsoft Research Asia will not rest on its laurels, Lei Cui indicated. Its researchers are planning to further advance fundamental research in Document AI in three aspects: increasing the scale of models, scaling up training data, and unifying frameworks. “The GPT-3 in NLP demonstrates that large language models can significantly improve performance. The training data for current Document AI models equates to less than one-tenth of web-scale data, so there is still room for improvement. We will focus on scaling up data and models in future research to achieve unification across Document AI frameworks.”

The Natural Language Computing Group at Microsoft Research Asia is looking for researchers and interns. We welcome people who are interested in this research area or related fields to join us!

To apply for a full-time position, please send your resume to fuwei@microsoft.com.