As the world becomes increasingly digital, enterprise documents like contracts, reports, invoices, and receipts have evolved to have intricate layouts. These documents are often automatically interpreted and analyzed, leading to the development of AI-driven solutions. However, there are challenges that arise due to the rich semantics found in the intersection of textual and spatial modalities within these complex layouts. The visual clues provided by these layouts are crucial for efficient interpretation.
While Document AI (DocAI) has made significant progress in areas such as question answering, categorization, and extraction, real-world applications still face persistent hurdles related to accuracy, reliability, contextual understanding, and generalization to new domains.
To address these issues, a team of researchers from JPMorgan AI Research has introduced DocLLM, a lightweight version of conventional Large Language Models (LLMs) specifically designed for reasoning over visual documents. DocLLM takes into account both textual semantics and spatial layout, making it inherently multi-modal.
Unlike traditional methods, DocLLM incorporates bounding box coordinates acquired using optical character recognition (OCR) to add spatial layout information. This design decision eliminates the need for a sophisticated visual encoder, reducing processing times and only slightly increasing model size while maintaining the causal decoder architecture.
The team has discovered that for several document intelligence tasks, such as form comprehension, table alignment, and visual question responding, having a spatial layout structure alone is sufficient. By separating spatial information from textual information, DocLLM extends the typical transformers’ self-attention mechanism to capture cross-modal interactions.
Visual documents often have fragmented text sections, erratic layouts, and varied information. To address this, the study suggests changing the pre-training target during the self-supervised pre-training phase. This adjustment allows the model to effectively handle mixed data types, complex layouts, contextual completions, and misaligned text.
To fine-tune DocLLM’s pre-trained knowledge for different document intelligence jobs, the team has used instruction data from various datasets. These tasks include document categorization, visual question answering, natural language inference, and key information extraction.
The instruction-tuning data covers both single- and multi-page documents, and layout cues such as field separators, titles, and captions are included to enhance readers’ understanding of the logical structure of the papers. The changes made by DocLLM to the Llama2-7B model have resulted in notable performance gains, ranging from 15% to 61%, in four of the five previously unpublished datasets.
In summary, the team’s primary contributions are as follows:
- Introducing a lightweight extension of a typical LLM specifically designed for visual document interpretation.
- Providing a unique attention mechanism that distinguishes between textual and spatial information, enabling efficient capture of cross-modal alignment between layout and text.
- Outlining a pre-training goal to address the challenges posed by asymmetrical layouts in visual documents.
- Designing a specialized instruction-tuning dataset for effective fine-tuning of the model for visual document intelligence tasks.
- Conducting in-depth trials that provide valuable insights into the behavior and functionality of the proposed model in managing visual documents.
With the introduction of DocLLM, the field of document intelligence takes a significant step forward in overcoming the challenges posed by complex layouts. By incorporating both textual semantics and spatial layout, DocLLM offers a comprehensive solution that enhances accuracy, reliability, and contextual understanding. As enterprises continue to rely on AI-driven solutions for document analysis, DocLLM proves to be a valuable tool in improving efficiency and effectiveness.