Semistructured vs. Unstructured Data

Which one is easier to understand and extract?

In the document understanding and processing world, people often think unstructured data is more challenging to extract than semi-structured data. While it is typically true that you need more comprehension and domain knowledge to understand paragraphs of language, such as legal contracts, a knowledgeable person can comprehend documents of this nature just by listening to it. The pure Natural Language Processing (NLP) model can be applied in those situations, and different IDP vendors can deliver similar results once they master the current transfer learning-based language models, such as Google’s BERT.

Now, imagine you are an accountant or a financial expert, and you try listening to a person reading an invoice. This would be considered semi-structured data, as it has language context and visual representation of different items at different parts of the document. In this scenario, it would be difficult for both you as the listener and the person reading the invoice to you. This is because in order to comprehend semi-structured data, you need visual information as well as context information to reach a full understanding. 

To that extent, I would argue that semi-structured data takes more dimensions, or “features” in machine learning terms, to comprehend its meaning and context. Some IDP vendors use computer vision (CV) modeling to recognize where each field item resides, and then apply OCR to transcribe the content. While this can work well with more structured data, such as fixed format forms, it is difficult with complex semi-structured data that has large and small variations. Because the CV model does not “read” the content of the document, and instead takes image pixels as input, it is as though you are asking an illiterate person to understand an invoice visually. It is doable if the documents are more structured, but precise actions like pulling a unique string from a description field that has a paragraph of words will be very challenging for this person to do. This is where our innovative and unique approach of AI Pathfinder comes into play, to combine NLP and CV for a better comprehension of semi-structured documents. As a result, we are getting higher model accuracy with far less training sample data needed. Hence, the nuance and challenge presented by semi-structured data are overcome. To learn more about how Singularity Systems, now AYR, achieves this, check out our brochure. You can also request a demo here.

Latest Post from AYR

About the Author