Updated: Oct 23
Scenario: The first and core feature of the business is to make credit card payments convenient and reliable on the app. This will be possible only when customer’s credit card transactions are available to the app; customers are ok to share their card statements in pdf form. The challenge is to extract information from the pdf statements with 100% accuracy and in under a second as this is the first interaction for a new customer. Any data identified incorrectly poses a high credibility risk and a barrier to user adoption. The initial trials with commercial software did not meet the accuracy and performance requirements, and cost is going to be high with millions of customers.
The Ask: The data extractor should be configurable for all the banks and should be 100% accurate. It should have in-built and configurable checks to classify data as incorrect if it did not meet the quality requirements. The solution should parse a statement in real-time in under a second to support the app’s workflows, and also handle millions of statements for the monthly payment cycle.
Solution: The parser worked in 2 steps, first converting the pdf document to html and interpreting the html by spatial and data patterns. The pdf to html conversion is powered by an excellent open source package and the rest of the solution is built in java. The reconstructions of tabular data was a key challenge to handle multiple headers, data across pages, wrapped text within cells and mapping data to their respective columns. The parser was able to provide 100% accuracy and was able to spot exceptions to this and send them to an exception flow. A card statement with 4-5 pages gets processed in under 200ms on average and the solution scales well to handle millions of statements per hour.