
The first and core feature of the business was to make credit card payments convenient and reliable on the app. This was possible only when customer’s credit card transactions were available to the app; customers expected their card statements in PDF form. The challenge was to extract information from the PDF statements with 100% accuracy and in under a second, as this was the first interaction for a new customer. Any data identified incorrectly posed a high credibility risk and a barrier to user adoption. The initial trials with commercial software did not meet the accuracy and performance requirements, and cost was going to be high with millions of customers.
The task at hand was that the data extractor had to be configurable for all the banks and had to be 100% accurate. It needed in-built and configurable checks to classify data as incorrect if it did not meet the quality requirements. The solution had to parse a statement in real-time in under a second to support the app’s workflows, and also handle millions of statements for the monthly payment cycle.
The parser worked in 2 steps, first converting the PDF document to HTML and interpreting the HTML through spatial and data patterns. The PDF to HTML conversion was powered by an excellent open source package and the rest of the solution was built in Java. The reconstructions of tabular data was a key challenge to handle multiple headers, data across pages, wrapped text within cells and mapping data to their respective columns. The parser was able to provide 100% accuracy and was able to spot exceptions to this and send them to an exception flow. A card statement with 4-5 pages was processed in under 200ms on average and the solution scaled well to handle millions of statements per hour.