Create an AWS Lambda API to extract information from PDF

If you have not created an AWS lambda yet please feel free to follow my previous post to create your first hello world lambda using quarkus and Java.

In this post, I will share information about how to extract data from PDFs (especially receipts or invoices) using Amazon Textract service.

Here is my requirement – If I upload an invoice or receipt PDF to the application endpoint, it should return an Expense Json (in the below format) with details filled in. The endpoint should also save the uploaded pdf to S3 for future reference.

{
  "description" : "string",
  "amountWithTax" : 11000.11,
  "taxAmount" : 122.23,
  "receiptDate" : "10-10-2021",
  "documentUrl" : "fileName.pdf"
}

To start with, we need an endpoint to which a pdf file can be uploaded. I have created a controller with an endpoint for the same here. The controller just receives multipart form data and delegates the input to DocumeUploadService

@Path("/pdf")
public class PdfUploadResource {

    @Inject
    DocumentUploadService documentUploadService;

    @POST
    @Path("/upload")
    @Consumes(MediaType.MULTIPART_FORM_DATA)
    public Response upload(@MultipartForm MultipartFormDataInput input) {
        return Response.ok().
                entity(documentUploadService.uploadFile(input)).build();
    }


}

To use multipartform in the controller, we have to include the following dependency in our pom.xml

<dependency>
    <groupId>io.quarkus</groupId>
    <artifactId>quarkus-resteasy-multipart</artifactId>
</dependency>

DocumentUploadService just uploads the pdf to S3 using AWSS3Service and invokes ExpenseAnalyserService to analyse the document and find the text from the document.

@ApplicationScoped
public class DocumentUploadService {

    @Inject
    AwsS3Service awsS3Service;

    @Inject
    ExpenseAnalyserService expenseAnalyserService;

    @SneakyThrows
    public Expense analyseDocument(MultipartFormDataInput input) {
        for (InputPart inputPart : input.getFormDataMap().get("file")) {
            String documentUrl = uploadFile(inputPart);
            return expenseAnalyserService.analyseDocument(documentUrl);
        }
        return null;
    }

    private String uploadFile(InputPart inputPart) throws IOException {
        InputStream inputStream = inputPart.getBody(InputStream.class, null);
        String fileName = getFileName(inputPart.getHeaders());
        writeFile(inputStream, fileName);
        return fileName;
    }

    private void writeFile(InputStream inputStream,String fileName)
            throws IOException {
        byte[] bytes = IOUtils.toByteArray(inputStream);
        awsS3Service.saveFile(fileName, bytes);
    }
}

The main part here is ExpenseAnalyserService. To use amazon Textract service, first we have to add amazon textract sdk to our pom.xml.

        <dependency>
            <groupId>software.amazon.awssdk</groupId>
            <artifactId>textract</artifactId>
            <version>${textract.version}</version>
        </dependency>

Here I am using AnalyseExpense API which is more appropriate for invoice and receipt documents.


import software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.textract.TextractClient;
import software.amazon.awssdk.services.textract.model.AnalyzeExpenseRequest;
import software.amazon.awssdk.services.textract.model.AnalyzeExpenseResponse;
import software.amazon.awssdk.services.textract.model.Document;
import software.amazon.awssdk.services.textract.model.S3Object;

import javax.enterprise.context.ApplicationScoped;
import java.math.BigDecimal;
import java.time.LocalDate;
import java.util.Collection;
import java.util.Comparator;
import java.util.Optional;

@ApplicationScoped
public class ExpenseAnalyserService {
    private static final Logger LOG = Logger.getLogger(ExpenseAnalyserService.class);

    public Expense analyseDocument(String uploadFileName) {
        TextractClient textractclient = TextractClient.builder()
                .httpClientBuilder(UrlConnectionHttpClient.builder())
                .region(Region.US_EAST_1).build();

        AnalyzeExpenseResponse result = textractclient.analyzeExpense(
                analyseExpenseRequest(uploadFileName));

        return Expense.builder()
                .amountWithTax(total(result).orElse(BigDecimal.ZERO))
                .documentUrl(uploadFileName)
                .build();
    }

    private AnalyzeExpenseRequest analyseExpenseRequest(String uploadFileName) {
        return AnalyzeExpenseRequest.builder()
                .document(
                        Document.builder().s3Object(S3Object.builder().name(uploadFileName)
                                .bucket(System.getenv("INVOICE_PDF_BUCKET_NAME")).build()).build())
                .build();
    }

    private Optional<BigDecimal> total(AnalyzeExpenseResponse result) {
        try {
            return result.expenseDocuments().stream()
                    .map(expenseDocument -> expenseDocument.summaryFields())
                    .flatMap(Collection::stream)
                    .filter(expenseField -> expenseField.type().text().equalsIgnoreCase("TOTAL"))
                    .filter(expenseField -> expenseField.type().confidence() > 80)
                    .max(Comparator.comparing(expenseField -> expenseField.type().confidence()))
                    .map(expenseField -> expenseField.valueDetection().text())
                    .map(BigDecimal::new);
        } catch (Exception e) {
            LOG.error("Error while analysing expense total", e);
            return Optional.empty();
        }
    }
    
}

AnalyseExpense API analyses the document and returns an object which contains information about all fields that it could identify. Every field identified by textract comes together with a confidence value.

To find the most correct value, it’s better to find the field with the maximum confidence value. For invoice total, the field returned from AWS Textract was “TOTAL“.

There are many other fields that can be extracted from the invoice.

- INVOICE_RECEIPT_ID - indicating the invoice id
- INVOICE_RECEIPT_DATE - indicating invoice date
- TAX - indicating the total tax in invoice
- VENDOR_NAME - indicating the vendor
- DUE_DATE - indicating invoice due date
- ITEM - indicating each item in the invoice
- QUANTITY - indicating quantity of each item in the invoice
- PRICE - indicating price of each item in the invoice

Important: It is very important that you have appropriate user rights to use the Textract APIS. Since I am using AWS SAM for my lambda here, providing the necessary privilege is just addition of one more policy (TextractPolicy) in my sam.yaml file.

Create an AWS Lambda API to extract information from PDF

Published by Khan

Leave a Comment Cancel reply

Share this:

Published by Khan

Leave a Comment Cancel reply