Table of Contents

In this blog , we will look how to process SROIE dataset and get key information from invoice.

SROIE dataset

For Invoice dataset we are using ICDAR 2019 RObust reading challenge on Scanned Receipts OCR and information extraction competition Dataset.

Problem statement

The dataset has 1000 whole scanned receipt images. Each receipt image contains around about four key text fields, such as goods name, unit price and total cost, etc. The text annotated in the dataset mainly consists of digits and English characters. An example scanned receipt is shown below:


The competition is divided into 3 tasks:

  1. Scanned Receipt Text Localisation: The aim of this task is to accurately localize texts with 4 vertices.
  2. Scanned Receipt OCR: The aim of this task is to accurately recognize the text in a receipt image. No localisation information is provided, or is required.
  3. Key Information Extraction from Scanned Receipts: The aim of this task is to extract texts of a number of key fields from given receipts, and save the texts for each receipt image in a json file.

We will go a little ahead and deploy this as a Flask app and make a generic tool out of the models.  

The below concepts are key concepts for further processing.


Its a python based package for serving as a replacement of Numpy and to provide flexibility as a Deep Learning Development Platform offered by Facebook.


Tensors are similar to numpy’s ndarrays, with the addition being that Tensors can also be used on a GPU to accelerate computing.

Tensors are multi dimensional Matrices.


This will create a X by Y dimensional Tensor that has been instantiated with random values.

To Create a 6×4 Tensor with values randomly selected from a Uniform Distribution between -1 and 1,

torch.Tensor(6, 4).uniform_(-1, 1)

Tensors have a size attribute that can be called to check their size



CTPN stands for Connectionist Text Proposal Network
CTPN is a deep learning method that accurately predicts text lines in a natural image. It is an end-to-end trainable model which consists of both CNN and RNN layers.

This algorithm detects text or words in any kind of image including both scanned documents and natural images. It accurately localizes text lines in natural image and detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps. CTPN works reliably on multi-scale and multi-language text without further post-processing, departing from previous bottom-up methods requiring multi-step post-processing.It is computationally efficient with 0:14s/image, by using the very deep VGG16 model.


The VGG network architecture was introduced by Simonyan and Zisserman in their 2014 paper, Very Deep Convolutional Networks for Large Scale Image Recognition.
This network is characterized by its simplicity, using only 3×3 convolutional layers stacked on top of each other in increasing depth. Reducing volume size is handled by max pooling. Two fully-connected layers, each with 4,096 nodes are then followed by a softmax classifie
In 2014, 16 and 19 layer networks were considered very deep (although we now have the ResNet architecture which can be successfully trained at depths of 50-200 for ImageNet and over 1,000 for CIFAR-10)
Due to its depth and number of fully-connected nodes, VGG is over 533MB for VGG16 and 574MB for VGG19.

  1. Firstly input image is passed through a pretrained VGG16 model (trained with ImageNet dataset).
  2. Features output from the last convolutional maps of the VGG16 model is taken.
  3. These outputs are passed through a 3×3 spatial window.
  4. Then outputs after a 3×3 spatial window are passed through a 256-D bi-directional Recurrent Neural Network (RNN).
  5. The recurrent output is then fed to a 512-D fully connected layer.
  6. Now comes the output layer which consists of 3 different outputs, 2k vertical coordinates, 2k text/non-text scores and k side refinement values.

The results of CTPN are transferred as inputs for the next module CRNN.


We use CRNN- convolutional recurrent Neural networks to identify the scene text.

This is a pytorch implementation of CRNN, which is based on @meijieru’s repository here.

  1. The convolutional neural network to extract features from the image.
  2. Recurrent neural network to predict sequential output per time-step.
  3. CTC loss function which is transcription layer used to predict output for each time step.

Convolutional Recurrent Neural Network (CRNN), which is a combination of DCNN and RNN. – one willFeature extraction,Sequence modelingwithTranscriptionA new neural network architecture integrated into the unified framework.
A new architecture is proposed with four different features:

  1. It is end-to-end training compared to most existing components that require separate training and coordination algorithms.
  2. It naturally processes sequences of arbitrary length without involving character segmentation or horizontal scale normalization.
  3. It is not limited to any predefined vocabulary, and No dictionary with Dictionary based Significant performance has been achieved in the scene text recognition tasks.
  4. It produces an effective and much smaller model, which is more practical for real-world scenarios. Experiments on standard benchmark data sets including the IIIT-5K, Street View Text and ICDAR data sets demonstrate that the proposed algorithm is more advantageous than the prior art.


Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm which can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other. The pre-processing required in a ConvNet is much lower as compared to other classification algorithms. While in primitive methods filters are hand-engineered, with enough training, ConvNets have the ability to learn these filters/characteristics.

The architecture of a ConvNet is analogous to that of the connectivity pattern of Neurons in the Human Brain and was inspired by the organization of the Visual Cortex. Individual neurons respond to stimuli only in a restricted region of the visual field known as the Receptive Field. A collection of such fields overlap to cover the entire visual area.

Recurrent Neural Network is a generalization of feedforward neural network that has an internal memory. RNN is recurrent in nature as it performs the same function for every input of data while the output of the current input depends on the past one computation. After producing the output, it is copied and sent back into the recurrent network. For making a decision, it considers the current input and the output that it has learned from the previous input.

Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition. In other neural networks, all the inputs are independent of each other. But in RNN, all the inputs are related to each other.

CTC Loss:
Connectionist Temporal Classification (CTC),

To train a neural network, we need to calculate the loss function. CTC loss is somewhat different from other deep learning losses. To compute the CTC loss you need to use the following two steps.
We first need to sum over probabilities of all possible alignments of the text present in the image.
Then take the negative logarithm of this to calculate the loss.
Let say you are having text “cat” present in the image and have 5 input time steps. Then you need to sum over all the possibilities of “cat” present in the output of those five time-steps. This can be very expensive to compute but CTC loss uses dynamic programming to compute it, which makes it much faster.

In scanned receipts each text usually contains several words. We add the blank space between words to the alphabet for LSTM prediction and thus improve the network from single word recognition to multiple words recognition. Moreover, we double the input image width to tackle the overlap problem of long texts after max-pooling and stack one more LSTM, enhancing the accuracy per character in the training set from 62% to 83%.

Key information extraction as character wise classification with LSTM

This is a method that tackles the key information extraction problem as a character-wise classification problem with a simple stacked bidirectional LSTM. The method first formats the text from an image into a single sequence. The sequence is then fed into a two-layer bidirectional LSTM to produce a classification label from 5 classes – 4 key information category and one “others” – for each character. The method is simple enough with just a two-layer bidirectional LSTM implemented in PyTorch, and proves to sufficient in understanding the context of a receipt text and outputting highly accurate results.

CRNN’s network architecture consists of three parts, includingConvolution layer,Circulating layerwithTranscription layerFrom the bottom up.

At the bottom of the CRNN, the convolutional layer automatically extracts the sequence of features from each input image. Above the convolutional network, a circular network is constructed to predict each frame of the feature sequence output by the convolutional layer. Each frame prediction of the loop layer is converted to a tag sequence using a transcription layer on top of the CRNN. Although CRNN consists of different types of network architectures (such as CNN and RNN), joint training can be performed through a loss function.

CRNN can capture input images of different sizes and produce predictions of different lengths. It runs directly on coarse-grained labels (such as words), and you don’t need to specify each individual element (such as a character) in detail during the training phase. In addition,Since CRNN abandoned the fully connected layer used in traditional neural networks, a more compact and efficient model was obtained.All of these attributes make CRNN an excellent method for image sequence recognition.
Experiments on the scene text recognition benchmark dataset show that CRNN achieves superior or highly competitive performance compared to traditional methods and other CNN and RNN based algorithms. This confirms the advantages of the proposed algorithm. In addition, CRNN is significantly superior to other competitors in the optical music recognition (OMR) benchmark dataset, which verifies the generalization of CRNN.

output of CRNN

We have used the above architecture to solve the CRNN and have developed a simple web application consuming the models.

Visit for a detailed explanation on this blog.

You may refer and create pull requests for more enhancements.