Shiraz University - CSE Dept - Machine Learning Project # 2 by M.M.Haji
mehdi.haji@gmail.com

Text Segmentation Using Decision Trees

I. Introduction

In this project, we apply decision trees to separate text from non-text blocks in a document image. This is an important part of Document Image Analysis and Recognition, because the text must be detected first and then recognized. So various approaches have been proposed to solve the problem and most of them use texture classification and image segmentation based on Fourier power spectrum, mean and covariance, spatial texture energy, DCT and wavelet transform (and ...) features and finally a 2-means classification algorithm may be applied on these features to classify the image blocks as text or non-text. Hence these methods does not need training data. In our new method, we use the training data generated in project 1 to learn a decision tree for the text segmentation problem. In the rest of this report we describe our new method, show some experimental results, compare it with the method described in [2] and finally draw some conclusions.

 II. Proposed Segmentation Method

At first, we train a decision tree using the continuous data we generated in porject 1. We have about 100,000 training example but we selected 1000 of them randomly and used them to train the decision tree faster, as you see in make_text_dtree.m. After training, the decision tree have a good generalization power even with the 1000 training samples as you will see in part III of this report. We segment the input image using the following procedure:

    1. Segment the gray-scale input image to non-overlapping 8 x 8 blocks.
    2. Apply DCT to each block and select the 18 coefficients that described in project 1 as the feature vector.
    3. Show this vector to classifier. If the classifier's output is Yes or above .5 label this block as text and otherwise non-text.
    4. Post-process the output  image to reduce noise efffect and improve segmentation accuracy.

The Matlab code for this procedure is remove_nontext4.m. In the Matlab environment type 'help remove_nontext4' to see an example. In step 4 of the procedure we reduced noise effect using rule-based smoothing and some morphological operations. The most successful smoothing scheme we tried was similar to the set of rules in [3], as described below:

0

0

0

0

1

0

0

0

0


Isolated text becomes graph

0

0

0

0

0

0

0

0

0

 

1

1

1

1

0

1

1

1

1


Isolated graph becomes text

1

1

1

1

1

1

1

1

1

 

0

1

1

1

1

1

1

1

1

0

1

1

1

1

1

1

 

 

1

1

0

1

1

1

1

1

1

1

1

0

1

1

1

1

 

Semi-rectangular text regions become rectangular

These rules are implemented in C++ to be as fast as possible; the executable file is rbsmooth.exe (send me an email if you want a Linux binary) which accepts the input file name as its first command line argument, the output file name as the second, smoothing type as the third and the number of repetitions as the last argument. The input file is assumed to be a black and white (bi-level) in BMP format.

 III. Experimental Results

We input the image 1 to our segmentation method, the output without post-processing of step 4, is shown in image 2 and the final output (after post-processing) is shown in image 3:

(1) Input image

(2) Output image before post-processing

(3) Output image after post-processing

And if we apply the mask to the original image:

We have implemented remove_nontext1.m that is another text segmentation method as described in [2], the method is based on high frequency wavelet coefficients. If we input the same image to the algorithm, the following images will be resulted:

Output image before post-processing

Output image after post-processing

Segmented image

It is obvious that segmented image is obtained by applying the post-processed image (mask) to the original image. Below you see the result of another experiment:

Input image

Segmented image by our new method

Segmented image by the method [2]

The input image contains a "white" text region and some textures; as you see the wavelet based method, identified the text correctly but also took into account the textures. By these two  image and some other test images, it is clear that our new method, outperforms their wavelet based method. The only advantage of their wavelet based method is its speed, their method is about 4 times faster than ours.

IV. Conclusions

It seems that our new method gives promising results for the text segmentation problem. As mentioned before, we used a very small fraction of our training data - that was selected randomly - to learn the decision tree. One method that may improve classification accuracy is to train several decision trees (at least three) by several random fraction of the original data set and then use the majority vote to find the classification result. Using some other image transforms and features instead of DCT-18 and other classifiers, such as neural networks and bayes classifiers, instead of decision trees, may further improve the classification accuracy. Analyzing the effect of proposed  methods and a comparison to some of the past approaches, are left as another projects.

References:

1. K. Konstantinides and D. Tretter, "A JPEG Variable Quantization Method for Compound Documents", IEEE Transactions on Image Processing, Vol.9, No.7, July 2000, p2.

2. Shulan Deng and Shahram Latifi, "Fast Text Segmentation Using Wavelet for Document Processing", Department of Electrical and Computer Engineering, University of Nevada, Las Vegas, 2000.

3. J. Minguillon, J. Pujol and K. Zeger, "Progressive Classification Scheme for Document Layout", SPIE Proceedings, Mathematical Modeling, Bayesian Estimation, and Inverse Problems, 1999.