mehdi.haji@gmail.com
Text Segmentation Using Decision Trees
I. Introduction
In this project, we apply decision trees to separate text from non-text blocks in a document image. This is an important part of Document Image Analysis and Recognition, because the text must be detected first and then recognized. So various approaches have been proposed to solve the problem and most of them use texture classification and image segmentation based on Fourier power spectrum, mean and covariance, spatial texture energy, DCT and wavelet transform (and ...) features and finally a 2-means classification algorithm may be applied on these features to classify the image blocks as text or non-text. Hence these methods does not need training data. In our new method, we use the training data generated in project 1 to learn a decision tree for the text segmentation problem. In the rest of this report we describe our new method, show some experimental results, compare it with the method described in [2] and finally draw some conclusions.
II. Proposed Segmentation Method
At first, we train a decision tree using the continuous data we generated in
porject 1. We have about
100,000 training example but we selected 1000 of them randomly and used them to
train the decision tree faster, as you see in make_text_dtree.m. After training, the decision tree
have a good generalization power even with the 1000 training samples as you
will see in part
1. Segment the gray-scale input image to non-overlapping
8 x 8 blocks.
2. Apply DCT to each block and select the 18 coefficients
that described in project 1 as the feature vector.
3. Show this vector to classifier. If the classifier's
output is Yes or above .5 label this block as text and otherwise non-text.
4. Post-process the output image to reduce noise efffect and improve segmentation accuracy.
The Matlab code for this procedure is remove_nontext4.m. In the Matlab
environment type 'help remove_nontext4' to see an example. In step 4 of the
procedure we reduced noise effect using rule-based smoothing and some
morphological operations. The most successful smoothing scheme we tried was
similar to the set of rules in [3], as described below:
|
→ |
|
|
→ |
|
|
→ |
|
|
→ |
|
||||||||||||||||
|
|
|
||||||||||||||||||||
|
→ |
|
|
→ |
|
||||||||||||||||
|
|
|||||||||||||||||||||
|
Semi-rectangular text regions become rectangular |
|||||||||||||||||||||
These rules are implemented in C++ to be as fast as possible; the executable file is rbsmooth.exe (send me an email if you want a Linux binary) which accepts the input file name as its first command line argument, the output file name as the second, smoothing type as the third and the number of repetitions as the last argument. The input file is assumed to be a black and white (bi-level) in BMP format.
We input the image 1 to our segmentation method, the output without
post-processing of step 4, is shown in image 2 and the final output (after
post-processing) is shown in image 3:
|
(1) Input image |
(2) Output image before post-processing |
(3) Output image after post-processing |
And if we apply the mask to the original image:

We have implemented remove_nontext1.m that
is another text segmentation method as described in [2], the method is based on
high frequency wavelet coefficients. If we input the same image to the
algorithm, the following images will be resulted:
|
Output image before post-processing |
Output image after post-processing |
Segmented image |
It is obvious that segmented image is obtained by applying the
post-processed image (mask) to the original image. Below you see the result of
another experiment:
|
Input image |
Segmented image by our new method |
Segmented image by the method [2] |
The input image contains a "white" text region and some textures; as you see the wavelet based method, identified the text correctly but also took into account the textures. By these two image and some other test images, it is clear that our new method, outperforms their wavelet based method. The only advantage of their wavelet based method is its speed, their method is about 4 times faster than ours.
IV. Conclusions
It seems that our new method gives promising results for the text segmentation problem. As mentioned before, we used a very small fraction of our training data - that was selected randomly - to learn the decision tree. One method that may improve classification accuracy is to train several decision trees (at least three) by several random fraction of the original data set and then use the majority vote to find the classification result. Using some other image transforms and features instead of DCT-18 and other classifiers, such as neural networks and bayes classifiers, instead of decision trees, may further improve the classification accuracy. Analyzing the effect of proposed methods and a comparison to some of the past approaches, are left as another projects.
References:
1. K. Konstantinides and D. Tretter,
"A JPEG Variable Quantization Method for Compound Documents", IEEE Transactions
on Image Processing, Vol.9, No.7, July 2000, p2.
2. Shulan
Deng and Shahram Latifi,
"Fast Text Segmentation Using Wavelet for Document Processing",
Department of Electrical and Computer Engineering, University of Nevada, Las
Vegas, 2000.
3. J. Minguillon, J. Pujol and K. Zeger, "Progressive
Classification Scheme for Document Layout", SPIE Proceedings, Mathematical
Modeling, Bayesian Estimation, and Inverse Problems, 1999.