mehdi.haji@gmail.com
Generating Training Data For The Text Segmentation Problem
In this project, we show how to generate training data for the text segmentation problem, which is the problem of separating text blocks from non-text components in a mixed document image. This is an important part of Document Image Analysis and Recognition and so various approaches have been proposed to solve the problem but most of these methods focus on choosing appropriate texture features and classifiers rather than generating appropriate training data which is the aim of this project. With a large training set in hand, the task of tuning and comparing various classifiers, is straightforward.
We used the following set of images selected from a wide category to extract
training data, as you see, the images contain both machine-printed and
handwritten, Farsi and English texts of
different fonts and sizes:
click on each image for a larger view:
|
|
|
|
|
|
||
The text / non-text concept is defined for blocks of an image, in this project we used the typical 8 x 8 blocks. For each block the DCT transform is computed, so we have an 8 x 8 matrix of DCT coefficients for each block. We used the DCT-18 features because they effectively capture the difference between the text and the non-text blocks. These are 18 elements of the matrix with coefficients: [4 5 6 12 13 14 20 21 22 44 45 46 52 53 54 60 61 62] when beginning counting coefficients at 1 and going line after line.
For the target value, that is 1 for a text block and 0 for a non-text one,
we generated a mask image for each of the above training images manually:
|
|
||
|
|
|
|
|
|
||
Now, if the block has more than 32 white pixels in the corresponding block of the mask image, we consider that as a text (so the target value is 1) and otherwise non-text. The Matlab source code for this project is here: gen_text_data.m. After executing this script, a text file will be generated, the first line of the file contains attribute names followed by the target concept name, and each of the next lines, represents one training data. In each, the first 18 float numbers are attributes and the last number which is 0 or 1, is the target value. But the attribute values are continuous, and are not suitable for learning algorithms such as ID3 decision tree and naive bayes classifiers. So we should put away ID3 and use C4.5 instead, or convert these data to discrete form. In reformat_dtree.m we used the following set of rules to convert the continuous-valued variable x into discrete form:
replace
x < -250 by 'S3'
replace -250<= x < -150 by
'S2'
replace -150<= x < -50 by 'S1'
replace -50<= x < 50 by 'CE'
replace 50<= x < 150 by 'B1'
replace 150<= x < 250 by 'B2'
replace 250<= x by 'B3'
The output file generated by reformat_dtree.m obeys the conventions of dTree, which is free java application and downloadable from http://www.cs.ubc.ca/labs/lci/CIspace/. But the program is for teaching purposes and not suitable for our large data set.
References:
1. K. Konstantinides and D. Tretter, "A JPEG Variable Quantization Method for Compound Documents", IEEE Transactions on Image Processing, Vol.9, No.7, July 2000, p.1282.
2. "Image Classification for Coding" technical report, dept of ..., Aug 1999, http://ise.stanford.edu/class/ee368b/projects2000/Projects/mkalman/report.html