Tesseract: C# Tesseract 3.02 How I access each character of word from image

Created on 12 Jan 2014 · 3Comments · Source: charlesw/tesseract

Hi, I'm newbie here.
First, I need to draw rectangle on each character of word from image.
in old version of tesseract I found that we can access each character by

foreach (tessnet2.Character c in word.CharList)
e.Graphics.DrawRectangle..........

demo

But, now I'm working on C# winform with Tesseract 3.02

TesseractEngine a = new TesseractEngine(@"./tessdata", "eng", EngineMode.TesseractAndCube);
Tesseract.Page page1 = a.Process(image);
foreach ( ....... in page1)
{
// draw rectangle from (bounding box of each character)
}

Question 1: how i access each character of page1.

I try many method like PageIteratorLevel and get some part of page like first line, first word or first block , but i can't get first character of them.
Well, I notice that on result text of HOCRtext from page1 each element like word, line , block has Bounding box's value.

Question 2: how i get value of bounding box of each element. ( I found only 1 method "TryGetBoundingBox" that return only boolean.

thank you.

question

Source

ominouse

Most helpful comment

Answer for Q1:

Check out the console sample provided as it gives an example of how to iterate through the results, however something like the following should work:

using (var iter = page.GetIterator()) {
    do {
        do {
            do {
                if (iter.IsAtBeginningOf(PageIteratorLevel.Block)) {
                    // do whatever you need to do when a block (top most level result) is encountered.
                }
                if (iter.IsAtBeginningOf(PageIteratorLevel.Para)) {
                    // do whatever you need to do when a paragraph is encountered.
                }
                if (iter.IsAtBeginningOf(PageIteratorLevel.TextLine)) {
                    // do whatever you need to do when a line of text is encountered is encountered.
                }                                               
                if (iter.IsAtBeginningOf(PageIteratorLevel.Word)) {
                    // do whatever you need to do when a word is encountered is encountered.
                }

                // get bounding box for symbol
                Rect symbolBounds;
                if(iter.TryGetBoundingBox(PageIteratorLevel.Symbol, out symbolBounds)) {
                    // do whatever you want with bounding box for the symbol
                }
            } while(iter.Next(PageIteratorLevel.Word, PageIteratorLevel.Block));
        } while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word));
    } while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine));
}

Note that the general result hierarchy is as follows:

Block -> Para -> TextLine -> Word -> Symbol

I.e. the result set can contain many Blocks, which can in turn contain many Paragraphs and so on.

Answer for Question 2:

As per above the TryGetBoundingBox method returns the bounds in an out parameter. Much like Dictionary.TryGetValue does.

charlesw on 13 Jan 2014

👍2

All 3 comments

Answer for Q1:

Check out the console sample provided as it gives an example of how to iterate through the results, however something like the following should work:

using (var iter = page.GetIterator()) {
    do {
        do {
            do {
                if (iter.IsAtBeginningOf(PageIteratorLevel.Block)) {
                    // do whatever you need to do when a block (top most level result) is encountered.
                }
                if (iter.IsAtBeginningOf(PageIteratorLevel.Para)) {
                    // do whatever you need to do when a paragraph is encountered.
                }
                if (iter.IsAtBeginningOf(PageIteratorLevel.TextLine)) {
                    // do whatever you need to do when a line of text is encountered is encountered.
                }                                               
                if (iter.IsAtBeginningOf(PageIteratorLevel.Word)) {
                    // do whatever you need to do when a word is encountered is encountered.
                }

                // get bounding box for symbol
                Rect symbolBounds;
                if(iter.TryGetBoundingBox(PageIteratorLevel.Symbol, out symbolBounds)) {
                    // do whatever you want with bounding box for the symbol
                }
            } while(iter.Next(PageIteratorLevel.Word, PageIteratorLevel.Block));
        } while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word));
    } while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine));
}

Note that the general result hierarchy is as follows:

Block -> Para -> TextLine -> Word -> Symbol

I.e. the result set can contain many Blocks, which can in turn contain many Paragraphs and so on.