Developing Bot

Antidote 12-16-2014, 08:47 AM

#78

First to train Tesseract you'll need the following files: (for training with supercell magic font and naming the new language 'coc')

coc.supercell-magic.exp0.tif (image file containing letters/numbers/symbols you need to train Tesseract with)
coc.supercell-magic.exp0.box (containing location of each characters inside the tif file and its translation)
coc.font_properties
coc.words_list
coc.frequent_words_list

Now I'll go into explaining how to prepare each file

You need to install the supercell font to your system first. Download from here: http://www71.zippyshare.com/v/46981357/file.html
Open Supercell-Magic_5.ttf and click install.

Then download jTessBoxEditor from this link: http://sourceforge.net/projects/vietocr/...p/download
It's a program written in Java to help automate the training process. Unzip and run it. If your PC doesn't have java, download and install it from here: http://www.oracle.com/technetwork/java/j...33155.html

Once you open the program, go to Tiff/Box Generator, click on the font and choose Supercell-Magic, for font size I choose 18pt but I'm not sure which one is optimum. For noise I set 2. Output is coc (language prefix)

For Input, you can either type directly inside the text box or input txt file. I use http://psychicscience.org/random.aspx to generate a list of 500 ramdom number from 0 to 999999 (it allows maximum 2500, I tried that but it took so long (2 hours before I gave up).
http://textmechanic.com/Generate-List-of-Numbers.html allows you to generate complete list from 0 to 999999 but jTessBox crashed when I tried to import.
Once you got the list, copy paste to the text box and click Generate. It will generate two files: coc.supercell-magic.exp0.tif and coc.supercell-magic.exp0.box. Move both to a new folder.

[Image: DDiISub.png]

For coc.font_properties, I'm not exactly sure which are the correct properties for Supercell Magic. The standard format is: <fontname> <italic> <bold> <fixed> <serif> <fraktur>. I put: supercell-magic 1 1 0 0 0

For coc.frequent_words_list, I put 0 1 2 3 4 5 6 7 8 9 (seperated by line), same for coc.words_list

Move all the files to one folder, then go to Trainer in jTessBoxEditor, choose the folder with your files in Training Data, make sure it's Train with Existing Box and Press Run. For 500 numbers it only took around 5 minutes and output coc.traineddata. Move your coc.traineddata to tessdata in Tesseract-OCR and you're good to go Smile