So it's been a really long time since my last post...oops ;p Sorry? There's no single explanation to excuse my absence, nor do I care to waste space to detail them at this time, so I'll just get right on with the main topic of this entry: OCR software.
What is OCR? OCR stands for Optical Character Recognition, and what OCR programs do is basically read an image file (like a JPG, PNG, TIF, etc.) and output any text it recognizes. As you can imagine, this greatly reduces the amount of manual labor needed to translate a purely binary file into a text file. And as Linux/BSD enthusiasts know, text files are incredibly easy to slice, dice, maneuver, and just generally utilize for various purposes. For example, even a general computer user could search for instances of a specific term in a text file; such a task, which could ordinarily take minutes, if not hours or days, to complete may take only a few seconds once an image is dealt with by an OCR program. With the aid of such text processing languages as sed, awk, or Perl, the possibilities are almost limitless as to what can be accomplished with a typical text file.
So having established the great advantage(s) test files have over image files, where can you get started? Well, I'm glad you asked! Just for your convenience, I've (lightly) tried out three different OCR programs that are available in Ubuntu's repositories: gocr, ocrad, and (the self-proclaimed "commercial quality") tesseract. These or others should be available in all or most other major distributions' repositories as well. The test image used incorporates various font effects and sizes, making it a good candidate to compare the capabilities of the 'wares. So come on, let's see how they all fare.
The first program I tried was gocr, which can be installed on Ubuntu with the command
sudo apt-get install gocr
Right out of the box, gocr claims it only supports PBM, PNM, PPM, and PCX image files; however, if you install the package "netpbm", it'll also handle pnm.gz, pnm.bz2, JPG, JPEG (what's the diff?), TIFF, GIF, BMP, PS (single pages only), and EPS files. So unless you have P*M or PCX files lying around your filesystem, you'll probably want to go ahead and install the "netpbm" package (simply replace "gocr" with "netpbm" in the above command. Technically, you can use GIMP to save your image as any of those filetypes, though, so that's an alternative.
At this point, you'll need to have a file of a supported type, and all you have to do to extract the text of a given image file is issue the command
gocr -i path/to/image/file -o path/to/desired/output/file
Example:
gocr -i test.png -o text
Keep in mind that Unix/Linux programs don't typically rely on file extensions, the output file in the example is without; if you must, you can name it "text.txt". So what this command does is it takes the input image file "test.png", has gocr process it, and then outputs the file "text". Simple enough, yeah? Unfortunately, gocr isn't the most accurate ever, and I had to hand edit several places in the newly-created text file. By and large, however, the output was the same as the original (which is good!).
The next specimen, ocrad, seems to be more picky about its file formats, and you must provide a PBM, PGM, PNM, or PPM file (GIMP may come in handy here). The command used for ocrad was
ocrad path/to/file -o path/to/desired/output/file
Example:
ocrad test.ppm -vo text -x results
In this example, I've also used the -v option (for verbose mode), which was legally combined with the -o option to make -vo, and the -x option (to specify a file in which to place OCR results, which is completely optional). The result was very similar to gocr's, albeit the errors were slightly different.
Finally, we get to tesseract, which was apparently developed by HP between 1985 and 1995. The actual package name is "tesseract-ocr", so that'll need to be used to install it. My experience with this contender started out a bit flaky-like. I tried to use it with a command of the form
tesseract test.png text
However, I ran into an error:
unable to load unicharset file /usr/share/tesseract-ocr/tessdata/eng.unicharset
Being the curious type that I can be, I ventured into the stated directory, and I immediately found the reason why the file couldn't be loaded: it didn't exist! Another file with a similar name did, though ("deu.unicharset"). Rather than do a full-blown research on this issue, however, I simply made a symbolic link named "eng.unicharset" to the existing "deu.unicharset" file. That apparently did the trick for that specific error, but when I tried the command again, I was greeted with yet another error that was very similar to the previous one. Since I was feeling kind of lazy (or smart, depending on your view), I just wrote a simple script to automate the task of creating links for the other files tesseract was complaining about:
for i in deu.*; do sudo ln -s $i $(echo "$i" | sed 's;deu;eng;'); done
And this was executed from within the /usr/share/tesseract-ocr/tessdata directory, of course. Back to the original command I went, which without fail produced yet another error; this time I was informed for the first time that the input file must be either a TIFF or MDI file. At last, I converted the image file into the TIFF format with GIMP, and success—an output file was created. True to its word, indeed the text output was the best of the bunch, in terms of accuracy. There were much fewer mistakes, most of which were made with the smaller text. On the other hand, white space was completely disregarded, so the text was all squished together vertically. Still, it was overall an improvement, and I think I'd recommend tesseract out of all of them. Perhaps there's an even better Linux OCR alternative that's not in Ubuntu's repositories, but for now, tesseract will do it for me and the general user.