Jump directly to the content
Revealed
Key role

This is why those annoying CAPTCHA boxes that make you type in words to prove you’re a human are actually great

Next time you are confronted by one of these, you can take pride in the fact that you are doing your part to preserve history

ONE of the larger annoyances during our day-to-day dealings online is having to constantly prove you’re a human amid a sea of robots by filling out those painful CAPTCHAs.

You know the things: You are presented with a few wavy, blurry words and you have to dutifully type them into a box, your teeth clenched, your mouth yelling, “I can’t read that mess.”

 We’ve all filled a CAPTCHA box, but did you know what you were really doing?
2
We’ve all filled a CAPTCHA box, but did you know what you were really doing?

Well, next time you are confronted by one of these, you can take pride in the fact that you are doing your part to preserve history.

Yes, those annoying CAPTCHAs are actually being used to help digitalise decades of old texts — books, magazines and newspapers — that scanning programs struggle to decipher.

The reason the words are blurry or warped isn’t to test your patience; these are taken from scanned texts, which are often mistranslated by auto-digitising programs — or optical character recognition (OCR) software if you wish to get technical. That’s where we step in.

Through the use of CAPTCHAs, humans around the world digitalised 20 years worth of New York Times back issues in mere months. Within the first year, 440 million words had been deciphered: the equivalent of 17,600 books.

 You probably didn’t realise this annoying process is part of a really cool project
2
 You probably didn’t realise this annoying process is part of a really cool project

Google bought the technology in 2009, and is using it as the cornerstone of its ambitious Google Books project, which digitalises ancient, rare, and out-of-print works and offers them for free.

The technology came to be used in this way after the inventor of the CAPTCHA, Louis von Ahn, realised that while it only took a few seconds to type the letters, collectively humans were wasting hundreds of thousands of man-hours each day doing so, and so he set about discovering the best way to harness this energy.

“Human computation” is the less-than-charming term von Ahn uses to describe the process he arrived at. The updated software was dubbed the reCAPTCHA.

Initially CAPTCHAs would work by offering up a series of jumbled letters and intentionally warping these just enough that humans could easily read them but robots could not.

In the case of a ticketing company, this would stop software being developed by scalpers in order to automatically buy multiple tickets.

But the same inherent flaw that allowed CAPTCHAs to trip up robots also meant that OCR programs often failed to accurately decipher scanned text with any imperfections.

Fading, damage to the paper, and printing flaws means that OCR software incorrectly reads around 20 per cent of words — an unacceptable amount by any standards.

The program corrects this by pairing a word unable to be deciphered by the OCR software with a control word. If enough people type both words, the program can assume both are correct.

Note that both words are warped further by the program in order to decrease the chances that another OCR program — essentially one being used for cyber attacks — can also read the text.

Otherwise, this would defeat the CAPTCHA’s initial purpose — to halt such automated attacks.

The CyLab institute at Carnegie Mellon University, who developed the software, report a 99.1 per cent accuracy rate with the program, success is “comparable to the best human professional transcription services”.

You may have also notice photos of numbers appearing in your CAPTCHA.

This is an even more ambitious plan: to digitalise street numbers scanned by Google Street View.

Of course, this is a less altruistic undertaking than the preservation of important literature from the past, and this implementation of the reCAPTCHA system might mesh more with critical views that this is simply Google employing free labour for its own commercial ends.

As the old adage goes: if you aren’t paying for a product, you are the product, and while this might be true of Google’s ultimate business model — which is to collect and sell information — there is still a nice feeling that we are all pulling together for a few seconds each day in service of something bigger — no matter how annoying the process may be.

Topics