Sunday, 28 August 2011

A brief analysis of Yahoo captchas

Captchas, initially a huge annoyance, are generally recognized as a necessary evil now. They stop bots from abusing your services, and there's a lot of interesting variants to use. The biggest is google's recaptcha, which is so popular even microsoft uses it occasionally. Today my attention is on Yahoo's implementation. You'll know them, they look like this:

In a nutshell: I had to type a few of these lately, and the character distribution didn't look quite right. I grabbed a hundred captchas, laboriously typed them out, and broke it down by character.


What you can't see here: Yahoo works with the traditional 'random combinations of letters and numbers' form of captcha. They use at least three different fonts, which are then physically skewed in a variety of ways. There's no additional visible interference between you and the letters, and the average length is 7.2 characters.

What you can see: Yahoo captchas use a relatively small subset of alphanumeric characters. A, B ,F ,G ,H ,J ,L ,M, T and V appear only in uppercase while c, d, e, n, p, r, s, t, y, q, y and z appear only in lowercase. Out of the numbers we have only 2, 3, 4, 5, 6, 7, and 8. This leaves 8 alphanumeric characters completely unrepresented - i, k, o, q, x, 1, 9 and 0.

Most of these seem to be omitted due to possible confusion. O, o and 0 are easily mistaken and so all are avoided, and the same goes for l/1 and K/X. Additionally, some two-character combinations which look similar to existing characters are omitted.
In this example the letter d is very easily mistaken for either 'cl' or 'ol' due to the font. However c, l, and o never appear in the captchas, presumably for this reason.. The letter p suffers similarly, while B and 8 manage to escape despite being sometimes difficult to distinguish.

I'm not entirely sure of the strategy here. They're purposefully obfuscating the word by overlapping the characters, but at the same time dramatically reducing the number of characters that could be present. By cutting down the total alphanumeric characters from 62 to 28 they're making it easier for OCR to render their technique ineffective.

2 comments:

  1. They're painful but they work. Honestly, I think they do discourage busy people from leaving comments sometimes. The codes are tedious and then the font style- ugh. The image based captchas are better for blog owners who want to encourage comments on their blogs.

    ReplyDelete
  2. Yahoo captchas are just horrible. Google has come up with an innovative image based approach

    ReplyDelete