Categories
Computers Programming Python

Dealing with Unicode in Python

I haven’t touched the code for the blog client I’d written in quite awhile. This is largely because it works well for my purposes and I haven’t had the need to add further support for other features.

There has been one major shortcoming for it, however, that I hadn’t taken the time to investigate and correct. Often times, when quoting text from an article on the web, I would get a unicode decode error related to the blob of text I’d copied from the browser.

Now, I understood in general terms what the problem was: stray characters within the copied text were not ASCII characters and markdown chokes on those characters. I had an inelegant workaround that kept me from properly dealing with the problem: I’d scan the text for offending characters, typically punctuation, and replace them with reasonable ASCII equivalents. It was a pain, but it worked.

Like all workarounds, this method had limitations. Specifically, certain special letter characters like letters with umlauts, tildes, accent graves or accent aigus over them cannot be duplicated. The fact that I didn’t run into that problem a lot kept me from dealing with it quicker. Also, scanning a block of text for unicode violators is tedious.

What I failed to understand at the time was that the characters on a web page were encoded in some kind of format, like UTF-8 for example. For most of the alpha characters (those without umlauts and the like) UTF-8 and unicode are identical. The problem comes in when characters don’t line up so neatly. What I finally came to understand was that the encoded web page text needed to be decoded into unicode prior to processing. The concept seems so blisteringly obvious, now, that I’m actually perplexed as to how I never grasped it originally.

So I finally fixed the problem. Or, perhaps better put, I came up with a solution with a better set of trade-offs. Because in order to actually “fix” the problem, it would be necessary to always know how text had been encoded. Unfortunately, from the program’s perspective, it can’t be done.

But it can make some educated guesses.

Here’s the basic code that fixes the problem:

for encoding in ['ascii', 'utf-8', 'utf-16', 'iso-8859-1']:
    try:
        xhtml = markdown.convert(text.decode(encoding))
    except (UnicodeDecodeError, UnicodeError):
        continue
    except:
        print "Unexpected Error: %s\n" % sys.exc_info()[0]
        sys.exit(1)
    else:
        return helperfunc(xhtml)

In this case, markdown is an object for marking up markdown formatted text. Prior to passing the text to the markdown object, I decode it using encoding that represent the most likely encodings I’ll run into. If an encoding fails, that a UnicodeDecodeError will get raised, which is caught by the first except clause. That clause merely passes control back to the for loop where the next encoding is selected and tried. Rinse, repeat. When no exception is created, control passes to the else clause where normal program flow continues on the returned xhtml from markdown.

This section of code eliminates, in my case, almost all occurrences my afore explained unicode problems. But that’s because the vast majority of webpages I use are encoded using UTF-8. I’ve since added a command line option to specify the encoding to use for decoding purposes. This should provide a means to cover all other situations that arise. In this instance, when the user specifies the encoding on the command line, the user specification supersedes all other encodings and is used. The presumption is the user knows what they are doing.

The code to support that looks like this:

if charset:
    encodings = [charset]
else
    encodings = ['ascii', 'utf-8', 'utf-16', 'iso-8859-1']

for encoding in encodings:
 .
 .
 .

The rest of the code looks identical to the above snippet.

It was a good exercise for me to muddle through, as I now fully comprehend the unicode problems that can arise and how to deal with them. The basic rules are:

  1. Decode text going into the program.
  2. Encode text coming out of the program.
  3. Use unicode for the string literals within the program.

These should help keep me out of unicode trouble in the future.

Leave a Reply

Your email address will not be published. Required fields are marked *