I’ve been working on a piece of code to convert a blogpost into Markdown text. Yes, I’m aware of the html2text.py module (mine is html2md, so nyeah!) All I’ll say is how the heck does one learn anything if they keep relying on other people’s work?
Anyway, I’ve got a naive implementation working now (won’t handle more complicated nestings like lists in blockquotes) when I ran into a snag involving unicode. Upon retrieval of a particular post, I got the following error:
UnicodeEncodeError: ‘ascii’ codec can’t encode character u’xa0′ in position 174: ordinal not in range(128)
It came up in the context of passing a string to the iterparse
function of the cElementTree module. The character in question is a ‘non-breaking space.’ Frankly, I wasn’t sure how it got there, but I verified it’s presence in the string and then set to figuring a way to deal with it. I believe this is an instance of mixing strings and unicode together, rather than dealing solely in one or the other.
I found the unicodedata
module for a solution. There is a function call normalize
which will map unicode characters to the local encoding. In this case, ASCII. Now this solution is far from perfect, since special characters (say, from the Russian alphabet or characters with tildes above them) are just converted to rough ASCII equivalents.
The following code additions fixed my problem:
import unicodedata as ud
.
.
.
uthml = ud.normalize('NFKD', unicode(html))
Where html
is a string of html directly from the website. I can then pass uhtml
to the iterparse
function and the error is gone (because the u’xa0′ characters are translated to ASCII space characters) and the rest of the program is able to do it’s thing.
I don’t know if this will be the final solution, but it allows me to continue with the development I had been originally interested in. I was aware of the potential for unicode issues, but had hoped they wouldn’t crop up. This gives me a simple 2 line way to deal with it for the now.