Thing-a-Day: Day Three — Batch Character Replacement

My favourite ebook reader for my Android phone is FBReader. It seems to handle ePub files the best, but not every ebook I download is available in that format. I convert those files with Calibre (which I also use to organize my library), but in certain cases the conversion isn’t perfect. The most annoying problem is incorrect character encoding. A lot of the time I just deal with it because the only effect is to ugly up the formatting a bit, but a book I was trying to read today was missing all of its em dashes. Usually FBReader will display � or Ͱ or whichever very wrong character applies, but these em dashes were replaced by nothing at all. Words on either side would runtogether likethis, and the author had used what I would consider an unacceptable number of em dashes, really, so the text was unreadable. Well, I could read it, but it was making me very angry.

The issue with this book turned out to be that most of the text was correctly encoded as UTF-8, but the em dashes were encoded as something else entirely (probably was originally CP1252). I could explicitly specify the input as either one before converting, but that gave me only two choices: em dashes incorrect or all other punctuation incorrect. This will probably be of no use or interest to anyone, but what I did (so I can remember later!) was this:

I renamed my ePub file to zip (because ePub files are zip files with some particular contents) and dug into the html files that were inside. I used ClipSpy to determine what the hex code for the invalid em dash character was that Calibre was choking on, this table of characters to figure out what the correct UTF-8 hex code was and Useful File Utilities with the Batch Replace plugin to go through and replace every instance of 0xc2 0x97 with 0xe2 0x80 0x94. When I checked the results in Chrome things looked good, so I zipped everything back up and loaded it back into Calibre. Great success was had.

It took me far too long to figure this process out so I went ahead and fixed a whole lot of other ugly stuff from other books, and then, since I was on a roll, I went through my library and made sure everything had the correct metadata while also deleting file formats I don’t need. Now everything in Calibre looks nice and uniform. Also, poof! Half of my day gone. I get a bit obsessive when I start doing stuff like this… However, I will never suffer from having to stare at ����� in my ebooks again.

Goggling at hex code made my eyes droopy, so I went to sleep and didn’t wake back up until midnight. Posting this at 12:22am, but I won’t consider it to be the 4th until I wake up after my “night’s sleep”. So shush. It’s how I always operate. Makes things less confusing when you’re the sort of person who doesn’t generally wake up until sometime in the afternoon! TV guides don’t flip the date until something like 5am, so… let’s assume I’m on that system, shall we?

Leave a Reply

Your email address will not be published. Required fields are marked *