NaNoGenMo and Text Encoding

Before anything else, a little housekeeping:

Rainbow.I’m back from the Hawaii honeymoon! Here’s a picture of far too many rainbows. We also have a picture of far too many dolphins. Hawaii was like that. (It was sort of like visiting a Lisa Frank folder.)

In the wake of my month-long wedding hiatus, I’ve been reevaluating my priorities (apparently getting married will do that to you).  I love working on Sibyl Moon (and I’m always encouraged by everyone’s boundless enthusiasm!) but I need to spend some time in lower gear.

For the time being, I’m going to keep Sibyl Moon on a “releases as they come” basis. I’m especially grateful to everyone who’s been supporting Sibyl Moon through Patreon, and I want to retain your enthusiasm and trust, so I’m also keeping monthly Patreon payments turned off for now.

Housekeeping over. Onward!

Things that are really annoying: misconverted text documents where you only see � in place of useful characters like apostrophes, quotation marks, and ellipses.

I ran headlong into this problem while working on my NaNoGenMo project. NaNoGenMo – National Novel Generation Month – is the brainchild of Darius Kazemi.

A few IF community members have tackled this in the past, including Nick Montefort (World Clock), Andrew Plotkin (Redwreath and Goldstar Have Traveled to Deathsgate), Aaron Reed (Aggressive Passive), and doubtlessly others that I’m forgetting (apologies!)

Because it sounds like fun (and possibly because there is something wrong with me), my project is “Choice of Someone Else’s Novel”, which reads a text source, breaks it down, and uses the building blocks to create a ChoiceScript interactive novel, composing prose primarily with grammar-based Markov chains. If all goes well, CSEN will also create characters (with relationship stats) and stats that can be increased and tested and viewed and so forth. And if all goes poorly instead, then CSEN will create a static novels, because that much is already working. (Hooray!)

Right now, it’s overenthusiastic about punctuation and bad with quotation marks, but I should have that smoothed out by the end of the month.

Back to misconverted text documents. When I was trying to read my source text, I discovered that I couldn’t recognize apostrophes, quotation marks, ellipses, and so forth. They were printing in the console window as unhappy �’s, and I couldn’t get them to register with substrings or regular expressoins, and when I finally tried reading the source file and immediately writing it back to a different file – more �’s appeared.

The culprit is text encoding. There’s not really any such thing as a “plain text file” because different computers will store text in different formats, such as ASCII or UTF-8 or ISO 8859-1 . There’s a good briefing on this subject at Joel on Software’s article  “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” , but to sum up, Windows was saving my text file in one format, and I was reading it in another. Hence, �.

The bad news: unless you know what kind of encoding is being used, it’s very hard to figure it out. There’s supposed to be a file header, but even the header isn’t reliable (as per this Stack Exchange discussion).

However, per the same Stack Exchange discussion, Firefox is pretty good at detecting encoding. (Open the file with Firefox, then View > Character Encoding). My source file was in the Windows-1252 format, and hence needed to be read with Encoding.GetEncoding (1252). Once I set that up, it worked perfectly.

…for the time being. If anyone else wants to run my code, their source text will have to be in Windows-1252. Which is not ideal, so, if I have time, I’ll try to work some encoding detection into the system. (And if not – oh well. It’s all for fun.)

My actual code, for reference (uses System.IO):

byte [] sourceBytes;
byte [] unicodeBytes;
sourceBytes = File.ReadAllBytes (path);
unicodeBytes = Encoding.Convert (Encoding.GetEncoding(1252), Encoding.Unicode, sourceBytes);
rawtext = Encoding.Unicode.GetString (unicodeBytes);

…and I could manipulate my rawtext and expect everything to be intact from there.

Bookmark the permalink.


  1. Procedurally generating a choicescript game is such an intriguing and ambitious idea: I hope it pans out!

    • I will warn you – even in its best form, this will probably be on the incoherent/poetic side. Here’s the kind of thing I’m getting right now (source text: Caelyn Sandel’s horror short story “Mara and the Bottleman”):

      Through?” the Were?” you Drink?” water After,” about the bedsheets pierce her handshell rang key into Mara slapped a nauseating sucking noise of her shotgun.

      Buried.” in Instead, she. All, Mara was finished she rush into the lights while dim were. Finally, the.” bottle

      Oh, I’m. An?” alarm By?” all You’re, welcome hun Mara turn, back to reclaim her.

      I’m, going to embrace a deep breath and logging onto one at the surface of. Mara, stumbled down them, again for me this. Seized, with drink, and went to live together in love Mara said. Craft, from; her side and dropped the dry heaves; for a raised eyebrow from her short-barreled shotgun to do either, Lily said matter-of-factly from her mouth open.

      It’ll improve once I get punctuation and capitalization under control, but it’ll still be more dreamlike than direct.

  2. Another IF community member: I Waded In Clear Water which is the best thing ever is by Allison Parrish, who co-wrote Earl Grey and wrote Which Describes How You’re Feeling.

  3. Notepad++ can convert encodings for you with a click — CP1252 is so awkward! Anyway, in general I find that having a procedure/script to standardize all my input documents into one encoding (practically speaking, UTF8, because that’s the most widely encompassing and multilingual-friendly) saves a lot of grief down the line. And also saves you from having to fuss with detecting and accommodating the very many encodings which exist out there (although of course if you’re sticking to strictly monolingual English text, the set is greatly reduced in size!).

    What (computer, not human) language is that? :O

Leave a Reply

Your email address will not be published. Required fields are marked *