Before anything else, a little housekeeping:
I’m back from the Hawaii honeymoon! Here’s a picture of far too many rainbows. We also have a picture of far too many dolphins. Hawaii was like that. (It was sort of like visiting a Lisa Frank folder.)
In the wake of my month-long wedding hiatus, I’ve been reevaluating my priorities (apparently getting married will do that to you). I love working on Sibyl Moon (and I’m always encouraged by everyone’s boundless enthusiasm!) but I need to spend some time in lower gear.
For the time being, I’m going to keep Sibyl Moon on a “releases as they come” basis. I’m especially grateful to everyone who’s been supporting Sibyl Moon through Patreon, and I want to retain your enthusiasm and trust, so I’m also keeping monthly Patreon payments turned off for now.
Housekeeping over. Onward!
Things that are really annoying: misconverted text documents where you only see � in place of useful characters like apostrophes, quotation marks, and ellipses.
I ran headlong into this problem while working on my NaNoGenMo project. NaNoGenMo – National Novel Generation Month – is the brainchild of Darius Kazemi.
Hey, who wants to join me in NaNoGenMo: spend the month writing code that generates a 50k word novel, share the novel & the code at the end
— Darius Kazemi (@tinysubversions) November 1, 2013
A few IF community members have tackled this in the past, including Nick Montefort (World Clock), Andrew Plotkin (Redwreath and Goldstar Have Traveled to Deathsgate), Aaron Reed (Aggressive Passive), and doubtlessly others that I’m forgetting (apologies!)
Because it sounds like fun (and possibly because there is something wrong with me), my project is “Choice of Someone Else’s Novel”, which reads a text source, breaks it down, and uses the building blocks to create a ChoiceScript interactive novel, composing prose primarily with grammar-based Markov chains. If all goes well, CSEN will also create characters (with relationship stats) and stats that can be increased and tested and viewed and so forth. And if all goes poorly instead, then CSEN will create a static novels, because that much is already working. (Hooray!)
Right now, it’s overenthusiastic about punctuation and bad with quotation marks, but I should have that smoothed out by the end of the month.
Back to misconverted text documents. When I was trying to read my source text, I discovered that I couldn’t recognize apostrophes, quotation marks, ellipses, and so forth. They were printing in the console window as unhappy �’s, and I couldn’t get them to register with substrings or regular expressoins, and when I finally tried reading the source file and immediately writing it back to a different file – more �’s appeared.
The culprit is text encoding. There’s not really any such thing as a “plain text file” because different computers will store text in different formats, such as ASCII or UTF-8 or ISO 8859-1 . There’s a good briefing on this subject at Joel on Software’s article “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” , but to sum up, Windows was saving my text file in one format, and I was reading it in another. Hence, �.
The bad news: unless you know what kind of encoding is being used, it’s very hard to figure it out. There’s supposed to be a file header, but even the header isn’t reliable (as per this Stack Exchange discussion).
However, per the same Stack Exchange discussion, Firefox is pretty good at detecting encoding. (Open the file with Firefox, then View > Character Encoding). My source file was in the Windows-1252 format, and hence needed to be read with Encoding.GetEncoding (1252). Once I set that up, it worked perfectly.
…for the time being. If anyone else wants to run my code, their source text will have to be in Windows-1252. Which is not ideal, so, if I have time, I’ll try to work some encoding detection into the system. (And if not – oh well. It’s all for fun.)
My actual code, for reference (uses System.IO):
byte  sourceBytes;
byte  unicodeBytes;
sourceBytes = File.ReadAllBytes (path);
unicodeBytes = Encoding.Convert (Encoding.GetEncoding(1252), Encoding.Unicode, sourceBytes);
rawtext = Encoding.Unicode.GetString (unicodeBytes);
…and I could manipulate my rawtext and expect everything to be intact from there.