ASCII Rules!

(Or, Couldn't We Bid Binary Files Good-Bye?)

Microsoft Word did it to me again the other day—corrupted a large document and cost me several days' work repairing it. How I wish we could store documents in something that Notepad can edit. But I’m getting ahead of myself ...

I've been putting in long hours writing my latest book, and the chapter I'm working on has grown to about 200 pages. (Before you tell me that I shouldn't use Word for large documents, let me point out that a 400KB file is hardly large in an age when my digital camera creates 14MB pictures.) After a productive day, I saved my document, closed Word with no error messages, copied the chapter's file to another directory for safekeeping, and trundled off to bed.

The next morning, I opened the Word document, intending to pick up where I left off, but discovered that the entire document was in boldface 20-point type. For some reason, the document had lost its formatting and become just one paragraph style. Word's document format is undocumented: "We don't give out that kind of information," Microsoft Product Support Services (PSS) told me. In the end, I had to manually reformat more than 150 pages of text because I couldn't create an automated tool to repair the damage.

As I was reformatting the paragraphs, I thought, "I wouldn't have to do this if this document had been stored in some type of ASCII format." But I'm not simply slamming Microsoft. Every text-editing program I've ever worked with has, at some point, corrupted a document, which doesn't surprise me given the current standards for commercial software. No, my thought is this: If word processing software vendors can't guarantee that defects in their products won't destroy our work—and they can't—then I wish they'd make recovering from those defects easier. For example, these vendors could document their file formats; however, I'd prefer that they just adopt some type of ASCII file format with markup codes in it. Then I could use one of my favorite repair tools—Notepad.

You might think that you've never heard of an ASCII file format with markup codes, but you have: HTML. I would guess that for most of the half-century plus that we've had computers, programs have stored documents in some simple ASCII format. Over the years, I've used ASCII markup formats such as troff, Standard Generalized Markup Language (SGML), WordStar International's WordStar (which was basically an ASCII file format), and others. In the early 90s, I was a big fan of a Lotus word processor called AmiPro. AmiPro was as balky and problem-prone as any Windows word processor, but it had a truly wonderful and undocumented feature—an ASCII-based file format. Whenever AmiPro corrupted a document, I could open the document in Notepad. A bit of poking around showed me how AmiPro files were structured, and I was able to stitch corrupted files back together, saving me days of work.

Don't think, however, that ASCII-based document formats are a thing of the past. Unless you've been sequestered in the basement with Milton from the movie "Office Space" for the past 5 years, you've heard about XML as a portable way to format and transfer data, often in an e-commerce context. But did you know that you can use XML to format documents? Do a Google search for "XSLT" or "XHTML," and you'll find a whole world of document markup and preparation technology based on XML and a pumped-up version of HTML called XHTML. As a matter of fact, if modern Web browsers adhered to the letter of the latest HTML specifications, you'd be able to use HTML for virtually all documents.

In my opinion, document files composed of ASCII text files with embedded markup codes offer several benefits. First, these documents are eminently repairable. A tool as simple as Notepad lets you view and edit them, and you can often use tools such as Perl or VBScript to write scripts to repair systematic damage. Think of this as "fault-tolerant document storage."

Second, ASCII files are easy to inspect. If you're concerned that a file contains a macro virus, open the file with Notepad and look through it for some incomprehensible garbage—that's probably your encrypted virus. Just delete that text and the tabs around it, and the file is cleaned.

Third, a file's transparent nature means that you can make it do what you want. Let's say that you don't like that your word processor's bulleted list feature puts a quarter inch between the bullet character and the text. No problem: Simply change the macro that describes bulleted lists.

And finally, such files are cross-platform. Not only are they simple to transport from one platform to another (e.g., Windows to Linux or Macintosh) but quite transportable from application to application. Sure, incompatibilities would exist, as every vendor would use a markup language's power to extend its word processor in different ways. But by looking at each vendor's style sheets, you could determine what changes they had made and make document translation simpler.

Not long ago, word processors differentiated themselves by offering unique features and functionality. One program might have a great equation editor, another might work great with tables, and another might be a wizard at fonts. But these days, I'd assert that all word processors offer basically the same features. The difference among word processors nowadays isn't what they do but how they do it. Some people love Word's feel, fit, and finish; others long for Corel WordPerfect's no-nonsense, dedicated word processing capabilities. Unreconstructed old cranks like me don't mind typing markup codes into a text editor directly. If what I'm saying is right (i.e., that people are more attached to a word processor's interface than its file format), then current word processing software vendors could adopt an ASCII file format with embedded markup codes and still retain their market share. If that happened, we could all use the tools that make us most comfortable, we could easily share documents, and we would have a fighting chance of getting our text back when the gremlins ate it.

I know that this idea is just fantasy, but think of it as my Christmas wish or perhaps a suggested New Year's resolution for the software industry—which reminds me: Thanks for letting me visit with you another year. Happy Holidays!

Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.