Manually writing the byte order mark (BOM) for an encoding into a stream

I recently discovered a problem with our WebCopy and Cyotek Sitemap Creator products to do with "corruption" of plain text documents, where non-ANSI characters appeared incorrectly. It didn't take long to realize that these programs were saving text content as ANSI files. Which I found curious as Crawler library they use detects response encoding and uses this to save the files.

Or does it? Consider the code below:

string fileName;
byte[] data;
Encoding encoding;

fileName = Path.GetTempFileName();
data = new byte[0]; // assume you have a populated byte array!
encoding = Encoding.UTF8;

using (FileStream stream = new FileStream(fileName, FileMode.Create))
{
  using (BinaryWriter writer = new BinaryWriter(stream, encoding))
    writer.Write(data);
}

Looking at this, you might be tempted to assume (as I did) that this code would save the content in the given encoding. When I tried opening one of the files generated by similar code to the above in Notepad++, I found they were encoded as ANSI files. Switching the encoding to UTF-8 immediately displayed the files correctly without the "corruption". So it seems the byte order mark (BOM) isn't actually written by the BinaryWriter - I think it only uses the given encoding for converting strings to a byte array. All this time I assumed files were being saved as UTF-8 (or whatever the response encoding was) and properly supported Unicode, and all this time I was wrong.

So how do you manually write a BOM into a document? The oddly named GetPreamble function available from the Encoding class is what you need - this returns the bytes that comprise the BOM, and you can then write this directly to your stream:

string fileName;
byte[] data;
Encoding encoding;

fileName = Path.GetTempFileName();
data = new byte[0]; // assume you have a populated byte array!
encoding = Encoding.UTF8;

using (FileStream stream = new FileStream(fileName, FileMode.Create))
{
  using (BinaryWriter writer = new BinaryWriter(stream, encoding))
  {
    writer.Write(encoding.GetPreamble());
    writer.Write(data);
  }
}

Note that you only need to write a BOM if your document is actually supposed to be a text file - if it is "normal" binary data (such as an image or a gzip stream) then you definitely do not want to write a BOM, or you truly will have a corrupt file.

Now the files produced by WebCopy and Sitemap Creator are encoded correctly and I can be happily with yet another bug squashed, unhappy at yet another reminder of why I need to write a proper set of automated tests for the libraries I use, but happy again that I had another (albeit brief) tip to post on this blog.

About The Author

Gravatar

The founder of Cyotek, Richard enjoys creating new blog content for the site. Much more though, he likes to develop programs, and can often found writing reams of code. A long term gamer, he has aspirations in one day creating an epic video game. Until that time, he is mostly content with adding new bugs to WebCopy and the other Cyotek products.

Leave a Comment

While we appreciate comments from our users, please follow our posting guidelines. Have you tried the Cyotek Forums for support from Cyotek and the community?

Styling with Markdown is supported