Fun with Code Pages

by Mike 14. August 2008 07:28

  While working on a contract, one of the requirements is to run an external process to generate a license.  The license information itself is then printed to standard output.  Being a C# / ASP.NET project, I need to do this in a fairly simple way - run the process, capture the output, and store in a database until the actual license file is needed.  

This of course leads me to the typical way of handling this - create a System.Diagnostics.Process object, feed in the path and argument info, and capture the output  from a redirected standard out stream.  All is good right?  Well - I hopefully would not be writing this post if it were.  Turns out another little detail of this project is that the license file is actually going to be used on a Linux machine.  

Let's start with some sample code:

This all looks good - until I write the output to the binary response stream in ASP.NET.  The code itself works fine and the file is written - the problem is the license won't work when read by the app on the Linux system.   A little investigation, and basically the string exists as a certain number of bytes (say 120) directly off the standard output, but the written file size was considerably larger (180 or so.)  Banged my head against the wall a little and it hit me - it was the encoding. (Note - this really only applies to binary files - text files are usually better handled cross platform.) 

A little more code:

My first guess was to read the bytes as ASCII into a buffer stream.  This still didn't work - came up with around 150-160 bytes or so.  UTF8, UTF7, etc, etc, etc, all came up short.  I was pretty sure I was on the right track but just going about things the wrong way.  Time for a little reading and some experimentation.  

First stop was simply to read about strings in .NET - everything is stored as Unicode - aka multi-byte characters.  I looked at a unicode map and it looked like for most of the character set I was using, everything would map to the first byte being the character I wanted.  I wrote a quick little loop to give me every other byte and tried writing that out to the byte stream.  Yeah, that didn't work either.  

Finally, I started searching around a little more, and from a few forums I found the magic of the "Code Page" (Wikipedia Link.)  For the unfamiliar, a code page is basically a character set - the map of characters to bytes.  Changing the code above, we simply need to use:

Where "codepage" is the integer value of the given code page.  I tried the two mentioned in several posts (437 and 850) but neither worked properly.  Finally, I found the right one - 1252, and all worked exactly as expected.  Of course, this was one of the situations of fixing the problem but not necessarily understanding how to avoid it in the future.  

It's been a couple weeks since I worked out the above so I decided to dig into exactly what was happening with the code pages.  I found / remembered about the old program, chcp.com, which lets you view / change the code page of a given command prompt.  Running this provided that the default code page was 437.  Huh?  If that's the case, why didn't setting my Encoding to code page 437 work?  There must be a difference between what cmd.exe defaults to and what the code page is for a System.Diagnostics.Process.  After some experimentation, I found the following:

The variable now contains the running code page for a given Process - of course this is 1252.  In other words, the best option for working across systems is to always get the Code Page / Encoding for your system in code and get the bytes accordingly.  

Tags:

.NET

Powered by BlogEngine.NET 1.4.5.0
Theme by Mads Kristensen