Setup

Inside the MSI file format.

I've sat down to write this blog no less than four times in the last week. Each time something has come up that has pulled me away from actually getting far enough into writing that it becomes basically self-propelled. Now tonight, I know there is at least one person out there focused on getting her homework done so I thought I'd buckle down and plow through a bit of writing myself.

Let’s talk about MSI files. First, MSI files got their name back when we thought Darwin was going to be called the “Microsoft Installer”. Thus the file extension MSI made some sense. Unfortunately, it so late when the name changed to “Windows Installer” that it wasn’t really feasible to safely change the extension that everyone had come to know and love as MSI.

Anyway, the vision driving development of Darwin was that setup needed to be a transacted set of changes to a target machine that could be aborted and cleaned up if an error occurred or the user cancelled setup. This means the setup logic must be declarative so that an engine can interpret the logic and calculate not only the changes to the target but the changes necessary to undo any of those changes should something go wrong. There are many ways to define data declaratively (XML being my personal favorite these days) but back around 1995 (when Darwin was first started) the team decided the setup logic should be in a database. Unfortunately, all of the database technologies back then required substantial amounts of setup before they could be used. Since a setup technology is kinda’ needed before you can setup anything, it wasn’t really feasible to use any of database engines that existed. Think classic chicken and egg problem.

So, the Darwin team decided to build a custom relational database. As an aside, in my humble opinion, building this custom relational database to store all the setup logic was unnecessary and generated a lot of overhead over the years (especially for those of us that have to create the flipping MSI files). However, my opinion is based on hindsight and we all know we see better when looking back on history. Anyway, I just wanted to be up front that I can’t provide a really strong justification for why MSI files had to be relational databases.

Okay, so say you’re in the middle of the 1990’s and you need to build a relational database, what do you do? Well, if you’re in Office (like the Darwin team was at the time) and you look at the Word and Excel file formats you might think, “Hey, those structured storage file thingies are really cool! I bet we could use that!”

So, MSI files are actually little databases laid out in a structured storage file. For those of you that haven’t played with structured storage files let me talk about them a little. A structured storage file exists on disk as a single file but can contain many “streams” and/or “sub-storages”. Streams are essentially just a bunch of bits with a name stored inside a structured storage file. Sub-storages are just structured storage files embedded in another structured storage files. I’ve seen people compare structured storage files to typical file systems where “files” map to “streams” and “directories” map to “sub-storages”. Structured storage files are also often called “compound documents” or sometimes “OLE documents”.

There are a few advantages to using structured storage files as the basis for your file format. First, the format provides a very natural way to separate your data with the streams and sub-storages. The MSI file uses separate streams for each of the tables in the database. Second, you can store multiple files in a single structured file which is nice when you want to have a single redistributable. For example, streams are used to store things like UI graphics, CustomAction DLLs, and even the binaries to be installed in many cases. Also, sub-storages are used to nest one MSI file inside another MSI file (note: you should never do this, but I’ll talk about nested installs another day). Finally, structured storage files have built in transaction semantics. Having someone else provide the transaction functionality for you is really nice when you’re trying to build a database on top of the format.

There are also a few disadvantages to structured storage files. First, the names of streams can only be something around 63 characters. This restriction isn’t particularly restrictive but it can cause some really wacky error messages. Second, structured storage files don’t shrink. If you add then delete data to a structured storage file, the file maintains its largest size. This design works out okay if you consider the case where a user is writing a document. In those cases, the user spends most of the time adding data and any deletes are often replaced with more data. Editing MSI files does not necessarily follow the same pattern so it is possible to end up with bloated MSI files if you’re not careful. Finally, structured storage files don’t handle multiple writers well at all. For example, open an MSI file in Orca then try to install the MSI by double clicking on it. You’ll get a lovely message box that says something like:

This installation package could not be opened. Verify that the package exists and that you can access it, or contact the application vendor to verify that this is a valid Windows Installer package. [OK]
Okay? No, not okay but whatever. Every time I see that message box I wonder how many hours have been lost trying to figure out what the heck is wrong with an MSI file only to find that it was held open for editing in Orca. K, a buddy of mine at work, was just about pulling his hair out one day trying to figure out what was going wrong with one of his MSI files until I pointed out that he had Orca editing the file on one of his other test machines.

Anyway, there are a couple other things I want to say about the MSI file format.

In the mid-1990’s Microsoft was still shipping Office on 3.5” floppies. Granted Office ‘97 shipped on something like 39 floppy disks but CD-ROMs weren’t quite popular enough (i.e. weren’t cheap enough). So one of the things the Darwin team needed to do was make the MSI files as small as possible so that the setup logic would fit on a single floppy disk (trying to read a structured storage file the spanned multiple floppy disks was not an option). This need led to the creation of the “string pool” and many dreaded “string pool corruption” bugs.

More detail. If you’re familiar with relational databases, you know that primary key identifiers are duplicated everywhere you have a foreign key reference. Well, primary key identifiers in MSI are strings that are recommended to be 72 or less characters long. It’s not hard to imagine how quickly all those identifiers could add up to create unnecessarily large MSI files. To combat this bloat there is a single stream in the MSI file that holds all the strings. This stream is called the string pool contains a single entry for each unique string. That way a string column in a table is just an integer offset into the string pool.

The string pool can save quite a bit of space. It was also pretty tricky to get right. I wasn’t directly involved, but I remember quite a few late night bugs when I was an intern where my mentor spent the whole night tracking down why the wrong string or a corrupt string was coming out of the string pool. Then there were the nights trying to figure out why localized strings were coming out corrupt. Anyway, if ever come across a copy of the original msival.exe you’ll see a command-line switch that would run tests to detect string pool corruption. Fortunately, the string pool code is stable now and that isn’t necessary any more.

On the note of localized strings, I should note that the MSI file format is not Unicode. I’m not an expert on localization and there is a pretty detailed topic in MSDN about localizing MSI files so I’m not going to say much more. Just keep in mind that you have to deal with codepages when storing localized strings in a MSI file. Yeah, I know, “Ick.”

So there’s a bunch of detail about MSI files at a level that is probably not terribly useful. Next blog I’ll actually try to answer Jim’s question about creating custom tables in a MSI file. However, now it is time to go to sleep and search out happy dreams in the synaptic gaps.