I've sat down to write this blog no less than four times in the last week. Each time something has come up that has pulled me away from actually getting far enough into writing that it becomes basically self-propelled. Now tonight, I know there is at least one person out there focused on getting her homework done so I thought I'd buckle down and plow through a bit of writing myself.
Let's talk about MSI files. First, MSI files got their name back when we thought Darwin was going to be called the "Microsoft Installer". Thus the file extension MSI made some sense. Unfortunately, it so late when the name changed to "Windows Installer" that it wasn't really feasible to safely change the extension that everyone had come to know and love as MSI.
Anyway, the vision driving development of Darwin was that setup needed to be a transacted set of changes to a target machine that could be aborted and cleaned up if an error occurred or the user cancelled setup. This means the setup logic must be declarative so that an engine can interpret the logic and calculate not only the changes to the target but the changes necessary to undo any of those changes should something go wrong. There are many ways to define data declaratively (XML being my personal favorite these days) but back around 1995 (when Darwin was first started) the team decided the setup logic should be in a database. Unfortunately, all of the database technologies back then required substantial amounts of setup before they could be used. Since a setup technology is kinda' needed before you can setup anything, it wasn't really feasible to use any of database engines that existed. Think classic chicken and egg problem.
So, the Darwin team decided to build a custom relational database. As an aside, in my humble opinion, building this custom relational database to store all the setup logic was unnecessary and generated a lot of overhead over the years (especially for those of us that have to create the flipping MSI files). However, my opinion is based on hindsight and we all know we see better when looking back on history. Anyway, I just wanted to be up front that I can't provide a really strong justification for why MSI files had to be relational databases.
Okay, so say you're in the middle of the 1990's and you need to build a relational database, what do you do? Well, if you're in Office (like the Darwin team was at the time) and you look at the Word and Excel file formats you might think, "Hey, those structured storage file thingies are really cool! I bet we could use that!"
So, MSI files are actually little databases laid out in a structured storage file. For those of you that haven't played with structured storage files let me talk about them a little. A structured storage file exists on disk as a single file but can contain many "streams" and/or "sub-storages". Streams are essentially just a bunch of bits with a name stored inside a structured storage file. Sub-storages are just structured storage files embedded in another structured storage files. I've seen people compare structured storage files to typical file systems where "files" map to "streams" and "directories" map to "sub-storages". Structured storage files are also often called "compound documents" or sometimes "OLE documents".
There are a few advantages to using structured storage files as the basis for your file format. First, the format provides a very natural way to separate your data with the streams and sub-storages. The MSI file uses separate streams for each of the tables in the database. Second, you can store multiple files in a single structured file which is nice when you want to have a single redistributable. For example, streams are used to store things like UI graphics, CustomAction DLLs, and even the binaries to be installed in many cases. Also, sub-storages are used to nest one MSI file inside another MSI file (note: you should never do this, but I'll talk about nested installs another day). Finally, structured storage files have built in transaction semantics. Having someone else provide the transaction functionality for you is really nice when you're trying to build a database on top of the format.
There are also a few disadvantages to structured storage files. First, the names of streams can only be something around 63 characters. This restriction isn't particularly restrictive but it can cause some really wacky error messages. Second, structured storage files don't shrink. If you add then delete data to a structured storage file, the file maintains its largest size. This design works out okay if you consider the case where a user is writing a document. In those cases, the user spends most of the time adding data and any deletes are often replaced with more data. Editing MSI files does not necessarily follow the same pattern so it is possible to end up with bloated MSI files if you're not careful. Finally, structured storage files don't handle multiple writers well at all. For example, open an MSI file in Orca then try to install the MSI by double clicking on it. You'll get a lovely message box that says something like:
This installation package could not be opened. Verify that the package exists and that you can access it, or contact the application vendor to verify that this is a valid Windows Installer package. [OK]
Okay? No, not okay but whatever. Every time I see that message box I wonder how many hours have been lost trying to figure out what the heck is wrong with an MSI file only to find that it was held open for editing in Orca. K, a buddy of mine at work, was just about pulling his hair out one day trying to figure out what was going wrong with one of his MSI files until I pointed out that he had Orca editing the file on one of his other test machines.
Anyway, there are a couple other things I want to say about the MSI file format.
In the mid-1990's Microsoft was still shipping Office on 3.5" floppies. Granted Office '97 shipped on something like 39 floppy disks but CD-ROMs weren't quite popular enough (i.e. weren't cheap enough). So one of the things the Darwin team needed to do was make the MSI files as small as possible so that the setup logic would fit on a single floppy disk (trying to read a structured storage file the spanned multiple floppy disks was not an option). This need led to the creation of the "string pool" and many dreaded "string pool corruption" bugs.
More detail. If you're familiar with relational databases, you know that primary key identifiers are duplicated everywhere you have a foreign key reference. Well, primary key identifiers in MSI are strings that are recommended to be 72 or less characters long. It's not hard to imagine how quickly all those identifiers could add up to create unnecessarily large MSI files. To combat this bloat there is a single stream in the MSI file that holds all the strings. This stream is called the string pool contains a single entry for each unique string. That way a string column in a table is just an integer offset into the string pool.
The string pool can save quite a bit of space. It was also pretty tricky to get right. I wasn't directly involved, but I remember quite a few late night bugs when I was an intern where my mentor spent the whole night tracking down why the wrong string or a corrupt string was coming out of the string pool. Then there were the nights trying to figure out why localized strings were coming out corrupt. Anyway, if ever come across a copy of the original msival.exe you'll see a command-line switch that would run tests to detect string pool corruption. Fortunately, the string pool code is stable now and that isn't necessary any more.
On the note of localized strings, I should note that the MSI file format is not Unicode. I'm not an expert on localization and there is a pretty detailed topic in MSDN about localizing MSI files so I'm not going to say much more. Just keep in mind that you have to deal with codepages when storing localized strings in a MSI file. Yeah, I know, "Ick."
So there's a bunch of detail about MSI files at a level that is probably not terribly useful. Next blog I'll actually try to answer Jim's question about creating custom tables in a MSI file. However, now it is time to go to sleep and search out happy dreams in the synaptic gaps.
RobMensching.com LLC
6 Comments
Comment by chris penney on Tuesday, September 15, 2009 9:34 AM
I've found your blogs very interesting as I'm sure a lot of other do too.
I'm hoping you may know the answer to a question which is puzzling me, "What criteria does the msi engine use to determain the order to write the registry table rows to the target machine's registry?"
This has caused me to spend a lot of time trying to figure it out but as yet, no luck.
Kind Regards,
Chris
Comment by Rob Mensching on Wednesday, September 16, 2009 7:11 AM
However, I don't understand why the order matters. The registry keys (like most tables) are processed declaratively. Order shouldn't matter.
Comment by Tassdar on Saturday, May 01, 2010 5:57 AM
Thanks for your introduction to msi file structure,it has helped me a lot,but there still are some problems cofusing me. can you help me?
I'm trying to code a program to identify installer version of a package. Do you have any suggestion?
Now,I'm de-coding the msi file using UltraEdit and hope that I can find the difference among different versions.Unfortunately, It's a huge task and it seems that the packages made by different installer version are the same.I even doubt that is it possible to ditinguish the version?
I know my English is poor and I hope you are able to understand what I mean.I would appreciate it if you could reply my question.
Kind Regards,
Tassdar
Comment by Tassdar on Saturday, May 01, 2010 8:29 AM
Thanks for your introduction to msi file structure,it has helped me a lot,but there still are some problems cofusing me. can you help me?
I'm trying to code a program to identify installer version of a package. Do you have any suggestion?
Now,I'm de-coding the msi file using UltraEdit and hope that I can find the difference among different versions.Unfortunately, It's a huge task and it seems that the packages made by different installer version are the same.I even doubt that is it possible to ditinguish the version?
I know my English is poor and I hope you are able to understand what I mean.I would appreciate it if you could reply my question.
Kind Regards,
Tassdar
Comment by Giri on Sunday, May 08, 2011 12:22 AM
Thanks for describing this in simple words. It helped me a lot to understand MSIs.
Comment by Gaurav on Monday, April 09, 2012 1:32 PM