Friday, March 9, 2012

Why case-insensitive filesystems need to go away

It happens all the time. Someone using a Mac for the first time, but who is used to Linux or other Unix systems comes across a problem caused by the mix of case-sensitive and case-insensitive filesystem handling. It also happens in reverse. So, the obvious line-up is the people who are most comfortable with MacOS defend case insensitivity and those most familiar with traditional Unix and Unix-like systems defend case sensitivity.

The reality is that case sensitivity is the only sane option, but it has nothing to do with tradition or Unix history. It has to do with the idea of upper and lower case and what they mean.

In the modern day, most systems support not just the Western (specifically American) subset of characters called ASCII in filesystems, but very nearly every character that is in use around the world. These expanded character sets exist within a framework called Unicode, and in the Unicode world it's rather a lot more complex. For example, on my Mac, I just created a file with the name, "一". This is the Japanese and Chinese character for the word, "one". In fact, it can be used interchangeably with the numeral "1". So why is it that, on my Mac I can create a file named "一" and another file in the same directory named "1"?

Oh, but that's just the start.

There's full-width versions of all of the ASCII characters like "D" which is the full-width "D". On a Mac, you can create a file whose name is "D" and another whose name is "D" in the same directory. Not only are these the same conceptual letter, but they look almsot exactly the same in a directory listing. So why? Because the Apple filesystem people rightly determined that trying to fold every variation of every "glyph" onto every other variant of that same glyph was not only prohibitively complicated, but guaranteed to be wrong in many circumstances (in some cases, for example, 一 doesn't mean the same thing as 1 and you could reasonably use 1 as a way to resolve ambiguity). The "wrong" behavior of mapping upper- and lower-case attributes to each other is no less wrong, but it was Apple tradition, and breaking with it would have created problems for Apple users, so they kept it, but they weren't foolish enough to try to expand it to every one of the possible glyph-folding permutations.

So, the next time something breaks because a user checked a file in from a Linux system with a name that conflicted with an existing, but upcased filename, before you blame that user or the Linux filesystem semantics, consider that the OS you're using is preserving part of a historical glitch that should never have been perpetuated in the first place.

For a more complete treatment of the complexities of case-folding, let me direct you to the Wikipedia "Letter case" article which contains a section on Unicode case folding. And which further points out the complexities of certain edge cases:

  • The German letter ß exists only in lowercase (but see Capital ß), and is capitalized as "SS".
  • The Greek letter Σ has two different lowercase forms: "ς" in word-final position and "σ" elsewhere.
  • The Cyrillic letter Ӏ usually has only a capital form, which is also used in lowercase text.
  • Unlike most Latin-script languages that use uppercase "I" and lowercase "i", Turkish has dotted and dotless I independent of case.

I wasn't even aware of a couple of these. I can't imagine how you would try to handle the German case. That's just ugly.