In this modern day, HTML entities can reference arbitrary unicode codepoints. For example,
☃ is the entity for ☃. Not surprisingly, WebKit appears uses UTF-16 internally to represent unicode strings, or at the very least when interpreting HTML entities. One of the big benefits of using UTF-16 is every character is represented by 2 bytes (the 16 in UTF-16 means 16 bits). Contrast this with UTF-8, where a single character can be represented by anywhere from 1 to 4 bytes, or UTF-32 where every character requires 4 bytes (i.e. twice as much as UTF-16). Clearly UTF-16 seems to be useful, as it’s not too much larger than UTF-8 for ASCII strings (only double the size), and you can jump to any character with a simple index into the string. However, one of the often-ignored aspects of UTF-16 is the surrogate pair. Unicode contains more than 0xFFFF bytes, and yet a single UTF-16 “character” (or unichar) can only reference up to U+FFFF. The solution to this is to take the codepoints in Unicode planes 1-16 (
U+10FFFF) and represent them as 2 unichars. This is a surrogate pair. You can find more information on this in the wikipedia entry for UTF-16, but to put it simply, a surrogate pair uses a range of codepoints that don’t represent real characters (
U+DFFF) and uses them in combination to represent all the characters in the other planes.
The reason this is interesting is because it exposes an interesting quirk as to how WebKit interprets HTML entities. WebKit properly converts entities that represent characters outside of plane 0 into a surrogate pair, such as
𝍧 (𝍧). This gets converted into
0xD834DF67. The quirk is if you give it the surrogate pair codepoints directly, it doesn’t realize they’re not real characters individually and passes them through unscathed, so that same character can be written as
�� (). Now this doesn’t seem particularly harmful, except if you only write the first of these entities, WebKit will then get very confused. It will end up throwing away the entire rest of the line of rendered text. Interestingly, it starts displaying text again after a line break, even if it’s just an implicit line break.
The ideal behavior here is WebKit should just silently ignore any entities which reference a codepoint that’s part of a surrogate pair. The fact that it doesn’t really doesn’t hurt anything, but I thought it was worth documenting.
Update: A question was raised on twitter about how surrogate pairs affect indexing into a UTF-16 string. I didn’t know the answer, and strangely, I couldn’t find information on how to handle it with google either, so I tested empirically.
NSString uses UTF-16 internally, so it was a great way to test. And what I found was that each half of a surrogate pair is counted as a separate character. The
-length of the
NSString is increased by 2 when you add a surrogate pair, and
-substringFromIndex: will happily split up the surrogate pair for you. Of course, if you do split a surrogate pair, then attempting to convert the
NSString into another encoding, even with the simple
-UTF8String, will return NULL as such a conversion is illegal (when you generate a unicode stream it has to be well-formed, and so you cannot generate a stream with half of a surrogate pair - and half of a surrogate pair in UTF-16 will be converted into a single invalid 3-byte UTF-8 sequence).