Archive for the “Web” Category
In this modern day, HTML entities can reference arbitrary unicode codepoints. For example,
☃ is the entity for ☃. Not surprisingly, WebKit appears uses UTF-16 internally to represent unicode strings, or at the very least when interpreting HTML entities. One of the big benefits of using UTF-16 is every character is represented by 2 bytes (the 16 in UTF-16 means 16 bits). Contrast this with UTF-8, where a single character can be represented by anywhere from 1 to 4 bytes, or UTF-32 where every character requires 4 bytes (i.e. twice as much as UTF-16). Clearly UTF-16 seems to be useful, as it’s not too much larger than UTF-8 for ASCII strings (only double the size), and you can jump to any character with a simple index into the string. However, one of the often-ignored aspects of UTF-16 is the surrogate pair. Unicode contains more than 0xFFFF bytes, and yet a single UTF-16 “character” (or unichar) can only reference up to U+FFFF. The solution to this is to take the codepoints in Unicode planes 1-16 (
U+10FFFF) and represent them as 2 unichars. This is a surrogate pair. You can find more information on this in the wikipedia entry for UTF-16, but to put it simply, a surrogate pair uses a range of codepoints that don’t represent real characters (
U+DFFF) and uses them in combination to represent all the characters in the other planes.
The reason this is interesting is because it exposes an interesting quirk as to how WebKit interprets HTML entities. WebKit properly converts entities that represent characters outside of plane 0 into a surrogate pair, such as
𝍧 (𝍧). This gets converted into
0xD834DF67. The quirk is if you give it the surrogate pair codepoints directly, it doesn’t realize they’re not real characters individually and passes them through unscathed, so that same character can be written as
�� (). Now this doesn’t seem particularly harmful, except if you only write the first of these entities, WebKit will then get very confused. It will end up throwing away the entire rest of the line of rendered text. Interestingly, it starts displaying text again after a line break, even if it’s just an implicit line break.
The ideal behavior here is WebKit should just silently ignore any entities which reference a codepoint that’s part of a surrogate pair. The fact that it doesn’t really doesn’t hurt anything, but I thought it was worth documenting.
Update: A question was raised on twitter about how surrogate pairs affect indexing into a UTF-16 string. I didn’t know the answer, and strangely, I couldn’t find information on how to handle it with google either, so I tested empirically.
NSString uses UTF-16 internally, so it was a great way to test. And what I found was that each half of a surrogate pair is counted as a separate character. The
-length of the
NSString is increased by 2 when you add a surrogate pair, and
-substringFromIndex: will happily split up the surrogate pair for you. Of course, if you do split a surrogate pair, then attempting to convert the
NSString into another encoding, even with the simple
-UTF8String, will return NULL as such a conversion is illegal (when you generate a unicode stream it has to be well-formed, and so you cannot generate a stream with half of a surrogate pair - and half of a surrogate pair in UTF-16 will be converted into a single invalid 3-byte UTF-8 sequence).
Comments Off on WebKit and handling of surrogate pairs in HTML entities
There’s a new site that’s been getting really popular lately called Twitter.
If you haven’t heard of it, it’s basically a status message, like IM, but actually usable.
Twitter provides XML/JSON feeds, as well as a way to post updates over the web, over
IM (using Jabber), or over SMS. You can also receive “tweets” (Twitter updates) that your
friends posted with any of these mechanisms. There are also various applications (such as
Twitterific), if you want yet another way to send/receive tweets.
By far the most powerful aspect of Twitterific is the SMS support. I have free SMS with my
Sidekick data plan, but I’ve never had a reason to use it before. But now I can receive
tweets on my phone all day long, and send my own no matter where I am. For example, I sent
this tweet while in class.
If you want to keep track of what I’m doing, just take a gander at my profile.
And if you sign up for your own account, you can add me as a friend to receive my tweets.
If you add me and I recognize your name, I’ll probably add you back too.
There’s a BBC News
twitter profile. How cool is that?
Comments Off on Tweet Tweet
After months of absence, Typosphere has returned from the dead!
We migrated off of Planet Argon and onto DreamHost, where we should have more control.
We also upgraded to Trac 0.10.3 and turned off anonymous editing (users now have to register to file a ticket).
This should (hopefully) prevent the issue that lead to Typosphere dying in the first place.
One important thing to note is that as part of this process, we also moved the subversion repository.
Unfortunately, the old repository was hosted as an svn:// URI using the typosphere.org domain, which meant
there was no way to preserve this URI (since we can’t run long-lived background daemons on DreamHost). The
new URI uses http and a new subdomain, so if necessary we can move the repository without moving the website.
The new repository URL is http://svn.typosphere.org/typo/trunk.
Read the rest of this entry »
3 Comments »
Haml is a new markup format for Ruby on Rails apps that just hit 1.0. At first glance
it looks pretty odd, but it turns out to be really easy to write in, and it’s shorter and,
actually, easier to read than the equivalent eRB.
I think I’m going to convert Typo to use Haml for all its templates. I already did Azure’s layout
file and it was pretty simple.
2 Comments »
Last night I tried searching MacUpdate for bittorrent or Azureus and got zero results. I thought that to be very strange, and wondered if Joel decided to take an anti-piracy stance (despite the fact that BitTorrent is very useful for distribution of perfectly legal materials). But today when I looked again, I got all the results I expected. Very strange.
Comments Off on MacUpdate oddity
My friend from college just sent me a link to The Ultimate Showdown of Ultimate Destiny, which is the greatest thing I have seen in a while. I managed to find the website for the guy(s?) who did the music. Their website can be found at eviltrailmix.com/lemondream/, although they hit their bandwidth limit today, but they should be back tomorrow.
Update: Comments disabled because I’m tired of all the inane comments I’ve been receiving. There’s really no discussion to be had here, so no point in having comments.
Comments Off on The Ultimate Showdown of Ultimate Destiny
Inspired by a script mentioned on ranchero.com, I wrote a ruby script that generates an RSS feed for all the Crash Reporter logs on your machine. Just create a New Special Subscription in NetNewsWire, point it at the script, and you’re all set.
10 Comments »
So why has been talking about MouseHole lately. If you’re unaware, MouseHole is a ruby script that acts as a web proxy and filters HTML documents via ruby scripts. Or for a much better look at it, go look at what why wrote.
I haven’t actually done anything with this new capability yet aside from a rather silly example script which simply adds a button to Google that asks MouseHole for a random number. I’d really like to extend Google Maps, but that will require delving into how it works, and it’s far too late to do that now, as I need to get to bed.
Anyhow, I put together a patch and sent it to the MouseHole Scripters mailing list. If you check out MouseHole from the CVS repository, you can apply my patch and test it. Or you can just wait to see if it gets added to MouseHole.
From the people who brought us Who Owns The Fish? comes a new logic puzzle, School of Government. I just ran across this today, and the answer deadline is tomorrow, so that was cutting it a little close, but with a pen and a pad of paper I think I’ve solved it. And no, I’m not going to give you the solution (like I did last time). Go solve it on your own.
Comments Off on The fish returns
I ran across a link on SvN to an interesting logic puzzle called Whose Fish?. It purports to be a logic puzzle created by Albert Einstein, and claims Einstein said 98% of the population would not be able to solve it. So naturally, I was intrigued.
If you don’t want to solve the puzzle yourself, feel free to check out some of my steps. Just please, don’t submit this as your own answer. If you’re going to submit an answer, do the work yourself. So anyway, here are some of my steps:
And here’s the final solution.
25 Comments »