A somewhat related story dealing with MaxInt in Javascript. One of the worst bug...

mikeash · on April 12, 2012

I think the lesson there is that numeric types should only be used for things you actually want to do arithmetic with. An account ID that just happens to be all digits should still be stored and transmitted as a string.

ww520 · on April 13, 2012

The lesson I got was to be very careful about data type limitation when going across language boundary. The problem is not limited to numeric types. Different encoding and code page can screw up string values as well.

mikeash · on April 13, 2012

If you're not using UTF-8 everywhere then you're doing it wrong. Exceptions made for legacy systems, but you should get that data into UTF-8 as soon as possible.

ww520 · on April 13, 2012

It's unwise to lazily adopt a silver bullet without understanding the context and thinking through the consequence. I can say if you are not using XML with encoding specified to encode everything everywhere, then you are doing it wrong. You should get all your data into XML as soon as possible. Of course it sounds ludicrous.

mikeash · on April 13, 2012

XML is just one data storage and exchange format above many, with no particularly interesting properties and no compelling reason to use it. UTF-8 is the only encoding that's ASCII compatible, widely accepted/expected, and can represent any text you'll ever encounter.

I can come up with half a dozen reasons to use something other than XML for data storage. I've yet to hear anyone give me a compelling reason to use something other than UTF-8 for encoding strings. Just because what I said is absurd when you replace UTF-8 with XML doesn't mean the original was absurd.

ww520 · on April 13, 2012

UTF-8 is not efficient for random access.

I don't have problem with UTF-8. I have problem with the silver bullet attitude advocating using an approach for all cases without thought. That's just intellectually lazy.

mikeash · on April 13, 2012

No encoding that can handle all the necessary languages will be efficient for random access.

I'm not saying don't think about it. But once you think about it, I think there's really only one sane conclusion to reach.

ww520 · on April 13, 2012

Never say never. UTF-32 handles them just fine.

mikeash · on April 13, 2012

Precomposed versus decomposed accents? Jamo versus precomposed Hangul characters? The Unicode code point is rarely useful thing to know about on its own, and code which assumes that one code point equals one "character", for whatever definition of a character is in use, is likely to work poorly with UTF-32.