Suppose you want to encode the Australian flag, you may consider this to be one simple emoji character. Actually you're in for a surprise, perhaps, because emojis aren't always represented as a single character, many emojis are combinations of multiple Unicode code points.
For instance, the Australian flag may be represented as two 8-byte code points.
U+0001F1E6
U+0001F1FA
This happens to display as a single glyph, the Australian flag, on some platforms, but may also display as two separate glyphs.
However, some ways of representing text only support encoding of 4-byte code points, those in the range U+0000 to U+FFFF. JSON is one of these. When we attempt to escape the 8-byte characters (which, note, do not need to be escaped under the JSON spec), we get a result that looks like two different codepoints. Quoth RFC 7159:
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a 12-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
So in fact, the Australian flag may be concretely represented in escaped JSON as this string:
"\ud83c\udde6\ud83c\uddfa"
As you can see, the last two hex digits of these escape pairs (e6
, fa
)
matches to the last bytes of the 8-byte code points above.