Emojis with links in rich text field produce weird behaviour

For anyone else dealing with this issue 5+ years later, the underlying cause has to deal with Unicode strings and how Prismic seems to be counting them for the span’s start / end.

I’m building an integration to translate content from Prismic, and have managed to do a bit of reverse engineering as to the cause. Or rather, how to get around it.

Note: I’m using the API directly, not any clients.

I’m guessing Prismic devs are using JS, which is UTF-16 by default. But a JSON payload is UTF-8 so I’m guessing there’s a piece of code generating these start / end counts as UTF-16 then not accounting for other types of encoding. Classic.

So here’s a high-level workflow on getting the counts to match up:

  • Get your JSON payload which will be UTF-8
  • Prismic returns escaped unicode, so my dirty hack is to encode / decode the escaped string into JSON and will remove the escapes (it’s still UTF-8 at this point).
  • Now you need to convert the encoding to UTF-16 I use UTF-16LE for reasons.

Now your string is in the same encoding as used by the JS file generating the start / end counts. Count the code units of the string and you’ll see it aligns with your span’s start / end numbers.

Heads up that your programming language might or might not count code units with the default len() or strlen() funcs. Check your documentation to confirm.

Hope this helps someone, reach out if you get stuck :backhand_index_pointing_right: aaron@onpurpose.studio