What is a span's start / end actually counting? Code Units or Graphemes?

Imagine I have the following text:

We’re big fans of pizza :pizza:. Pizza is an Italian, specifically Neapolitan, dish typically consisting of a flat base of leavened wheat-based dough topped with tomato, cheese, and other ingredients, baked at a high temperature, traditionally in a wood-fired oven.

When I get this content via the API, the output is escaping the unicode characters, as seen here:

{
  "content": [
      {
        "type": "paragraph",
        "text": "We\u2019re big fans of pizza \ud83c\udf55. Pizza is an Italian, specifically Neapolitan, dish typically consisting of a flat base of leavened wheat-based dough topped with tomato, cheese, and other ingredients, baked at a high temperature, traditionally in a wood-fired oven.",
        "spans": [
          { "start": 18, "end": 23, "type": "strong" },
          {
            "start": 28,
            "end": 33,
            "type": "hyperlink",
            "data": {
              "link_type": "Web",
              "url": "https://en.wikipedia.org/wiki/Pizza",
              "target": "_self"
            }
          }
        ],
        "direction": "ltr"
      }
    ]
}

Can you confirm whether the start and end count is counting code units, or graphemes, or something else? I am working with a client to translate their Prismic content, and will be relying on Unicode characters very heavily.

Thanks.

It’s based on the number of characters in the text. Out of curiosity, what’s the goal behind checking this? If it’s for implementing your own Rich Text logic, we’d recommend avoiding that, our SDKs already handle this automatically!

It’s based on the number of characters in the text.

You’re not wrong, but specifically it’s counting characters in UTF-16 encoding. JSON is UTF-8 encoding, so the start / end counts aren’t accurate depending on your encoding. You can see my discovery in this thread.

Out of curiosity, what’s the goal behind checking this?

I’ve built a Prismic integration for the purpose of translating content for enterprise clients (onpurpose.studio). Yes, your SDK allows me to go from Rich Text → HTML, but when that HTML gets through translation it needs to go HTML → Rich Text before it can be pushed back to Prismic as the target language.

For this reason I’ve built the ability to convert Rich Text bi-directionally and in / out of other langs.

1 Like