What is a span's start / end actually counting? Code Units or Graphemes?

APOnPurpose · October 29, 2025, 8:29pm

Imagine I have the following text:

We’re big fans of pizza . Pizza is an Italian, specifically Neapolitan, dish typically consisting of a flat base of leavened wheat-based dough topped with tomato, cheese, and other ingredients, baked at a high temperature, traditionally in a wood-fired oven.

When I get this content via the API, the output is escaping the unicode characters, as seen here:

{
  "content": [
      {
        "type": "paragraph",
        "text": "We\u2019re big fans of pizza \ud83c\udf55. Pizza is an Italian, specifically Neapolitan, dish typically consisting of a flat base of leavened wheat-based dough topped with tomato, cheese, and other ingredients, baked at a high temperature, traditionally in a wood-fired oven.",
        "spans": [
          { "start": 18, "end": 23, "type": "strong" },
          {
            "start": 28,
            "end": 33,
            "type": "hyperlink",
            "data": {
              "link_type": "Web",
              "url": "https://en.wikipedia.org/wiki/Pizza",
              "target": "_self"
            }
          }
        ],
        "direction": "ltr"
      }
    ]
}

Can you confirm whether the start and end count is counting code units, or graphemes, or something else? I am working with a client to translate their Prismic content, and will be relying on Unicode characters very heavily.

Thanks.

Pau · October 31, 2025, 7:31pm

It’s based on the number of characters in the text. Out of curiosity, what’s the goal behind checking this? If it’s for implementing your own Rich Text logic, we’d recommend avoiding that, our SDKs already handle this automatically!

APOnPurpose · November 25, 2025, 6:30pm

It’s based on the number of characters in the text.

You’re not wrong, but specifically it’s counting characters in UTF-16 encoding. JSON is UTF-8 encoding, so the start / end counts aren’t accurate depending on your encoding. You can see my discovery in this thread.

Out of curiosity, what’s the goal behind checking this?

I’ve built a Prismic integration for the purpose of translating content for enterprise clients (onpurpose.studio). Yes, your SDK allows me to go from Rich Text → HTML, but when that HTML gets through translation it needs to go HTML → Rich Text before it can be pushed back to Prismic as the target language.

For this reason I’ve built the ability to convert Rich Text bi-directionally and in / out of other langs.

Topic		Replies	Views
Emojis with links in rich text field produce weird behaviour Prismic for Content Creators rich-text	15	1308	October 31, 2025
RichText : Bold + Italic returns strange result from API Developing with Prismic rest-api , vuejs , rich-text	10	105	September 13, 2024
Custom Rich Text Serializer not able to get spans to work Developing with Prismic svelte , rich-text	1	94	July 8, 2024
RichText return incomplete value Prismic for Content Creators	5	393	December 9, 2021
Trouble formating <pre> using Prism in Gatsby Developing with Prismic gatsby	6	2034	December 9, 2021

What is a span's start / end actually counting? Code Units or Graphemes?

Related topics