How does the "similar" feature work?

The "similar" filter is vaguely documented here: https://prismic.io/docs/technologies/query-similar-documents-graphql

I found a tiny bit more information here: How does Prismic work out "Similar content"?

As things stand, I have so far shied away from this feature (for years, over several projects) because it is too vaguely documented. Instead I've felt much more confident writing my own code to do things like match tags because I'll actually understand how it's working, and I will be able to communicate it to my client in a way they'll understand too.

Please expand significantly on the documentation for that feature. I have a lot of questions. For example:

  1. Which types of fields are considered? Rich text and key text were mentioned but I'm not sure if that's exhaustive.
  2. Does it matter if the rich text is a single block or multi-block?
  3. Are all possible block types of a rich text field considered, eg headings, paragraph, list item? Are images and embeds etc completely ignored, or will chunks of URLs etc sneak their way in?
  4. What about alt text of image fields?
  5. What about alt text of image blocks of rich text fields?
  6. Does key text just include "key text" itself or are you lumping in things like "select" with that, which I presume are stored very similarly?
  7. What if the field is in a repeatable group? Is it still indexed?
  8. What if the field is in the non-repeatable area of a slice? Is it still indexed?
  9. What if the field is in the repeatable area of a slice? Is it still indexed?
  10. It sounds like it works by indexing words. How is a "word" defined? Can you share a regex or similar which the underlying routine uses to match a "word"?
  11. Is the search exhaustive, or does it find "enough" matches then stop, even if more relevant ones might have been found later?
  12. Do results come back in any particular order?
1 Like

Hey @bart,

Thanks for these questions! As you've probably guessed, I'm going to have to have a conversation with our back-end developer team to get most of the answers. I'll get back to you soon with the answers, and take your suggestion to update the documentation at the same time :slight_smile:

Best,
Sam

Hey @bart,

I've got a response from our dev team on your questions:

  • Which types of fields are considered? Rich text and key text were mentioned but I'm not sure if that's exhaustive.

RichText, KeyText, Select, UID.

  • Does it matter if the rich text is a single block on multi-block?

No it doesn't.

  • Are all possible block types of a rich text field considered, eg headings, paragraph, list item? Are images and embeds etc completely ignored, or will chunks of URLs etc sneak their way in?

Only text blocks are taken into account. A text block can be a heading, a paragraph, a list item. Embeds and images are ignored.

  • What about alt text of image fields?

They are ignored.

  • What about alt text of image blocks of rich text fields?

They are ignored.

  • Does key text just include "key text" itself or are you lumping in things like "select" with that, which I presume are stored very similarly?

Even if KeyText and Select are very close, they are independent.

  • What if the field is in a repeatable group? Is it still indexed?

Yes

  • What if the field is in the non-repeatable area of a slice? Is it still indexed?

Yes

  • What if the field is in the repeatable area of a slice? Is it still indexed?

Yes

  • It sounds like it works by indexing words. How is a "word" defined? Can you share a regex or similar which the underlying routine uses to match a "word"?

Our search engine is able to take a word, find it's root and matches the term will all it's variants/conjugation depending on the locale of the content.

  • Is the search exhaustive, or does it find "enough" matches then stop, even if more relevant ones might have been found later?

The search is exhaustive and our API offer pagination to go over each result.

  • Do results come back in any particular order?

Yes, most relevant first. We give more priority to RichText content, especially headings.

Thanks for posting this question. Let me know if you have others!

Best,
Sam

1 Like

That's really helpful; thank you. It would be very nice if that information could make its way into the documentation.

1 Like

Actually I do have a followup question here.

I actually meant how is it tokenized, but it sounds like it's more sophisticated than I was imagining so my intended question is probably moot.

This answer brings up a new question though, since clearly how this is working is language-sensitive. Does this take language into account, i.e. does it work in languages other than English?

1 Like

@bart Cool, I'm glad this is useful! And yes, I will be adding some of this information to the Predicate reference documents (which I'm in the process of rewriting).

For the word indexing, the engine uses your content's locale. So, if the locale is fr-fr, it will process the content in French.

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.