How does the "similar" feature work?

The "similar" filter is vaguely documented here: Fetch Data with GraphQL - Documentation - Prismic

I found a tiny bit more information here: How does Prismic work out "Similar content"?

As things stand, I have so far shied away from this feature (for years, over several projects) because it is too vaguely documented. Instead I've felt much more confident writing my own code to do things like match tags because I'll actually understand how it's working, and I will be able to communicate it to my client in a way they'll understand too.

Please expand significantly on the documentation for that feature. I have a lot of questions. For example:

  1. Which types of fields are considered? Rich text and key text were mentioned but I'm not sure if that's exhaustive.
  2. Does it matter if the rich text is a single block or multi-block?
  3. Are all possible block types of a rich text field considered, eg headings, paragraph, list item? Are images and embeds etc completely ignored, or will chunks of URLs etc sneak their way in?
  4. What about alt text of image fields?
  5. What about alt text of image blocks of rich text fields?
  6. Does key text just include "key text" itself or are you lumping in things like "select" with that, which I presume are stored very similarly?
  7. What if the field is in a repeatable group? Is it still indexed?
  8. What if the field is in the non-repeatable area of a slice? Is it still indexed?
  9. What if the field is in the repeatable area of a slice? Is it still indexed?
  10. It sounds like it works by indexing words. How is a "word" defined? Can you share a regex or similar which the underlying routine uses to match a "word"?
  11. Is the search exhaustive, or does it find "enough" matches then stop, even if more relevant ones might have been found later?
  12. Do results come back in any particular order?
2 Likes

Hey @bart,

Thanks for these questions! As you've probably guessed, I'm going to have to have a conversation with our back-end developer team to get most of the answers. I'll get back to you soon with the answers, and take your suggestion to update the documentation at the same time :slight_smile:

Best,
Sam

Hey @bart,

I've got a response from our dev team on your questions:

  • Which types of fields are considered? Rich text and key text were mentioned but I'm not sure if that's exhaustive.

RichText, KeyText, Select, UID.

  • Does it matter if the rich text is a single block on multi-block?

No it doesn't.

  • Are all possible block types of a rich text field considered, eg headings, paragraph, list item? Are images and embeds etc completely ignored, or will chunks of URLs etc sneak their way in?

Only text blocks are taken into account. A text block can be a heading, a paragraph, a list item. Embeds and images are ignored.

  • What about alt text of image fields?

They are ignored.

  • What about alt text of image blocks of rich text fields?

They are ignored.

  • Does key text just include "key text" itself or are you lumping in things like "select" with that, which I presume are stored very similarly?

Even if KeyText and Select are very close, they are independent.

  • What if the field is in a repeatable group? Is it still indexed?

Yes

  • What if the field is in the non-repeatable area of a slice? Is it still indexed?

Yes

  • What if the field is in the repeatable area of a slice? Is it still indexed?

Yes

  • It sounds like it works by indexing words. How is a "word" defined? Can you share a regex or similar which the underlying routine uses to match a "word"?

Our search engine is able to take a word, find it's root and matches the term will all it's variants/conjugation depending on the locale of the content.

  • Is the search exhaustive, or does it find "enough" matches then stop, even if more relevant ones might have been found later?

The search is exhaustive and our API offer pagination to go over each result.

  • Do results come back in any particular order?

Yes, most relevant first. We give more priority to RichText content, especially headings.

Thanks for posting this question. Let me know if you have others!

Best,
Sam

2 Likes

That's really helpful; thank you. It would be very nice if that information could make its way into the documentation.

1 Like

Actually I do have a followup question here.

I actually meant how is it tokenized, but it sounds like it's more sophisticated than I was imagining so my intended question is probably moot.

This answer brings up a new question though, since clearly how this is working is language-sensitive. Does this take language into account, i.e. does it work in languages other than English?

1 Like

@bart Cool, I'm glad this is useful! And yes, I will be adding some of this information to the Predicate reference documents (which I'm in the process of rewriting).

For the word indexing, the engine uses your content's locale. So, if the locale is fr-fr, it will process the content in French.

Threads close after a period of inactivity. Flag this thread to re-open it and continue the conversation.

@samlittlefair were you able to update the docs with all this info. I haven't find anything about this anywhere and I found all this questions very helpful.

I have one more question to add to this that I haven't seen the answer and is how long it takes for the data to get indexed? I have a few documents created with some RichText fields and KeyText fields, uid, but everytime I run the similar query I get nothing in return. Is there any way to manually force the re-indexing process?

Thanks!

Hi @levijesica,

Yes, we published a Technical Reference for the Rest API which includes some information about how the fulltext predicate works, and similar follows the same principles.

Prismic content is cached by ref, but if you're using our development kits, you should always be querying with the latest ref automatically. That means that nothing should be cached, and you should get the expected results. If the similar predicate isn't working as expected, I could take a look to see if there's an issue.

If you want, you could share your project files with me as a ZIP file or GitHub repo. Alternatively, you could share your repo name (plus an access token, if necessary) and a detailed description of the query you're performing. You can send all of this in a DM if you like.

Sam

hey @samlittlefair sorry for the late response. After a while we were able to get the results back. I'm still not sure how long it takes to index everything but now at least we get some results back. It'd be great if the embedded data could also be used as reference to query similar items

Thanks for the response anyway!

Thanks for the update, @levijesica! I'll add the note about embedded data as a feature request. I'll also check with the API team to see if there's any reason why you'd have a delay on the indexing, and get back to you.

Hi @samlittlefair after I replied to you we found one document that for some reason is not getting and results back when querying the similar api. But if we query a different document id then we do get the results back, eve though the documents are the same kind and their data is also very similar. That's the reason why I asked before if the values were cached or if it was possible to reindex the similar results.

Is there any way you can take a look at this documents and let me know what the differences could be? I can give you the ids of both of them

@levijesica Yes, I'd be happy to take a look. Let me know the repo name and the document IDs. (You can send it in a DM if you like.)

Thanks for sharing that info, @levijesica. I looked at that document on the API, and it worked for me. Here's the syntax for the predicate:

similar( documentID, value )

... where the value is the maximum number of documents that a term may appear in to still be considered relevant. If the value is low, the results will be limited. If the value is high, the results will be broader. Looking at your API, I didn't get any results for the document you sent me if the value was under 30, but if I increased the value above 30, I got results as expected.

It might be counterintuitive, but this actually means that the content in that article is probably very similar to the content in lots of other articles. There are no words in that document that occur in fewer than 30 other documents in your repository, so when value is below 30, there are no results. Does that make sense?

I could be wrong in my analysis, so let me know if anything I said sounds inaccurate. And let me know if this leaves anything unanswered.

Thanks,
Sam

1 Like

@samlittlefair yes that makes sense and like you say, it's not very counterintuitive, in fact I had to read the explanation many times until I finally understood how it works. I thought the value was the amount of documents to filter, not the occurrences. I feel like it's very hard to control that number since the number of related content could grow over time. I believe it would be much better if that value was the number of documents that at least have similar common instead of max values, so if that number grows it won't affect the results. In any case, thanks for the explanation

That's a useful suggestion. Thanks, @levijesica. I'll share it with the product team :slight_smile:

1 Like