How to perform a fuzzy search on the elements of a text array

Hi team, I’m using CrateDB 5.10.11, now I have a table to store application log in object type, and some column is ARRAY(TEXT), I want to perform fuzzy query on these column with ARRAY(TEXT). How to implement it? Thanks.

cr> select * from sdtest where log['billIds'] like '%484575b%';
UnsupportedFunctionException[Unknown function: (doc.sdtest.log['billIds'] LIKE '%484575b%'), no overload found for matching argument types: (text_array, text). Possible candidates: op_like(text, text):boolean, op_like(text, text, text):boolean]

io.crate.exceptions.UnsupportedFunctionException: Unknown function: (doc.sdtest.log['billIds'] LIKE '%484575b%'), no overload found for matching argument types: (text_array, text). Possible candidates: op_like(text, text):boolean, op_like(text, text, text):boolean
        at io.crate.metadata.Functions.raiseUnknownFunction(Functions.java:370)
        at io.crate.metadata.Functions.resolveFunctionBySignature(Functions.java:262)
        at io.crate.metadata.Functions.get(Functions.java:163)
        at io.crate.metadata.Functions.get(Functions.java:141)
        at io.crate.analyze.expressions.ExpressionAnalyzer.allocateBuiltinOrUdfFunction(ExpressionAnalyzer.java:1239)
        at io.crate.analyze.expressions.ExpressionAnalyzer.allocateFunction(ExpressionAnalyzer.java:1212)
        at io.crate.analyze.expressions.ExpressionAnalyzer.allocateFunction(ExpressionAnalyzer.java:1203)
        at io.crate.analyze.expressions.ExpressionAnalyzer$InnerExpressionAnalyzer.visitLikePredicate(ExpressionAnalyzer.java:875)
        at io.crate.analyze.expressions.ExpressionAnalyzer$InnerExpressionAnalyzer.visitLikePredicate(ExpressionAnalyzer.java:449)
        at io.crate.sql.tree.LikePredicate.accept(LikePredicate.java:61)
        at io.crate.analyze.expressions.ExpressionAnalyzer.convert(ExpressionAnalyzer.java:250)
        at io.crate.analyze.expressions.ExpressionAnalyzer.lambda$generateQuerySymbol$1(ExpressionAnalyzer.java:259)
        at java.base/java.util.Optional.map(Optional.java:260)

The table definition are as follows.

CREATE TABLE IF NOT EXISTS "doc"."sdtest" (
   "log" OBJECT(DYNAMIC) AS (
      "zipkinSpanId" TEXT,
      "billIds" ARRAY(TEXT),
      "optUser" TEXT,
      "auditLog" OBJECT(DYNAMIC) AS (
         "fields" OBJECT(DYNAMIC) AS (
            "isUpload" TEXT,
            "retailPrice" TEXT,
            "updTime" TEXT,
            "uploadStatusMessage" TEXT
         ),
         "afterSave" OBJECT(DYNAMIC) AS (
            "isUpload" TEXT,
            "retailPrice" TEXT,
            "updTime" TEXT,
            "uploadStatusMessage" TEXT
         ),
         "beforeSave" OBJECT(DYNAMIC) AS (
            "isUpload" TEXT,
            "retailPrice" TEXT,
            "updTime" TEXT,
            "uploadStatusMessage" TEXT
         )
      ),
      "level" TEXT,
      "typeName" TEXT,
      "ccloudTraceId" TEXT,
      "txId" TEXT,
      "zipkinTraceId" TEXT,
      "env" TEXT,
      "message" TEXT,
      "type" TEXT,
      "esIndex" TEXT,
      "prodLine" TEXT,
      "application" TEXT,
      "id" TEXT,
      "optTime" TEXT,
      "ccloudTransactionId" TEXT,
      "zipkinParentId" TEXT
   )
)
CLUSTERED INTO 8 SHARDS
WITH (
   column_policy = 'strict',
   number_of_replicas = '0-1'
)

Hi, that error message can be a bit cryptic :face_with_spiral_eyes:. It means that the function that gets called with the operator LIKE does not haven an implementation when the input is a type text_array, which basically means that LIKE cannot be used with a text array, as far as I know it is the case with arrays in general.

LIKE operator would not be good for this use case as in some cases it wouldn’t be able to use the index and would have to iterate through all the rows.

For fuzzy search I’d recommend to use fulltext-search:

CREATE TABLE t (
  arr array(text),
  INDEX arr_ft using fulltext (arr)
)
INSERT INTO
  t
VALUES
  (
    [ 
        '54e68f71-d3e6-400b-b36d-d2b2a74dd29a',
        '46c87882-b521-435c-ac47-2e706d35b7a8'
     ]
)
SELECT
  _score,
  *
FROM
  t
WHERE
  match(arr_ft, 'd2b2a74dd29a')

In your case using an object you could do something like:

CREATE TABLE t (
  logs OBJECT AS (
    bills_id array(text)
  ),
  INDEX bills_id_ft using fulltext (logs ['bills_id'])
) 

Note that you would have to include bills_id in the definition of your table since adding a column that references an already existing column is not supported.

I would also encourage you to have a look at the fulltext docs, there are options to tweak fuzziness and have different tokenizers which you might be benefit from!

Thanks @surister , I recreate table and specify a fulltext index as your suggestion, but the match function cannot return fuzzy or partial matches. Could you please help me check if something was done incorrectly?

//The original data
cr> select _score,message['billIds'] from sd.sdtest;
+--------+--------------------------------------+
| _score | message['billIds']                   |
+--------+--------------------------------------+
|    1.0 | ["e2f9c9a8bc484575b42b31d87d6ec5db"] |
+--------+--------------------------------------+
SELECT 1 row in set (0.004 sec)

//the match function can return exact matches, but it cannot return fuzzy or partial matches.
cr> select _score,message['billIds'] from sd.sdtest where match(bills_id_ft,'e2f9c9a8bc484575b42b31d87d6ec5db');
+------------+--------------------------------------+
|     _score | message['billIds']                   |
+------------+--------------------------------------+
| 0.13076457 | ["e2f9c9a8bc484575b42b31d87d6ec5db"] |
+------------+--------------------------------------+
SELECT 1 row in set (0.018 sec)
cr> select _score,message['billIds'] from sd.sdtest where match(bills_id_ft,'e2f9c9a8bc484575b42b31d87d6e');
+--------+--------------------+
| _score | message['billIds'] |
+--------+--------------------+
+--------+--------------------+
SELECT 0 rows in set (0.004 sec)
cr>

The table definition

cr> show create table sd.sdtest;
+---------------------------------------------------------------------+
| SHOW CREATE TABLE sd.sdtest                                         |
+---------------------------------------------------------------------+
| CREATE TABLE IF NOT EXISTS "sd"."sdtest" (                          |
|    "message" OBJECT(DYNAMIC) AS (                                   |
|       "zipkinSpanId" TEXT,                                          |
|       "billIds" ARRAY(TEXT),                                        |
|       "optUser" TEXT,                                               |
|       "auditLog" OBJECT(DYNAMIC) AS (                               |
|          "fields" OBJECT(DYNAMIC) AS (                              |
|             "isUpload" TEXT,                                        |
|             "retailPrice" TEXT,                                     |
|             "updTime" TEXT,                                         |
|             "uploadStatusMessage" TEXT                              |
|          ),                                                         |
|          "afterSave" OBJECT(DYNAMIC) AS (                           |
|             "isUpload" TEXT,                                        |
|             "retailPrice" TEXT,                                     |
|             "updTime" TEXT,                                         |
|             "uploadStatusMessage" TEXT                              |
|          ),                                                         |
|          "beforeSave" OBJECT(DYNAMIC) AS (                          |
|             "isUpload" TEXT,                                        |
|             "retailPrice" TEXT,                                     |
|             "updTime" TEXT,                                         |
|             "uploadStatusMessage" TEXT                              |
|          )                                                          |
|       ),                                                            |
|       "level" TEXT,                                                 |
|       "typeName" TEXT,                                              |
|       "ccloudTraceId" TEXT,                                         |
|       "txId" TEXT,                                                  |
|       "zipkinTraceId" TEXT,                                         |
|       "env" TEXT,                                                   |
|       "message" TEXT,                                               |
|       "type" TEXT,                                                  |
|       "esIndex" TEXT,                                               |
|       "prodLine" TEXT,                                              |
|       "application" TEXT,                                           |
|       "id" TEXT,                                                    |
|       "optTime" TEXT,                                               |
|       "ccloudTransactionId" TEXT,                                   |
|       "zipkinParentId" TEXT                                         |
|    ),                                                               |
|    INDEX "bills_id_ft" USING FULLTEXT ("message"['billIds']) WITH ( |
|       analyzer = 'standard'                                         |
|    )                                                                |
| )                                                                   |
| CLUSTERED INTO 8 SHARDS                                             |
| WITH (                                                              |
|    column_policy = 'strict',                                        |
|    number_of_replicas = '0-1',                                      |
|    refresh_interval = 1000                                          |
| )                                                                   |
+---------------------------------------------------------------------+
SHOW 1 row in set (0.194 sec)
cr>

Oh yeah it makes perfect sense that it cannot match, the default analyzer is meant for common languages :person_facepalming:, so an uuid is treated as a single word, as the docs says:

Fulltext indices take the contents of one or more fields and split it up into tokens that are used for fulltext-search. The transformation from a text to separate tokens is done by an analyzer.

We could leverage n_gram or pattern tokenizer:

CREATE ANALYZER uuid_anyzer(
  tokenizer uuid_ngram with (type = 'ngram', min_gram = 4, max_gram = 6)
)

This will create grams of 4-6 length.

CREATE TABLE sdtest (
  "message" OBJECT(dynamic) as ("billIds" ARRAY(TEXT)),
  INDEX "bills_id_ft" USING FULLTEXT ("message" ['billIds']) with (analyzer = 'uuid_anyzer')
)
INSERT INTO
  sdtest
VALUES
  (
    { "billIds" = ['93670e1a798d4670a38a54b878cd57cc'] }
  )
SELECT
  _score,
  message ['billIds']
FROM
  sdtest
WHERE
  match(bills_id_ft, '8cd57cc');

| _score    | message['billIds']                   |
|-----------|--------------------------------------| 
| 1.1768812 | ["93670e1a798d4670a38a54b878cd57cc"] |

Now, as you can see in the docs, there are a lot of analyzers, tokenizers, filters.. so a lot of things can be tweaked to make your search more efficient and accurate. For example if ids are UUIDs separated by a hyphen, like 93670e1a-798d-4670-a38a-54b878cd57cc you could use the pattern tokenizer and split tokens into indexes like: [93670e1a, 798d, a38a, 54b878cd57cc]if that makes sense for how you want to do search.

I will ask some teammates just in case, I’m not the most expert in fs search!

While writing to the team another way of doing this occurred to me:

select * from sdtest where array_to_string(log['billIds'], '') LIKE '%85B%'

This might just work for you, but I’d be careful with performance, LIKE can in some instances do a full table scan, which might be ok or not depending on your use-case and requirements, we’d also have the overhead of calling array_to_string in every row iteration, you could maybe optimize for that by doing:

CREATE TABLE sdtest (
  "message" OBJECT(dynamic) as ("billIds" ARRAY(TEXT)),
  billds_ids_text generated always as array_to_string("message" ['billIds'], '')
)

Without question the fulltext search option will perform the best, but using array_to_string is way simpler, please do a good benchmark and decide what to do depending on the results and your requirements.