Academy course: wrong analyzer?

MirtoBusico · December 15, 2024, 5:34pm

Hi all,
seems that there is an error in defining the analyzer in the tables defined for the Academy CrateDB Fundamentals course.

What have I to do?
Going on with the course accepting different results from the lesson?

What I discovered:

In the Hands-on: Full-Text Search lesson is reported that the community_areas table was created using the “english” analyzer

When the community_areas table was created, we told CrateDB to perform full-text indexing on it:

…
details OBJECT(STRICT) AS (
   description TEXT INDEX USING fulltext WITH (analyzer='english'),
   population BIGINT
), …

Let’s use the MATCH predicate to try some queries against the index that was created. Recall that, when using MATCH, we can select _score to retrieve a score for each result. This score is a measure of the quality of the match compared to other matching rows.

Execute this query to search for community areas containing either the term “railroad” or the term “tracks”:

SELECT name, _score, details['description'] AS description
FROM community_areas
WHERE MATCH(details['description'], 'railroad tracks')
ORDER BY _score DESC;

Take a moment to look at the results returned by this query. Note that Greater Grand Crossing and Burnside are the community areas with the highest scores.

Running the query I see different results: the first two community_areas are West Town and “Burnside” instead of "Greater Grand Crossing"and “Burnside”

Looking at the Hands-on: Your First CrateDB Cluster lesson I see that the table was created using this command

CREATE TABLE community_areas (
   areanumber INTEGER PRIMARY KEY,
   name TEXT,
   details OBJECT(STRICT) AS (
       description TEXT INDEX USING fulltext,
       population BIGINT
   ),
   boundaries GEO_SHAPE INDEX USING geohash WITH (PRECISION='1m', DISTANCE_ERROR_PCT=0.025)
);

But this means to use the “standard” analyzer

cr> SHOW CREATE TABLE community_areas
    ;
+------------------------------------------------------+
| SHOW CREATE TABLE doc.community_areas                |
+------------------------------------------------------+
| CREATE TABLE IF NOT EXISTS "doc"."community_areas" ( |
|    "areanumber" INTEGER NOT NULL,                    |
|    "name" TEXT,                                      |
|    "details" OBJECT(STRICT) AS (                     |
|       "description" TEXT INDEX USING FULLTEXT WITH ( |
|          analyzer = 'standard'                       |
|       ),                                             |
|       "population" BIGINT                            |
|    ),                                                |
|    "boundaries" GEO_SHAPE INDEX USING GEOHASH WITH ( |
|       distance_error_pct = 0.025,                    |
|       precision = '1m'                               |
|    ),                                                |
|    PRIMARY KEY ("areanumber")                        |
| )                                                    |
| CLUSTERED BY ("areanumber") INTO 6 SHARDS            |
| WITH (                                               |
|    "allocation.max_retries" = 5,                     |
|    "blocks.metadata" = false,                        |
|    "blocks.read" = false,                            |
|    "blocks.read_only" = false,                       |
|    "blocks.read_only_allow_delete" = false,          |
|    "blocks.write" = false,                           |
|    codec = 'default',                                |
|    column_policy = 'strict',                         |
|    "mapping.total_fields.limit" = 1000,              |
|    max_ngram_diff = 1,                               |
|    max_shingle_diff = 3,                             |
|    number_of_replicas = '0-1',                       |
|    "routing.allocation.enable" = 'all',              |
|    "routing.allocation.total_shards_per_node" = -1,  |
|    "store.type" = 'fs',                              |
|    "translog.durability" = 'REQUEST',                |
|    "translog.flush_threshold_size" = 536870912,      |
|    "translog.sync_interval" = 5000,                  |
|    "unassigned.node_left.delayed_timeout" = 60000,   |
|    "write.wait_for_active_shards" = '1'              |
| )                                                    |
+------------------------------------------------------+
SHOW 1 row in set (0.009 sec)
cr>

So using different analyzers we get different results.

simon_prickett · December 16, 2024, 10:30am

Hi there - thanks for the detailed post, I’ll take a look at this and get back to you. I’ve started tracking it as an issue on the course’s GitHub repository that can be seen here. I’ll post any updates here in the community forum too.

Thanks,

Simon.

simon_prickett · January 14, 2025, 8:54pm

Hi there - this has now been fixed and the course content updated… you were correct, the table should have been creating using the English analyzer for that column.

Thank you,

Simon.

Topic		Replies	Views
How to search on analyzed fields CrateDB	1	695	February 18, 2019
Crate db full text search phrase_prefix does not return everything need to be returned CrateDB	2	23	March 14, 2025
ANALYZER with two and more TOKENIZER/TOKEN_FILTER CrateDB	3	558	November 23, 2021
Consider the below Scenario CrateDB	0	676	March 27, 2020
Does crate support 3rd party analyzers/tokenizers? CrateDB	2	582	December 5, 2018

Academy course: wrong analyzer?

Related topics