Academy course: wrong analyzer?

Hi all,
seems that there is an error in defining the analyzer in the tables defined for the Academy CrateDB Fundamentals course.

What have I to do?
Going on with the course accepting different results from the lesson?

What I discovered:

In the Hands-on: Full-Text Search lesson is reported that the community_areas table was created using the “english” analyzer

When the community_areas table was created, we told CrateDB to perform full-text indexing on it:

…
details OBJECT(STRICT) AS (
   description TEXT INDEX USING fulltext WITH (analyzer='english'),
   population BIGINT
), …

Let’s use the MATCH predicate to try some queries against the index that was created. Recall that, when using MATCH, we can select _score to retrieve a score for each result. This score is a measure of the quality of the match compared to other matching rows.

Execute this query to search for community areas containing either the term “railroad” or the term “tracks”:

SELECT name, _score, details['description'] AS description
FROM community_areas
WHERE MATCH(details['description'], 'railroad tracks')
ORDER BY _score DESC;

Take a moment to look at the results returned by this query. Note that Greater Grand Crossing and Burnside are the community areas with the highest scores.

Running the query I see different results: the first two community_areas are West Town and “Burnside” instead of "Greater Grand Crossing"and “Burnside

Looking at the Hands-on: Your First CrateDB Cluster lesson I see that the table was created using this command

CREATE TABLE community_areas (
   areanumber INTEGER PRIMARY KEY,
   name TEXT,
   details OBJECT(STRICT) AS (
       description TEXT INDEX USING fulltext,
       population BIGINT
   ),
   boundaries GEO_SHAPE INDEX USING geohash WITH (PRECISION='1m', DISTANCE_ERROR_PCT=0.025)
);

But this means to use the “standard” analyzer

cr> SHOW CREATE TABLE community_areas
    ;
+------------------------------------------------------+
| SHOW CREATE TABLE doc.community_areas                |
+------------------------------------------------------+
| CREATE TABLE IF NOT EXISTS "doc"."community_areas" ( |
|    "areanumber" INTEGER NOT NULL,                    |
|    "name" TEXT,                                      |
|    "details" OBJECT(STRICT) AS (                     |
|       "description" TEXT INDEX USING FULLTEXT WITH ( |
|          analyzer = 'standard'                       |
|       ),                                             |
|       "population" BIGINT                            |
|    ),                                                |
|    "boundaries" GEO_SHAPE INDEX USING GEOHASH WITH ( |
|       distance_error_pct = 0.025,                    |
|       precision = '1m'                               |
|    ),                                                |
|    PRIMARY KEY ("areanumber")                        |
| )                                                    |
| CLUSTERED BY ("areanumber") INTO 6 SHARDS            |
| WITH (                                               |
|    "allocation.max_retries" = 5,                     |
|    "blocks.metadata" = false,                        |
|    "blocks.read" = false,                            |
|    "blocks.read_only" = false,                       |
|    "blocks.read_only_allow_delete" = false,          |
|    "blocks.write" = false,                           |
|    codec = 'default',                                |
|    column_policy = 'strict',                         |
|    "mapping.total_fields.limit" = 1000,              |
|    max_ngram_diff = 1,                               |
|    max_shingle_diff = 3,                             |
|    number_of_replicas = '0-1',                       |
|    "routing.allocation.enable" = 'all',              |
|    "routing.allocation.total_shards_per_node" = -1,  |
|    "store.type" = 'fs',                              |
|    "translog.durability" = 'REQUEST',                |
|    "translog.flush_threshold_size" = 536870912,      |
|    "translog.sync_interval" = 5000,                  |
|    "unassigned.node_left.delayed_timeout" = 60000,   |
|    "write.wait_for_active_shards" = '1'              |
| )                                                    |
+------------------------------------------------------+
SHOW 1 row in set (0.009 sec)
cr> 

So using different analyzers we get different results.

Hi there - thanks for the detailed post, I’ll take a look at this and get back to you. I’ve started tracking it as an issue on the course’s GitHub repository that can be seen here. I’ll post any updates here in the community forum too.

Thanks,

Simon.