I have a table with firstname and lastname. I want to find matches that tolerate minor inversions such as where lastname = ‘bronw’ (actual brown). Also, I would like to use a synonym file to fine nick names and diminutive names such as William=Bill. The columns have been defined as fulltext. Can you provide any assistance or reference? Thank you!
Hello Todd,
A common approach in information retrieval is to use n-grams for fuzzy search or spelling corrections. Each word will be split into tokens of n number of grams:
E.g. a 2-2 n-gram tokenizer will split a word into tokens starting from length 2 up until length 2:
'brown' -> ['br', 'ro','ow', 'wn']
The same approach is applied on the query term:
'bronw' -> ['br', 'ro', 'on' ,'nw']
Then the overlap is calculated and scored with a heuristic, CrateDB uses Okapi BM25 - Wikipedia. If you increase the length of the n-grams, the approach will be less fuzzy.
Here is a fully working example:
CREATE ANALYZER a1 (TOKENIZER t1 with (type='ngram', min_gram=2, max_gram=2, token_chars=['letter']));
CREATE TABLE doc.test(firstname varchar, lastname varchar, INDEX lastname_ft USING FULLTEXT (lastname) WITH (analyzer = 'a1'));
INSERT INTO doc.test (firstname, lastname) values ('charly', 'brown'), ('charly', 'braun'), ('charly', 'browne');
SELECT firstname, lastname, _score FROM doc.test WHERE MATCH(lastname_ft, 'bronw') ORDER BY _score DESC;
+-----------+----------+------------+
| firstname | lastname | _score |
+-----------+----------+------------+
| charly | brown | 0.17363958 |
| charly | browne | 0.1585405 |
| charly | braun | 0.13076457 |
+-----------+----------+------------+
Another approach for fuzzy matching is the fulltext search with the fuzziness parameters. This will internally use Levenshtein distance - Wikipedia to calculate the match. This approach is CPU intensive and should only be used for smaller datasets. On the other hand, n-grams will increase your storage size.
CrateDB does support synonym files. The synonym file needs to be placed in the config folder and must be in the Solr or WordNet synonym file format.
Here is a full working example:
config/synonyms.txt
William => Bill
CREATE ANALYZER a2 (TOKENIZER lowercase, TOKEN_FILTERS (my_synonyms WITH (type='synonym', synonyms_path='synonyms.txt')));
CREATE TABLE doc.test(name varchar, INDEX synonym_ft USING FULLTEXT (name) WITH (analyzer = 'a2'));
INSERT INTO doc.test (name) values ('Bill');
SELECT name FROM doc.test WHERE MATCH(synonym_ft, 'William');
+------+
| name |
+------+
| Bill |
Best Regards,
Michael
Michael,
Thank you very much. Can you provide an example that combines tokenizers?
Thanks!
Hi Todd,
You cannot combine tokenizers in an analyzer but you can combine multiple token filters:
CREATE ANALYZER a1 (TOKENIZER standard, TOKEN_FILTERS (lowercase, asciifolding, my_synonyms WITH (type='synonym', synonyms_path='synonyms.txt')));
Best Regards,
Michael