What are the best practices for managing blob's metadata?

Hi all,
before going too far on the wrong way I need some advices on how to manage metadata related to blobs.

If I understand correctly the documentation, the blob only queryable fields are digest and last_modified columns.

But I need to manage blob’s metadata.

What are the best practices to do this in CrateDB?

TL;DR

What I’ve tried so far.

In my use case I need to manage attachments to documents.

So I created a blob table

create blob table attachments with (blobs_path='/data/data0/myblobs/attachments');

But I need to associate to the blob at least a searchable name, a searchable description and the original file name.

So I created a table to contain the metadata related to the blobs

cr> SHOW CREATE TABLE attachments_meta;
+-------------------------------------------------------+
| SHOW CREATE TABLE doc.attachments_meta                |
+-------------------------------------------------------+
| CREATE TABLE IF NOT EXISTS "doc"."attachments_meta" ( |
|    "digest" TEXT NOT NULL,                            |
|    "name" TEXT INDEX USING FULLTEXT WITH (            |
|       analyzer = 'english'                            |
|    ),                                                 |
|    "description" TEXT INDEX USING FULLTEXT WITH (     |
|       analyzer = 'english'                            |
|    ),                                                 |
|    "filename" TEXT,                                   |
|    PRIMARY KEY ("digest")                             |
| )                                                     |
| CLUSTERED BY ("digest") INTO 6 SHARDS                 |
| WITH (                                                |
|    "allocation.max_retries" = 5,                      |
|    "blocks.metadata" = false,                         |
|    "blocks.read" = false,                             |
|    "blocks.read_only" = false,                        |
|    "blocks.read_only_allow_delete" = false,           |
|    "blocks.write" = false,                            |
|    codec = 'default',                                 |
|    column_policy = 'strict',                          |
|    "mapping.total_fields.limit" = 1000,               |
|    max_ngram_diff = 1,                                |
|    max_shingle_diff = 3,                              |
|    number_of_replicas = '0-1',                        |
|    "routing.allocation.enable" = 'all',               |
|    "routing.allocation.total_shards_per_node" = -1,   |
|    "store.type" = 'fs',                               |
|    "translog.durability" = 'REQUEST',                 |
|    "translog.flush_threshold_size" = 536870912,       |
|    "translog.sync_interval" = 5000,                   |
|    "unassigned.node_left.delayed_timeout" = 60000,    |
|    "write.wait_for_active_shards" = '1'               |
| )                                                     |
+-------------------------------------------------------+
SHOW 1 row in set (0.001 sec)
cr>

The tables associations is done using the digest column.

For testing I loaded the tables with some example data

cr> SELECT * FROM BLOB.attachments;
+------------------------------------------+---------------+
| digest                                   | last_modified |
+------------------------------------------+---------------+
| acfa51b963b5a1fb9eecaf7954ae39c9bea736cc | 1735995025468 |
| 135b6d193c9acba8ea180dea7424b863ce52858b | 1735994937133 |
| f83be20cb0439e5008242ec7a01cf1fe168f413b | 1735994988697 |
| 9a3993e2ecbbb3b1ea1435650cedd7fec2454e73 | 1735995055568 |
+------------------------------------------+---------------+
SELECT 4 rows in set (0.020 sec)
cr> SELECT * FROM attachments_meta;
+------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+----------------------------------------------+
| digest                                   | name                                          | description                                                                             | filename                                     |
+------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+----------------------------------------------+
| 135b6d193c9acba8ea180dea7424b863ce52858b | Insert CPU IN motherboard socket              | Short video ON how TO INSERT a CPU IN the motherboard socket                            | 11537353-hd_1920_1080_30fps.mp4              |
| f83be20cb0439e5008242ec7a01cf1fe168f413b | Impress Guide v. 24.2                         | LibreOffice Impress Guide version 24.2. A manual FOR creating presentations             | IG24-ImpressGuide.pdf                        |
| acfa51b963b5a1fb9eecaf7954ae39c9bea736cc | ROG Maximus Z790 Hero BTF - Quick START guide | The quick START guide FOR the Asus Republic Of Gamers Maximus Z790 Hero BTF motherboard | Q23155_ROG_MAXIMUS_Z790_HERO_BTF_QSG_WEB.pdf |
| 9a3993e2ecbbb3b1ea1435650cedd7fec2454e73 | Top VIEW of ROG CROSSHAIR X670E HERO          | The top VIEW image of the Asus Republic Of Gamers CROSSHAIR X670E HERO motherboard      | ROG_CROSSHAIR_X670E_HERO_h732.png            |
+------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------+----------------------------------------------+
SELECT 4 rows in set (0.020 sec)
cr>

Using these tables I’m able to do a full text search for attachments and download the blobs found restoring the original filename.

1 Like

Yes, using a dedicated table to store some metadata of the blobs is the recommended way.
There is an open feature request to store more information at the blob table but it’s not completely clear what in concrete should be stored and what benefit this brings over using a user-defined table. One of the main purpose of the blob table is the benefit of being able to iterate over all blobs, or filter on them. As one can join the blob table with a user-defined on by the digest, most use-case should be working.
Otherwise, any more input on the mentioned issue would be great.

1 Like

Thanks
I’ll add a comment on the feature request for my use case

1 Like