Hi,
We are building a retrieval platform that mainly serves hybrid search workloads using vector search and BM25/full text search. CrateDB looks like a strong fit for this because hybrid retrieval can stay inside SQL, instead of splitting vector search, lexical search, structured filtering, tenant isolation, collection isolation, and query layer governance across multiple systems.
During implementation, we identified another important use case, some documents are explicitly linked to other documents, and retrieval sometimes needs to traverse those links before ranking or generating an answer. For example, a query may first match one document, then need to follow document to document relationships such as related cases, circulars, orders, attachments, references, or evidence chains. At the moment, we implemented bounded traversal (using document anchoring) up to 3 levels using SQL and application side control. This works, but it becomes harder to express and optimize as the relationship patterns become more dynamic.
Do you have any roadmap or design thoughts around adding recursive traversal support, SQL/PGQ, or a Cypher/GQL-like property graph query capability in CrateDB?
I understand this is a large feature and probably not something that can be answered fully in a forum thread. But since SQL/PGQ is now part of the SQL standard and PostgreSQL is moving in that direction, I think this could add significant value for GenAI and retrieval pipelines where hybrid search, structured filters, and graph-style traversal need to work together.
Even a high level answer would be useful, whether this is aligned with CrateDB’s direction, whether recursive CTEs or SQL/PGQ are being considered, or whether the recommended pattern is to keep bounded traversal in application logic for now.
Regarding graph query capability in CrateDB: supporting some complex queries like shortest path and things like that, likely won’t be part of CrateDB as it would require having graph entities as first class citizens (like Neo4J).
But recursive CTE should be already a good start to unlock some use cases.
Traversing in application can be also an option:
for small graphs you can load entire graph into memory on demand and do pretty much anything with it in your application
for large graphs the simplest way to do traversing is to store graph via adjacency list and represent each
“lookup of all neighbours of a vertex V” with a SELECT, but it might generate many queries. Basically O(N) of DFS will become O(N) SELECTS - maybe not that bad if we mitigate connection creation time with a connection pool and parsing time with prepared statements, but still not as good as having the whole thing baked in.
To somehow balance between number of SELECT-s and application memory usage,
you can go with BFS as it traverses layer by layer. I mean storing queue in memory and using IN to get the whole layer (or part of it) in one go. But this will work only if we know that graph don’t have many nodes on each layer, and we would need to carefully tune IN size as we also need to store unprocessed items in memory while adding new ones.
If graph is large but doesn’t produce many nodes on each iteration
(for example, visually looks like a super long fork), we can even store many layers in the queue and get multiple layers in one go.