currently the data path must be in a local filesystem. I would like to define data path to be in S3 (or GCP) storage; this could be difficult to do (depending on code); do you think it is possible to add this possibility (configure data path to be stored on s3 storage)?
any reference to github code where data is written to disk is welcome
Thanks
I would not recommend doing this at all for various reasons, one of them being that CrateDB persists a tranlsog (wal) for every operation. Another, that iops and therefore performance would most likely be terrible.
However if you still want to try it you could use something like S3FS-Fuse or equivalent for kubernetes pvcs
yes, I considered s3fs but was wondering about performance; did you try it (for large data)? also, since s3fs have some limitations (i.e. random access, …), does cratedb works without any issues with s3fs?
CrateDB uses memory-mapped files for accessing the filesystem. While you could put the filesystem on a remote system and connect it over the network, it will suffer severely from latency problems, which is probably not the right thing to do when running a database [1].
Other than latency and throughput issues [2][3][4], there will probably be also severe concurrency issues, eventually leading to data corruption, because databases are allowed to write data.
If you are only looking at optimizing the read path to your data, you may want to look at solutions/technologies like using a sparse index or Zarr, but both are usually only applied to more specific data domains, and are not suitable for general purpose databases.
In general, we recommend to use fast local-attached SSD disks for running CrateDB, to avoid any network roundtrips.