I am looking for a best practice to evaluate the disk usage of CrateDB as it pertains to a single column.
Our system receives various messages that are stored in columns as type:object (JSON).
With the array of various test messages that we have already received it makes sense in our context to estimate how we would provision a larger, production, system by looking at select columns in select tables and extrapolating.
An idealized way to get this information would be a query to a system virtual table where I could filter for a specific column and return the amount of bytes consumed.
What I have found so far:
sys.shards – Great information; very close to what I am looking for, but does not seem to allow granularity to a select column.
pg_stats – This exceeds the need for granularity, actually breaking down the message column to the columns produced by each of the JSON fields; however, it provides averages for each of these columns.
bit_length – More specifically, casting each object of the target column to a string and evaluating the sum of the bit_lentgh.
This method has the benefit of targeting exactly what I am looking for, but I fear this is a somewhat naïve approach that creates inaccuracies from casting and may not consider the size-on-disk after compression and instead just the character counts of the object.
My next effort would be to create a new table with one column to migrate all the data from a select column to this new table and evaluate the sum of the size of the partitions filtered by this table name.
But, before doing this, I felt that perhaps I am missing something much easier.