My question is similar to this one: Table creation considerations
In short, I have to collect data from thousands of devices (potentially in the order of 100k), each of which is identified by a UUID.
Each device is of a given type, and there are a few types (about 50 different types; some types have more devices than others).
What’s a good strategy to create tables for such a setup? I understand that creating 100k tables might not be optimal, but I fear that creating one table per device type would end up degrading performance, first because it could have billions of records for the more popular device types, second because I would have to put the 128-bit device UUID in each row, potentially using up a lot of disk space.
Data would never expire and has a very simple format: for each device, it’s a series of (timestamp, value) tuples.
Sorry for the newbie question but I’m just starting out with cratedb (which looks impressive so far btw).
Hi,
Having billions of records on a single table is not a problem for CrateDB and performance does not degrade because of this.
So I would go for a design with a smallish number of tables which would allow me to easily run queries across devices.
If the amount of different properties returned by the different device types is not too high, we could put different device types together on a single table, you may want to check CrateDB’s OBJECT columns for this.
One possible way to address the issue of the long UUID consuming too much space on each row could be to create a separate mapping table where you could map the long UUIDs to integers perhaps?
I hope this helps.
At my workplace, we had to setup a sort of similar approach, although in our case we only have 10 different types. We have around 30K devices inserting 1 minute records with 512Kb-1Kb of data for each. Our biggest table has around 1.9 billion records for 90 days of data, after which we are (starting to) downsample. The other tables are much smaller, in the millions of records.
Our UUID is also 16 bytes, and there isn’t really a problem as Hernan has said.
I tried to create tables based off the first 8 bytes of the UUID, but this still created over a thousand tables and that is when you hit hard issues with shards/tables.
As a suggestion, you’re better off creating something based on the a combination of other data ,for instance device type and maybe some part of the UUID which is not unique - this is only as guess if the front part of the UUID is common across millions of records (like we have)
To save space on our large record size we did turn off indexing on the numerical columns ( 95% of our columns are numerical) as we don’t perform a where clause on them and we got a 60-70% saving along with the tables having compression codec.
Also you can affect performance with partitioning, which at the moment we’ve got set to 2 weeks as most queries we get are within 7 days , so not all the partitions are queried ( again the partitions are around the 300GB size ).
When running queries we’ve got no more than 2 joins, as that does affect the query time, but with some base calculations of AVG and LAG on some parameters, we’re getting sub 5 seconds on querying the biggest table ( we only have a small cluster due to penny pinching finance managers!) .
This is taking into consideration the cluster is processing 15<> bulk insert queries/sec and each insert is 160-200 rows each.
So really the 16 Byte UUID is not a problem at all (unless you really need sub second response times!)
Sorry for the long winded reply, but I kept thinking of new things. I hope this helps in some way
Thanks for the replies! @djbestenergy , how do you manage to use only 16 bytes for UUIDs? It seems to me that there is no native type and the best one can do is to save them as text, which consumes even more space.
It’s not a normal UUID type we use, but we reference it as such
12 bytes is the device MAC address and we append 4 bytes for references to other things like the “port” for the sensor .
This is stored as a varchar ( just in case we ever expand it)
@hernanc, I was conteplating the mapping table option that you mentioned. However, it seems to me that cratedb has no method for autogenerating sequences or other shorter-than-uuid values, so the creation of new devices in the mapping table would have to be done by reading first to see whether the device exists, and if not creating the new device with (largest id + 1), the largest id also has to be discovered of course. All of this is inherently racy.
Does cratedb provide some tools (locking?) to make the process safe(r)?
Another possibility that occurs to me is delegating the mapping to a “conventional” external service, where coherency and concurrency control can be enforced.
You might be able to use the _id system column in the mapping table you make… ? That way you just insert the UUID and the _id is generated for you.
I might be barking up the wrong tree, but I’m sure Hernan can correct me
We also have to have a mapping table but for different reasons. Rather than keep hitting this table, we load it into a local redis cache intermittently, for faster reading and less stress on the CrateDB cluster
Well yes, but the idea was to have an id that is (possibly much) shorter than the text representation of a uuid, and my understanding is that _id is not that short, although it’s somewhat shorter, granted.