Table creation considerations

djbestenergy · July 14, 2022, 10:38am

Hi,

I’m new to using Crate and currently POC testing to use with IoT data . I come from a relational DB background so I know I need to change my thinking a bit.

I’m having issues with table creation, in that I’m not sure what the best approach might be to the schema.

Summary :
There are 6700(<>) devices each being 1 of 11 device types (between 24 and 146 parameters for each type).
The parameters are 90% double precision with a deviceid and timestamp field.

The input data is in a custom binary format and comes in once every minute, so all the parameters for the deviceid arrive at once, which I am then translating to insert into CrateDB.

At the moment I am creating 7 tables for the device types with fields for all the parameters . I did try creating a table for each deviceid but it seemed to hit a limit that CrateDB did not like.

So:
1: Is it OK to create these “Tall” tables or would it be better to split out the individual parameters so you have deviceid, timestamp, field, value type schema and have far more data.

2: Could Cratedb be configured to allow for having 6700+ deviceid tables, or is it a stability/reliability issue ?

3: Regarding primary keys, does the order of the fields matter ?

I hope this makes some sense and any help would be appreciated, even if pointing to any videos/articles on the subject.

Note:
I’m using a local CrateDB 3-node Docker setup and its running quite slow for only querying 1440 rows (with interpolation for missing minutes) from 141,000 records - 63 seconds !!

Many thanks
David

proddata · July 14, 2022, 11:28am

Yes, CrateDB can handle tables with lots of records depending on the setup 100s of billions with the right partitioning.

2: Could Cratedb be configured to allow for having 6700+ deviceid tables, or is it a stability/reliability issue ?

I don’t think there is a hard limit, however having so many tables is considered bad practice with CrateDB.

3: Regarding primary keys, does the order of the fields matter ?

not, that I know of. a multi column value is hashed anyway.

Note:
I’m using a local CrateDB 3-node Docker setup and its running quite slow for only querying 1440 rows (with interpolation for missing minutes) from 141,000 records - 63 seconds !!

Can you provide the table schema and query used?

djbestenergy · July 14, 2022, 1:29pm

Hi, Thanks for the reply.

This is the table DDL in question

CREATE TABLE doc.safe_1526595589 (
	safeword bigint,
	ts timestamp with time zone,
	id_low bigint,
	id_hi bigint,
	period bigint,
	samples bigint,
	flags bigint,
	meta_8 double precision,
	a12 double precision,
	a13 double precision,
	f double precision,
	"In" double precision,
	t real,
	system_6 real,
	system_7 real,
	system_8 real,
	e double precision,
	ex double precision,
	re double precision,
	rex double precision,
	ae double precision,
	e1 double precision,
	ex1 double precision,
	re1 double precision,
	rex1 double precision,
	ae1 double precision,
	e2 double precision,
	ex2 double precision,
	re2 double precision,
	rex2 double precision,
	ae2 double precision,
	e3 double precision,
	ex3 double precision,
	re3 double precision,
	rex3 double precision,
	ae3 double precision,
	pulse double precision,
	counter_22 real,
	counter_23 real,
	counter_24 real,
	v double precision,
	u double precision,
	i double precision,
	p double precision,
	q double precision,
	s double precision,
	pf double precision,
	d real,
	v1 double precision,
	u1 double precision,
	i1 double precision,
	p1 double precision,
	q1 double precision,
	s1 double precision,
	pf1 double precision,
	d1 real,
	v2 double precision,
	u2 double precision,
	i2 double precision,
	p2 double precision,
	q2 double precision,
	s2 double precision,
	pf2 double precision,
	d2 real,
	v3 double precision,
	u3 double precision,
	i3 double precision,
	p3 double precision,
	q3 double precision,
	s3 double precision,
	pf3 double precision,
	d3 real,
	vpeak double precision,
	upeak double precision,
	ipeak double precision,
	ppeak double precision,
	qpeak double precision,
	speak double precision,
	pfpeak double precision,
	dmax real,
	vpeak1 double precision,
	upeak1 double precision,
	ipeak1 double precision,
	ppeak1 double precision,
	qpeak1 double precision,
	speak1 double precision,
	pfpeak1 double precision,
	dmax1 real,
	vpeak2 double precision,
	upeak2 double precision,
	ipeak2 double precision,
	ppeak2 double precision,
	qpeak2 double precision,
	speak2 double precision,
	pfpeak2 double precision,
	dmax2 real,
	vpeak3 double precision,
	upeak3 double precision,
	ipeak3 double precision,
	ppeak3 double precision,
	qpeak3 double precision,
	speak3 double precision,
	pfpeak3 double precision,
	dmax3 real,
	id text,
	roundts timestamp with time zone,
	CONSTRAINT safe_1526595589_pk PRIMARY KEY (id,roundts)
);

and a query that fetches a days worth of data for one id ( deviceid) with interpolation with nulls for missing data is :-

select s."roundts" as ts,
e,
id
from GENERATE_SERIES( 
1652050800000,
1652137200000,  
CAST('1 minute' as interval)) s("roundts") 
LEFT JOIN safe_1526595589 r ON r.roundts = s.roundts 
and r.id = '801f125d361a0004' limit 1440;

I’ve not used generate series before, but from earlier tests it seem to plug in the missing data gaps for each minute.

Thanks

djbestenergy · July 14, 2022, 1:37pm

So, I’m just trying clarify: - is it better to have a smaller table holding a few field , but hundreds of millions of records or a bigger table with more columns and possibly less data.

The data tends to be queried by day at a minimum timeframe with multiple parameters requested i.e. it could be 10 parameters for a device at a minutes resolution from 1 day up to a few months.

Thanks for your help!

proddata · July 14, 2022, 1:43pm

djbestenergy:

select s."roundts" as ts,
e,
id
from GENERATE_SERIES( 
1652050800000,
1652137200000,  
CAST('1 minute' as interval)) s("roundts") 
LEFT JOIN safe_1526595589 r ON r.roundts = s.roundts 
and r.id = '801f125d361a0004' limit 1440;


Eval[roundts AS ts, e, id]
  └ Fetch[roundts, e, id, roundts]
    └ Limit[1440::bigint;0]
      └ NestedLoopJoin[INNER | ((roundts = roundts) AND (id = '801f125d361a0004'))]
        ├ Rename[roundts] AS s
        │  └ TableFunction[generate_series | [col1] | true]
        └ Rename[r._fetchid, id, roundts] AS r
          └ Collect[doc.safe_1526595589 | [_fetchid, id, roundts] | true]

could you rewrite this to

select s."roundts" as ts,
e,
id
from GENERATE_SERIES( 
1652050800000,
1652137200000,  
CAST('1 minute' as interval)) s("roundts") 
LEFT JOIN safe_1526595589 r ON r.roundts = s.roundts 
WHERE r.id = '801f125d361a0004' limit 1440;

Eval[roundts AS ts, e, id]
  └ Fetch[roundts, e, id, roundts]
    └ Limit[1440::bigint;0]
      └ HashJoin[(roundts = roundts)]
        ├ Rename[roundts] AS s
        │  └ TableFunction[generate_series | [col1] | true]
        └ Rename[r._fetchid, roundts] AS r
          └ Collect[doc.safe_1526595589 | [_fetchid, roundts] | (id = '801f125d361a0004')]

seems like the filter is not propagated down and this leads to a (slower) nested loop join instead of a hash join.

So, I’m just trying clarify: - is it better to have a smaller table holding a few field , but hundreds of millions of records or a bigger table with more columns and possibly less data.

there is no default answer, but having tables with 100 mio of records is nothing special within CrateDB (we have users running 10-100s of billions of records). There is a “soft” limit of max 1000 columns per table (which can be adjusted) but it is good practice to stay below that.

djbestenergy · July 14, 2022, 2:03pm

Thank you for clarifying with regards to the table sizes.

That returned an empty result set for that new query - even though for that time range there are two records in the safe_1526595589 table. ( I was expecting roughly 1440 records with only two that did not have nulls in for e and id.
Is this because of the left join ?

Is there another way to generate minutes for the day with nulls into the result set, where there is no corresponding record in the safe_1526595589 table ?

thanks again.

proddata · July 14, 2022, 2:17pm

Sorry, my mistake

a sub-select should work better:

select s."roundts" as ts,
e,
id
from GENERATE_SERIES( 
1652050800000,
1652137200000,  
CAST('1 minute' as interval)) s("roundts") 
left JOIN (SELECT * FROM safe_1526595589 where id = '801f125d361a0004') r
  ON r.roundts = s.roundts
limit 1440;

djbestenergy · July 14, 2022, 3:37pm

proddata:

select s."roundts" as ts,
e,
id
from GENERATE_SERIES( 
1652050800000,
1652137200000,  
CAST('1 minute' as interval)) s("roundts") 
left JOIN (SELECT * FROM safe_1526595589 where id = '801f125d361a0004') r
  ON r.roundts = s.roundts
limit 1440;

Wow I can’t believe the difference - its running @ about 0.04ms now!!

Thank you so much, that’s been a great help.

Topic		Replies	Views
Table creation strategy for thousands of IoT devices CrateDB	8	252	January 22, 2024
Starting out with cratedb CrateDB	5	1028	May 31, 2022
What can be the root cause of a recent crash we encountered in one of our CrateDB nodes? CrateDB	0	780	January 27, 2022
Couldn't complete updating a table with more than 200 million records CrateDB	3	113	May 28, 2024
Partition requires significantly a lot more space than the others CrateDB	10	1073	October 26, 2021

Table creation considerations

Related topics