I managed to use CrateDB with Tableau using the CrateDB Tableau connector and a fairly large database with about 500 million rows.
I ran into some small problems which could be solved eventually.
Installation using Docker:
That went without any problems! I opted for a very basic setup. I assigned 4 cores to CrateDB and used a single container.
Importing data:
Importing data was fun: it took almost 6 hours to import a 200GB csv file
CrateDB is not particularly fast in this area: other databases took only 3 hours, with a much lower CPU load (80-100%) compared to CrateDB (200-400%).
Importing was done using the year and month of the timestamp (time-series) for partitioning. That seems to work well!
If I look at disk space, CrateDB does a fine job:
- Tableau Hyper database: 20GB → single file
- CrateDB data folder: 84GB → partitioning per year/month
- TimescaleDB data folder: 177GB → partitioning per year/month
- QuestDB data/db folder: 594GB → partitioning per day (4.400 folders, 666.000 files)
That is 4 times more than Tableau, 7 times less than QuestDB and about half of TimescaleDB, which is as expected from reading the blogs about this.
Issues with Tableau connector:
I could not find the actual connector on Github, so I created the .taco
file myself by zipping the contents of the Github files.
Another point is that the connector is not signed, meaning you have to start Tableau with the option to disable connector signing checks.
I hope that that signing issue can be solved soon.
Issues in combination with Tableau Desktop:
After the issues with the connector were solved, and the data was imported, I could test CrateDB with the existing sheets/dashboards to see how it integrates with Tableau.
- I had to up the size of the heap. This is default 1GB and I had to change that to 15GB or more to get things working…
- CrateDB does not automatically cast an int or bigint to a double when using a CASE statement where the expected result is a float.
Also mixing int/bigint gives an error in CrateDB. That was solved by explicitly casting these dimensions in Tableau. None of the other tested databases needed these casts in Tableau.
Performance with Tableau Desktop:
CrateDB does a fine job compared to the other databases. Nothing beats a local Tableau hyper database, but thanks to the automatic indexing of every column, and the Lucene full-text search support, CrateDB gives usable dashboard presentation times.
Compared to Tableau…
- Most dashboards are about twice as slow, but because Tableau only needs 1-2 seconds to display a dashboard, the 1-4 seconds for CrateDB is still great!
- Only for the larger dashboards (10 million data points or more) the wait increases to 40-80 seconds or more for CrateDB. This is partly due to the difference in local SSD versus 1Gbit LAN connection speed to get the data into Tableau.
- I have not tested if assigning more CPU power to CrateDB helps to shorten these waiting times.
Compared to TimescaleDB…
- Many dashboards are about as fast as TimescaleDB if the indexes for TimescaleDB are set correctly for the required columns/fields. If not, CrateDB is much, much, much faster.
- In some dashboards where full-text search is needed, the difference becomes enormous: seconds versus 10-20-30 minutes!
This is because TimescaleDB does a full table scan in those cases where CrateDB can rely on Lucene!
Compared to QuestDB…
- QuestDB does not fully implement (yet) all CAST types and SQL92 statements, so difficult to compare.
- The dashboards that do work show comparable speed.
In all cases, CrateDB needs a lot of RAM (40GB in my case) to do the job. I haven’t looked into the why, as it is a test. It might be Lucene, as I did not disable full-text search/indexing on any TEXT field.
Verdict:
CrateDB does a fine job, given the basic installation I used and the amount of data (about 500 million rows) I’m querying from Tableau.
Many dashboards take only a second longer to display compared to Tableau, which is hardly noticeable to the average user.
I will probably test with a bit more CPU cores to see if that helps in decreasing the time needed to present the dashboards. I don’t think assigning more memory will help.
Also, the partitioning on year/month seems to work well. I’m afraid that partitioning on year/month/day will make it worse, as many dashboards show data from a year or more. But that is just guesswork from my side.
Keep up the good work