Data Analysis with Cluvio and CrateDB

Introduction

In this tutorial, we’ll explore how to leverage the power of Cluvio, a modern data analysis platform with CrateDB Cloud as the underlying database.

Prerequisites

Setting up CrateDB

Deploying a CrateDB cloud cluster has never been easier, simply follow our tutorial here and you can have a cluster up and running within minutes. We offer a CRFREE plan which offers up to 2 vCPUs, 2 GiB of memory, and 8 GiB of storage completely for free. Ideal for small-scale testing and evaluation purposes.

Loading Data into CrateDB cluster

In this tutorial we’ll use 2 tables as our datasource. flights and airports from January of 2008.

Create tables

Once you have access to the Admin UI of your cluster, execute these statements in the console:

CREATE TABLE airports (
  code character varying(3) NOT NULL,
  name character varying(100) NOT NULL,
  city character varying(50) NOT NULL,
  country character varying(50) NOT NULL,
  latitude double precision NOT NULL,
  longitude double precision NOT NULL,
  elevation integer NOT NULL
);
CREATE TABLE flights (
  year integer,
  month integer,
  day_of_month integer,
  day_of_week integer,
  dep_time  integer,
  crs_dep_time integer,
  arr_time integer,
  crs_arr_time integer,
  unique_carrier varchar(6),
  flight_num integer,
  tail_num varchar(8),
  actual_elapsed_time integer,
  crs_elapsed_time integer,
  air_time integer,
  arr_delay integer,
  dep_delay integer,
  origin varchar(3),
  dest varchar(3),
  distance integer,
  taxi_in integer,
  taxi_out integer,
  cancelled integer,
  cancellation_code varchar(1),
  diverted varchar(1),
  carrier_delay integer,
  weather_delay integer,
  nas_delay integer,
  security_delay integer,
  late_aircraft_delay integer,
  dep_timestamp timestamp,
  arr_timestamp timestamp
);

This creates 2 empty tables in your database. flights and airports, with the correct data types of the columns.

Import data

Now you should import the data into the tables. We will use Console “Import” feature in this example. Use the following links:

Make sure to use your pre-created tables in the “Table name” field, otherwise the column types may be created incorrectly. Do this for both .csv files:

After this your tables should no longer be empty. airports contains 5876 records, and flights 150 000 records.

Connecting CrateDB to Cluvio

Now that you have the dataset to visualize, you can connect your cluster to Cluvio.

In Cluvio, navigate to SettingsDatasourcesAdd datasource:

Host and Password will understandably differ for you.

After filling out the details for your cluster, press Next: Test Connection. If the credentials are correct, you should see this success message:

Dashboards

A dashboard is the main point of Cluvio. It is a collection of interactive reports, giving great insight into any area of your data. These are the types of charts Cluvio offers:

Now, let’s create some and see how Cluvio works. Head to DashboardsNew Dashboard. After naming your Dashboard, you can create your first report. Click the New report in the upper right.

Number of flights and delays

The first piece information you might be interested in, for a given period, is the number of flights and average delays of departures and arrivals. This is the code for this report:

SELECT
       COUNT(*)       AS "Number of flights",
       AVG(dep_delay) AS "Average Departure Delay",
       AVG(arr_delay) AS "Average Arrival Delay"
FROM   doc.flights
ORDER  BY 1 

This is a pretty simple query that counts the number of rows in the flights as the number of flights, and averages values in the dep_delay and arr_delay for the departure delays and arrival delays respectively.

To see the information displayed this way, you need to switch to “Number” chart after running query.

Country distribution

This query looks at the country distribution in the airports table:

SELECT   country,
         COUNT(1)
FROM     doc.airports
GROUP BY country
ORDER BY 2 DESC

In this one, it’s suitable to use pie chart to better see the distribution. We also used the Value(%) option for the legend, and edited the legend to show up to 25 values (countries).

Filters

Filters offer a great way to quickly specify the condition under which you want to display your data.

In the flights table in day_of_week column 1 represents Monday, 2 means Tuesday, etc. Using that, we can create a filter to display data for a specific day of the week without changing the SQL in our reports.

VALUES 
(1, 'Monday'),
(2, 'Tuesday'),
(3, 'Wednesday'),
(4, 'Thursday'),
(5, 'Friday'),
(6, 'Saturday'),
(7, 'Sunday')
ORDER BY 1

Now we can filter the data by day of the week:

Find out more about filters here.

SQL snippets

SQL snippets are small reusable pieces of code that can make your work easier within larger dataset. They are managed here.

We used them to create JOIN statements:

JOIN doc.airports AS origin_airport ON flights.origin = origin_airport.code
JOIN doc.airports AS dest_airport ON flights.dest = dest_airport.code

This snippet creates two joins between the flights and airports tables, aliasing the airports table as origin_airport and dest_airport for the origin and destination airports, respectively.

Then create a report using the snippet:

SELECT flights.year,
       flights.month,
       origin_airport.city AS origin_city,
       dest_airport.city AS destination_city,
       COUNT(*) AS number_of_flights
FROM doc.flights
[join_airports] -- Reference to the SQL Snippet
WHERE [flight_filters] -- Reference our filters
GROUP BY flights.year, flights.month, origin_airport.city, dest_airport.city
ORDER BY number_of_flights DESC
LIMIT 100;

Using the SQL snippets and filters, we can quickly find out what is the most popular destination departing from Los Angeles (LAX) on a Tuesday. Pretty cool.

Conclusion

That’s it for this tutorial. If using Cluvio could help you make sense of your data, feel free to head to Cloud Console, connect your cluster to Cluvio and get started! Make sure to visit their documentation to explore all the features.

2 Likes