Quickly starting CrateDB with 2.5M records of the NYC Yellowcab dataset

:spiral_notepad: Introduction

This is a little note on how to quickly spawn a single-node CrateDB instance on your workstation using Docker and load it with a subset of the NYC Yellowcab dataset. The intention is to have a single command to give you a fast path to be ready for different explorations, without needing to invoke the corresponding commands interactively.

:battery: Prerequisites

You will only need Bash and Docker to be installed on your workstation, the launcher program has been confirmed to work in Linux, macOS, and WSL environments. When running it, you need to be connected to the internet because it will acquire the dataset from an S3 bucket using the HTTP protocol.

:arrow_forward: Usage

curl https://raw.githubusercontent.com/crate/cratedb-examples/main/operation/testbench-yellowcab/cratedb-import-nyc-yellowcab.sh | bash

Before running the program, you may want to inspect it at cratedb-import-nyc-yellowcab.sh.

:broom: Clean up

The CrateDB database instance running in a container can be terminated by invoking:

docker rm cratedb --force

:hammer_and_wrench: Reference

The program can be used as a blueprint for your own explorations as well. Just reuse the boilerplate code from the top half of the program and exercise your own explorations by adjusting the SQL statements in the bottom half. If you feel they might be interesting to others as well, you are encouraged to share them back.

1 Like