A significant feature in CrateDB is that it can scale horizontally, which means that instead of adding more RAM, CPU, and disk resources to our existing nodes we can add more nodes to our CrateDB cluster.
This allows the handling of volumes of data that simply could not fit on a single node, but it is also very useful in scenarios where hosting everything in a single node, or a small number of nodes, would still be possible, this is because smaller nodes are often easier to manage infrastructure-wise.
More nodes also mean more resiliency to issues, on a scenario where we have for instance 5 nodes, and configure our tables with 2 replicas, we could lose 2 nodes and still serve our production workloads. This means we can carry out maintenance on the nodes one at a time and still be able to withstand an unplanned issue on another node without downtime.
Today we want to review how to add a new node to an existing on-premises cluster.
When a CrateDB node starts it needs a mechanism to get a list of the nodes that make up the cluster, this is called discovery.
At the time of writing, there are 3 ways for a node to get this list:
- The list of nodes can be defined in the
discovery.seed_hostssetting in the configuration file (typically in
- The list can be retrieved with a DNS query, see https://crate.io/docs/crate/reference/en/5.4/config/cluster.html#discovery-via-dns
- In AWS environments, the list of nodes can be looked up via the EC2 API, filtering on specific security groups, availability zones, and tags, see Cluster-wide settings — CrateDB: Reference
For the purpose of this post, we will work with the
If a node is started without specifying the
initial_master_nodes setting (the default configuration), or with
discovery_type set to
single-node, it will be started as a standalone instance and it cannot later be scaled into a cluster. Single-node deployments are great for development and testing, but for production setups we recommend using a cluster with at least 3 nodes.
If you are going for a single-node deployment initially, but plan to scale to a multi-node cluster in the future, there are some settings to configure before the very first run of the CrateDB node so that it bootstraps as a 1-node cluster instead of a standalone instance.
The settings that we need are:
discovery.seed_hostsset to the hostname or the
node.nameof the node
initial_master_nodesset to the hostname or the
node.nameof the node
- optionally we can set a
If you are using containers you would pass these settings with lines in the
args section of your YAML file, otherwise you could create
/etc/crate/crate.yml before deploying the package for your distribution (refer to https://github.com/crate/crate/blob/master/app/src/main/dist/config/crate.yml for the template), or you could prevent the package installation from auto-starting the daemon by using a mechanism such as
policy-rcd-declarative, then edit the configuration file (
crate.yml ), and start the
crate daemon once all settings are ready.
Nodes need to be able to resolve each other’s hostnames at DNS level, and they need to be able to reach each other on a TCP port which is 4300 by default.
For security reasons you should configure your network so that CrateDB cluster nodes are only reachable on port 4300 from other CrateDB nodes in the cluster.
In a Kubernetes environment this can be achieved with a
Service resource with a
In a non-containerized environment one way to do this is to use firewall software directly on each node, for instance:
#Enable ufw - all incoming connections blocked by default sudo ufw enable #Allow SSH if you are using it to manage your server sudo ufw allow 22 #Allow 4200 for clients to connect to CrateDB via the http endpoint sudo ufw allow 4200 #Allow 5432 if you have PostgreSQL clients sudo ufw allow 5432 #Allow 4300 from 192.168.0.202 (another cluster node in this example) sudo ufw allow proto tcp from 192.168.0.202 to any port 4300
You may also want to consider network access control and/or a separate network adapter for intra-cluster communications.
Make sure the new node does not auto-bootstrap as a single-node instance, you may want to either create
/etc/crate/crate.yml in advance or use a mechanism as
policy-rcd-declarative as mentioned earlier.
On the configuration file for the new node:
discovery.seed_hoststo the full list of nodes, including the new one you are adding.
- Optionally set a
node.name, if not done the node get assigned a random name from the
sys.summitstable. You may wonder what those default names are about, they are the names of mountains in the area around our main office, we love mountains at Crate.io.
cluster.nameto a value that matches the other nodes in the cluster, if not specified the default cluster name is
- Consider if you want to set the cluster-wide settings
gateway.recover_after_timeto prevent the unnecessary creation of new replicas and the rebalancing of shards when a node takes a little bit longer to start, or in case of transient issues, when the cluster is starting up from a situation where all nodes are shutdown. Please note these settings are used when the cluster is starting up from being offline, if you want to delay the allocation of replicas when a node becomes unavailable on a cluster that stays online there is a different setting at table level.
Now we can start the
crate daemon, you will see the node joining the cluster and CrateDB will start using it for shards allocation.
Remember to add the new nodes alongside the old ones in any monitoring system and load balancer configuration you may have in your environment.
Now we need to align a number of settings in the other nodes, these are typically in the
discovery.seed_hostsadding the new node
- If you have configured
gateway.settings, update them to have the same values on all nodes
These settings only play a role during restart, not at runtime, so you do not need to restart the nodes after making these changes, but if the
gateway. settings need updating you may see a warning in the Admin UI which can be acknowledged.
Please also note there is no need to update the
initial_master_nodes list, this is only considered during the initial cluster bootstrapping.
And that is it, we have scaled out our cluster and we are ready to work with larger volumes of data. I hope you find this useful and, as usual, please do not hesitate to raise any thoughts or questions in the CrateDB Community.