Crate DB Clustering on EC2

Geoff_Cunningham · February 3, 2020, 5:47pm

I’m trying to test CrateDB’s performance for some telematics data, and am looking to set up a 3-node cluster on AWS EC2. But for some reason, I can’t get the node discovery to work. I’m starting with a 2-node cluster but have hit an issue.

I set up a single node no problem on an instance, loaded it up with data, then made another copy from an image of this working instance. My crate.yml file looks like this:

gateway.expected_nodes: 2
gateway.recover_after_nodes: 2
network.host: _site_
discovery.seed_hosts:
    - <node1_ip>:4300
    - <node2_ip>:4300
cluster.initial_master_nodes:
    - <node1_ip>:4300
    - <node2_ip>:4300
cluster.name: crate

Where node1_ip and node2_ip are the instance private IPs. All other settings are defaults.

I checked the networking and can confirm with telnet and netcat that each instance can connect to the other on port 4300. Is there something else I should be doing?

I also tried the EC2 discovery method using the name of the security group, but that didn’t work either.

Any help would be great - I’m sure I’m doing something dumb. Note, I’ve replaced the IP with crate1, but the true printout contains the actual instance IP.

Both nodes have the following output:

ubuntu@ip-crate1:~/crate-4.0.10$ ./bin/crate
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC [2020-02-03T17:35:24,121][INFO ][o.e.e.NodeEnvironment [2020-02-03T17:35:24,125][INFO ][o.e.e.NodeEnvironment [2020-02-03T17:35:24,280][INFO ][o.e.n.Node [2020-02-03T17:35:24,289][INFO ][o.e.n.Node [2020-02-03T17:35:24,584][INFO ][i.c.plugin SLF4J: Failed to load class “org.slf4j.impl.StaticL SLF4J: Defaulting to no-operation (NOP) logger SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
] [Pilatus] no modules loaded
] [Pilatus] loaded plugin [crate-azure-discovery]
] [Pilatus] loaded plugin [es-repository-hdfs]
] [Pilatus] loaded plugin [io.crate.plugin.BlobPlugin]
] [Pilatus] loaded plugin [io.crate.plugin.CrateCommonPlugin]
] [Pilatus] loaded plugin [io.crate.plugin.HttpTransportPlugin]
] [Pilatus] loaded plugin [io.crate.plugin.PluginLoaderPlugin]
] [Pilatus] loaded plugin [io.crate.plugin.SrvPlugin]
] [Pilatus] loaded plugin [io.crate.udc.plugin.UDCPlugin]
] [Pilatus] loaded plugin [org.elasticsearch.analysis.common.CommonAnalysisPlugin]
] [Pilatus] loaded plugin [org.elasticsearch.discovery.ec2.Ec2DiscoveryPlugin]
] [Pilatus] loaded plugin [org.elasticsearch.plugin.repository.url.URLRepositoryPlugin]
] [Pilatus] loaded plugin [org.elasticsearch.repositories.azure.AzureRepositoryPlugin]
] [Pilatus] loaded plugin [org.elasticsearch.repositories.s3.S3RepositoryPlugin]
] [Pilatus] loaded plugin [org.elasticsearch.transport.Netty4Plugin]
] [Pilatus] using discovery type [zen] and seed hosts providers [settings]
] [Pilatus] PSQL SSL support is disabled.
] [Pilatus] HTTP SSL support is disabled.
] [Pilatus] initialized
] [Pilatus] starting …
] [Pilatus] publish_address {crate1:5432}, bound_addresses {crate1:5432}
tpServerTransport] [Pilatus] publish_address {crate1:4200}, bound_addresses {crate1:4200}
] [Pilatus] publish_address {crate1:4300}, bound_addresses {crate1:4300}
] [Pilatus] bound or publishing to a non-loopback address, enforcing bootstrap checks
] [Pilatus] elected-as-master ([1] nodes joined)[{Pilatus}{lLw7phjsT1CqEhY-0PT5fg}{_T7XdZlUSAeaCAz66nh8ug}{crate1}{crate1:4300}{http_address=crate1:4200} elect leader, BECOME_MASTER_TASK, FINISH_ELECTION], term: 24, version: 151285, reason: master node changed {previous [], current [{Pilatus}{lLw7phjsT1CqEhY-0PT5fg}{_T7XdZlUSAeaCAz66nh8ug}{crate1}{crate1:4300}{http_address=crate1:4200}]}
[Pilatus] master node changed {previous [], current [{Pilatus}{lLw7phjsT1CqEhY-0PT5fg}{_T7XdZlUSAeaCAz66nh8ug}{crate1}{crate1:4300}{http_address=crate1:4200}]}, term: 24, version: 151285, reason: Publication{term=24, version=151285}
] [Pilatus] started

Walter_Behmann · February 11, 2020, 10:18am

Hi Geoff,
you probably got this sorted out, already … so sorry the late response. Here an example that should work for a 2-node cluster, please adopt for the node.name on the 2nd node:

cluster.name: YourClusterName
node.name: vm1
node.sql.read_only: false
path.data: /data1
#bootstrap.memory_lock: true

cluster.initial_master_nodes:
  - vm1
  - vm2

discovery:
  seed_hosts:
- x.x.x.x:4300
- x.x.x.x:4300

gateway:
  expected_nodes: 2
  recover_after_nodes: 2

Regards,
Walter

Geoff_Cunningham · February 11, 2020, 12:39pm

Thanks Walter. I actually gave up after another couple of hours playing with it, so really appreciate the reply. I’ll revisit it now and see if I can get it up and running.

Is it the node.sql.read_only: false part that makes the difference then for EC2? Or the vm1/vm2 naming on the nodes in cluster.initial_master_nodes?

Walter_Behmann · February 11, 2020, 1:14pm

Great!
Ah, I see the node.sql.read_only kind of “slipped” into the post. It shouldn’t make any difference in your case, FALSE is the default setting. discovery needs proper indent, my bad sorry.

 discovery:
  seed_hosts:
    - 10.10.3.5:4300
    - 10.10.3.6:4300

Let me know, whether this works for you.

Geoff_Cunningham · February 12, 2020, 12:45pm

Still no results, Walter. I also added network.host: site, to make sure it publishes beyond the local area.

Could it be the licence? I am using the 3-node limited free licence, maybe that doesn’t support nodes beyond local?

I’ve double checked with telnet I can manually connect to :4300 on each node from the other one, so the EC2 security groups are ok.

This is my yml:

cluster.name: crate
node.name: vm1
#path.data: /data
#bootstrap.memory_lock: true

cluster:
initial_master_nodes:
- vm1
- vm2

discovery:
seed_hosts:
- 172.xx.xx.xx:4300
- 172.xx.xx.xx:4300

gateway:
expected_nodes: 2
recover_after_nodes: 2

network.host: site
transport.publish_port: 4300

As before, I am just seeing this in the terminal (relevant parts only):

[o.e.n.Node ] [vm1] initialized
[o.e.n.Node ] [vm1] starting …
[psql ] [vm1] publish_address {172.22.33.72:5432}, bound_addresses {172.22.33.72:5432}
[i.c.p.h.CrateNettyHttpServerTransport] [vm1] publish_address {172.22.33.72:4200}, bound_addresses {172.22.33.72:4200}
[o.e.t.TransportService ] [vm1] publish_address {172.22.33.72:4300}, bound_addresses {172.22.33.72:4300}
[o.e.b.BootstrapChecks ] [vm1] bound or publishing to a non-loopback address, enforcing bootstrap checks
[o.e.c.s.MasterService ] [vm1] elected-as-master ([1] nodes joined)[{vm1}{lLw7phjsT1CqEhY-0PT5fg}{uTeBRmjkTWm2IZ0amvmatQ}{172.22.33.72}{172.22.33.72:4300}{http_address=172.22.33.72:4200} elect leader, BECOME_MASTER_TASK, FINISH_ELECTION], term: 30, version: 924427, reason: master node changed {previous , current [{vm1}{lLw7phjsT1CqEhY-0PT5fg}{uTeBRmjkTWm2IZ0amvmatQ}{172.22.33.72}{172.22.33.72:4300}{http_address=172.22.33.72:4200}]}
[o.e.c.s.ClusterApplierService] [vm1] master node changed {previous , current [{vm1}{lLw7phjsT1CqEhY-0PT5fg}{uTeBRmjkTWm2IZ0amvmatQ}{172.22.33.72}{172.22.33.72:4300}{http_address=172.22.33.72:4200}]}, term: 30, version: 924427, reason: Publication{term=30, version=924427}
[2020-02-12T12:36:09,455][INFO ][o.e.n.Node ] [vm1] started

Walter_Behmann · February 12, 2020, 1:34pm

the logs on the 2nd node look the same? As I might not see the whole logs, did you set the CRATE_HEAP_SIZE variable? https://crate.io/docs/crate/guide/en/latest/deployment/linux/debian.html?highlight=heap#id7 and please also check this settings here https://crate.io/docs/crate/guide/en/latest/admin/bootstrap-checks.html?highlight=bootstrap%20checks

Geoff_Cunningham · February 12, 2020, 2:00pm

Yeah the logs are identical except the IP. I’ve set the CRATE_HEAP_SIZE variable to 16G, which is half the instance total (it’s a t3a.2xlarge). Interestingly, htop only shows 31.2G of RAM total though. I’ve also set the other bootstrap check proc settings as well.

I can access the :4200 admin pages for both nodes, but they each show nodes: 1 in the header.
nc -vz 4300 shows success for each node to access the other node.
This is driving me nuts!

The full startup status logs are:

Feb 12 13:49:56 ip-172-22-33-97 systemd[1]: Started CrateDB Server.
Feb 12 13:49:58 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:58,120][INFO ][o.e.e.NodeEnvironment ] [vm2] using [1] data paths, mounts [[/ (/dev/nvme0n1p1)]], net usable_space [455.8gb], net total_space [969.3gb], types [ext4]
Feb 12 13:49:58 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:58,135][INFO ][o.e.e.NodeEnvironment ] [vm2] heap size [512mb], compressed ordinary object pointers [true]
Feb 12 13:49:58 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:58,136][INFO ][o.e.n.Node ] [vm2] node name [vm2], node ID [J2DyxtoZTvKG_e3WI-n5eA]
Feb 12 13:49:58 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:58,137][INFO ][o.e.n.Node ] [vm2] version[4.1.1], pid[13492], build[95e20da/2020-01-30T16:22:05Z], OS[Linux/4.15.0-1058-aws/amd64], JVM[Ubuntu/OpenJDK 64-Bit S
Feb 12 13:49:58 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:58,429][INFO ][i.c.plugin ] [vm2] plugins loaded: [jmx-monitoring, lang-js, enterpriseFunctions]
Feb 12 13:49:59 ip-172-22-33-97 crate[13492]: SLF4J: Failed to load class “org.slf4j.impl.StaticLoggerBinder”.
Feb 12 13:49:59 ip-172-22-33-97 crate[13492]: SLF4J: Defaulting to no-operation (NOP) logger implementation
Feb 12 13:49:59 ip-172-22-33-97 crate[13492]: SLF4J: See SLF4J Error Codes for further details.
Feb 12 13:49:59 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:59,588][INFO ][o.e.p.PluginsService ] [vm2] no modules loaded
Feb 12 13:49:59 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:59,594][INFO ][o.e.p.PluginsService ] [vm2] loaded plugin [crate-azure-discovery]
Feb 12 13:49:59 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:59,594][INFO ][o.e.p.PluginsService ] [vm2] loaded plugin [es-repository-hdfs]
Feb 12 13:49:59 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:59,594][INFO ][o.e.p.PluginsService ] [vm2] loaded plugin [io.crate.plugin.BlobPlugin]
Feb 12 13:49:59 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:59,595][INFO ][o.e.p.PluginsService ] [vm2] loaded plugin [io.crate.plugin.CrateCommonPlugin]
Feb 12 13:49:59 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:59,595][INFO ][o.e.p.PluginsService ] [vm2] loaded plugin [io.crate.plugin.HttpTransportPlugin]
Feb 12 13:49:59 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:59,595][INFO ][o.e.p.PluginsService ] [vm2] loaded plugin [io.crate.plugin.PluginLoaderPlugin]
Feb 12 13:49:59 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:59,595][INFO ][o.e.p.PluginsService ] [vm2] loaded plugin [io.crate.plugin.SrvPlugin]
Feb 12 13:49:59 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:59,595][INFO ][o.e.p.PluginsService ] [vm2] loaded plugin [io.crate.udc.plugin.UDCPlugin]
Feb 12 13:49:59 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:59,596][INFO ][o.e.p.PluginsService ] [vm2] loaded plugin [org.elasticsearch.analysis.common.CommonAnalysisPlugin]
Feb 12 13:49:59 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:59,596][INFO ][o.e.p.PluginsService ] [vm2] loaded plugin [org.elasticsearch.discovery.ec2.Ec2DiscoveryPlugin]
Feb 12 13:49:59 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:59,596][INFO ][o.e.p.PluginsService ] [vm2] loaded plugin [org.elasticsearch.plugin.analysis.AnalysisPhoneticPlugin]
Feb 12 13:49:59 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:59,596][INFO ][o.e.p.PluginsService ] [vm2] loaded plugin [org.elasticsearch.plugin.repository.url.URLRepositoryPlugin]
Feb 12 13:49:59 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:59,596][INFO ][o.e.p.PluginsService ] [vm2] loaded plugin [org.elasticsearch.repositories.azure.AzureRepositoryPlugin]
Feb 12 13:49:59 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:59,597][INFO ][o.e.p.PluginsService ] [vm2] loaded plugin [org.elasticsearch.repositories.s3.S3RepositoryPlugin]
Feb 12 13:49:59 ip-172-22-33-97 crate[13492]: [2020-02-12T13:49:59,597][INFO ][o.e.p.PluginsService ] [vm2] loaded plugin [org.elasticsearch.transport.Netty4Plugin]
Feb 12 13:50:00 ip-172-22-33-97 crate[13492]: [2020-02-12T13:50:00,363][INFO ][o.e.d.DiscoveryModule ] [vm2] using discovery type [zen] and seed hosts providers [settings, ec2]
Feb 12 13:50:01 ip-172-22-33-97 crate[13492]: [2020-02-12T13:50:01,201][INFO ][psql ] [vm2] PSQL SSL support is disabled.
Feb 12 13:50:01 ip-172-22-33-97 crate[13492]: [2020-02-12T13:50:01,359][INFO ][i.c.p.PipelineRegistry ] [vm2] HTTP SSL support is disabled.
Feb 12 13:50:01 ip-172-22-33-97 crate[13492]: [2020-02-12T13:50:01,423][INFO ][o.e.n.Node ] [vm2] initialized
Feb 12 13:50:01 ip-172-22-33-97 crate[13492]: [2020-02-12T13:50:01,424][INFO ][o.e.n.Node ] [vm2] starting …
Feb 12 13:50:01 ip-172-22-33-97 crate[13492]: [2020-02-12T13:50:01,603][INFO ][psql ] [vm2] publish_address {172.22.33.97:5432}, bound_addresses {172.22.33.97:5432}
Feb 12 13:50:01 ip-172-22-33-97 crate[13492]: [2020-02-12T13:50:01,621][INFO ][i.c.p.h.CrateNettyHttpServerTransport] [vm2] publish_address {172.22.33.97:4200}, bound_addresses {172.22.33.97:4200}
Feb 12 13:50:01 ip-172-22-33-97 crate[13492]: [2020-02-12T13:50:01,656][INFO ][o.e.t.TransportService ] [vm2] publish_address {172.22.33.97:4300}, bound_addresses {172.22.33.97:4300}
Feb 12 13:50:01 ip-172-22-33-97 crate[13492]: [2020-02-12T13:50:01,662][INFO ][o.e.b.BootstrapChecks ] [vm2] bound or publishing to a non-loopback address, enforcing bootstrap checks
Feb 12 13:50:01 ip-172-22-33-97 crate[13492]: [2020-02-12T13:50:01,833][INFO ][i.c.p.h.HttpAuthUpstreamHandler] [vm2] Password authentication failed for user=admin from connection=/10.193.30.53
Feb 12 13:50:01 ip-172-22-33-97 crate[13492]: [2020-02-12T13:50:01,879][INFO ][o.e.c.s.MasterService ] [vm2] elected-as-master ([1] nodes joined)[{vm2}{J2DyxtoZTvKG_e3WI-n5eA}{MTv0nf48R2qCtUYWkCro-w}{172.22.33.97}{172.22.33.97:4300}{h
Feb 12 13:50:01 ip-172-22-33-97 crate[13492]: [2020-02-12T13:50:01,949][INFO ][o.e.c.s.ClusterApplierService] [vm2] master node changed {previous , current [{vm2}{J2DyxtoZTvKG_e3WI-n5eA}{MTv0nf48R2qCtUYWkCro-w}{172.22.33.97}{172.22.33.
Feb 12 13:50:01 ip-172-22-33-97 crate[13492]: [2020-02-12T13:50:01,968][INFO ][o.e.n.Node ] [vm2] started
Feb 12 13:50:02 ip-172-22-33-97 crate[13492]: [2020-02-12T13:50:02,003][INFO ][i.c.p.h.HttpAuthUpstreamHandler] [vm2] Password authentication failed for user=crate from connection=/10.193.30.53
Feb 12 13:50:02 ip-172-22-33-97 crate[13492]: [2020-02-12T13:50:02,017][INFO ][i.c.l.EnterpriseLicenseService] [vm2] License loaded. issuedTo=Trial-crate maxNodes=3 expiration=None
Feb 12 13:50:02 ip-172-22-33-97 crate[13492]: [2020-02-12T13:50:02,019][INFO ][o.e.g.GatewayService ] [vm2] recovered [0] indices into cluster_state

Walter_Behmann · February 12, 2020, 4:50pm

Hi Geoff, we decided to take a closer look at it. In our case we stumbled probably over the same situation - cluster did not build quorum.
It seems like that - because the first startup of the nodes didn’t work - they left some sort of stale cluster state in the data directory. After stopping crate on both nodes and deleting the content in /home/ec2-user/crate-4.1.1/data/nodes (please adapt to your directory), everything started to work for us. Could you pls. try this on your setup? Let us know whether this fixes your problem and we gonna escalate this to development.

smu · February 18, 2020, 9:09am

So you used the existing data directory of that working single instance (started with discovery.type: single-node?) for one of the new 2-cluster instances?
If that is the case a initial cluster state (cluster UUID) was already created for single node usage and this node won’t join any other cluster.
This is the expected behaviour but not well documented (only noted a bit hidden here: Run CrateDB on Docker — CrateDB: Tutorials), we’ll follow up here and improve the documentation.

To force a node to forget its current (single-node-only) cluster state, one can use the crate-node CLI tool. See CLI tools — CrateDB: Reference.
This will throw away the nodes meta data and thus may loose data.
Example usage (node must be stopped):
<PATH-TO>/crate-node detach-cluster -Cpath.conf=<DATA-PATH> -Cpath.home=<CRATE-HOME>

Geoff_Cunningham · February 18, 2020, 10:27am

That makes perfect sense, thanks smu.

I did it because it’s easy to clone an instance on AWS - so I made an image to save inserting around a terabyte of data for each one. Thought I was being clever and saving time and bandwidth! Didn’t realise I was locking down single-node for the cluster.

I will follow your advice and hopefully get it working. If I loose some data that’s not a big deal - we are currently evaluating crateDB for our telematic use case and common queries so it won’t affect our benchmarking much.

Geoff_Cunningham · February 18, 2020, 5:02pm

For future readers, do not do this for all nodes. Just do it for all but one, otherwise you get the following error:

[2020-02-18T16:57:52,598][WARN ][o.e.c.c.ClusterFormationFailureHelper] [vm1] master not discovered yet and this node was detached from its previous cluster, have discovered [{vm2}{lLw7phjsT1CqEhY-0PT5fg}{SbMXIU1VRUGeKYRFCnTjxA}{172.22.33.97}{172.22.33.97:4300}{http_address=172.22.33.97:4200}]; discovery will continue using [172.22.33.97:4300] from hosts providers and [{vm1}{lLw7phjsT1CqEhY-0PT5fg}{vR6tM_BmRFWqEc4cgZ2SBg}{172.22.33.72}{172.22.33.72:4300}{http_address=172.22.33.72:4200}] from last-known cluster state; node term 0, last-accepted version 1160769 in term 0

If you do that, like I did, you need to manually unsafe-bootstrap a node to be a new master, as there are no masters left after detaching them all.

Geoff_Cunningham · February 18, 2020, 5:15pm

The story continues I’m afraid. After all that, the node still can’t join. The error message is long, but boils down to:

Caused by: java.lang.IllegalArgumentException: can’t add node {vm2}, found existing node {vm1} with the same id but is a different node instance

At this stage, I’m going to start from scratch and create each EC2 node manually. This is a shame, but not the end of the world.

If anyone cracks this problem, please post here so future EC2 users can read it. Using images is really convenient for scaling, but it’s clear there are multiple issues with CrateDB when doing this.

miguel.arregui · June 8, 2020, 1:29pm

Hi Geoff,

If I understood correctly, this is what you did:

Installed/started crate with a default config
Loaded a lot of data
Made an image copy in AWS
Stopped all nodes, changed their config to create a cluster, restarted them, expecting them to form the new cluster

If so, first off, a couple clarifications are in order.

In CrateDB, tables are broken down into shards, and these are automatically distributed across the cluster. You do not need to cleverly distribute the data the way you attempted by cloning the image.
When you start CrateDB on an empty configuration file, or the default one, it will form its own cluster and will not expect to discover further nodes. This setup is meant for development only. This happens because the default configuration does not define setting cluster.initial_master_nodes, the note at the end of this section hints at it (we will amend the note to make it clearer).

Once a node joins a cluster, it is trapped there by state, which is part of the information stored in the data folder. You can detach the node from any cluster (with risk of loosing some data - in flight data) by stopping it and running the crate-node tool. Then you can reconfigure it and start it.

Maybe, the best approach would be to setup the cluster first, and then do the ingestion.

Kind regards,

EYassir · May 14, 2021, 7:29pm

Hi,

I also tried the ec2 discovery feature of crateDB > 4.x, it didn’t work well for me so i made this with terraform that create a full cratedb cluster with autoscalling features, if it can help someone
https://github.com/EYassir/terra-crate, with every scall up it update cratedb conf file crate.yml so it keep track of all nodes

Topic		Replies	Views
3 Node Cluster Not Working CrateDB	2	1231	February 24, 2020
I'm trying to set up a 3-node CrateDB cluster and now I can't even connect with crash CrateDB	8	1532	October 4, 2021
Docker Swarm cluster CrateDB	8	1950	July 26, 2021
3 Node Cluster not forming after install (Azure, Ubuntu VMs) CrateDB	2	908	March 22, 2021
I can connect via crash remotely, but I can't access Admin UI via web browser, why? CrateDB	1	1381	October 4, 2021

Crate DB Clustering on EC2

Related topics