Hi everyone,
We have a 3 nodes (8 cores - 64 GB RAM each node) cluster in AWS Cloud. Our use case is an IOT platform and our data ingestion rate is about 350 insertions/sec. With a data retention window of about 6 months.
CRATE_HEAP_SIZE = 30.5 GB
We are currently getting nodes disconnections every 30 minutes or less. The node completely crashes, restart and a new master node is elected. This happens all day.
Any advice?
Thanks in advance
Could you share some further information with us?
- Which version of CrateDB are you using?
- Which OS are you running CrateDB on?
Further you might provide:
- DB Schemas
- Config File
- A heap dump
- The crate log of the crashing node
… and …
- Monitoring snapshots (e.g. grafana) of exposed JMX metrix
- crate_threadpools queueSize/active/rejected,
- GC infos:
- Young + Old Generation avg (jvm_gc_collection_seconds_sum/jvm_gc_collection_seconds_count)
- Survivor space (jvm_memory_pool_bytes_used)
- GC rates (jvm_gc_collection_seconds_count)
- DirectBuffer memory usage (jvm_buffer_pool_used_bytes)
- Queries per second (crate_query_total_count)
- Query error rate (crate_query_failed_count)
- CircuitBreaker memory usages (crate_circuitbreakers)
Hey Daniel, any update on this one?
I will be glad to look into this. To be sure to be as close to your setup as possible, I appreciate the following information:
- cratedb version
-
crate.yml
/ configuration flags
-
show create table
of the ingest column
- how many data do you have in the table/partitions
- os version/image
- EC2 Type you use
- Loadbalancer type
- filesystem/-type/size where the data is stored
Regards,
Walter