Increased cost in AWS inter-zone transfers

For several years, our Production cluster (v3.3) has been a 3-node, multi-zone (3 zones) cluster hosted on AWS. So each node was in its own zone. Each partition is setup for 2 primary shards, each with 2 replicas. For a total of 6 shards per partition (so each zone had 2 shards).

To help with storage and load (and to assist in moving to v4.8), we recently added 6 nodes (moving the total to 9). After this, we began incurring significantly more inter-zone transfer costs. The shards look to be allocated the same in that I still see 2 shards per zone (but now they are just spread out amongst the new nodes). Does someone have an idea as to why we would be incurring more transfer costs?

1 Like

Hi @AncilHumph,

Do you observe the increased inter-zone traffic on an ongoing basis? When adding new nodes, CrateDB starts rebalancing shards. That rebalancing is throttled, so it may take several hours to complete (depending on your data volume).

Are there any rebalancing operations still in progress (SELECT * FROM sys.allocations WHERE current_state <> 'STARTED')?

1 Like

The re-balancing took about 1-2 days to complete, so we would not have been surprised to see increased charges for that time. However those charges have continued long after (now for a couple of months).

How do your tools and applications connect to CrateDB? Do you use an AWS Load Balancer? If yes, any chance it could be related to a setting on the load balancer, such as cross-zone load balancing?

Besides spreading replicas across availability zones, CrateDB doesn’t consider the availability zone when executing queries. Maybe for some reason, there is now generally more node-to-node traffic happening. I don’t know if there are any particularities in CrateDB 3 in that regard. Do you plan to continue with the upgrade? It would be interesting to see if the behavior is still present in a recent CrateDB version.

1 Like

Our applications connect using a JDBC driver, and yes, we use a Load Balancer. I don’t think there would be any settings that factor in, particularly since we had a 3-zone both before and after the move to 9 nodes.

You mentioned the Crate doesn’t consider availability zones when executing queries, but doesn’t it state here (CrateDB multi-zone setup - CrateDB: Guide) that:

When querying data, all data should only be collected from shards that are inside the same zone as the initial request.

We do plan to upgrade to v4, and that is how this issue came up. We have 21 TB of data on our Production system and needed to add nodes both because the data was getting so large (and to accommodate re-creating our largest table (which holds 14 TB of that data). We’ve been trying to upgrade to v4 for several years now, but have struggled to re-create such large (and live) tables.

1 Like

Hi @AncilHumph,

I checked the paragraph about query execution from our documentation with the team, and you are right, this is indeed how CrateDB processes queries.
DML operations cause the remaining inter-zone traffic (INSERT/UPDATE/DELETE), as they are always routed to the primary shard first, and then to all replica shards. So there will be traffic from the query handler node (that initially receives the query) to the node holding the primary shards, which may be in different zones. And then all replicas will naturally be in different zones.

Can you share the complete crate.yml of all nodes, please?

1 Like

Here is the contents of the crate.yml file:

cluster.name: OE-CRATE-PROD
node.name: “openeye-prod-crate-a | v3.3.5-37 | us-west-2a | i-0305058a36107065c”
node.attr.zone: us-west-2a
license.enterprise: false
cluster.routing.allocation.awareness.attributes: zone
cluster.routing.allocation.awareness.force.zone.values: us-west-2a,us-west-2b,us-west-2c
cluster.routing.rebalance.enable: replicas
gateway.expected_nodes: 9
gateway.recover_after_nodes: 5
discovery.zen.minimum_master_nodes: 5
discovery.zen.hosts_provider: ec2
discovery.zen.ping_timeout: 60s
discovery.ec2.groups: openeye-crate-prod,
udc.enabled: false
network.host: local,site
cluster.routing.allocation.node_concurrent_recoveries: 8
bootstrap.memory_lock: true
indices.memory.index_buffer_size: 20%
indices.recovery.max_bytes_per_sec: 250mb
stats.jobs_log_persistent_filter: classification[‘type’] in (‘INSERT’,‘UPDATE’,‘DELETE’) and stmt like ‘%quota_tracking%’ and stmt not like ‘%%1%’

You’ll notice that discovery.ec2.groups has an unnecessary comma at the end. We noticed that recently and are trying to determine if that is causing the issue.

1 Like

Hi @AncilHumph,

Your configuration appears to be okay regarding the allocation awareness settings. While trying to reproduce your setup, I noticed that allocation awareness settings currently aren’t exposed in sys.cluster, so it is hard to validate if they got applied successfully or not.
I have raised an issue with our development team: Allocation awareness settings not shown in `sys.cluster` · Issue #18047 · crate/crate · GitHub. Maybe it is only a problem with displaying the values, and they get applied correctly. But maybe there is also more to it.

To go further into details, a more recent CrateDB version will be very helpful. To get a sense for how much CrateDB has changed, between versions 3.3.5 and 5.10.9, there have been 860k lines of code added and 330k lines deleted. That’s a lot, and if there was any bug in 3.3.5, it may have been fixed already. Any potential bug fix would also only get released on the 5.10 branch.

Are you able to proceed with the upgrade and check if the issue persists on later versions? Maybe it is also a temporary issue that disappears after restarting nodes during the upgrade.

1 Like

We hope to upgrade the cluster to v4.8 soon, but as I think I mentioned, we’ve been trying to do this for several years. The requirement of re-creating tables in order to upgrade has been very problematic. One table in particular has billions of records, and we add about 1,000 more every second to it. So we can’t take the table offline for many hours/days (it would take about 2-3 weeks to re-create the table). Our business only allows for about 15 minutes of downtime.

That being said, we do have some other test environments with those same allocation settings that are on v4.8, and the settings are not exposed there as well. We’ve noticed over the years that there are quite a few other settings that aren’t exposed (such as discovery.ec2).

If you can figure out a way for us to verify any of these settings, that would be great.

1 Like