Cluster Discovery breaking once enabling host based auth in Kubernetes on GCP

Hello, we are deploying a self managed CrateDB Cluster on top of our Kubernetes Cluster on GKE/GCP. We have it working well in general, however now that we’re done our proof of concept, we’re moving onto making it production ready.

In setting up security, we’re trying to enable Host Based Auth. We want to lock down the “crate” user to only be accessible from a single address (the DNS name of a headless service, attached to a Job which handles creation of our service accounts for various things that need to talk to Crate, as well as a password protected superuser which can only login from trusted networks).

We have this all working, with one exception. Once we add a custom config file in order to enable host based auth, it no longer discovers it’s cluster nodes (so we end up with 8 individual nodes each in their own cluster). If I disable the config file, by not mounting it and not specifying the config path, it discovers the cluster fine.

We have tried to implement node to node auth by adding a “transport” protocol entry, with “trust” method, but that had no effect.

We have tried a variety of options in the HBA config, none of which seem to work, it’s almost as if when we add ANY config there, it breaks cluster discovery. (we have literally added a default “trust” method entry, and nothing else as in:

auth:
  host_based:
    enabled: true
    config:
      a:
        method: trust

And it still doesn’t work.

I’m starting crate with the following settings on the CLI (configured in my statefulset):

          args:
            - '-Cnode.name=${POD_NAME}'
            - '-Ccluster.name=${CLUSTER_NAME}'
            - >-
              -Ccluster.initial_master_nodes=${POD_BASENAME}-0,${POD_BASENAME}-1,${POD_BASENAME}-2
            - '-Cdiscovery.seed_providers=srv'
            - >-
              -Cdiscovery.srv.query=_crate-internal._tcp.${INTERNAL_SERVICE_NAME}.${NAMESPACE}.svc.cluster.local
            - '-Cgateway.recover_after_data_nodes=5'
            - '-Cgateway.expected_data_nodes=${EXPECTED_NODES}'
            - '-Cpath.data=/data'
            - '-Cpath.conf=/config'

And yes all those environment variables are set correctly (as I said this config works perfectly if I only change the HBA to be disabled).

It’s worth noting here, that we’re providing the config by specifying the -Cpath.conf=/config as shown above, and then mounting /config via a configmap in kubernetes. The configmap has a log4j config, as well as crate.yml file in it, which currently ONLY specifies the HBA configs (as shown above).

If I remove that flag, so that it uses the default config path, then the cluster discovery works perfectly fine.

I’ve tried changing the order of the CLI flags, so that the config path is specified first, then the other config flags (thinking maybe the config file is overriding the cli flags) but that had no effect (as I would expect).

And it’s also worth noting that during the bootup logs, I see that it’s aware of and acting on the config flags set on the CLI.

The log just ends up spamming:
Using dynamic nodes [ip of this node:4300]
interspersed with periodic:
master not discovered or elected yet

I’ve read through the page on HBA and it’s linked pages, but found nothing that helps:

Note, the example on the HBA page only refers to node to node comms if using multi-zonal cluster which in this case I’m not doing… But I tried copying that config, and modifying it to use trust method, instead of cert, and got nowhere.

What am I missing to get cluster discovery working with HBA enabled?
Is there some additional critical setting that needs to go into the crate.yml file for this to work?

Any suggestions/help would be greatly appreciated.

Thanks!

If you provide a config file in addition to the settings in the arguments, you might override the default configuration values of the Docker container. Ensure that you include the network.host setting.

Could you share the complete sts? Also could you copy the logs of of one of the nodes?