Unexpected Behavior with Dynamic OBJECT Columns in Large Tables (CrateDB 6.0.3)

I’m running a CrateDB cluster (version 6.0.3) with 4 nodes, hosting several tables, some of which are 25–200 GB in size. The core CREATE TABLE statement for one of these tables is as follows:

CREATE TABLE IF NOT EXISTS "doc"."foobar" (
   "ts" TIMESTAMP WITHOUT TIME ZONE DEFAULT current_timestamp(3) NOT NULL,
   "measurement" TEXT NOT NULL,
   "tags" OBJECT(DYNAMIC),
   "fields" OBJECT(DYNAMIC),
   "partition_field" TIMESTAMP WITHOUT TIME ZONE GENERATED ALWAYS AS date_trunc('month', "ts")
)
CLUSTERED INTO 4 SHARDS
PARTITIONED BY ("partition_field")

Over time, new fields are added to the tags and fields objects via the HTTP API, primarily using bulk requests. However, I’ve encountered three unusual and non-deterministic issues:


1. Random Key Generation in OBJECT Columns

Occasionally, new keys are spontaneously created in the tags or fields objects. For example, if keys like "host" and "port" exist, a new key like "hort" might appear. The value for these keys is either derived from an existing key or empty. This behavior is inconsistent and cannot be reproduced intentionally.


2. Cross-Contamination Between tags and fields Objects

Sometimes, values intended for one object (e.g., tags) end up in the other (fields), or vice versa. This can also result in the key-generation issue described above.


3. Systematic Value Corruption for Specific Keys

The most puzzling issue involves the tags['host'] key, which typically contains an IP address (e.g., 10.2.251.112). After some operations, the value is altered in a specific pattern:

  • Example: 10.2.251.112 becomes 10.51.11.
  • The digit "2" is consistently removed from the value. If a "2" appears alone between two dots (e.g., 10.2.11), the entire column is dropped, including the dot.

This behavior is always tied to the digit "2" and occurs only in large tables (>20 GB). Smaller tables do not exhibit these issues.


Insert Query

The bulk-insert query is straightforward:

INSERT INTO "doc"."foobar" ("ts", "measurement", "tags", "fields") VALUES (?, ?, ?, ?)

Observations and Questions

  • The issues occur exclusively in large tables (>20 GB).
  • A PHP API client is used for all operations.
  • The insert logic is unlikely to be the root cause, as 99% of operations succeed without issues.

Has anyone encountered similar behavior? Are there known issues with dynamic OBJECT columns in CrateDB 6.0.3, particularly with large tables? Could this be related to partitioning, sharding, or bulk operations?

Any insights or suggestions would be greatly appreciated!

1 Like

I’m not aware of any known issue like you describe.

Besides the table definition you shared, can you confirm that you didn’t modify any table option default?

The hosthort is kind of suspicious since the unicodes are sequential: s 144 and r 145. Could you verify that your app logic cannot in any way create a hort key? e.g. if this data comes from logs, can you grep for the value hort?

If you can I’d recommend to log your insert statements before sending them to CrateDB, that way we can verify it’s not a client side issue.

1 Like

Thank you for your response. I will now log all raw queries and then we’ll see if the issue is on the client side. The example with “hort” is really just one case among various other variations. I’ll get back to you as soon as I find something. Thanks for working with me to investigate this bug.
Best regards,
Volker

1 Like

Additionally, I created a small test to try to reproduce this:

  • 3 x 6.0.3 nodes
  • Your table definition
  • This ingestion script:
import random
import uuid

import requests

query = """
        INSERT INTO "doc"."foobar" ("measurement", "tags", "fields")
        VALUES (?, ?, ?)
"""

def generate_data(n: int):
    data = []
    host = {'host': '10.2.251.112'}
    for _ in range(n):
        if random.randint(0, 1):
            host['port'] = str(random.randint(0, 1000))
        data.append((str(uuid.uuid4()), host, {'some_field': '234'}))
    return data


def send_bulk_data(query, data):
    response = requests.post('http://192.168.88.251:4200/_sql',
                             json={'bulk_args': data, 'stmt': query})
    return response.text

c = 0
n = 50_000

import time

now = time.time()
while True:
    try:
        send_bulk_data(query, generate_data(n))
        c+=n
    except KeyboardInterrupt:
        print(f'After {time.time() - now}s sent {c} rows')
        break

It’s now sitting at 200M records, ~24GiB and I don’t see any schema drift nor the host value being different.

SHOW CREATE TABLE FOOBAR

CREATE TABLE IF NOT EXISTS "doc"."foobar" (
   "ts" TIMESTAMP WITHOUT TIME ZONE DEFAULT current_timestamp(3) NOT NULL,
   "measurement" TEXT NOT NULL,
   "tags" OBJECT(DYNAMIC) AS (
      "host" TEXT,
      "port" TEXT
   ),
   "fields" OBJECT(DYNAMIC) AS (
      "some_field" TEXT
   ),
   "partition_field" TIMESTAMP WITHOUT TIME ZONE GENERATED ALWAYS AS date_trunc('month', "ts")
)
CLUSTERED INTO 4 SHARDS
PARTITIONED BY ("partition_field")
WITH (
   column_policy = 'strict',
   number_of_replicas = '0-1'
)

Is there anything you’d like me to change about the test I’m running to fit more your case?

1 Like

Additionally, if you find any other schema drift (new columns you were not expecting), please share them.

On another note, to help debug if it’s the client or the database you could also try to make object tags/fields NOT dynamic, so any new column will be an error that you can log. This might not be feasible if you are in production. Alternatively you could create a new table with the same schema but the objects being strict, point your ingestion code to that new table, and later if needed reconcile the data by ingesting from the new table to the old one. But please only do this if you are 100% sure in what you are doing, otherwise logging on the client side is likely the safest option.

1 Like

Hi,

That’s very interesting. Your observations feel like some hallucinations would be involved, or possibly just a faulty memory? Are you using ECC memory in your server to avoid soft errors like single-event upsets?

With kind regards,
Andreas.

The crateDB nodes (count 4) are VMs (Proxmox, KVM) and running on a Server with ECC RAM.

1 Like

That’s good to hear, thank you for sharing this information. I hope we can figure out the problem on your end, based on logging data like @surister suggested, and possibly the demo program he shared above.

Wow, this behaviour certainly has the potential to play in a similar league like the famous 500 miles email story.

https://web.mit.edu/jemorris/humor/500-miles

Disclaimer: Please don’t receive my messages too snarky, we are certainly reading your observations seriously. However, please understand your report is truly unique so far, so I am trying to apply my personal share of pattern matching to get a better understanding, which may occasionally also include a share of humor. It is certainly meant to be received in a spirit that you are not alone with such observations on other systems.

Seconding Iván, I’d like to confirm we have never heard about such an issue with CrateDB, not even remotely. While every software system occasionally includes bugs, we never observed any data corruption issues like those you are describing. We are curious to learn what might be the root cause for the problem you are observing.

Do you see any possibility to share more insights into fragments of the client-side software stack you are running, or can you come up with a solid reproducer, possibly based on Iván’s demo program?

this is my crateDB php implementation for IoT stuff: GitHub - skydiablo/ReactCrate

maybe you can see any issue?

1 Like

I’m not very good at PHP but what I understand is that whatever gets into a Measurement or MeasurementBulk is what ends up arriving at CrateDB, meaning that fields and tags just wrap up whatever dynamic object they get from the clients, so a faulty device could be sending bad data.

I’ve worked in the past with IoT devices and we had instances where a faulty device would send us into a downward spiral making us think that the backend/database didn’t work as expected. In another instance it was a development device in the lab that shouldn’t have been connected to the production backend.

In any case, I doubt this is related to CrateDB, we have tons of users with similar and bigger clusters/data sizes and so far we haven’t got a remotely similar report.

I assume this is your situation:

d1  --->
d2  --->    PHP_BACKEND ---> CrateDB cluster 
dn  --->

Assuming it’s a faulty device: What I would do is implement proper validation in your backend, at first it could be something quick and simple like sanity checking keys in fields and tags: if any list of known keys should go to tags but appear on fields, reject and log the request for example. With the objective of finding who’s at fault.

Let me know if any of this makes sense in your use case.

One bad payload can change the shape (object → scalar, null, etc.), and bulk inserts won’t say anything. You only notice when queries act strangely later on when you have big tables or partitions.

look for:

different types under the same object path
one bad IoT device sending junk
no validation on the backend

Try putting data back into a STRICT OBJECT column or running typeof() checks across partitions. It was ingestion, not CrateDB, every time I saw this.

But just let the data types remain as inconsistent as they want—at worst, the INSERT would fail. However, keys in the object shouldn’t change or be “made up” in the process, right?

Initial checks for schema drift have already confirmed that the data is being sent correctly to the database. But specifically, data that often shows errors (like tags['host']) comes from a config value that isn’t influenced by the IoT device.