Adaptive partitioning

CREATE TABLE projects_posts (
    project_id VARCHAR,
    labels ARRAY (VARCHAR),
    sentiment VARCHAR CHECK (sentiment IN ('negative', 'positive', 'neutral')),
    favorited_at TIMESTAMP WITH TIME ZONE,
    hidden_at TIMESTAMP WITH TIME ZONE,
    post_id VARCHAR,
    channel VARCHAR,
    external_id VARCHAR,
    published_at TIMESTAMP WITH TIME ZONE,
    created_at TIMESTAMP WITH TIME ZONE,
    url VARCHAR NULL,
    title VARCHAR,
    content TEXT,
    transcript ARRAY (
        OBJECT AS (
            start DOUBLE PRECISION,
            text TEXT
        )
    ) NULL,
    media ARRAY (OBJECT (DYNAMIC)) NULL,
    language VARCHAR,
    hashtags ARRAY (VARCHAR) NULL,
    comment_count BIGINT,
    like_count BIGINT,
    view_count BIGINT,
    parent_external_id VARCHAR NULL,
    extras OBJECT(DYNAMIC) NULL,
    source OBJECT AS (
        id VARCHAR,
        description TEXT,
        name VARCHAR,
        url VARCHAR,
        domain VARCHAR,
        favicon VARCHAR,
        score DOUBLE PRECISION,
        monthly_traffic BIGINT,
        country_id VARCHAR,
        state_id VARCHAR,
        city_id VARCHAR,
        location VARCHAR
    ) NULL,
    profile OBJECT AS (
        id VARCHAR,
        avatar VARCHAR,
        name VARCHAR,
        username VARCHAR,
        description TEXT,
        url VARCHAR,
        links ARRAY(VARCHAR),
        location VARCHAR,
        joined_at TIMESTAMP,
        verification_status VARCHAR,
        gender VARCHAR,
        followers BIGINT,
        following BIGINT,
        post_count INTEGER
    ) NULL,
    INDEX keyword_ft USING FULLTEXT(title, content, transcript['text']) 
        WITH (analyzer = 'ilike_analyzer'),
    INDEX search_ft USING FULLTEXT(
        source['name'],
        source['url'],
        profile['name'],
        profile['username'],
        profile['url']
    ) WITH (analyzer = 'ilike_analyzer'),
    CHECK (source['location'] IN ('local', 'foreign')),
    CHECK (profile['verification_status'] IN ('unknown', 'unverified', 'verified')),
    CHECK (profile['gender'] IN ('male', 'female', 'unknown')),
    PRIMARY KEY (project_id, post_id)
) WITH (codec = 'best_compression');

Can we apply partitioning when the number of posts under a project exceeds a certain threshold? I don’t want to partition all projects because the number of posts varies drastically. Is there an article authored in this regard?

This kind of partitioning effectively applies multitenancy for projects with a huge number of posts, leaving the insignificant ones in the same partition.

CrateDB has no direct support for adaptive partitioning but one could archive this by calculating the partition key based on a incrementing row id.

E.g. lets assume the given `post_id` is an auto-increment integral value, the following partition key definition would create a partition every 1000th post entry.

CREATE TABLE t1 (project_id int, post_id int, parted_key as ceil(post_id/1000)) PARTITIONED BY (parted_key);

INSERT INTO t1 (project_id, post_id) VALUES (1, 1), (1, 1000), (1, 1001), (1, 2000);

SELECT values FROM information_schema.table_partitions WHERE table_name = 't1' ORDER BY partition_ident;
+--------------+
| values       |
+--------------+
| {"p_key": 0} |
| {"p_key": 1} |
| {"p_key": 2} |
+--------------+ 

Due to it’s distributed nature, CrateDB does not support auto-incremented numbers itself, but providing one from the client, even that this may result in collisions, could be acceptable for this use-case.

This is just an idea, not sure if this may work for your use-case.