Adaptive partitioning

CREATE TABLE projects_posts (
    project_id VARCHAR,
    labels ARRAY (VARCHAR),
    sentiment VARCHAR CHECK (sentiment IN ('negative', 'positive', 'neutral')),
    favorited_at TIMESTAMP WITH TIME ZONE,
    hidden_at TIMESTAMP WITH TIME ZONE,
    post_id VARCHAR,
    channel VARCHAR,
    external_id VARCHAR,
    published_at TIMESTAMP WITH TIME ZONE,
    created_at TIMESTAMP WITH TIME ZONE,
    url VARCHAR NULL,
    title VARCHAR,
    content TEXT,
    transcript ARRAY (
        OBJECT AS (
            start DOUBLE PRECISION,
            text TEXT
        )
    ) NULL,
    media ARRAY (OBJECT (DYNAMIC)) NULL,
    language VARCHAR,
    hashtags ARRAY (VARCHAR) NULL,
    comment_count BIGINT,
    like_count BIGINT,
    view_count BIGINT,
    parent_external_id VARCHAR NULL,
    extras OBJECT(DYNAMIC) NULL,
    source OBJECT AS (
        id VARCHAR,
        description TEXT,
        name VARCHAR,
        url VARCHAR,
        domain VARCHAR,
        favicon VARCHAR,
        score DOUBLE PRECISION,
        monthly_traffic BIGINT,
        country_id VARCHAR,
        state_id VARCHAR,
        city_id VARCHAR,
        location VARCHAR
    ) NULL,
    profile OBJECT AS (
        id VARCHAR,
        avatar VARCHAR,
        name VARCHAR,
        username VARCHAR,
        description TEXT,
        url VARCHAR,
        links ARRAY(VARCHAR),
        location VARCHAR,
        joined_at TIMESTAMP,
        verification_status VARCHAR,
        gender VARCHAR,
        followers BIGINT,
        following BIGINT,
        post_count INTEGER
    ) NULL,
    INDEX keyword_ft USING FULLTEXT(title, content, transcript['text']) 
        WITH (analyzer = 'ilike_analyzer'),
    INDEX search_ft USING FULLTEXT(
        source['name'],
        source['url'],
        profile['name'],
        profile['username'],
        profile['url']
    ) WITH (analyzer = 'ilike_analyzer'),
    CHECK (source['location'] IN ('local', 'foreign')),
    CHECK (profile['verification_status'] IN ('unknown', 'unverified', 'verified')),
    CHECK (profile['gender'] IN ('male', 'female', 'unknown')),
    PRIMARY KEY (project_id, post_id)
) WITH (codec = 'best_compression');

Can we apply partitioning when the number of posts under a project exceeds a certain threshold? I don’t want to partition all projects because the number of posts varies drastically. Is there an article authored in this regard?

This kind of partitioning effectively applies multitenancy for projects with a huge number of posts, leaving the insignificant ones in the same partition.

1 Like

CrateDB has no direct support for adaptive partitioning but one could archive this by calculating the partition key based on a incrementing row id.

E.g. lets assume the given `post_id` is an auto-increment integral value, the following partition key definition would create a partition every 1000th post entry.

CREATE TABLE t1 (project_id int, post_id int, parted_key as ceil(post_id/1000)) PARTITIONED BY (parted_key);

INSERT INTO t1 (project_id, post_id) VALUES (1, 1), (1, 1000), (1, 1001), (1, 2000);

SELECT values FROM information_schema.table_partitions WHERE table_name = 't1' ORDER BY partition_ident;
+--------------+
| values       |
+--------------+
| {"p_key": 0} |
| {"p_key": 1} |
| {"p_key": 2} |
+--------------+ 

Due to it’s distributed nature, CrateDB does not support auto-incremented numbers itself, but providing one from the client, even that this may result in collisions, could be acceptable for this use-case.

This is just an idea, not sure if this may work for your use-case.

1 Like