Upload of wide CSV (200'000 columns)

Pierre · November 3, 2023, 4:48pm

Hi,
I’m trying to upload a dummy CSV of size 1000 rows x 200’000 columns.
The process is stuck since about 2h. Any idea what the issue is ?

romanas667 · November 3, 2023, 7:33pm

Hi Pierre,

Unfortunately the processed crashed due to the super wide file. We will take a look at whether we can improve this, but unfortunately for now this is unlikely going to work via the web interface. Have you tried a direct COPY FROM statement, which is also able to import CSVs?

Regards
Roman

Pierre · November 4, 2023, 7:24am

Hi Roman,

Thanks for the proposal. I’m trying now to create, copy and select queries in the below.
Process still stuck after 30’.

Queries

N_COLS = 200000

expr = "(" + ",".join([f"col_{i} int" for i in range(0,N_COLS)]) + ")"
query1 = f""" CREATE TABLE wide_table {expr} WITH ("mapping.total_fields.limit" = 200000)"""
# No error

query2 = "COPY wide_table FROM 'file:///Users/pierre/wide_data.csv' "
# No error

query3 = "SELECT * from wide_table LIMIT 1"
# Keeps running >30'

Execution

with conn:
    cursor = conn.cursor()
    cursor.execute(query1) # or query2 or query3
    result = cursor.fetchone()
    print(result)

Thanks for your support.

proddata · November 4, 2023, 7:56am

Hi @Pierre

It appears that you’re attempting to create a table with 200,000 indexed integer columns. This approach would indeed result in an excessive number of files and performance degradation, especially for a relatively modest dataset size.

While CrateDB is optimized for handling tables with lots of indexed columns, creating tables with 1000 indexed columns or more than can lead to inefficiencies. To enhance performance and maintain manageability, I suggest considering the storage of your data in a nested OBJECT(IGNORED) data type, which effectively means storing it as JSON. This approach will allow you to sidestep the limitations of having too many indexed columns. Alternatively, you could also disable indexing and/or the columnar storage for specific columns using the clause col_1 int INDEX OFF STORAGE WITH (columnstore = false).

Beside of storing the row data as-is (and indexing each value by default), each value term is stored into a Column Store by default. The usage of a Column Store is greatly improving global aggregations and groupings and enables ordering possibility as the data for one column is packed at one place.

Do you see any log messages indicating issues in the CrateDB logs?

Pierre · November 4, 2023, 8:04am

Thanks. I guess we have to discuss the data storage options separately, but I want precisely to be able to run aggregation queries on all those columns, therefore naturally moving away from JSON type of storage.

I don’t even see an entry in the Audit Logs of the CrateDB cloud console. Kindly let me know where I shall find logs. This is the detail of my instance: https://amber-tarfful.aks1.eastus2.azure.cratedb.net:4200

Now the same query SELECT * from wide_table LIMIT 1 returns a result (after 20-30’) : None

Was it aborted somehow ?

Thanks for your support.

Topic		Replies	Views
Table reindexing after version upgrade CrateDB	7	647	October 4, 2022
Inserting billions of rows the hard way CrateDB	15	2243	April 6, 2021
Crate Column Limit & Performance Impact CrateDB	3	824	January 29, 2019
Couldn't complete updating a table with more than 200 million records CrateDB	3	113	May 28, 2024
Error when reading from local crate installation CrateDB	3	708	February 1, 2023

Upload of wide CSV (200'000 columns)

Related topics