What would be a good practice to request big amounts of data from crate. I have a use case where I need to fetch all data available in a table and I wonder if there is a way to request all the data without saturating the heap size.
What client are you using?
Many postgres client libraries support a fetch size
(e.g. asyncpg or jdbc)
Do just want to fetch a complete table or a resultset of a query?
I need to fetch a complete table. I took a look into asyncpg but I’m getting this error (this same error happened while trying with sqlalchemy and psycopg2):
asyncpg.exceptions.InternalServerError: line 1:1: mismatched input 'WITH' expecting {'SELECT', 'DEALLOCATE', 'CREATE', 'ALTER', 'KILL', 'BEGIN', 'START', 'COMMIT', 'ANALYZE', 'DISCARD', 'EXPLAIN', 'SHOW', 'OPTIMIZE', 'REFRESH', 'RESTORE', 'DROP', 'INSERT', 'VALUES', 'DELETE', 'UPDATE', 'SET', 'RESET', 'COPY', 'GRANT', 'DENY', 'REVOKE'}
or
line 1:1: mismatched input 'DECLARE' expecting {'SELECT', 'DEALLOCATE', 'CREATE', 'ALTER', 'KILL', 'BEGIN', 'START', 'COMMIT', 'ANALYZE', 'DISCARD', 'EXPLAIN', 'SHOW', 'OPTIMIZE', 'REFRESH', 'RESTORE', 'DROP', 'INSERT', 'VALUES', 'DELETE', 'UPDATE', 'SET', 'RESET', 'COPY', 'GRANT', 'DENY', 'REVOKE'}
which is related to https://github.com/crate/crate/issues/10212
While successfully trying to fetch all the data to store it in a cursor with psycopg2 it throws:
BytesStreamOutput cannot hold more than 2GB of data
—Update—
Found a hacky way to do it through pandas: https://stackoverflow.com/questions/18107953/how-to-create-a-large-pandas-dataframe-from-an-sql-query-without-running-out-of
while True:
sql = "SELECT * FROM {table} limit %d offset %d" % (chunk_size,offset)
dfs.append(pd.read_sql(sql, conn))
offset += chunk_size
if len(dfs[-1]) < chunk_size:
break
full_df = pd.concat(dfs)
Another approach could be to dump the table with crash and write it to a file, to later read it. Not sure if there are alternatives to these two.
Another and the best way I think to retrieve all the data in the table would be with keyset pagination.
For the first iteration do:
SELECT {columns} FROM {table} ORDER by {time}, {id} LIMIT {size};
And for subsequent queries do:
SELECT {columns} FROM {table} WHERE {time} > {previous last item's time} AND {id} > {previous last item's id} ORDER by {time}, {id} LIMIT {size};
@cande1gut Is there a specific reason why you order by both time and id?
Specifically when using WHERE clause {time} > {previous last item's time}
I think it’s worth considering what happens when you have records with the same time.
Example:
text | time
-----+-----
a | 1
b | 2
c | 3
d | 3
e | 4
f | 5
SELECT text FROM table ORDER BY time LIMIT 3;
returns a, b, c
Now using time 3 of last row c
for the subsequent query:
SELECT text FROM table WHERE time > 3 ORDER BY time LIMIT 3;
returns e, f
Record d
is never returned
I totally skipped that, thanks for pointing that out. You would need to add an =
in the {time}
condition to consider the skipped row:
SELECT {columns} FROM {table} WHERE {time} >= {previous last item's time} AND {id} > {previous last item's id} ORDER by {time}, {id} LIMIT {size};
Considering that the time can be repeated and you have another column that has a unique value, this other column can be used to help with the order (talking about why I am ordering by time and id).
Considering that the time can be repeated and you have another column that has a unique value, this other column can be used to help with the order (talking about why I am ordering by time and id).
Given the id
column is unique it should be sufficient to only order by id and use the WHERE clause {id} > {previous list item id}