Best way to get two billion rows out

mindscratch · June 3, 2019, 2:53pm

I have a partitioned table that contains ~2 billion records in total. I need to get each row out, to publish to a queuing system for follow-on processing.

I was starting with SELECT ... FROM ... ORDER BY created_at ASC LIMIT <num> OFFSET <num>. I am using a limit of 10,000. I noticed as I got further into the results (2 million rows) that the queries were taking longer to get the next 10,000 rows.

Is there a better way to get this data out? Should I perform an export and read the files from disk instead?

Thanks in advance!

mfussenegger · June 3, 2019, 6:13pm

If you can use a client which supports protocol level cursors that would allow you to do a SELECT * FROM t without order/limit or offset and still only fetch a couple hundred or thousand records at a time.

The JDBC client does support this, another client would be asyncpg

#!/usr/bin/env python3

import asyncpg
import asyncio


async def main():
    conn = await asyncpg.connect(host='localhost', user='crate')
    async with conn.transaction():
        # this fetches 50 records at a time, but could be increased by passing prefetch=1000 to the cursor call
        async for record in conn.cursor('SELECT generate_series(0, 100)'):
            print(record)


if __name__ == "__main__":
    asyncio.run(main())

With JDBC:

try (Connection conn = DriverManager.getConnection(url(RW), properties)) {
    conn.setAutoCommit(false);
    try (Statement st = conn.createStatement()) {
        st.setFetchSize(1000);
        try (ResultSet resultSet = st.executeQuery("select x from t")) {
            while (resultSet.next()) {
                // resultSet.getInt(1);
            }
        }
    }
}

mindscratch · June 5, 2019, 10:35am

Thanks for the suggestion. I am using Go at the moment, but could try python or JDBC.

mfussenegger · June 5, 2019, 10:34pm

Might be that go also supports it, we haven’ tested it yet. Are you using pgx ?

Topic		Replies	Views
Fetching large result sets from CrateDB Tutorials sql , fundamentals	0	1762	October 31, 2022
Optimal way to do SELECT without saturating the heap SQL	6	1273	October 15, 2021
Crate DB stream result with .NET and Npgsql CrateDB	1	553	September 11, 2022
How to remove default "LIMIT 100" in SQL Console? CrateDB	5	655	March 25, 2021
CrateDB/postgres_fdw/LIMIT parameter CrateDB	14	700	April 17, 2023

Best way to get two billion rows out

Related topics