I have a partitioned table that contains ~2 billion records in total. I need to get each row out, to publish to a queuing system for follow-on processing.
I was starting with SELECT ... FROM ... ORDER BY created_at ASC LIMIT <num> OFFSET <num>
. I am using a limit of 10,000. I noticed as I got further into the results (2 million rows) that the queries were taking longer to get the next 10,000 rows.
Is there a better way to get this data out? Should I perform an export and read the files from disk instead?
Thanks in advance!
If you can use a client which supports protocol level cursors that would allow you to do a SELECT * FROM t
without order/limit or offset and still only fetch a couple hundred or thousand records at a time.
The JDBC client does support this, another client would be asyncpg
#!/usr/bin/env python3
import asyncpg
import asyncio
async def main():
conn = await asyncpg.connect(host='localhost', user='crate')
async with conn.transaction():
# this fetches 50 records at a time, but could be increased by passing prefetch=1000 to the cursor call
async for record in conn.cursor('SELECT generate_series(0, 100)'):
print(record)
if __name__ == "__main__":
asyncio.run(main())
With JDBC:
try (Connection conn = DriverManager.getConnection(url(RW), properties)) {
conn.setAutoCommit(false);
try (Statement st = conn.createStatement()) {
st.setFetchSize(1000);
try (ResultSet resultSet = st.executeQuery("select x from t")) {
while (resultSet.next()) {
// resultSet.getInt(1);
}
}
}
}
Thanks for the suggestion. I am using Go at the moment, but could try python or JDBC.
Might be that go also supports it, we haven’ tested it yet. Are you using pgx ?