COPY FROM using python http module as source .csv

Hello,

I was trying to get COPY FROM using http://192.168.252.1:8000/lr-sql-scada/output_sqlt_data_1_2017_11.csv' which is the ip of my host. The CrateDb is running as a shard with docker compose on the same host.

I cd to the directory of my csv data and start the python http server.

cd /my/data/dir
python -m http.server 8000
  • verified my webserver shows GET requests in the log when I triggered copy from statement

  • I verified I can manually access the files in my browser and copy the link.

when i process the query it shows success but no records changed

When I manually download that csv file using the same link in my browser it transmits at ~500 mb/s.

1 Like

Dear Ben,

welcome to the forum, and thank you for reporting your problem.

May I ask you to append the RETURN SUMMARY clause to your COPY FROM statement? It might give you a clue already about what is going wrong.

For more information about that topic, see also the excellent Fundamentals of the COPY FROM statement document by @Marija.

This could actually be the cause of the problem:

With kind regards,
Andreas.

Thanks for the reply, it looks like a connection is refused. Perhaps it is some kind of issue with windows firewall.

Hi again,

in your original post, you shared a non-TLS URL with us, starting with http://192.168.252.1/. Now, in your recent example, the URL shows up as https://localhost:4443/.

=> Are you sure this endpoint is serving a sound X.509 certificate for localhost:4443, which the client can successfully verify on all details?

I don’t think COPY FROM accepts any parameters to relax the X.509 verification [1], so it is not unlikely to fail like this when there is a wrong certificate. A Connection refused exception, in one way or another, would actually be expected in this case.

With kind regards,
Andreas.


  1. COPY FROM — CrateDB: Reference ↩︎

I may have pulled a fast one on my self with that https.

This is the simplified version where I just call python -m http.serve when i invoke submit query on the cratedb ui with this method i see logs on the http.serve are showing status code 200.

I tried with all the adapters. Sorry I think I need to be posting my issues in a networking forum. In the mean time I got the execute many script to run and supposedly that is way faster?

Hi again,

that’s weird. Maybe you can share a sample of your data?

In the mean time I got the execute many script to run and supposedly that is way faster?

I am not sure what you are referring to, but it reads like you have been successful on some other end?

Please let me know if you are still struggling, and maybe share a data sample, so I can have a closer look.

With kind regards,
Andreas.

I did get a solution working with the crate python client using execute many. still curious about the loader feature on web ui. It says new users cant load attachments.

t_stamp,entity,float_value,quality
1654064489856,site_a/equipment_01,512.000,196
1654064939895,site_a/equipment_01,463.500,196
1654066289969,site_a/equipment_01,586.500,196
1654066740002,site_a/equipment_01,491.167,196
1654068090078,site_a/equipment_01,622.714,196
1654068540111,site_a/equipment_01,701.667,196
1654069890183,site_a/equipment_01,329.200,196
1654070340222,site_a/equipment_01,342.833,196
1654071690296,site_a/equipment_02,309.769,196
1654072140345,site_a/equipment_02,1357.333,196
1654073490413,site_a/equipment_02,359.000,196
1654073940438,site_a/equipment_02,317.091,196
1654075290520,site_a/equipment_02,141.769,196
1654075740547,site_a/equipment_02,408.300,196
1654077090628,site_a/equipment_02,163.250,196
1654077540664,site_a/equipment_02,122.500,196
1654078890748,site_a/equipment_02,209.000,196
1654079340773,site_a/equipment_02,540.167,196
1654080690847,site_a/equipment_02,303.545,196

this is my create db ddl

CREATE TABLE spot_data_history (
"t_stamp" TIMESTAMP,
"entity" TEXT,
"float_value" DOUBLE,
"quality" INT,
PRIMARY KEY (t_stamp,entity)
);

The pipfile

[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
sqlalchemy = "*"
pyodbc = "*"
pandas = "*"
dynaconf = "*"
loguru = "*"
crate = {extras = ["sqlalchemy"], version = "*"}
pendulum = "*"

[dev-packages]
black = "*"

[requires]
python_version = "3.10"

The ingest

from pathlib import Path
import csv
from crate import client
from loguru import logger  # Add loguru for logging

connection = client.connect("http://localhost:4201")
cursor = connection.cursor()
sql = "INSERT INTO spot_data_history values (?,?,?,?)"

data_dir = Path(__file__).parent / "data"

csv_files = (data_dir).glob('*.csv')

for csv_file in csv_files:

    with open(csv_file, "r") as file:
        reader = csv.reader(file)
        next(reader)  # Skip the header row
        

        chunk_size = 10_000  # Define your chunk size
        chunk = []

        for idx, row in enumerate(reader):
            # csv may have empty lines
            if len(row) == 0:
                continue

            t_stamp, entity, float_value, quality = row
            chunk.append([t_stamp, entity, float_value, quality])

            if len(chunk) >= chunk_size:
                logger.info("inserting...")
                cursor.executemany(sql, chunk)
                chunk = []  # Clear the chunk after insertion

    if chunk:  # Insert any remaining rows
        cursor.executemany(sql, chunk)

cursor.close()
connection.close()

the docker compose used to spin up local crate db cluster

version: '3.8'
services:
  cratedb01:
    image: crate:latest
    ports:
      - "4201:4200"
      - "5432:5432"      
    volumes:
      - ./data/crate/01:/data
    command: ["crate",
              "-Ccluster.name=crate-docker-cluster",
              "-Cnode.name=cratedb01",
              "-Cnode.data=true",
              "-Cnetwork.host=_site_",
              "-Cdiscovery.seed_hosts=cratedb02,cratedb03",
              "-Ccluster.initial_master_nodes=cratedb01,cratedb02,cratedb03",
              "-Cgateway.expected_data_nodes=3",
              "-Cgateway.recover_after_data_nodes=2"]
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
    environment:
      - CRATE_HEAP_SIZE=2g

  cratedb02:
    image: crate:latest
    ports:
      - "4202:4200"
    volumes:
      - ./data/crate/02:/data
    command: ["crate",
              "-Ccluster.name=crate-docker-cluster",
              "-Cnode.name=cratedb02",
              "-Cnode.data=true",
              "-Cnetwork.host=_site_",
              "-Cdiscovery.seed_hosts=cratedb01,cratedb03",
              "-Ccluster.initial_master_nodes=cratedb01,cratedb02,cratedb03",
              "-Cgateway.expected_data_nodes=3",
              "-Cgateway.recover_after_data_nodes=2"]
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
    environment:
      - CRATE_HEAP_SIZE=2g

  cratedb03:
    image: crate:latest
    ports:
      - "4203:4200"
    volumes:
      - ./data/crate/03:/data
    command: ["crate",
              "-Ccluster.name=crate-docker-cluster",
              "-Cnode.name=cratedb03",
              "-Cnode.data=true",
              "-Cnetwork.host=_site_",
              "-Cdiscovery.seed_hosts=cratedb01,cratedb02",
              "-Ccluster.initial_master_nodes=cratedb01,cratedb02,cratedb03",
              "-Cgateway.expected_data_nodes=3",
              "-Cgateway.recover_after_data_nodes=2"]
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
    environment:
      - CRATE_HEAP_SIZE=2g
1 Like