Hello I started optimize query in my cratedb cluster and it seems like this process never ends I do not want to kill job but is there any way to understand what is happening at the background and nothing is blocking this query. this is the query I executed “OPTIMIZE TABLE someable WITH (only_expunge_deletes = true);”. I can see partitions of my table still shrinking but isn’t this ridiculous table totally has 800 - 900 gb of data and it shrieked 200 gb. This table never get queries or anything also cluster was idle and sitting like this for more than a week. that means if I do not execute this query it will be never shrink or merged. If there is any person that can explain what is happening I would be very happy. I am afraid of executing this on my production cluster
.
Hi, I’m sorry that you are experiencing issues with your query.
For how long has this query been running?
–
Deletes expunge removes segments that have ‘delete’ records marked.
OPTIMIZE TABLE someable WITH (only_expunge_deletes = true)
So for example, given a table test_app_simplemodel, you can get how many segments with deletes there are with:
select count(*) from sys.segments where table_name = 'test_app_simplemodel' and deleted_docs > 0
-- 18
This number should decrease until it reaches 0.
I have 1302 in production out of 2224. In test environment it took
OPTIMIZE OK, 13 records affected (13372.002 seconds) 3.7 hours. if there is actually way to track the process for partitioned tables I can execute optimize query for partition and go safe. the tables that I want to execute query has 1 tb data and almost same segment numbers. but without tracking it his really hard for me to execute optimize since it doubles the size of partition or table.
With this query you can obtain how many segments with deleted documents there are per partition given a table and a schema:
select
arbitrary(
values
['part_key']
) as part_value,
s.partition_ident,
sum(s.deleted_docs) as n_segments_with_deleted_docs
from
information_schema.table_partitions tp,
sys.segments s
where
tp.partition_ident = s.partition_ident
and tp.table_name = 'tt'
AND tp.table_schema = 'doc'
group by
s.partition_ident;
With this you should be able to track optimizing partitions.