Re: enhance the efficiency of migrating particularly large tables

From: David Zhang <david(dot)zhang(at)highgo(dot)ca>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Pgsql Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: enhance the efficiency of migrating particularly large tables
Date: 2024-05-02 21:33:41
Message-ID: 42e054f6-c1f5-41fe-8682-c6f1d221f8f8@highgo.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks a lot David Rowley for your suggestion in details.

On 2024-04-08 3:23 p.m., David Rowley wrote:
> On Tue, 9 Apr 2024 at 09:52, David Zhang<david(dot)zhang(at)highgo(dot)ca> wrote:
> Finding the exact ctid seems overkill for what you need. Why you
> could just find the maximum block with:
>
> N = pg_relation_size('name_of_your_table'::regclass) /
> current_Setting('block_size')::int;
>
> and do WHERE ctid < '(N,1)';
We experienced this approach using pg_relation_size and tried to compare
the performance. Below are some simple timing results for 100 million
records in a table:

Using system function max():
SELECT max(ctid) from t;
Time: 2126.680 ms (00:02.127)

Using pg_relation_size and where condition:
SELECT pg_relation_size('t'::regclass) / current_setting('block_size')::int;
Time: 0.561 ms

Using the experimental function introduced in previous patch:
SELECT ctid from get_ctid('t', 1);
Time: 0.452 ms

Delete about 1/3 records from the end of the table:
SELECT max(ctid) from t;
Time: 1552.975 ms (00:01.553)

SELECT pg_relation_size('t'::regclass) / current_setting('block_size')::int;
Time: 0.533 m
But before vacuum, pg_relation_size always return the same value as
before and this relation_size may not be so accurate.

SELECT ctid from get_ctid('t', 1);
Time: 251.105 m

After vacuum:
SELECT ctid from get_ctid('t', 1);
Time: 0.478 ms

Below are the comparison between system function min() and the
experimental function:

SELECT min(ctid) from t;
Time: 1932.554 ms (00:01.933)

SELECT ctid from get_ctid('t', 0);
Time: 0.478 ms

After deleted about 1/3 records from the beginning of the table:
SELECT min(ctid) from t;
Time: 1305.799 ms (00:01.306)

SELECT ctid from get_ctid('t', 0);
Time: 244.336 ms

After vacuum:
SELECT ctid from get_ctid('t', 0);
Time: 0.468 ms

> If we wanted to optimise this in PostgreSQL, the way to do it would
> be, around set_plain_rel_pathlist(), check if the relation's ctid is a
> required PathKey by the same means as create_index_paths() does, then
> if found, create another seqscan path without synchronize_seqscans *
> and tag that with the ctid PathKey sending the scan direction
> according to the PathKey direction. nulls_first does not matter since
> ctid cannot be NULL.
>
> Min(ctid) query should be able to make use of this as the planner
> should rewrite those to subqueries with a ORDER BY ctid LIMIT 1.

Is there a simple way to get the min of ctid faster than using min(),
but similar to get the max of ctid using pg_relation_size?

Thank you,

David Zhang

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-05-02 21:37:06 Re: BitmapHeapScan streaming read user and prelim refactoring
Previous Message Tomas Vondra 2024-05-02 21:31:00 Re: BitmapHeapScan streaming read user and prelim refactoring