Re: PoC Refactor AM analyse API

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Denis Smirnov <sd(at)arenadata(dot)io>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PoC Refactor AM analyse API
Date: 2021-01-22 13:12:13
Message-ID: 45641a84-11c8-5173-2f12-4efa48b18811@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 30/12/2020 11:12, Denis Smirnov wrote:
>
>> But why do you pass int natts and VacAttrStats **stats to
>> acquire_sample_rows()? Is it of any use? It seems to break
>> abstraction too.
>
> Yes, it is really a kluge that should be discussed. The main problem
> is that we don’t pass projection information to analyze scan (analyze
> begin scan relise only on relation information during
> initialization). And as a result we can’t benefit from column AMs
> during «analyze t(col)» and consume data only from target columns.
> This parameters were added to solve this problem.

The documentation needs to be updated accordingly, see
AcquireSampleRowsFunc in fdwhandler.sgml.

This part of the patch, adding the list of columns being analyzed, seems
a bit unfinished. I'd suggest to leave that out for now, and add it as
part of the "Table AM modifications to accept column projection lists"
patch that's also in this commitfest [1]

> This suggestion also ... removes PROGRESS_ANALYZE_BLOCKS_TOTAL and
> PROGRESS_ANALYZE_BLOCKS_DONE definitions as not all AMs can be
> block-oriented.

We shouldn't just remove it, a progress indicator is nice. It's true
that not all AMs are block-oriented, but those that are can still use
those. Perhaps we can add ther PROGRESS_ANALYZE_* states for
non-block-oriented AMs, but that can wait until there is a concrete use
for it.

> static int
> acquire_sample_rows(Relation onerel, int elevel,
> HeapTuple *rows, int targrows,
> double *totalrows, double *totaldeadrows)
> {
> int numrows = 0; /* # rows now in reservoir */
> TableScanDesc scan;
>
> Assert(targrows > 0);
>
> scan = table_beginscan_analyze(onerel);
>
> numrows = table_acquire_sample_rows(scan, elevel,
> natts, stats,
> vac_strategy, rows,
> targrows, totalrows,
> totaldeadrows);
>
> table_endscan(scan);
>
> /*
> * If we didn't find as many tuples as we wanted then we're done. No sort
> * is needed, since they're already in order.
> *
> * Otherwise we need to sort the collected tuples by position
> * (itempointer). It's not worth worrying about corner cases where the
> * tuples are already sorted.
> */
> if (numrows == targrows)
> qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
>
> return numrows;
> }

Perhaps better to move the qsort() into heapam_acquire_sample_rows(),
and document that the acquire_sample_rows() AM function must return the
tuples in 'ctid' order. In a generic API, it seems like a shady
assumption that they must be in order if we didn't find as many rows as
we wanted. Or always call qsort(); if the tuples are already in order,
that should be pretty quick.

The table_beginscan_analyze() call seems a bit pointless now. Let's
remove it, and pass the Relation to table_acquire_sample_rows directly.

> /*
> * This callback needs to fill reservour with sample rows during analyze
> * scan.
> */
> int (*acquire_sample_rows) (TableScanDesc scan,

The "reservoir" is related to the block sampler, but that's just an
implementation detail of the function. Perhaps something like "Acquire a
sample of rows from the table, for ANALYZE". And explain the arguments
here, or in table_acquire_sample_rows().

Overall, I like where this patch is going.

[1] https://commitfest.postgresql.org/31/2922/

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2021-01-22 13:13:04 Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
Previous Message Tomas Vondra 2021-01-22 13:09:04 Re: PoC/WIP: Extended statistics on expressions