[Proposal] Extend TableAM routines for ANALYZE scan

From: Pengzhou Tang <ptang(at)pivotal(dot)io>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: [Proposal] Extend TableAM routines for ANALYZE scan
Date: 2019-12-05 10:14:17
Message-ID: CAG4reARKmeZezUNc8YmSxht9q=FNe6Lw=+f_ui=Bs6a2vpLmHA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

When hacking the Zedstore, we need to get a more accurate statistic for
zedstore and we
faced some restrictions:
1) acquire_sample_rows() always use RelationGetNumberOfBlocks to generate
sampling block
numbers, this is not friendly for zedstore which wants to use a logical
block number and might also
not friendly to non-block-oriented Table AMs.
2) columns of zedstore table store separately, so columns in a row have a
different physical position,
tid in a tuple is invalid for zedstore which means the correlation
statistic is incorrect for zedstore.
3) RelOptInfo->pages is not correct for Zedstore if we only access partial
of the columns which make
the IO cost much higher than the actual cost.

For 1) and 2), we propose to extend existing ANALYZE-scan table AM routines
in patch
"0001-ANALYZE-tableam-API-change.patch" which add three more APIs:
scan_analyze_beginscan(), scan_analyze_sample_tuple(),
scan_analyze_endscan(). This provides
more convenience and table AMs can take more control of every step of
sampling rows. Meanwhile,
with the new structure named "AcquireSampleContext", we can acquire extra
info (eg: physical position,
physical size) except the real columns values.

For 3), we hope we can have a similar mechanism with RelOptInfo->rows which
is calculated from
(RelOptInfo->tuples * Selectivity), we can calculate RelOptInfo->pages
with a page selectivity which
is base on the selected zedstore columns.
0002-Planner-can-estimate-the-pages-based-on-the-columns-.patch
shows one idea that adding the `stadiskfrac` to pg_statistic and planner
use it to estimate the
RelOptInfo->pages.

0003-ZedStore-use-extended-ANAlYZE-API.patch is attached to only show how
Zedstore use the
previous patches to achieve:
1. use logical block id to acquire the sample rows.
2. can only acquire sample rows from specified column c1, this is used when
user only analyze table
on specified columns eg: "analyze zs (c1)".
3 when ANALYZE, zedstore table AM provided extra disksize info, then
ANALYZE compute the
physical fraction statistic of each column and planner use it to
estimate the IO cost based on
the selected columns.

Thanks,
Pengzhou

Attachment Content-Type Size
0001-ANALYZE-tableam-API-change.patch application/x-patch 34.0 KB
0002-Planner-can-estimate-the-pages-based-on-the-columns-.patch application/x-patch 9.1 KB
0003-ZedStore-use-extended-ANAlYZE-API.patch application/x-patch 8.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2019-12-05 10:50:18 Re: logical decoding : exceeded maxAllocatedDescs for .spill files
Previous Message Amit Kapila 2019-12-05 09:37:08 Re: Windows buildfarm members vs. new async-notify isolation test