Re: Extracting only the columns needed for a query

From: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Ashwin Agrawal <aagrawal(at)pivotal(dot)io>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Extracting only the columns needed for a query
Date: 2019-12-17 10:59:50
Message-ID: 20191217105950.slmha6jrqy5j3qo7@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Tue, Jul 16, 2019 at 06:49:10PM -0700, Melanie Plageman wrote:
>
> We implemented Approach B in the attached patch set (patch 0001) and
> then implemented Approach A (patch 0002) to sanity check the pruned
> list of columns to scan we were getting at plan-time.
> We emit a notice in SeqNext() if the two column sets differ.
> Currently, for all of the queries in the regression suite, no
> differences are found.
> We did notice that none of the current regression tests for triggers
> are showing a difference between columns that can be extracted at plan
> time and those that can be extracted at execution time--though we
> haven't dug into this at all.

Thanks for the patch! If I understand correctly from this thread,
approach B is more preferable, so I've concentrated on the patch 0001
and have a few commentaries/questions:

* The idea is to collect columns that have being used for selects/updates
(where it makes sense for columnar storage to avoid extra work), do I see it
right? If it's the case, then scanCols could be a bit misleading, since it
gives an impression that it's only about reads.

* After a quick experiment, it seems that extract_used_columns is invoked for
updates, but collects too many colums, e.g:

create table test (id int primary key, a text, b text, c text);
update test set a = 'something' where id = 1;

collects into scanCols all columns (varattno from 1 to 4) + again the first
column from baserestrictinfo. Is it correct?

* Not sure if it supposed to be covered by this functionality, but if we do

insert ... on conflict (uniq_col) do update set other_col = 'something'

and actually have to perform an update, extract_used_columns is not called.

* Probably it also makes sense to check IS_DUMMY_REL in extract_used_columns?

> > On Sat, Jun 15, 2019 at 10:02 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >
> > Another reason for having the planner do this is that presumably, in
> > an AM that's excited about this, the set of fetched columns should
> > play into the cost estimates for the scan. I've not been paying
> > enough attention to the tableam work to know if we've got hooks for
> > the AM to affect scan costing ... but if we don't, that seems like
> > a hole that needs plugged.
>
> AM callback relation_estimate_size exists currently which planner leverages.
> Via this callback it fetches tuples, pages, etc.. So, our thought is to extend
> this API if possible to pass down needed column and help perform better costing
> for the query. Though we think if wish to leverage this function, need to know
> list of columns before planning hence might need to use query tree.

I believe it would be beneficial to add this potential API extension patch into
the thread (as an example of an interface defining how scanCols could be used)
and review them together.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2019-12-17 11:17:36 Re: error context for vacuum to include block number
Previous Message Peter Eisentraut 2019-12-17 10:56:17 Re: automating pg_config.h.win32 maintenance