Re: Extracting only the columns needed for a query

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc: Ashwin Agrawal <aagrawal(at)pivotal(dot)io>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Extracting only the columns needed for a query
Date: 2020-01-03 01:21:55
Message-ID: CAAKRu_Z355CurEzBJ1Mq3RS_dYqx3x0eUHhoOsQa+RQwHrje-w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 17, 2019 at 2:57 AM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:

>
> Thanks for the patch! If I understand correctly from this thread,
> approach B is more preferable, so I've concentrated on the patch 0001
> and have a few commentaries/questions:
>

Thanks so much for the review!

>
> * The idea is to collect columns that have being used for selects/updates
> (where it makes sense for columnar storage to avoid extra work), do I
> see it
> right? If it's the case, then scanCols could be a bit misleading, since
> it
> gives an impression that it's only about reads.
>

The "scanCols" columns are only what will need to be scanned in order
to execute a query, so, even if a column is being "used", it may not
be set in "scanCols" if it is not required to scan it.

For example, a column which does not need to be scanned but is "used"
-- e.g. in UPDATE x SET col = 2; "col" will not be in "scanCols" because
it is known that it will be 2.

That makes me think that maybe the function name,
extract_used_columns() is bad, though. Maybe extract_scan_columns()?
I tried this in the attached, updated patch.

>
> * After a quick experiment, it seems that extract_used_columns is invoked
> for
> updates, but collects too many colums, e.g:
>
> create table test (id int primary key, a text, b text, c text);
> update test set a = 'something' where id = 1;
>
> collects into scanCols all columns (varattno from 1 to 4) + again the
> first
> column from baserestrictinfo. Is it correct?
>

For UPDATE, we need all of the columns in the table because of the
table_lock() API's current expectation that the slot has all of the
columns populated. If we want UPDATE to only need to insert the column
values which have changed, table_tuple_lock() will have to change.

Collecting columns from the baserestrictinfo is important for when
that column isn't present in another part of the query, but it won't
double count it in the bitmap (when it is already present).

>
> * Not sure if it supposed to be covered by this functionality, but if we do
>
> insert ... on conflict (uniq_col) do update set other_col = 'something'
>
> and actually have to perform an update, extract_used_columns is not
> called.
>
>
For UPSERT, you are correct that it will not extract scan columns.
It wasn't by design. It is because that UPDATE is planned as part of
an INSERT.
For an INSERT, in query_planner(), because the jointree has only one
relation and that relation is an RTE_RESULT planner does not continue
on to make_one_rel() and thus doesn't extract scan columns.

This means that for INSERT ON CONFLICT DO UPDATE, "scanCols" is not
populated, however, since UPDATE needs to scan all of the columns
anyway, I don't think populating "scanCols" would have any impact.
This does mean that that bitmap would be different for a regular
UPDATE vs an UPSERT, however, I don't think that doing the extra work
to populate it makes sense if it won't be used. What do you think?

> * Probably it also makes sense to check IS_DUMMY_REL in
> extract_used_columns?
>
>
I am wondering, when IS_DUMMY_REL is true for a relation, do we
reference the associated RTE later? It seems like if it is a dummy
rel, we wouldn't scan it. It still makes sense to add it to
extract_used_columns(), I think, to avoid any wasted loops through the
rel's expressions. Thanks for the idea!

--
Melanie Plageman

Attachment Content-Type Size
v2-0001-Plan-time-extraction-of-scan-cols.patch application/octet-stream 9.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-01-03 02:33:59 Re: Removal of support for OpenSSL 0.9.8 and 1.0.0
Previous Message John Naylor 2020-01-02 23:56:28 Re: benchmarking Flex practices