Extracting only the columns needed for a query

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Extracting only the columns needed for a query
Date: 2019-06-15 01:45:51
Message-ID: CAAKRu_Yj=Q_ZxiGX+pgstNWMbUJApEJX-imvAEwryCk5SLUebg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

While hacking on zedstore, we needed to get a list of the columns to
be projected--basically all of the columns needed to satisfy the
query. The two use cases we have for this is
1) to pass this column list down to the AM layer for the AM to leverage it
2) for use during planning to improving costing
In other threads, such as [1], there has been discussion about the
possible benefits for all table types of having access to this set of
columns.

Focusing on how to get this used cols list (as opposed to how to pass
it down to the AM), we have tried a few approaches to constructing it
and wanted to get some ideas on how best to do it.

We are trying to determine which phase to get the columns -- after
parsing, after planning, or during execution right before calling the
AM.

Approach A: right before calling AM

Leverage expression_tree_walker() right before calling beginscan()
and collecting the columns into a needed columns context. This
approach is what is currently in the zedstore patch mentioned in
this thread [2].

The benefit of this approach is that it walks the tree right
before the used column set will be used--which makes it easy to
skip this walk for queries or AMs that don't benefit from this
used columns list.

Approach B: after parsing and/or after planning

Add a new member 'used_cols' to PlannedStmt which contains the
attributes for each relation present in the query. Construct
'used_cols' at the end of planning using the PathTargets in the
RelOptInfos in the PlannerInfo->simple_rel_array and the
RangeTblEntries in PlannerInfo->simple_rte_array.

The nice thing about this is that it does not require a full walk
of the plan tree. Approach A could be more expensive if the tree
is quite large. I'm not sure, however, if just getting the
PathTargets from the RelOptInfos is sufficient for obtaining the
whole set of columns used in the query.

Approach B, however, does not work for utility statements which do
not go through planning.

One potential solution to this that we tried was getting the
columns from the query tree after parse analyze and then in
exec_simple_query() adding the column list to the PlannedStmt.

This turned out to be as messy or more than Approach A because
each kind of utility statement has its own data structure that is
composed of elements taken from the Query tree but does not
directly include the original PlannedStmt created for the query
(the PlannedStmt doesn't contain anything except the query tree
for utility statements since they do not go through planning). So,
for each type of utility statement, we would have to separately
copy over the column list from the PlannedStmt in its own way.

It is worth noting that Approach A also requires special handling
for each type of utility statement.

We are wondering about specific benefits of Approach B--that is, is
there some use case (maybe plan re-use) for having the column set
accessible in the PlannedStmt itself?

One potential benefit of Approach B could be for scans of partition
tables. Collecting the used column list could be done once for the
query instead of once for each partition.

Both approaches, however, do not address our second use case, as we
would not have the column list during planning for non-utility
statements. To satisfy this, we would likely have to extract the
columns from the query tree after parse analyze for non-utility
statements as well.

An approach which extracted this list before planning and saved it
somewhere would help avoid having to do the same walk during planning
and then again during execution. Though, using the list constructed
after parsing may not be ideal when some columns were able to be
eliminated during planning.

Melanie & Ashwin

[1]
https://www.postgresql.org/message-id/20190409010440.bqdikgpslh42pqit%40alap3.anarazel.de
[2]
https://www.postgresql.org/message-id/CALfoeiuuLe12PuQ%3DzvM_L7B5qegBqGHYENHUGbLOsjAnG%3Dmi4A%40mail.gmail.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2019-06-15 03:28:05 Re: Speed up transaction completion faster after many relations are accessed in a transaction
Previous Message Joe Conway 2019-06-15 01:37:37 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)