Re: Tid scan improvements

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Edmund Horner <ejrh00(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Tid scan improvements
Date: 2018-11-22 07:41:10
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Mon, 12 Nov 2018 at 17:35, Edmund Horner <ejrh00(at)gmail(dot)com> wrote:
> Hi, here's the new patch(s).
> Mostly the same, but trying to address your comments from earlier as
> well as clean up a few other things I noticed.

Thanks for making those changes.

I've now had a look over the latest patches and I've found a few more
things. Many of these are a bit nitpicky, but certainly not all. I
also reviewed 0004 this time.


1. The row estimates are not quite right. This cases the row
estimation to go the wrong way for isgt.

For example, the following gets 24 rows instead of 26.

postgres=# create table t (a int);
postgres=# insert into t select generate_Series(1,100);
INSERT 0 100
postgres=# analyze t;
postgres=# explain analyze select * from t where ctid >= '(0,75)';
Seq Scan on t (cost=0.00..2.25 rows=24 width=4) (actual
time=0.046..0.051 rows=26 loops=1)
Filter: (ctid >= '(0,75)'::tid)
Rows Removed by Filter: 74
Planning Time: 0.065 ms
Execution Time: 0.074 ms
(5 rows)

The < and <= case is not quite right either. < should have 1 fewer
tuple than the calculated average tuples per page, and <= should have
the same (assuming no gaps)

I've attached a small delta patch that I think improves things here.


2. You should test for a non-empty List with list != NIL

* If no quals were specified, then a complete scan is assumed. Make a
* TidExpr with an empty list of TidOpExprs.
if (!node->tidquals)

Also, can you not just return after that if test? I think the code
would be easier to read with it like that.

3. I'd rather see EnsureTidRangeSpace() keep doubling the size of the
allocation until it reaches the required size. See how
MakeSharedInvalidMessagesArray() does it. Doing it this way ensures
we always have a power of two sized array which is much nicer if we
ever reach the palloc() limit as if the array is sized at the palloc()
limit / 2 + 1, then if we try to double it'll fail. Of course, it's
unlikely to be a problem here, but... the question would be how to
decide on the initial size.

4. "at" needs shifted left a couple of words

* If the lower bound was already or above at the maximum block
* number, then there is no valid range.

but I don't see how it could be "or above". The ctid type does not
have the room for that. Although, that's not to say you should test if
(block == MaxBlockNumber), the >= seems better for the code. I'm just
complaining about the comment.

5. TidInArrayExprEval() lacks a header comment, and any other comments
to mention what it does. The function args also push over the 80 char
line length. There's also a few other functions in nodeTidscan.c that
are missing a header comment.

6. In MergeTidRanges(), you have:

ItemPointerData a_last = a->last;
ItemPointerData b_last;

if (!ItemPointerIsValid(&a_last))
a_last = a->first;

but I don't see anywhere you're setting ->last to an invalid item
pointer. Is this left over from a previous design of the range scan?
It looks like in TidExprEval() you're setting the upperbound to the
last page on the relation.

7. "fist" -> "first"

* If the tuple is in the fist block of the range and before the first

8. tss_TidRangePtr is a pretty confusingly named field.

if (node->tss_TidRangePtr >= numRanges || node->tss_TidRangePtr < 0)

I'd expect anything with ptr in it to be a pointer, but this seems to
be an array index. Maybe "idx" is better than "ptr", or take note from
nodeAppend.c and have something like "tts_whichRange".

UPDATE: I see you've just altered what's there already. Perhaps it's
okay to leave it as you have it, but it's still not ideal.

9. This comment seems to indicate that a range can only have one
bound, but that does not seem to be the case.

* Ranges with only one item -- including one resulting from a
* CURRENT-OF qual -- are handled by looking up the item directly.

It seems open bounded ranges just have the lowest or highest possible
value for a ctid on the open side.

Perhaps the comment could be written as:

* For ranges containing a single tuple, we can simply make an
* attempt to fetch the tuple directly.

10. In cost_tidscan() I think you should ceil() the following:

double pages = selectivity * baserel->pages;

Otherwise, you'll end up partially charging a seq_page_cost, which
seems pretty invalid since you can't partially read a page.

11. In the comment:

/* TODO decide what the costs should be */

I think you can just explain why you're charging 1 random_page_cost
and the remainder in seq_page_cost. Or is there something left to do
here that I've forgotten about?

12. expected_comparison_operator is a bit long a name:

IsTidComparison(OpExpr *node, int varno, Oid expected_comparison_operator)

How about just expected_opno?

13. !rlst -> rlst != NIL

/* if no range qual was found, look for any other TID qual */
if (!rlst)

(Yeah I know there's various cases where it's done incorrectly there
already :-( )

14. This is not great:

#define IsTidEqualClause(node, varno) IsTidComparison(node, varno,
#define IsTidLTClause(node, varno) IsTidComparison(node, varno, TIDLessOperator)
#define IsTidLEClause(node, varno) IsTidComparison(node, varno,
#define IsTidGTClause(node, varno) IsTidComparison(node, varno,
#define IsTidGEClause(node, varno) IsTidComparison(node, varno,

#define IsTidRangeClause(node, varno) (IsTidLTClause(node, varno) || \
IsTidLEClause(node, varno) || \
IsTidGTClause(node, varno) || \
IsTidGEClause(node, varno))

The 4 macros for >, >=, < and <= are only used by IsTidRangeClause()
which means IsTidComparison() could get called up to 4 times. Most of
the work it does would be redundant in that case. Maybe it's better
to rethink that?

15. There's no field named NumTids:

* TidRanges evaluated item pointers (array of size NumTids)


16. I think the following comment needs to be updated:

/* start from last page of the scan */


/* When scanning the whole relation, start from the last page of the scan */

and drop:

/* Scanning the full relation: start just before start block. */

then maybe change:

/* Scanning a restricted range: start at end of range. */


/* Otherwise, if scanning just a subset of the relation, start at the
final block in the range */


17. Can you make a few changed to build_tidscan_pathkeys():

a. build_index_pathkeys() uses ScanDirectionIsBackward(scandir), can
you set the opno based on that rather than doing "direction ==
b. varexpr can be an Expr and just be named expr. Please move the
declaration and assignment out onto separate lines and wrap the long
c. wrap long line with the call to build_expression_pathkey(). Get rid
of the (Expr *) cast.

18. I'd expect the following not to produce a sort above the Tid Scan.

postgres=# set enable_seqscan=0;
postgres=# explain select * from t inner join t t1 on t.ctid = t1.ctid
where t.ctid < '(0,10)' ;
Merge Join (cost=10000000008.65..10000000009.28 rows=9 width=8)
Merge Cond: (t.ctid = t1.ctid)
-> Sort (cost=3.33..3.35 rows=9 width=10)
Sort Key: t.ctid
-> Tid Scan on t (cost=0.00..3.18 rows=9 width=10)
TID Cond: (ctid < '(0,10)'::tid)
-> Sort (cost=10000000005.32..10000000005.57 rows=100 width=10)
Sort Key: t1.ctid
-> Seq Scan on t t1 (cost=10000000000.00..10000000002.00
rows=100 width=10)
(9 rows)

On looking at why the planner did this, I see it's down to how you've
coded create_tidscan_paths(). You're creating a tidpath if there's any
quals or any useful pathkeys useful to the query's ORDER BY, but only
including the pathkeys if they're useful for the query's ORDER BY. I
think it'll be better to include the forward pathkeys in all cases,
and just make it a backward Tid Scan if backward keys are useful for
the ORDER BY. There's still a problem with this as a Merge Join
would need a Sort if there was an ORDER BY ctid DESC for one relation
even if the other relation had some valid ctid quals since the 2nd
scan would create a forward Tid Scan. Maybe that's not worth worrying
about. The only fix I can imagine is to always create a forward and
backward Tid Scan path, which is pretty bad as it's two more paths
that likely won't get used 99.9% of the time.

This also caused me to notice the costs are pretty broken for this:

postgres=# explain select * from t order by ctid;
Tid Scan on t (cost=0.00..0.00 rows=100 width=10)
(1 row)

19. Looks like the ScanDirection's normally get named "scandir":

static TidScan *make_tidscan(List *qptlist, List *qpqual, Index scanrelid,
List *tidquals, ScanDirection direction);

Likewise for the various .h files you've added that as a new field to
various structs.

Setting back to waiting on author.

David Rowley
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
fixes_for_v4_0001.diff application/octet-stream 1.6 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-11-22 07:53:21 Re: [Todo item] Add entry creation timestamp column to pg_stat_replication
Previous Message Pavan Deolasee 2018-11-22 06:44:43 Re: MERGE SQL statement for PG12