Re: Tid scan improvements

From: Edmund Horner <ejrh00(at)gmail(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, tomas(dot)vondra(at)2ndquadrant(dot)com
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Tid scan improvements
Date: 2018-11-27 08:16:46
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Thu, 22 Nov 2018 at 20:41, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> I've now had a look over the latest patches and I've found a few more
> things. Many of these are a bit nitpicky, but certainly not all. I
> also reviewed 0004 this time.

Whew! A lot more things to look at.

I've tried to address most of what you've raised, and attach yet
another set of patches. There are are few things that I'm not settled
on, discussed below under Big Items.

CC'd Tomas, if he wants to check what I've done with the TidRange
array allocation.

***** Big Items *****

> 0001:
> 1. The row estimates are not quite right. This cases the row
> estimation to go the wrong way for isgt.
> For example, the following gets 24 rows instead of 26.
> postgres=# create table t (a int);
> postgres=# insert into t select generate_Series(1,100);
> INSERT 0 100
> postgres=# analyze t;
> postgres=# explain analyze select * from t where ctid >= '(0,75)';
> ---------------------------------------------------------------------------------------------
> Seq Scan on t (cost=0.00..2.25 rows=24 width=4) (actual
> time=0.046..0.051 rows=26 loops=1)
> Filter: (ctid >= '(0,75)'::tid)
> Rows Removed by Filter: 74
> Planning Time: 0.065 ms
> Execution Time: 0.074 ms
> (5 rows)
> The < and <= case is not quite right either. < should have 1 fewer
> tuple than the calculated average tuples per page, and <= should have
> the same (assuming no gaps)
> I've attached a small delta patch that I think improves things here.

Thanks, I've incorporated your patch. I think the logic for iseq and
isgt makes sense now.

Since we only have the total number of tuples and the total number of
pages, and no real statistics, this might be the best we can
reasonably do. There's still a noticable rowcount error for the last
page, and slighter rowcount errors for other pages. We estimate
density = ntuples/npages for all pages; but in a densely populated
table, we'll average only half the number of tuples in the last page
as earlier pages.

I guess we *could* estimate density = ntuples/(npages - 0.5) for all
but the last page; and half that for the last. But that adds
complexity, and you'd still only get a good row count when the last
page was about half full.

I implemented this anyway, and it does improve row counts a bit. I'll
include it in the next patch set and you can take a look.

I also spent some time today agonising over how visiblity would affect
things, but did not come up with anything useful to add to our

> 3. I'd rather see EnsureTidRangeSpace() keep doubling the size of the
> allocation until it reaches the required size. See how
> MakeSharedInvalidMessagesArray() does it. Doing it this way ensures
> we always have a power of two sized array which is much nicer if we
> ever reach the palloc() limit as if the array is sized at the palloc()
> limit / 2 + 1, then if we try to double it'll fail. Of course, it's
> unlikely to be a problem here, but... the question would be how to
> decide on the initial size.

I've tried to change things that way, but we still need to deal with
excessive numbers of items.

I've defined a constant MaxTidRanges = MaxAllocSize/sizeof(TidRange),
and raise an error if the required size exceeds that.

> 4. "at" needs shifted left a couple of words
> /*
> * If the lower bound was already or above at the maximum block
> * number, then there is no valid range.
> */
> but I don't see how it could be "or above". The ctid type does not
> have the room for that. Although, that's not to say you should test if
> (block == MaxBlockNumber), the >= seems better for the code. I'm just
> complaining about the comment.

We have to deal with TIDs entered by the user, which can include
invalid ones like (4294967295,0). MaxBlockNumber is 4294967294.

> 12. expected_comparison_operator is a bit long a name:
> IsTidComparison(OpExpr *node, int varno, Oid expected_comparison_operator)
> How about just expected_opno?
> 14. This is not great:
> [horrible macros in tidpath.c]
> The 4 macros for >, >=, < and <= are only used by IsTidRangeClause()
> which means IsTidComparison() could get called up to 4 times. Most of
> the work it does would be redundant in that case. Maybe it's better
> to rethink that?

Yeah. I've rewritten all this as two functions, IsTidEqualClause and
IsTidRangeClause, which each check the opno, with a helper function
IsTidBinaryExpression that checks everything else.

> 18. I'd expect the following not to produce a sort above the Tid Scan.
> postgres=# set enable_seqscan=0;
> postgres=# explain select * from t inner join t t1 on t.ctid = t1.ctid
> where t.ctid < '(0,10)' ;
> ---------------------------------------------------------------------------------------
> Merge Join (cost=10000000008.65..10000000009.28 rows=9 width=8)
> Merge Cond: (t.ctid = t1.ctid)
> -> Sort (cost=3.33..3.35 rows=9 width=10)
> Sort Key: t.ctid
> -> Tid Scan on t (cost=0.00..3.18 rows=9 width=10)
> TID Cond: (ctid < '(0,10)'::tid)
> -> Sort (cost=10000000005.32..10000000005.57 rows=100 width=10)
> Sort Key: t1.ctid
> -> Seq Scan on t t1 (cost=10000000000.00..10000000002.00
> rows=100 width=10)
> (9 rows)
> On looking at why the planner did this, I see it's down to how you've
> coded create_tidscan_paths(). You're creating a tidpath if there's any
> quals or any useful pathkeys useful to the query's ORDER BY, but only
> including the pathkeys if they're useful for the query's ORDER BY. I
> think it'll be better to include the forward pathkeys in all cases,
> and just make it a backward Tid Scan if backward keys are useful for
> the ORDER BY. There's still a problem with this as a Merge Join
> would need a Sort if there was an ORDER BY ctid DESC for one relation
> even if the other relation had some valid ctid quals since the 2nd
> scan would create a forward Tid Scan. Maybe that's not worth worrying
> about. The only fix I can imagine is to always create a forward and
> backward Tid Scan path, which is pretty bad as it's two more paths
> that likely won't get used 99.9% of the time.

Two paths seems excessive just to cater for these unlikely plans. We
don't provide any other support for joining on CTID.

But setting the path keys doesn't cost much, so we should do that.

> This also caused me to notice the costs are pretty broken for this:
> postgres=# explain select * from t order by ctid;
> ---------------------------------------------------
> Tid Scan on t (cost=0.00..0.00 rows=100 width=10)
> (1 row)

Yeah -- a side effect of treating empty tidquals as a scan over the
whole table. I've added costing code for this case.

***** Smaller items *****

Compacted for brevity (hope you don't mind):

> 2. You should test for a non-empty List with list != NIL [...] Also, can you not just return after that if test? I think the code
> would be easier to read with it like that.
> 5. TidInArrayExprEval() lacks a header comment [...]
> 6. In MergeTidRanges(), you have: [leftover code]
> 7. "fist" -> "first" [...]
> 8. tss_TidRangePtr is a pretty confusingly named field. [...]
> 9. This comment seems to indicate that a range can only have one bound, but that does not seem to be the case. [...]
> 10. In cost_tidscan() I think you should ceil() the following: [...]
> 11. In the comment: /* TODO decide what the costs should be */ [...]
> 13. !rlst -> rlst != NIL
> 15. There's no field named NumTids: [...]
> 16. I think the following comment needs to be updated: [heapam comments]
> 17. Can you make a few changed to build_tidscan_pathkeys(): [...]
> 19. Looks like the ScanDirection's normally get named "scandir": [...]

These are mostly trivial and I've generally gone with your recommendation.

Attachment Content-Type Size
v5-0001-Add-selectivity-and-nullness-estimates-for-CTID-syst.patch application/octet-stream 3.3 KB
v5-0004-Tid-Scan-results-are-ordered.patch application/octet-stream 20.1 KB
v5-0005-TID-selectivity-reduce-the-density-of-the-last-page-.patch application/octet-stream 1.9 KB
v5-0003-Support-backward-scans-over-restricted-ranges-in-hea.patch application/octet-stream 2.2 KB
v5-0002-Support-range-quals-in-Tid-Scan.patch application/octet-stream 62.1 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-11-27 08:26:55 Re: pg11.1 jit segv
Previous Message Thomas Munro 2018-11-27 08:02:29 Re: dsa_allocate() faliure