Re: Tid scan improvements

From: Edmund Horner <ejrh00(at)gmail(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Tid scan improvements
Date: 2018-10-05 01:24:50
Message-ID: CAMyN-kBRYy=DV3oQ0TD6pJYvNg1oHSTXdCTe94XJGOpJK1GeAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 3 Oct 2018 at 17:36, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> I know commit fest is over, but I made a pass of this to hopefully
> provide a bit of guidance so that it's closer for the November 'fest.

Hi David. Thanks for the review. It's fairly thorough and you must
have put some time into it -- I really appreciate it.

> I've only done light testing on the patch and it does seem to work,
> but there are a few things that I think should be changed. Most
> importantly #11 below I think needs to be done. That might overwrite
> some of the items that come before it in the list as you likely will
> have to pull some of code which I mention out out due to changing #11.
> I've kept them around anyway just in case some of it remains.

> 1. Could wrap for tables > 16TB. Please use double. See index_pages_fetched()
> 2. Should multiply by nseqpages, not add.
> 3. Should be double.

I agree with these three.

> 4. Comment needs updated to mention what the new code does in
> heapgettup() and heapgettup_pagemode()
>
> +
> /* start from last page of the scan */
> - if (scan->rs_startblock > 0)
> - page = scan->rs_startblock - 1;
> + if (scan->rs_numblocks == InvalidBlockNumber)
> + {
> + if (scan->rs_startblock > 0)
> + page = scan->rs_startblock - 1;
> + else
> + page = scan->rs_nblocks - 1;
> + }
> else
> - page = scan->rs_nblocks - 1;
> + page = scan->rs_startblock + scan->rs_numblocks - 1;
> +

I'm thinking that, as they don't depend on the others, the heapam.c
changes should be a separate preparatory patch?

The heap scan code has to support things like synchonised scans and
parallel scans, but as far as I know, its support for scanning
subranges is currently used only for building BRIN indexes. I found
that although I could specify a subrange with heap_setscanlimits, I
could not scan backward over it, because the original version of the
above code would start at the end of the whole table.

I'm not especially comfortable with this understanding of heapam, so
close review would be appreciated.

I note that there's a lot of common code in heapgettup and
heapgettup_pagemode, which my changes add to. It might be worth
trying to factor out somehow.

> 5. Variables should be called "inclusive". We use "strict" to indicate
> an operator comparison cannot match NULL values.
>
> + bool strict; /* Indicates < rather than <=, or > rather */
> + bool strict2; /* than >= */
>
> Don't break the comment like that. If you need more space don't end
> the comment and use a new line and tab the next line out to match the
> * of the first line.

> 8. Many instances of the word "strict" are used to mean "inclusive".
> Can you please change all of them.

I don't mind renaming it. I took "strict" from "strictly greater/less
than" but I knew it was confusable with the other usages of "strict".

> 9. Confusing comment:
>
> + * If the expression was non-strict (<=) and the offset is 0, then just
> + * pretend it was strict, because offset 0 doesn't exist and we may as
> + * well exclude that block.
>
> Shouldn't this be, "If the operator is non-inclusive, then since TID
> offsets are 1-based, for simplicity, we can just class the expression
> as inclusive.", or something along those lines.

Ok, I'll try to reword it along those lines.

> I think I'm going to stop here as changing this going to cause quite a
> bit of churn.
>
> but one more...

> 12. I think the changes to selfuncs.c to get the selectivity estimate
> is a fairly credible idea, but I think it also needs to account for
> offsets. You should be able to work out the average number of items
> per page with rel->tuples / rel->pages and factor that in to get a
> better estimate for cases like:
>
> postgres=# explain analyze select ctid,* from t1 where ctid <= '(0,200)';
> QUERY PLAN
> -----------------------------------------------------------------------------------------------
> Tid Scan on t1 (cost=0.00..5.00 rows=1 width=10) (actual
> time=0.025..0.065 rows=200 loops=1)
> TID Cond: (ctid <= '(0,200)'::tid)
> Planning Time: 0.081 ms
> Execution Time: 0.088 ms
> (4 rows)
>
> You can likely add on "(offset / avg_tuples_per_page) / rel->pages" to
> the selectivity and get a fairly accurate estimate... at least when
> there are no dead tuples in the heap

I think the changes to selfuncs.c could also be a separate patch?

I'll try to include the offset in the selectivity too.

Related -- what should the selectivity be on an empty table? My code has:

/* If the relation's empty, we're going to read all of it. */
if (vardata->rel->pages == 0)
return 1.0;

(which needs rewording, since selectivity isn't always about reading).
Is 1.0 the right thing to return?

> 6. Why not pass the TidExpr into MakeTidOpExprState() and have it set
> the type instead of repeating code
> 7. It's not very obvious why the following Assert() can't fail. [...]
> I had to hunt around quite a bit to see that
> TidQualFromBaseRestrictinfo could only ever make the list have 2
> elements, and we'd not form a BoolExpr with just 1. (but see #11)
> 10. Comment talks about LHS, but the first OpExpr in a list of two
> OpExprs has nothing to do with left hand side. You could use LHS if
> you were talking about the first arg in an OpExpr, but this is not the
> case here.

These three might become non-issues if we change it along the lines of #11:

> 11. I think the qual matching code needs an overhaul. Really you
> should attempt to find the smallest and largest ctid for your
> implicitly ANDed ranges. This would require you getting rid of the
> BETWEEN type claused you're trying to build in
> TidQualFromBaseRestrictinfo
> and instead just include all quals, don't ignore other quals when
> you've already found your complete range bounds.
>
> The problem with doing it the way that you're doing it now is in cases like:
>
> create table t1(a int);
> insert into t1 select generate_Series(1,10000000);
> create index on t1 (a);
> select ctid,a from t1 order by a desc limit 1; -- find the max ctid.
> ctid | a
> -------------+----------
> (44247,178) | 10000000
> (1 row)
>
> set max_parallel_workers_per_gather=0;
> explain analyze select ctid,* from t1 where ctid > '(0,0)' and ctid <=
> '(44247,178)' and ctid <= '(0,1)';
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------
> Tid Scan on t1 (cost=0.01..169248.78 rows=1 width=10) (actual
> time=0.042..2123.432 rows=1 loops=1)
> TID Cond: ((ctid > '(0,0)'::tid) AND (ctid <= '(44247,178)'::tid))
> Filter: (ctid <= '(0,1)'::tid)
> Rows Removed by Filter: 9999999
> Planning Time: 4.049 ms
> Execution Time: 2123.464 ms
> (6 rows)
>
> Due to how you've coded TidQualFromBaseRestrictinfo(), the ctid <=
> '(0,1)' qual does not make it into the range. It's left as a filter in
> the Tid Scan.

My first thought was to support the fairly-common use case of the
two-bound range "ctid >= ? AND ctid< ?" (or single-bound variations);
hence my original patch for a "TID Range Scan".

Following the comments made earlier, I tried incorporating this into
the existing TID Scan; but I still had the same use case in mind, so
only the first lower and upper bounds were used. My thoughts were
that, while we need to *correctly* handle more complicated cases like
"ctid > '(0,0)' AND ctid <= '(44247,178)' AND ctid <= '(0,1)'", such
queries will not come up in practice and hence it's OK if those extra
bounds are applied in the filter. For the same reason, I did not
consider it worthwhile trying to pick which bound to use in the scan.

I've since realised that such queries aren't always redundant. At
query time we might not know which of the bounds if the "best", but we
will after evaluating them in the executor. So I quite like the idea
of keeping all of them.

This means a TID path's quals is an OR-list of:
- "ctid = ?"
- "ctid = ANY (?)" / "ctid IN (?)"
- "(ctid op ?) AND ..." (where op is one of >,>=,<,<=)
- "CURRENT OF"

I still don't think the scan needs to support quals like "ctid = ? AND
ctid > ?", or "ctid IN (?) AND ctid IN (?)" -- the executor *could*
try to form the intersection but I don't think it's worth the code.
In these cases, picking a simple qual is usually enough for an
efficient scan; the full qual can go into the filter.

I'm part way through implementing this. It looks like it might
actually be less code than what I had before.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-10-05 01:26:45 TAP tests for pg_verify_checksums
Previous Message Bossart, Nathan 2018-10-05 01:14:46 Re: pg_ls_tmpdir()