Re: [PATCH] Incremental sort (was: PoC: Partial sort)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Shaun Thomas <shaun(dot)thomas(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andreas Karlsson <andreas(at)proxel(dot)se>
Subject: Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Date: 2020-03-24 23:08:31
Message-ID: 20200324230831.5zvjpgayh35bxedp@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 24, 2020 at 06:26:11PM -0400, James Coleman wrote:
>On Mon, Mar 23, 2020 at 11:44 PM Alvaro Herrera
><alvherre(at)2ndquadrant(dot)com> wrote:
>>
>> On 2020-Mar-23, James Coleman wrote:
>>
>> > 4. nodeIncrementalSort.c ExecReScanIncrementalSort: This whole chunk
>> > is suspect. I've mentioned previously I don't have a great mental
>> > model of how rescan works and its invariants (IIRC someone said it was
>> > about moving around a result set in a cursor). Regardless I'm pretty
>> > sure this code just doesn't work correctly.
>>
>> I don't think that's the whole of it. My own vague understanding of
>> ReScan is that it's there to support running a node again, possibly with
>> different parameters. For example if you have a join of an indexscan
>> on the outer side and an incremental sort on the inner side, and the
>> values from the index are used as parameters to the incremental sort,
>> then the incremental sort is going to receive ReScan calls for each of
>> the values that the index returns. Sometimes the index could give you
>> the same values as before (because there's a dupe in the index), so you
>> can just return the same values from the incremental sort; but other
>> times it's going to return different values so you need to reset the
>> incremental sort to "start from scratch" using the new values as
>> parameters.
>>
>> Now, if you have a cursor reading from the incremental sort and fetch
>> all tuples, then rewind completely and fetch all again, then that's
>> going to be a rescan as well.
>>
>> I agree with you that the code doesn't seem to implement that.
>
>I grepped the codebase for rescan, and noted this relevant info in
>src/backend/executor/README:
>
>* Rescan command to reset a node and make it generate its output sequence
>over again.
>
>* Parameters that can alter a node's results. After adjusting a parameter,
>the rescan command must be applied to that node and all nodes above it.
>There is a moderately intelligent scheme to avoid rescanning nodes
>unnecessarily (for example, Sort does not rescan its input if no parameters
>of the input have changed, since it can just reread its stored sorted data).
>
>That jives pretty well with what you're saying.
>
>The interesting thing with incremental sort, as the comments in the
>patch already note, is that even if the params haven't changed, we
>can't regenerate the same values again *unless* we know that we're
>still in the same batch, or, have only processed a single full batch
>(and the tuples are still in the full sort state) or we've
>transitioned to prefix mode and have only transferred tuples from the
>full sort state for a single prefix key group.
>
>That's a pretty narrow range of applicability of not needing to
>re-execute the entire node, at least based on my assumptions about
>when rescanning will typically happen.
>
>So, two followup questions:
>
>1. Given the narrow applicability, might it make sense to just say
>"we're only going to do a total reset and rescan and not try to
>implement a smart 'don't rescan if we don't have to'"?
>

I think that's a sensible approach.

>2. What would be a typical or good way to test this? Should I
>basically repeat many of the existing implementation tests but with a
>cursor and verify that rescanning produces the same results? That's
>probably the path I'm going to take if there are no objections. Of
>course we would need even more testing if we wanted to have the "smart
>rescan" functionality.
>

I haven't checked, but how are we testing it for the other nodes?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Cary Huang 2020-03-24 23:19:21 Include sequence relation support in logical replication
Previous Message Tom Lane 2020-03-24 23:08:15 Re: Ltree syntax improvement