Re: PATCH: Using BRIN indexes for sorted output

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: PATCH: Using BRIN indexes for sorted output
Date: 2023-07-14 15:51:45
Message-ID: 79133b5e-5244-4059-1d4a-3748ad4a19bc@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/14/23 16:42, Matthias van de Meent wrote:
> On Fri, 14 Jul 2023 at 16:21, Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>>
>>
>>
>> On 7/11/23 13:20, Matthias van de Meent wrote:
>>> On Mon, 10 Jul 2023 at 22:04, Tomas Vondra
>>> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>>> Approximation in what sense? My understanding was we'd get a range of
>>>> distances that we know covers all rows in that range. So the results
>>>> should be accurate, no?
>>>
>>> The distance is going to be accurate only to the degree that the
>>> summary can produce accurate distances for the datapoints it
>>> represents. That can be quite imprecise due to the nature of the
>>> contained datapoints: a summary of the points (-1, -1) and (1, 1) will
>>> have a minimum distance of 0 to the origin, where the summary (-1, 0)
>>> and (-1, 0.5) would have a much more accurate distance of 1.
>>
>> Ummm, I'm probably missing something, or maybe my mental model of this
>> is just wrong, but why would the distance for the second summary be more
>> accurate? Or what does "more accurate" mean?
>>
>> Is that about the range of distances for the summary? For the first
>> range the summary is a bounding box [(-1,1), (1,1)] so all we know the
>> points may have distance in range [0, sqrt(2)]. While for the second
>> summary it's [1, sqrt(1.25)].
>
> Yes; I was trying to refer to the difference between what results you
> get from the summary vs what results you get from the actual
> datapoints: In this case, for finding points which are closest to the
> origin, the first bounding box has a less accurate estimate than the
> second.
>

OK. I think regular minmax indexes have a similar issue with
non-distance ordering, because we don't know if the min/max values are
still in the page range (or deleted/updated).

>>> The point I was making is that the summary can only approximate the
>>> distance, and that approximation is fine w.r.t. the BRIN sort
>>> algoritm.
>>>
>>
>> I think as long as the approximation (whatever it means) does not cause
>> differences in results (compared to not using an index), it's OK.
>

I haven't written any code yet, but I think if we don't try to find the
exact min/max distances for the summary (e.g. by calculating the closest
point exactly) but rather "estimates" that are guaranteed to bound the
actual min/max, that's good enough for the sorting.

For the max, this probably is not an issue, as we can just calculate
distance for the corners and use a maximum of that. At least with
reasonable euclidean distance ... in 2D I'm imagining the bounding box
summary as a rectangle, with the "max distance" being a minimum radius
of a circle containing it (the rectangle).

For min we're looking for the largest radius not intersecting with the
box, which seems harder to calculate I think.

However, now that I'm thinking about it - don't (SP-)GiST indexes
already do pretty much exactly this?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David G. Johnston 2023-07-14 15:53:29 Re: CommandStatus from insert returning when using a portal.
Previous Message Andres Freund 2023-07-14 15:42:09 XLogSaveBufferForHint() correctness and more