Re: On partitioning

From: "Amit Langote" <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: "'Claudio Freire'" <klaussfreire(at)gmail(dot)com>
Cc: "'Alvaro Herrera'" <alvherre(at)2ndquadrant(dot)com>, "'Josh Berkus'" <josh(at)agliodbs(dot)com>, "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Pg Hackers'" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: On partitioning
Date: 2014-12-15 09:53:50
Message-ID: 007201d0184d$077bd9d0$16738d70$@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Claudio Freire wrote:
> On Sun, Dec 14, 2014 at 11:12 PM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> >> On egress you need some direct way to compare the scan quals with the
> >> partitioning values. I would imagine this to be similar to how scan
> >> quals are compared to the values stored in a BRIN index: each scan qual
> >> has a corresponding operator strategy and a scan key, and you can say
> >> "aye" or "nay" based on a small set of operations that can be run
> >> cheaply, again without any proof or running arbitrary expressions.
> >>
> >
> > My knowledge of this is far from being perfect, though to clear any
> confusions -
> >
> > As far as planning is concerned, I could not imagine how index access
> method way of pruning partitions could be made to work. Of course, I may
> be missing something.
>
> Let me be overly verbose, don't take it as patronizing, just answering
> in lots of detail why this could be a good idea to try.
>

Thanks for explaining. It helps.

> Normal indexes store a pointer for each key value of sorts. So B-Tree
> gets you a set of tids for each key, and so does GIN and hash.
>
> But BRIN is different in that it does the mapping differently. BRIN
> stores a compact, approximate representation of the set of keys within
> a page range. It can tell with some degree of (in)accuracy whether a
> key or key range could be part of that page range or not. The way it
> does this is abstracted out, but at its core, it stores a "compressed"
> representation of the key set that takes a constant amount of bits to
> store, and no more, no matter how many elements. What changes as the
> element it represents grows, is its accuracy.
>
> Currently, BRIN only supports min-max representations. It will store,
> for each page range, the minimum and maximum of some columns, and
> when
> you query it, you can compare range vs range, and discard whole page
> ranges.
>
> Now, what are partitions, if not page ranges?

Yes, I can see a partition as a page range. The fixed summary info in BRIN's terms would be range bounds in case this is a rang partition, list of values in case this is a list partition and hash value in case this is a hash partition.

There is debate on the topic but each of these partitions also happens to be a separate relation. IIUC, BRIN is an access method for a relation (say, top-level partitioned relation) that comes into play in executor if that access method survives as preferred access method by the planner. I cannot see a way to generalize it further and make it support each block range as a separate relation and then use it for partition pruning in planner. This is assuming a partitioned relation is planned as an Append node which contains a list of plans for surviving partition relations based on, say, restrict quals.

I may be thinking of BRIN as a whole as not being generalized enough but I may be wrong. Could you point out if so?

> A BRIN tuple is a min-max pair. But BRIN in more general, it could use
> other data structures to hold that "compressed representation", if
> someone implemented them. Like bloom filters [0].
>
> A BRIN index is a complex data structure because it has to account for
> physically growing tables, but all the complexities vanish when you
> fix a "block range" (the BR in BRIN) to a partition. Then, a mere
> array of BRIN tuples would suffice.
>
> BRIN already contains the machinery to turn quals into something that
> filters out entire partitions, if you provide the BRIN tuples.
>

IIUC, that machinery comes into play when, say, a Bitmap Heap scan starts, right?

> And you could even effectively matain a BRIN index for the partitions
> (just a BRIN tuple per partition, dynamically updated with every
> insertion).
>
> If you do that, you start with empty partitions, and each insert
> updates the BRIN tuple. Avoiding concurrency loss in this case would
> be tricky, but in theory this could allow very general partition
> exclusion. In fact it could even work with constraint exclusion right
> now: you'd have a single-tuple BRIN index for each partition and
> benefit from it.
>
> But you don't need to pay the price of updating BRIN indexes, as
> min-max tuples for each partition can be produced while creating the
> partitions if the syntax already provides the information. Then, it's
> just a matter of querying this meta-data which just happens to have
> the form of a BRIN tuple for each partition.
>

Thanks,
Amit

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2014-12-15 09:58:37 Re: Escaping from blocked send() reprised.
Previous Message Kyotaro HORIGUCHI 2014-12-15 09:21:16 Re: alter user/role CURRENT_USER