Re: BRIN indexes - TRAP: BadArgument

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, Emanuel Calvo <3manuek(at)esdebian(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BRIN indexes - TRAP: BadArgument
Date: 2014-10-06 22:33:59
Message-ID: 20141006223359.GL7043@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> On 09/23/2014 10:04 PM, Alvaro Herrera wrote:
> >+ <para>
> >+ The <acronym>BRIN</acronym> implementation in <productname>PostgreSQL</productname>
> >+ is primarily maintained by &Aacute;lvaro Herrera.
> >+ </para>
>
> We don't usually have such verbiage in the docs. The GIN and GiST
> pages do, but I think those are a historic exceptions, not something
> we want to do going forward.

Removed.

> >+ <variablelist>
> >+ <varlistentry>
> >+ <term><function>BrinOpcInfo *opcInfo(void)</></term>
> >+ <listitem>
> >+ <para>
> >+ Returns internal information about the indexed columns' summary data.
> >+ </para>
> >+ </listitem>
> >+ </varlistentry>
>
> I think you should explain what that internal information is. The
> minmax-19a.patch adds the type OID argument to this; remember to
> update the docs.

Updated.

> In SP-GiST, the similar function is called "config". It might be
> good to use the same name here, for consistency across indexams
> (although I actually like the "opcInfo" name better than "config")

Well, I'm not sure that there's any value in being consistent if the new
name is better than the old one. Most likely, a person trying to
implement an spgist opclass wouldn't try to do a brin opclass at the
same time, so it's not like there's a lot of value in being consistent
there, anyway.

> The docs for the other support functions need to be updated, now
> that you changed the arguments from DeformedBrTuple to BrinValues.

Updated.

> >+ <!-- this needs improvement ... -->
> >+ To implement these methods in a generic ways, normally the opclass
> >+ defines its own internal support functions. For instance, minmax
> >+ opclasses add the support functions for the four inequality operators
> >+ for the datatype.
> >+ Additionally, the operator class must supply appropriate
> >+ operator entries,
> >+ to enable the optimizer to use the index when those operators are
> >+ used in queries.
>
> The above needs improvement ;-)

I rechecked and while I tweaked it here and there, I wasn't able to add
much more to it.

> >+ BRIN indexes (a shorthand for Block Range indexes)
> >+ store summaries about the values stored in consecutive table physical block ranges.
>
> "consecutive table physical block ranges" is quite a mouthful.

I reworded this introduction. I hope it makes more sense now.

> >+ For datatypes that have a linear sort order, the indexed data
> >+ corresponds to the minimum and maximum values of the
> >+ values in the column for each block range,
> >+ which support indexed queries using these operators:
> >+
> >+ <simplelist>
> >+ <member><literal>&lt;</literal></member>
> >+ <member><literal>&lt;=</literal></member>
> >+ <member><literal>=</literal></member>
> >+ <member><literal>&gt;=</literal></member>
> >+ <member><literal>&gt;</literal></member>
> >+ </simplelist>
>
> That's the built-in minmax indexing strategy, yes, but you could
> have others, even for datatypes with a linear sort order.

I "fixed" this by removing this list. It's not possible to be
comprehensive here, I think, and anyway I don't think there's much
point.

> >+ To find out the index tuple for a particular page range, we have an internal
>
> s/find out/find/
>
> >+ new heap tuple contains null values but the index tuple indicate there are no
>
> s/indicate/indicates/

Both fixed.

> >+ Open questions
> >+ --------------
> >+
> >+ * Same-size page ranges?
> >+ Current related literature seems to consider that each "index entry" in a
> >+ BRIN index must cover the same number of pages. There doesn't seem to be a
>
> What is the related literature? Is there an academic paper or
> something that should be cited as a reference for BRIN?

I the original "minmax-proposal" file, I had these four URLs:

: Other database systems already have similar features. Some examples:
:
: * Oracle Exadata calls this "storage indexes"
: http://richardfoote.wordpress.com/category/storage-indexes/
:
: * Netezza has "zone maps"
: http://nztips.com/2010/11/netezza-integer-join-keys/
:
: * Infobright has this automatically within their "data packs" according to a
: May 3rd, 2009 blog post
: http://www.infobright.org/index.php/organizing_data_and_more_about_rough_data_contest/
:
: * MonetDB also uses this technique, according to a published paper
: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2662
: "Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS"

I gave them all a quick look and none of them touches the approach in
detail; in fact other than the Oracle Exadata one, they are all talking
about something else and mention the "minmax" stuff only in passing. I
don't think any of them is worth citing.

> >+ * TODO
> >+ * * ScalarArrayOpExpr (amsearcharray -> SK_SEARCHARRAY)
> >+ * * add support for unlogged indexes
> >+ * * ditto expressional indexes
>
> We don't have unlogged indexes in general, so no need to list that
> here. What would be needed to implement ScalarArrayOpExprs?

Well, it requires a different way to handle ScanKeys. Anyway the
queries that it is supposed to serve can already be served in some other
ways for AMs that don't have amsearcharray, so I don't think it's a huge
loss if we don't implement it. We can add that later.

> I didn't realize that expression indexes are still not supported.
> And I see that partial indexes are not supported either. Why not? I
> wouldn't expect BRIN to need to care about those things in
> particular; the expressions for an expressional or partial index are
> handled in the executor, no?

Yeah; those restrictions were leftovers from back when I didn't really
know how they were supposed to be implemented. I took out the
restrictions and there wasn't anything else required to support both
these features.

> >+ /*
> >+ * A tuple in the heap is being inserted. To keep a brin index up to date,
> >+ * we need to obtain the relevant index tuple, compare its stored values with
> >+ * those of the new tuple; if the tuple values are consistent with the summary
> >+ * tuple, there's nothing to do; otherwise we need to update the index.
>
> s/compare/and compare/. Perhaps replace one of the semicolons with a
> full stop.

Fixed.

> >+ * If the range is not currently summarized (i.e. the revmap returns InvalidTid
> >+ * for it), there's nothing to do either.
> >+ */
> >+ Datum
> >+ brininsert(PG_FUNCTION_ARGS)
>
> There is no InvalidTid, as a constant or a #define. Perhaps replace
> with "invalid item pointer".

Fixed -- actually it doesn't return invalid TID anymore, only NULL.

> >+ /*
> >+ * XXX We need to know the size of the table so that we know how long to
> >+ * iterate on the revmap. There's room for improvement here, in that we
> >+ * could have the revmap tell us when to stop iterating.
> >+ */
>
> The revmap doesn't know how large the table is. Remember that you
> have to return all blocks that are not in the revmap, so you can't
> just stop when you reach the end of the revmap. I think the current
> design is fine.

Yeah, I was leaning towards the same conclusion myself. I have removed
the comment. (We could think about having brininsert update the
metapage so that the index keeps track of what's the last heap page,
which could help us support this, but I'm not sure there's much point.
Anyway we can tweak this later.)

> I have to stop now to do some other stuff. Overall, this is in
> pretty good shape. In addition to little cleanup of things I listed
> above, and similar stuff elsewhere that I didn't read through right
> now, there are a few medium-sized items I'd still like to see
> addressed before you commit this:
>
> * expressional/partial index support
> * the difficulty of testing the union support function that we
> discussed earlier

I added an USE_ASSERTION-only block in brininsert that runs the union
support proc and compares the output with the one from regular addValue.
I haven't tested this too much yet.

> * clarify the memory context stuff of support functions that we also
> discussed earlier

I re-checked this stuff. Turns out that the support functions don't
palloc/pfree memory too much, except to update the stuff stored in
BrinValues, by using datumCopy(). This memory is only freed when we
need to update a previous Datum. There's no way for the brin.c code to
know when the Datum is going to be released by the support proc, and
thus no way for a temp context to be used.

The memory context experiments I alluded to earlier are related to
pallocs done in brininsert / bringetbitmap themselves, not in the
opclass-provided support procs. All in all, I don't think there's much
room for improvement, other than perhaps doing so in brininsert/
bringetbitmap. Don't really care too much about this either way.

Once again, many thanks for the review. Here's a new version. I have
added operator classes for int8, text, and actually everything that btree
supports except:
bool
record
oidvector
anyarray
tsvector
tsquery
jsonb
range

since I'm not sure that it makes sense to have opclasses for any of
these -- at least not regular minmax opclasses. There are some
interesting possibilities, for example for range types, whereby we store
in the index tuple the union of all the range in the block range.

(I had an opclass for anyenum too, but on further thought I removed it
because it is going to be pointless in nearly all cases.)

contrib/pageinspect/Makefile | 2 +-
contrib/pageinspect/brinfuncs.c | 410 +++++++++++
contrib/pageinspect/pageinspect--1.2.sql | 37 +
contrib/pg_xlogdump/rmgrdesc.c | 1 +
doc/src/sgml/brin.sgml | 498 +++++++++++++
doc/src/sgml/filelist.sgml | 1 +
doc/src/sgml/indices.sgml | 36 +-
doc/src/sgml/postgres.sgml | 1 +
src/backend/access/Makefile | 2 +-
src/backend/access/brin/Makefile | 18 +
src/backend/access/brin/README | 179 +++++
src/backend/access/brin/brin.c | 1116 ++++++++++++++++++++++++++++++
src/backend/access/brin/brin_minmax.c | 320 +++++++++
src/backend/access/brin/brin_pageops.c | 712 +++++++++++++++++++
src/backend/access/brin/brin_revmap.c | 473 +++++++++++++
src/backend/access/brin/brin_tuple.c | 553 +++++++++++++++
src/backend/access/brin/brin_xlog.c | 319 +++++++++
src/backend/access/common/reloptions.c | 7 +
src/backend/access/heap/heapam.c | 22 +-
src/backend/access/rmgrdesc/Makefile | 3 +-
src/backend/access/rmgrdesc/brindesc.c | 112 +++
src/backend/access/transam/rmgr.c | 1 +
src/backend/catalog/index.c | 24 +
src/backend/replication/logical/decode.c | 1 +
src/backend/storage/page/bufpage.c | 179 ++++-
src/backend/utils/adt/selfuncs.c | 74 +-
src/include/access/brin.h | 52 ++
src/include/access/brin_internal.h | 87 +++
src/include/access/brin_page.h | 70 ++
src/include/access/brin_pageops.h | 36 +
src/include/access/brin_revmap.h | 39 ++
src/include/access/brin_tuple.h | 97 +++
src/include/access/brin_xlog.h | 107 +++
src/include/access/heapam.h | 2 +
src/include/access/reloptions.h | 3 +-
src/include/access/relscan.h | 4 +-
src/include/access/rmgrlist.h | 1 +
src/include/catalog/index.h | 8 +
src/include/catalog/pg_am.h | 2 +
src/include/catalog/pg_amop.h | 164 +++++
src/include/catalog/pg_amproc.h | 245 +++++++
src/include/catalog/pg_opclass.h | 32 +
src/include/catalog/pg_opfamily.h | 28 +
src/include/catalog/pg_proc.h | 38 +
src/include/storage/bufpage.h | 2 +
src/include/utils/selfuncs.h | 1 +
src/test/regress/expected/opr_sanity.out | 14 +-
src/test/regress/sql/opr_sanity.sql | 7 +-
48 files changed, 6122 insertions(+), 18 deletions(-)

(I keep naming the patch file "minmax", but nothing in the code is
actually called that way anymore, except the opclasses).

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
minmax-20.patch text/x-diff 216.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Fetter 2014-10-06 22:55:47 Re: [RFC] Incremental backup v2: add backup profile to base backup
Previous Message Stephen Frost 2014-10-06 21:13:48 Re: copy.c handling for RLS is insecure