Re: [HACKERS] Proposal: Local indexes for partitioned table

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Maksim Milyutin <milyutinma(at)gmail(dot)com>
Subject: Re: [HACKERS] Proposal: Local indexes for partitioned table
Date: 2018-01-10 15:37:07
Message-ID: 20180110153707.x4uklha5qupwh6qb@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

David Rowley wrote:

> Hi Álvaro,

Hi David,

Thanks for the review. Attached is a delta patch that fixes most
things, except your item 14 below.

Before getting into the details of the items you list, in this version I
also fixed a couple of issues noted by Jaime Casanova; namely

a) ALTER INDEX SET TABLESPACE threw an error for partitioned indexes.
Handled the same was as for partitioned tables: silently do nothing.

b) Expression indexes not considered at all. (You also commented on
this point). Jaime reported it for this case:
create index on t_part USING gin (string_to_array(btrim((t)::text, '-'::text), '--'::text));

I dealt with this by adding code to CompareIndexInfo that runs equal()
on the expression lists, after applying map_variable_attnos() (to
support the case of attaching a table with dropped or reordered
columns). As far as I can see, this works correctly for very simple
expressions, though I didn't test Jaime's case specifically; I suppose I
should add a few more tests.

On to your notes:

> 1. "either partition" -> "either the partition"
> 4. Extra newline
> 8. Missing if (attachInfos) pfree(attachInfos);
> 10. lock -> locks:
> 11. "already the right" -> "already in the right"
> 16. It's not exactly a problem, but I had a quick look and I don't see
> any other uses of list_free() checking for a non-NIL list first:

Fixed.

> 2. In findDependentObjects(), should the following if test be below
> the ReleaseDeletionLock call?
>
> + if (foundDep->deptype == DEPENDENCY_INTERNAL_AUTO)
> + break;
> ReleaseDeletionLock(object);

No, as far as I understand we should keep that lock if we break out of
the loop -- it's the lock on the object being deleted (the index
partition in this case).

I did break the comment in two, so that now the "First," paragraph
appears below the "if () break" test. That seems to make more sense.
It looks like this now:

/*
* 3. Not all the owning objects have been visited, so
* transform this deletion request into a delete of this
* owning object.
*
* For INTERNAL_AUTO dependencies, we don't enforce this;
* in other words, we don't follow the links back to the
* owning object.
*/
if (foundDep->deptype == DEPENDENCY_INTERNAL_AUTO)
break;

/*
* First, release caller's lock on this object and get
* deletion lock on the owning object. (We must release
* caller's lock to avoid deadlock against a concurrent
* deletion of the owning object.)
*/
ReleaseDeletionLock(object);
AcquireDeletionLock(&otherObject, 0);

> 3. The following existing comment indicates that we'll be creating a
> disk file for the index. Should we update this comment to say that
> this is the case only for RELKIND_INDEX?

> Maybe:
>
> /*
> * create the index relation's relcache entry and for non-partitioned
> * indexes, a physical disk file. (If we fail further down, it's the
> * smgr's responsibility to remove the disk file again.)
> */

I used this:
/*
* create the index relation's relcache entry and, if necessary, the
* physical disk file. (If we fail further down, it's the smgr's
* responsibility to remove the disk file again, if any.)
*/

> 5. Do you think it's out of scope to support expression indexes?

Answered above.

> 6. pg_inherits.inhseqno is int, not smallint. I understand you copied
> this from StoreCatalogInheritance1(), so maybe a fix should be put in
> before this patch is?

Hmm, yeah, will fix.

> 7. Is the following a bug fix? If so, should it not be fixed and
> backpatched, rather than sneaked in here?
>
> @@ -10119,6 +10195,7 @@ ATExecChangeOwner(Oid relationOid, Oid
> newOwnerId, bool recursing, LOCKMODE lock
> * relation, as well as its toast table (if it has one).
> */
> if (tuple_class->relkind == RELKIND_RELATION ||
> + tuple_class->relkind == RELKIND_PARTITIONED_TABLE ||

No, it's not a bug fix -- this block is only concerned with searching
index for this relkind, and prior to this patch PARTITIONED_TABLEs do
not have indexes, so it's ok not to consider the case.

> 9. The first OR branch of the Assert will always be false, so might as
> well get rid of it.

True. Removed.

> 12. It seems to be RangeVarGetRelidExtended() that locks the partition
> index, not the callback.
>
> /* no deadlock risk: our callback above already acquired the lock */

Hmm ... yeah, you're right. Fixed.

> 13. We seem to only be holding an AccessShareLock on the partitioned
> table. What happens if someone concurrently does ALTER TABLE
> partitioned_table DETACH our_partition;?
>
> parentTbl = relation_open(parentIdx->rd_index->indrelid, AccessShareLock);
>
> and;
> /* Make sure it indexes a partition of the other index's table */
> partDesc = RelationGetPartitionDesc(parentTbl);
> found = false;
> for (i = 0; i < partDesc->nparts; i++)
> {
> if (partDesc->oids[i] == state.partitionOid)
> {
> found = true;
> break;
> }
> }
>
> Wouldn't you also need an AccessExclusiveLock to prevent another
> session trying to ALTER INDEX part_index ATTACH PARTITION leaf_index;
> at the same time? The concurrent session could be trying to ATTACH it
> to a subpartition and this comment could be attaching to the parent.

But AccessShare does conflict with AccessExclusive. So even with just
AccessShare we've already prevented others from detaching/attaching
partitions.

I think I should spend a bit of time going over the locking
considerations again and make sure it is all sound.

> 14. validatePartitionedIndex does not verify that the index it is
> checking is valid. Consider the following:
>
> create table part (a int, b int) partition by list(a);
> create table part1 partition of part for values in(1) partition by list(b)
> create table part1 partition of part for values in(1) partition by list(b);
> create table part2 partition of part1 for values in (1);
> create index on only part1 (a);
> create index on only part (a);
> alter index part_a_idx attach partition part1_a_idx;
> \d part
> Table "public.part"
> Column | Type | Collation | Nullable | Default
> --------+---------+-----------+----------+---------
> a | integer | | |
> b | integer | | |
> Partition key: LIST (a)
> Indexes:
> "part_a_idx" btree (a) <--- should be INVALID
> Number of partitions: 1 (Use \d+ to list them.)
>
> \d part1
> Table "public.part1"
> Column | Type | Collation | Nullable | Default
> --------+---------+-----------+----------+---------
> a | integer | | |
> b | integer | | |
> Partition of: part FOR VALUES IN (1)
> Partition key: LIST (b)
> Indexes:
> "part1_a_idx" btree (a) INVALID
> Number of partitions: 1 (Use \d+ to list them.)
>
> But that probably means you also need more code to search up to parent
> partitions and try and validate any invalid indexes once a
> sub-partition index is made valid.
>
> As, if the part_a_idx had correctly been marked as INVALID, and then we do:
>
> CREATE INDEX part2_a_idx ON part2 (a);
> ALTER INDEX part1_a_idx ATTACH PARTITION part2_a_idx;
>
> This would validate part1_a_idx, but we'd also then need to attempt
> validation on part_a_idx. (I'm still reading, so perhaps you've
> already thought of that.)

In prior incarnations of the patch, I had an if-test to prevent
attaching invalid indexes, but I decided to remove it at some point --
mainly thinking of attaching a partition for which a CREATE INDEX
CONCURRENTLY was running which already had the index as invalid and was
later expected to become valid. I suppose that doesn't really work
anyway because of locking considerations (you can't attach a partition
in which CIC is concurrently running, can you). I'll think some more
about this case and post an updated version later.

> 15. Can you explain why you do the following for CREATE INDEX, but not
> for CREATE INDEX IF NOT EXISTS?

I cannot -- just an oversight. Fixed. It's true that makeNode() fixes
the hole, but we're supposed to be strict about initializing structures.

> 17. Just a stylistic thing; I think it would be neater to set isindex
> using the existing switch(). It would just need some case shuffling.

> Or maybe it's even better to set relation->rd_amroutine->amoptions
> into a local variable and use it unconditionally in the call to
> extractRelOptions().

Yeah, that last suggestion leads to simpler code, so I did it that way.

> 18. At most it can't do any harm, but is the following still needed? I
> originally thought this would have been for pg_index changes. What's
> changed?
>
> -#define CATALOG_VERSION_NO 201712251
> +#define CATALOG_VERSION_NO 201712291

Yeah, probably not needed anymore. Will lose it.

> The tests seem fine, but I'd like to see a test for #14 once that's fixed.

Sure thing.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
lpi.v12-delta.patch text/plain 14.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2018-01-10 15:49:18 Re: Rangejoin rebased
Previous Message Alexander Korotkov 2018-01-10 15:17:20 Re: CUBE seems a bit confused about ORDER BY