Quick Links

Re: Declarative partitioning - another take

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>
Cc:	Amit Langote <amitlangote09(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject:	Re: Declarative partitioning - another take
Date:	2016-12-12 15:17:53
Message-ID:	00449f15-fbb3-aadc-7b30-aaf5034e1515@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 12/12/2016 07:37 AM, Amit Langote wrote:
>
> Hi Tomas,
>
> On 2016/12/12 10:02, Tomas Vondra wrote:
>>
>> 2) I'm wondering whether having 'table' in the catalog name (and also in
>> the new relkind) is too limiting. I assume we'll have partitioned indexes
>> one day, for example - do we expect to use the same catalogs?
>
> I am not sure I understand your idea of partitioned indexes, but I doubt
> it would require entries in the catalog under consideration. Could you
> perhaps elaborate more?
>

OK, let me elaborate. Let's say we have a partitioned table, and I want
to create an index. The index may be either "global" i.e. creating a
single relation for data from all the partitions, or "local" (i.e.
partitioned the same way as the table).

Local indexes are easier to implement (it's essentially what we have
now, except that we need to create the indexes manually for each
partition), and don't work particularly well for some use cases (e.g.
unique constraints). This is what I mean by "partitioned indexes".

If the index is partitioned just like the table, we probably won't need
to copy the partition key info (so, nothing in pg_partitioned_table).
I'm not sure it makes sense to partition the index differently than the
table - I don't see a case where that would be useful.

The global indexes would work better for the unique constraint use case,
but it clearly contradicts our idea of TID (no information about which
partition that references).

So maybe the catalog really only needs to track info about tables? Not
sure. I'm just saying it'd be unfortunate to have _table in the name,
and end up using it for indexes too.

>
>> 4) I see GetIndexOpClass() got renamed to ResolveOpClass(). I find the new
>> name rather strange - all other similar functions start with "Get", so I
>> guess "GetOpClass()" would be better. But I wonder if the rename was even
>> necessary, considering that it still deals with index operator classes
>> (although now also in context of partition keys). If the rename really is
>> needed, isn't that a sign that the function does not belong to indexcmds.c
>> anymore?
>
> The fact that both index keys and partition keys have very similar
> specification syntax means there would be some common code handling the
> same, of which this is one (maybe, only) example. I didn't see much point
> in finding a new place for the function, although I can see why it may be
> a bit confusing to future readers of the code.
>
> About the new name - seeing the top line in the old header comment, what
> the function does is resolve a possibly explicitly specified operator
> class name to an operator class OID. I hadn't changed the name in my
> original patch, but Robert suggested the rename, which I thought made sense.
>

OK, I don't particularly like the name but I can live with that.

>> 5) Half the error messages use 'child table' while the other half uses
>> 'partition'. I think we should be using 'partition' as child tables really
>> have little meaning outside inheritance (which is kinda hidden behind the
>> new partitioning stuff).
>
> One way to go about it may be to check all sites that can possibly report
> an error involving child tables (aka "partitions") whether they know from
> the context which name to use. I think it's possible, because we have
> access to the parent relation in all such sites and looking at the relkind
> can tell whether to call child tables "partitions".
>

Clearly, this is a consequence of building the partitioning on top of
inheritance (not objecting to that approach, merely stating a fact).

I'm fine with whatever makes the error messages more consistent, if it
does not make the code significantly more complex. It's a bit confusing
when some use 'child tables' and others 'partitions'. I suspect even a
single DML command may return a mix of those, depending on where exactly
it fails (old vs. new code).

>> 6) The psql tab-completion seems a bit broken, because it only offers the
>> partitions, not the parent table. Which is usually exactly the opposite of
>> what the user wants.
>
> Actually, I have not implemented any support for tab-completion for the
> new DDL. I will work on that.
>

Aha! So the problem probably is that psql does not recognize the new
'partitioned table' relkind, and only looks at regular tables.

>> testing
>> -------
>
>> 2) When doing a SELECT COUNT(*) from the partitioned table, I get a plan
>> like this:
>>
>> QUERY PLAN
>> -------------------------------------------------------------------
>> Finalize Aggregate (cost=124523.64..124523.65 rows=1 width=8)
>> -> Gather (cost=124523.53..124523.64 rows=1 width=8)
>> Workers Planned: 1
>> -> Partial Aggregate (cost=123523.53..123523.54 rows=1 ...)
>> -> Append (cost=0.00..108823.53 rows=5880001 width=0)
>> -> Parallel Seq Scan on parent ...
>> -> Parallel Seq Scan on partition_1 ...
>> -> Parallel Seq Scan on partition_2 ...
>> -> Parallel Seq Scan on partition_3 ...
>> -> Parallel Seq Scan on partition_4 ...
>> -> Parallel Seq Scan on partition_5 ...
>> -> Parallel Seq Scan on partition_6 ...
>> ... and the rest of the 10k partitions
>>
>> So if I understand the plan correctly, we first do a parallel scan of the
>> parent, then partition_1, partition_2 etc. But AFAIK we scan the tables in
>> Append sequentially, and each partition only has 1000 rows each, making
>> the parallel execution rather inefficient. Unless we scan the partitions
>> concurrently.
>>
>> In any case, as this happens even with plain inheritance, it's probably
>> more about the parallel query than about the new partitioning patch.
>
> Yes, I have seen some discussion [2] about a Parallel Append, which would
> result in plans you're probably thinking of.
>

Actually, I think that thread is about allowing partial paths even if
only some appendrel members support it. My point is that in this case
it's a bit silly to even build the partial paths, when the partitions
only have 1000 rows each.

I kinda suspect we only look at the total appendrel rowcount estimate,
but haven't checked.

>> 3) The last issue I noticed is that while
>>
>> EXPLAIN SELECT * FROM partitioned_table WHERE id = 1292323;
>>
>> works just fine (it takes fair amount of time to plan with 10k partitions,
>> but that's expected), this
>>
>> EXPLAIN UPDATE partitioned_table SET x = 1 WHERE id = 1292323;
>>
>> allocates a lot of memory (~8GB on my laptop, before it gets killed by OOM
>> killer). Again, the same thing happens with plain inheritance-based
>> partitioning, so it's probably not a bug in the partitioning patch.
>>
>> I'm mentioning it here because I think the new partitioning will hopefully
>> get more efficient and handle large partition counts more efficiently (the
>> inheritance only really works for ~100 partitions, which is probably why
>> no one complained about OOM during UPDATEs). Of course, 10k partitions is
>> a bit extreme (good for testing, though).
>
> Plans with the inheritance parents as target tables (update/delete)
> go through inheritance_planner() in the optimizer, which currently
> has a design that is based on certain assumptions about traditional
> inheritance. It's possible hopefully to make it less expensive for
> the partitioned tables.
>

Yes, I know. I wasn't really reporting it as a bug in the partitioning
patch, but more as a rather surprising difference between plain SELECT
and UPDATE.

Am I right that one of the ambitions of the new partitioning is to
improve behavior with large number of partitions?

At first I thought it's somewhat related to the FDW sharding (each node
being a partition and having local subpartitions), but I realize the
planner will only deal with the node partitions I guess.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Re: Declarative partitioning - another take at 2016-12-12 06:37:17 from Amit Langote

Responses

Re: Declarative partitioning - another take at 2016-12-13 08:45:41 from Amit Langote

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2016-12-12 15:29:27	Re: background sessions
Previous Message	Craig Ringer	2016-12-12 15:02:59	Re: background sessions