Re: Making "COPY partitioned_table FROM" faster

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, khuddleston(at)pivotal(dot)io
Subject: Re: Making "COPY partitioned_table FROM" faster
Date: 2018-07-24 16:37:46
Message-ID: CANP8+jJk4vg1PXXBByX0s+HMQLHxo2ca8rKCwpJ0DaJOZkftDg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 24 July 2018 at 16:32, Melanie Plageman <melanieplageman(at)gmail(dot)com> wrote:
>
> On Fri, Jul 20, 2018 at 7:57 AM, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
> wrote:

>> One final note: I'm not entirely convinced we need this adaptive
>> code, but it seems easy enough to rip it back out if it's more trouble
>> than it's worth. But if the other option is a GUC, then I'd rather
>> stick with the adaptive code, it's likely going to do much better than
>> a GUC since it can change itself during the copy which will be useful
>> when the stream contains a mix small and large sets of consecutive
>> tuples which belong to the same partition.
>>
> Though the v2 numbers do look better, it doesn't complete alleviate the
> slow-down in the worst case. Perhaps the GUC and the adaptive code are not
> alternatives and could instead be used together. You could even make the
> actual RECHECK_MULTI_INSERT_THRESHOLD the GUC.
> However, I think that the decision as to whether or not it makes sense to do
> the adaptive code without a GUC is really based on the average use case, to
> which I can't really speak.
> The counter-argument I see, however, is that it is not acceptable to have
> completely un-optimized insertion of data into partition tables and that an
> overwhelming flood of GUCs is undesirable.

I don't see any need here for another GUC, nor even a command option.
The user has already indicated their use case to us:

We know that the common case for RANGE partitioned tables is to load
into the one current partition. We also know that the partition might
change as we load data, when the data loaded crosses the partition
boundary, so we need this optimization to be adaptive. Not all data
loads follow that rule, so we also need the adpative algorithm to shut
off for those cases.

We also know that the common case for HASH partitions is to load into
all partitions at once, since hash is specifically designed to spread
data out across partitions. It is almost never true that we would want
to load one partition at a time, so it seems easy to turn the
optimization off if we use this type of partitioning. Or better, we
need work done to improve that case also, but that is outside the
current scope of this patch.

If we have multi-level partitions, if any of the levels includes a
Hash, then turn it off.

LIST partitions are less likely to have a clear pattern, so I would
treat them like HASH and assume the data is not sorted by partition.

So for this patch, just add an "if (RANGE)" test.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vladimir Sitnikov 2018-07-24 17:16:35 Re: Stored procedures and out parameters
Previous Message David G. Johnston 2018-07-24 16:28:01 Re: Stored procedures and out parameters