Re: Making "COPY partitioned_table FROM" faster

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, khuddleston(at)pivotal(dot)io, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Re: Making "COPY partitioned_table FROM" faster
Date: 2018-07-20 14:57:16
Message-ID: CAKJS1f-pVrk+W7brd=L3FTATOkmMwxN_byWC64eUwm0KfH4BMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

(This email seemed to only come to me and somehow missed the mailing
list. Including the message here in its entirety)

On 20 July 2018 at 13:26, Karen Huddleston <khuddleston(at)pivotal(dot)io> wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world: tested, passed
> Implements feature: tested, passed
> Spec compliant: not tested
> Documentation: not tested
>
> This patch applied cleanly and worked as expected.
>
> Patch description:
> This patch implements the use of heap_multi_insert() for partition tables when using the COPY FROM command, benefiting the performance of COPY FROM in cases in which multiple tuples in the file are destined for the same partition.
>
> Tests:
> All tests passed make check-world and TAP tests
>
> Functionality:
> Specifically, we tried the cases in the bottom section "Exercising the code" and found all behavior as described and expected.
> Note that we did conduct performance testing for our test cases enumerated in this section using COPY FROM PROGRAM comparing the patched and unpatched code using one example with tuples all destined for the same partition and one example with tuples destined for alternating partitions. We saw a similar performance improvement to the one recorded by David in his email, including a decrease in performance for copying data with tuples alternating between destination partitions as compared to the unpatched code.
> However, in our cases below, we focus on exercising the code added as opposed to on performance.
>
> Feature feedback from a user perspective:
> There will be two differences in performance that may be perceptible to the user after the inclusion of this patch:
> 1) the relatively small decrease in performance (as compared to master) for copying data from a file or program in which the destination partition changes frequently
> 2) the stark contrast between the performance of copying data destined for the same partition and data destined for alternating partitions
> Based on the numbers in David's email the performance difference
> 27721.054 ms patched for copying data destined for the same partition vs 46151.844 ms patched for copying data destined for two different partitions alternating each row
> Because both differences could impact users in data loading, it is worth considering making this transparent to the user in some way. Because this will be noticeable to the user, and, furthermore, because it is potentially within the power of the user to make changes to their data copying strategy given this information, it might be prudent to either create a GUC allowing the user to disable heap_multi_insert for COPY FROM (if the user knows the data he/she is copying over is very varied) or to add a comment to the docs that when using COPY FROM for a partition table, it is advisable to sort the data in some manner which would optimize the number of consecutive tuples destined for the same partition.

I'd rather steer clear of any new GUCs. Instead, in the patch I've
attached, I've included some adaptive code which chooses to use or not
use multi-inserts based on the average number of tuples that have been
in the batches. I also did some tests and it does appear that the
benefits of multi-inserts come pretty early. So I ended up coding it
to enabling multi-inserts when the average tuples per batch is at
least 1.3. I originally guessed 1.5, but one of my tests was faster
with multi-inserts and the average batch size there was 1.33.

> Code Review:
> Naming of newly introduced enum:
> The CopyInsertMethod enum value CIM_MULTI_PARTITION is potentially confusing, since, for a partition table with BEFORE or INSTEAD INSERT triggers on child partitions only heap_insert is valid, requiring the additional variable part_can_multi_insert to determine if heap_multi_insert can be used or not.
> It might be more clear to name it something that indicates it is a candidate for multi-insertion, pending further review. This decouples the state of potentially applicable multi-insertion from the table being a partition table. In the future, perhaps, some other feature might make a table conditionally eligible for multi-insertion of tuples, so, it may be desirable to use this intermediate state for other kinds of tables as well.

I've renamed this to CIM_MULTI_CONDITIONAL.

> Alternatively, it might make it more clear if the variable part_can_multi_insert was clearly for a leaf partition table, since this can change from leaf to leaf, maybe named leaf_part_can_multi_insert

Renamed to leafpart_use_multi_insert. I changed from "can" to "use"
due to the adaptive code might just choose not to, even if it actually
"can" use multi-inserts.

> Code comment potential typo corrections:
> In the following comment, in the patched code line 2550
> /* ...
> * Note: It does not matter if any partitions have any volatile default
> * expression as we use the defaults from the target of the COPY command.
> */
> It seems like "volatile default expression" should be "volatile default expressions"

Changed.

> In the following comment, in the patched code starting on line 2582
>
> /* ...
> * the same partition as the previous tuple, and since we flush the
> * previous partitions buffer once the new tuple has already been
> * built, we're unable to reset the estate since we'd free the memory
> * which the new tuple is stored. To work around this we maintain a
> * secondary expression context and alternate between these when the
> ... */
> "previous partitions" should probably be "previous partition's" and
> "since we'd free the memory which the new tuple is stored" should be "since we'd free the memory in which the new tuple is stored"

Changed

> In the following comment, in the patched code starting on line 2769
>
> /*
> * We'd better make the bulk insert mechanism gets a new
> * buffer when the partition being inserted into changes.
> */
> "We'd better make the bulk insert mechanism..." should be "We'd better make sure the bulk insert mechanism..."

I didn't change this, as the code is not really making sure of that.
It's actually doing it. The comment existed like that before the
patch. I just moved it.

> Code comment clarification 1:
> In the patched code, starting on line 2553, the addition of the new OR condition makes this if statement very difficult to parse as a reader. It would be helpful to the reader if the comment above it was partially inlined in the if statement--e.g.
> if (
> /* relation has either a row-level insert trigger or an insert instead of trigger */
> (resultRelInfo->ri_TrigDesc != NULL &&
> (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
> resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
> /* relation is a partition table with a statement-level insert trigger */
> (proute != NULL && resultRelInfo->ri_TrigDesc != NULL &&
> resultRelInfo->ri_TrigDesc->trig_insert_new_table) ||
> /* relation is a foreign table or has volatile default expressions */
> resultRelInfo->ri_FdwRoutine != NULL ||
> cstate->volatile_defexprs
> )

hmm, yeah. I'm not much of a fan of that either. I ended up breaking
it out into multiple if/else if/else, just we set the insertMethod to
CIM_SINGLE in 3 separate places. I think that's okay. I imagine the
compiler will optimise it correctly, but I didn't check.

> Code comment clarification 2:
> Starting with patched code line 2689 and ending on line 2766, there are several comments we found confusing:
>
> In the following comment, in the patched code starting on line 2740,
> /*
> * Save the old ResultRelInfo and switch to the one
> * corresponding to the selected partition.
> */
> upon initially reading this comment, we thought "saving the old ResultRelInfo" meant saving the relinfo of the partition from which we are switching, however, it seems like that is happening above on line 2691, and, that that partition from which we are switching is not the sibling partition but the parent partition and that we do not, in fact, need to "switch" the resultRelInfo from one sibling to another. Related to this, on walking through the code with the partition table described in the DDL below (table foo(a int) in the section "Exercising the code")
> and copying the following data from a file 4 4 7 7 11 11 7 7 4 4, we found that the variable saved_resultRelInfo first set on line 2689
> saved_resultRelInfo = resultRelInfo;
> is always the parent partition rather than a sibling. Perhaps saved_resultRelInfo could be named something like parent_resultRelInfo, if this is a correct understanding of the code.
> Or, if this is an incorrect understanding of the code, perhaps a comment could clarify this.

> In fact, on line 2744, it appears we are "saving" the information for the leaf partition for which the current tuple is destined into the proute->partitions[]. It would be helpful to clarify this, perhaps with a comment to the effect of "it the destined partition has not already had its information initialized and saved into the proute->partitions list, do so now"

Yeah, that's existing code and I agree it's a bit confusing and
actually a bit convoluted. I've simplified it a bit and renamed the
saved_resultRelInfo to target_resultRelInfo, where target means the
target of the COPY command. This stays fixed and the "resultRelInfo"
variable changes to point to the various partitions. This saves
having to perform the restore from the saved_resultRelInfo at the end
of the loop. This means I could also get rid of the goto nexttuple
code. The goto can just continue to the start of the loop, since
there's nothing to do at the end of the loop now.

This also meant that I had to change the if test above the
ExecPartitionCheck() call. Previously the saved_resultRelInfo there
was testing if we're doing tuple routing into a partition. Using
proute instead seems to be more clear, so I change it to that. I also
changed the test for the trigger there to use the bool that I'd set
above.

> So, overall, we feel that the code from lines 2689 until 2691 and from 2740 to 2766 could use further clarification with regard to switching from parent to child partition and sibling to sibling partition as well as regarding saving relinfo for partitions that have not been seen before in the proute->partitions[]

I hope it's more clear now that I've got rid of saved_resultRelInfo.

> Exercising the code:
> -- DDL to create a partition table
> CREATE TABLE foo(a int) partition by range (a);
> CREATE TABLE foo_prt1 partition of foo for values from (1) to (5);
> CREATE TABLE foo_prt2 partition of foo for values from (5) to (10);
> CREATE TABLE foo_prt_default partition of foo DEFAULT;
>
> -- Case 1
> -- a simple partition table
> -- copying a file containing the following values for foo(a) : 4 4 7 7 11 11 7 7 4 4 in file 'f.csv'
> COPY foo FROM 'f.csv';
> -- we see that the insertMethod value is correctly set to CIM_MULTI_PARTITION
> -- the value of part_can_multi_insert is true for each of the leaf partitions
> -- heap_multi_insert is called to insert data into these leaf partitions
>
> -- Case 2
> -- a partition table with a statement level insert trigger
> -- DDL to create the statement level insert trigger
> CREATE OR REPLACE FUNCTION parent_stmt_insert_function() RETURNS TRIGGER AS $$
> BEGIN
> RETURN NEW;
> END;
> $$ LANGUAGE plpgsql;
>
> CREATE TRIGGER parent_stmt_trig
> AFTER INSERT ON foo
> REFERENCING NEW TABLE AS newtable
> FOR EACH STATEMENT execute procedure parent_stmt_insert_function();
>
> -- copying a file containing the following values for foo(a) : 4 4 7 7 11 11 7 7 4 4 in file 'f.csv'
> COPY foo FROM 'f.csv';
> -- we see that the value of insertMethod is correctly set to CIM_SINGLE due to the statement level insert trigger
> -- heap_insert is called to insert data into these leaf partitions
>
> -- Case 3
> -- a partition table with a row-level before insert trigger on exactly one child partition out of 3
> -- DDL to create the row level before insert trigger on one child partition which executes a function that inserts data into default partition
> DROP FUNCTION IF EXISTS parent_stmt_insert_function CASCADE;
> CREATE OR REPLACE FUNCTION child_row_insert_function() RETURNS TRIGGER AS $$
> BEGIN
> RETURN NEW;
> END;
> $$ LANGUAGE plpgsql;
>
> CREATE TRIGGER child_row_trig
> BEFORE INSERT ON foo_prt1
> FOR EACH ROW EXECUTE PROCEDURE child_row_insert_function();
> -- copying a file containing the following values for foo(a) : 4 4 7 7 11 11 7 7 4 4 in file 'f.csv'
> COPY foo FROM 'f.csv';
> -- we see that the insertMethod value is set to CIM_MULTI_PARTITION
> -- the value of part_can_multi_insert is false for foo_prt1, true for foo_prt2, and true for foo_prt_default
> -- heap_multi_insert is called for foo_prt2 and foo_prt_default (for the rows inserted by COPY)
> -- heap_insert is called for foo_prt1
> -- heap_insert is called for the rows inserted by our row-level trigger into foo_prt_default
>
>
> -- Case 4
> -- a partition table with a row-level after insert trigger on the root partition
> -- DDL to create the row-level after insert trigger on the root
> DROP FUNCTION IF EXISTS parent_stmt_insert_function CASCADE;
> DROP FUNCTION IF EXISTS child_row_insert_function CASCADE;
> CREATE OR REPLACE FUNCTION parent_row_insert_function() RETURNS TRIGGER AS $$
> BEGIN
> RETURN NEW;
> END;
> $$ LANGUAGE plpgsql;
>
> CREATE TRIGGER parent_row_trig
> AFTER INSERT ON foo
> FOR EACH ROW execute procedure parent_row_insert_function();
> -- copying a file containing the following values for foo(a) : 4 4 7 7 11 11 7 7 4 4 in file 'f.csv'
> COPY foo FROM 'f.csv';
> -- we see that the insertMethod is CIM_MULTI_PARTITION
> -- foo_prt1, foo_prt2, and foo_prt_default have part_can_multi_insert as true
>
> Thanks,
> Melanie & Karen
>
> The new status of this patch is: Waiting on Author

AWS m5d.large (times in milliseconds)

master v1 master vs v1
Control 26265 25201 96.0%
All 42667 30473 71.4%
Avg 4 42393 36658 86.5%
Avg 2.0 43775 39844 91.0%
Avg 2.0 43493 40305 92.7%
Avg 1.33 43219 43717 101.2%
Avg 1.0 44993 47616 105.8%

master v2 master vs v2
Control 26265 25421 96.8%
All 42667 30164 70.7%
Avg 4 42393 34288 80.9%
Avg 2.0 43775 40441 92.4%
Avg 2.0 43493 40991 94.2%
Avg 1.33 43219 43424 100.5%
Avg 1.0 44993 45204 100.5%

To test this I expanded the two Perl files into 6 files which are
formatted like:

for (my $i=0; $i < 2228222; $i++) {
print "2018-04-25 15:00:00|1|2|3|4|5\n";
print "2018-04-26 15:00:00|1|2|3|4|5\n";
print "2018-04-25 15:00:00|1|2|3|4|5\n";
print "2018-04-26 15:00:00|1|2|3|4|5\n";
print "2018-04-25 15:00:00|1|2|3|4|5\n";
print "2018-04-26 15:00:00|1|2|3|4|5\n";
print "2018-04-25 15:00:00|1|2|3|4|5\n";
print "2018-04-26 15:00:00|1|2|3|4|5\n";
}

The 2228222 is just a number I set so a 1GB table is produced.

To test I did:

\timing on
truncate table partbench_;
copy partbench_ from program $$perl ~/partbench_11111111.pl$$
delimiter '|'; -- control test
truncate table partbench;
copy partbench from program $$perl ~/partbench_11111111.pl$$ delimiter
'|'; -- all tuples in same part
truncate table partbench;
copy partbench from program $$perl ~/partbench_12222222.pl$$ delimiter
'|'; -- avg 4 tuples per part
truncate table partbench;
copy partbench from program $$perl ~/partbench_12221222.pl$$ delimiter
'|'; -- avg 2.0 tuples per part
truncate table partbench;
copy partbench from program $$perl ~/partbench_12211221.pl$$ delimiter
'|'; -- avg 2.0 tuples per part
truncate table partbench;
copy partbench from program $$perl ~/partbench_12212112.pl$$ delimiter
'|'; -- avg 1.33 tuples per part
truncate table partbench;
copy partbench from program $$perl ~/partbench_12121212.pl$$ delimiter
'|'; -- avg 1.0 tuples per part

The files are named according to what each of the 8 print statements
does. Anything with a "1" uses the 2018-04-25 timestamp and anything
with a "2" uses the 26th. So the one above is partbench_12121212.pl

Many thanks to both of you for the thorough review. I hope the
attached addresses all the concerns.

One final note: I'm not entirely convinced we need this adaptive
code, but it seems easy enough to rip it back out if it's more trouble
than it's worth. But if the other option is a GUC, then I'd rather
stick with the adaptive code, it's likely going to do much better than
a GUC since it can change itself during the copy which will be useful
when the stream contains a mix small and large sets of consecutive
tuples which belong to the same partition.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
v2-0001-Allow-multi-inserts-during-COPY-into-a-partitione.patch application/octet-stream 21.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-07-20 14:58:36 Re: [HACKERS] logical decoding of two-phase transactions
Previous Message Victor Wagner 2018-07-20 14:55:08 Re: Non-portable shell code in pg_upgrade tap tests