Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From: Damir Belyalov <dam(dot)bel07(at)gmail(dot)com>
To: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, a(dot)lepikhov(at)postgrespro(dot)ru
Subject: Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Date: 2022-08-24 16:54:11
Message-ID: CALH1Lgv5QeYSWYDgMBd=wvKum8egiaMR-_77VGsiHPt7E8dK9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
> > + /* Buffer was filled, commit subtransaction and prepare
> to replay */
> > + ReleaseCurrentSubTransaction();
> What is actually being committed by this ReleaseCurrentSubTransaction()?
> It seems to me that just safecstate->replay_buffer is fulfilled before
> this commit.
>
All tuples are collected in replay_buffer, which is created in
''oldcontext'' of CopyFrom() (not context of a subtransaction). That's why
it didn't clean up when you used RollbackAndReleaseCurrentSubTransaction().
Subtransactions are needed for better safety. There is no error when
copying from a file to the replay_buffer, but an error may occur at the
next stage - when adding a tuple to the table. Also there may be other
errors that are not obvious at first glance.

I feel it might be better to have it as a parameter rather than fixed at
> 1000.
>
Yes, I thought about it too. So I created another version with the GUC
parameter - replay_buffer_size. Attached it. But I think users won't need
to change replay_buffer_size.
Also replay_buffer does the same thing as MultiInsert buffer does and
MultiInsert buffer defined by const = 1000.

NextCopyFrom() is in copyfromparse.c while safeNextCopyFrom() is in
> copyfrom.c.
> Since safeNextCopyFrom() is analog of NextCopyFrom() as commented, would
> it be natural to put them in the same file?
>
Sure, corrected it.

> > 188 + safecstate->errors++;
> > 189 + if (safecstate->errors <= 100)
> > 190 + ereport(WARNING,
> > 191 + (errcode(errdata->sqlerrcode),
> > 192 + errmsg("%s", errdata->context)));
>
> It would be better to state in the manual that errors exceeding 100 are
> not displayed.
> Or, it might be acceptable to output all errors that exceed 100.
>
It'll be too complicated to create a new parameter just for showing the
given number of errors. I think 100 is an optimal size.

> > +typedef struct SafeCopyFromState
> > +{
> > +#define REPLAY_BUFFER_SIZE 1000
> > + HeapTuple replay_buffer[REPLAY_BUFFER_SIZE]; /* accumulates
> > tuples for replaying it after an error */
> > + int saved_tuples; /* # of tuples in
> > replay_buffer */
> > + int replayed_tuples; /* # of tuples was
> > replayed from buffer */
> > + int errors; /* total # of errors */
> > + bool replay_is_active;
> > + bool begin_subtransaction;
> > + bool processed_remaining_tuples; /* for case of
> > replaying last tuples */
> > + bool skip_row;
>
> It would be helpful to add comments about skip_row, etc.
>
Corrected it.

> WARNING: FIND 0 ERRORS
> When there were no errors, this WARNING seems not necessary.
>
Corrected it.

Add to this patch processing other errors and constraints and tests for
them.
I had to create another function safeExecConstraints() only for processing
constraints, because ExecConstraints() is after NextCopyFrom() and is not
in PG_TRY. This thing a little bit complicated the code.
Maybe it is a good approach to create a new function SafeCopyFrom() and do
all ''safe copying'' in PG_TRY, but it will almost duplicate the CopyFrom()
code.
Or maybe create a function only for loop for(;;). But we have the same
thing with duplicating code and passing a lot of variables (which are
created at the beginning of CopyFrom()) to this function.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2022-08-24 16:56:03 Re: [PATCH] Optimize json_lex_string by batching character copying
Previous Message Nathan Bossart 2022-08-24 16:46:24 Re: O(n) tasks cause lengthy startups and checkpoints