Re: Force lookahead in COPY FROM parsing

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Force lookahead in COPY FROM parsing
Date: 2021-04-01 20:47:37
Message-ID: bf73c57e-b2e3-9df6-4412-6ca552bc64a6@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 18/03/2021 17:09, John Naylor wrote:
> The cfbot thinks this patch no longer applies, but it works for me, so
> still set to RFC. It looks like it's because the thread to remove the v2
> FE/BE protocol was still attached to the commitfest entry. I've deleted
> that, so let's see if that helps.

It was now truly bitrotted by commit f82de5c46b (the encoding conversion
changes). Rebase attached.

> To answer the side question of whether it makes any difference in
> performance, I used the blackhole AM [1] to isolate the copy code path
> as much as possible. Forcing lookahead seems to not make a noticeable
> difference (min of 5 runs):
>
> master:
> 306ms
>
> force lookahead:
> 304ms
>
> I forgot to mention the small detail of what the test was:
>
> create extension blackhole_am;
> create table blackhole_tab (a text) using blackhole_am ;
> copy blackhole_tab from program 'for i in {1..100}; do cat /path/to/UTF-8\ Sampler.htm ; done;' ;
>
> ...where the .htm file is at http://kermitproject.org/utf8.html

Ok, I wouldn't expect to see much difference in that test, it gets
drowned in all the other parsing overhead. I tested this now with this:

copy (select repeat('x', 10000) from generate_series(1, 100000)) to
'/tmp/copydata-x.txt'
create table blackhole_tab (a text);

copy blackhole_tab from '/tmp/copydata-x.txt' where false;

I took the min of 5 runs of that COPY FROM statement:

master:
4107 ms

v3-0001-Simplify-COPY-FROM-parsing-by-forcing-lookahead.patch:
3172 ms

I was actually surprised it was so effective on that test, I expected a
small but noticeable gain. But I'll take it.

Replying to your earlier comments (sorry for the delay):

> Looks good to me. Just a couple minor things:
>
> + * Look at the next character. If we're at EOF, c2 will wind
> + * up as '\0' because of the guaranteed pad of raw_buf.
> */
> - IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0);
> -
> - /* get next char */
> c = copy_raw_buf[raw_buf_ptr];
>
> The new comment seems copy-pasted from the c2 statements further down.

Fixed.

> - if (raw_buf_ptr >= copy_buf_len || need_data)
> +#define COPY_READ_LINE_LOOKAHEAD 4
> + if (raw_buf_ptr + COPY_READ_LINE_LOOKAHEAD >= copy_buf_len)
>
> Is this #define deliberately put down here rather than at the top of the file?

Yeah, it's only used right here locally. Matter of taste, but I'd prefer
to keep it here.

> - * of the buffer and then we load more data after that. This case occurs only
> - * when a multibyte character crosses a bufferload boundary.
> + * of the buffer and then we load more data after that.
>
> Is the removed comment really invalidated by this patch? I figured it was something not affected until the patch to do the encoding conversion in larger chunks.

Not sure anymore, but this is moot now, since the other patch was committed.

Thanks for the review and the testing!

- Heikki

Attachment Content-Type Size
v3-0001-Simplify-COPY-FROM-parsing-by-forcing-lookahead.patch text/x-patch 8.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2021-04-01 21:15:20 Re: OpenBSD versus semaphores
Previous Message Robert Haas 2021-04-01 20:08:56 Re: pg_amcheck contrib application