Re: prevent immature WAL streaming

From: Andres Freund <andres(at)anarazel(dot)de>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "masao(dot)fujii(at)oss(dot)nttdata(dot)com" <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "mengjuan(dot)cmj(at)alibaba-inc(dot)com" <mengjuan(dot)cmj(at)alibaba-inc(dot)com>, "Jakub(dot)Wartak(at)tomtom(dot)com" <Jakub(dot)Wartak(at)tomtom(dot)com>, Ryo Matsumura <matsumura(dot)ryo(at)fujitsu(dot)com>
Subject: Re: prevent immature WAL streaming
Date: 2021-10-13 18:31:29
Message-ID: 20211013183129.jutbvropp7nkg2kk@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2021-10-13 11:03:39 -0700, Andres Freund wrote:
> It seems that 026_overwrite_contrecord.pl test often fails under valgrind. I
> first thought that related failures on skink were due to me migrating the
> animal to a new host (and then running out of space due to a mistake in ccache
> config). But it happened again after I fixed those, and I just reproduced the
> issue locally.
>
> It's a bit odd that we didn't start to see these failures immediately, but
> only in the last few days. I'd blame skink being migrated to a new home,
> except that I can see the issue locally.
>
> FWIW, the way skink runs all postgres instances through valgrind is by
> replacing the postgres binary. Here's my local version of that:
>
> andres(at)awork3:~/build/postgres/dev-assert/vpath$ cat tmp_install/home/andres/build/postgres/dev-assert/install/bin/postgres
> #!/bin/bash
>
> exec /usr/bin/valgrind \
> --quiet \
> --error-exitcode=128 \
> --suppressions=/home/andres/src/postgresql/src/tools/valgrind.supp \
> --trace-children=yes --track-origins=yes --read-var-info=no \
> --leak-check=no \
> --run-libc-freeres=no \
> --vgdb=no \
> --error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END \
> /home/andres/build/postgres/dev-assert/vpath/tmp_install/home/andres/build/postgres/dev-assert/install/bin/postgres.orig \
> "$@"
>
> make -C src/test/recovery/ check PROVE_FLAGS='-v' PROVE_TESTS='t/026_overwrite_contrecord.pl' NO_TEMP_INSTALL=1
> ...
>
> not ok 1 - 000000010000000000000002 differs from 000000010000000000000002
>
> # Failed test '000000010000000000000002 differs from 000000010000000000000002'

I added LSNs to the error message:
not ok 1 - 000000010000000000000002 (0/2002350) differs from 000000010000000000000002 (0/2099600)

It appears that the problem is that inbetween the determination of
rows_walsize the insert LSN moved to the next segment separately from the
insertion, presumably due to autovacuum/analayze or such.
<retries, with log_autovacuum_min_duration_statement=0, log_min_duration_statement=0>

2021-10-13 11:23:25.659 PDT [1491455] 026_overwrite_contrecord.pl LOG: statement: insert into filler select * from generate_series(1, 1000)
2021-10-13 11:23:26.467 PDT [1491455] 026_overwrite_contrecord.pl LOG: duration: 861.112 ms
2021-10-13 11:23:27.055 PDT [1491458] 026_overwrite_contrecord.pl LOG: statement: select pg_current_wal_insert_lsn() - '0/0'
2021-10-13 11:23:27.357 PDT [1491458] 026_overwrite_contrecord.pl LOG: duration: 347.888 ms
2021-10-13 11:23:27.980 PDT [1491461] 026_overwrite_contrecord.pl LOG: statement: WITH setting AS (
SELECT setting::int AS wal_segsize
FROM pg_settings WHERE name = 'wal_segment_size'
)
INSERT INTO filler
SELECT g FROM setting,
generate_series(1, 1000 * (wal_segsize - ((pg_current_wal_insert_lsn() - '0/0') % wal_segsize)) / 64232) g
2021-10-13 11:24:25.173 PDT [1491550] LOG: automatic analyze of table "postgres.public.filler"
avg read rate: 3.185 MB/s, avg write rate: 0.039 MB/s
buffer usage: 96 hits, 731 misses, 9 dirtied
system usage: CPU: user: 1.79 s, system: 0.00 s, elapsed: 1.81 s
2021-10-13 11:24:26.255 PDT [1491461] 026_overwrite_contrecord.pl LOG: duration: 58360.811 ms
2021-10-13 11:24:26.857 PDT [1491557] 026_overwrite_contrecord.pl LOG: statement: SELECT pg_current_wal_insert_lsn()
2021-10-13 11:24:27.120 PDT [1491557] 026_overwrite_contrecord.pl LOG: duration: 300.562 ms

Hm. I guess we can disable autovac. But that's not a great solution, there
might be WAL files due to catalog access etc too.

Seems like it might be worth doing the "filling" of the segment with a loop in
a DO block instead, where the end condition is to be within some distance to
the end of the segment? With plenty headroom?

Another thing: filling a segment by inserting lots of very tiny rows is pretty
expensive. Can't we use something a bit wider? Perhaps even emit_message?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2021-10-13 18:40:19 Re: [RFC] building postgres with meson
Previous Message bt21masumurak 2021-10-13 18:08:13 Re: Improved tab completion for PostgreSQL parameters in enums