RE: [Patch] Optimize dropping of relation buffers using dlist

From: "k(dot)jamison(at)fujitsu(dot)com" <k(dot)jamison(at)fujitsu(dot)com>
To: "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, 'Kyotaro Horiguchi' <horikyota(dot)ntt(at)gmail(dot)com>
Cc: "amit(dot)kapila16(at)gmail(dot)com" <amit(dot)kapila16(at)gmail(dot)com>, "tgl(at)sss(dot)pgh(dot)pa(dot)us" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "andres(at)anarazel(dot)de" <andres(at)anarazel(dot)de>, "robertmhaas(at)gmail(dot)com" <robertmhaas(at)gmail(dot)com>, "tomas(dot)vondra(at)2ndquadrant(dot)com" <tomas(dot)vondra(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: [Patch] Optimize dropping of relation buffers using dlist
Date: 2020-10-15 03:34:09
Message-ID: OSBPR01MB2341672E9A95E5EC6D2E79B5EF020@OSBPR01MB2341.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tuesday, October 13, 2020 10:09 AM, Tsunakawa-san wrote:
> Why do you do this way? I think the previous patch was more correct (while
> agreeing with Horiguchi-san in that nTotalBlocks may be unnecessary. What
> you want to do is "if the size of any fork could be inaccurate, do the traditional
> full buffer scan without performing any optimization for any fork," right? But
> the above code performs optimization for forks until it finds a fork with
> inaccurate size.
>
> (2)
> + * Get the total number of cached blocks and to-be-invalidated
> blocks
> + * of the relation. The cached value returned by smgrnblocks could
> be
> + * smaller than the actual number of existing buffers of the file.
>
> As you changed the meaning of the smgrnblocks() argument from cached to
> accurate, and you nolonger calculate the total blocks, the comment should
> reflect them.
>
>
> (3)
> In smgrnblocks(), accurate is not set to false when mdnblocks() is called.
> The caller doesn't initialize the value either, so it can see garbage value.
>
>
> (4)
> + if (nForkBlocks[i] != InvalidBlockNumber &&
> + nBlocksToInvalidate <
> BUF_DROP_FULL_SCAN_THRESHOLD)
> + {
> ...
> + }
> + else
> + break;
> + }
>
> In cases like this, it's better to reverse the if and else. Thus, you can reduce
> the nest depth.

Thank you for the review!
1. I have revised the patch addressing your comments/feedback. Attached are the latest set of patches.

2. Non-recovery Performance
I also included a debug version of the patch (0004) where I removed the recovery-related checks
to measure non-recovery performance.
However, I still can't seem to find the cause of why the non-recovery performance
does not change when compared to master. (1 min 15 s for the given test case below)

> - if (InRecovery && reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
> + if (reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)

Here's how I measured it:
0. postgresql.conf setting
shared_buffers = 100GB
autovacuum = off
full_page_writes = off
checkpoint_timeout = 30min
max_locks_per_transaction = 100
wal_log_hints = on
wal_keep_size = 100
max_wal_size = 20GB

1. createdb test

2. Create tables: SELECT create_tables(1000);

create or replace function create_tables(numtabs int)
returns void as $$
declare query_string text;
begin
for i in 1..numtabs loop
query_string := 'create table tab_' || i::text || ' (a int);';
execute query_string;
end loop;
end;
$$ language plpgsql;

3 Insert rows to tables (3.5 GB db): SELECT insert_tables(1000);

create or replace function insert_tables(numtabs int)
returns void as $$
declare query_string text;
begin
for i in 1..numtabs loop
query_string := 'insert into tab_' || i::text || ' SELECT generate_series(1, 100000);' ;
execute query_string;
end loop;
end;
$$ language plpgsql;

4. DELETE FROM tables: SELECT delfrom_tables(1000);

create or replace function delfrom_tables(numtabs int)
returns void as $$
declare query_string text;
begin
for i in 1..numtabs loop
query_string := 'delete from tab_' || i::text;
execute query_string;
end loop;
end;
$$ language plpgsql;

5. Measure VACUUM timing
\timing
VACUUM;

Using the debug version of the patch, I have confirmed that it enters the optimization path
when it meets the conditions. Here are some printed logs from 018_wal_optimize_node_replica.log:
> make world -j4 -s && make -C src/test/recovery/ check PROVE_TESTS=t/018_wal_optimize.pl

WARNING: current fork 0, nForkBlocks[i] 1, accurate: 1
CONTEXT: WAL redo at 0/162B4E0 for Storage/TRUNCATE: base/13751/24577 to 0 blocks flags 7
WARNING: Optimization Loop.
buf_id = 41. nforks = 1. current fork = 0. forkNum: 0 == tag's forkNum: 0. curBlock: 0 < nForkBlocks[i] = 1. tag blockNum: 0 >= firstDelBlock[i]: 0. nBlocksToInvalidate = 1 < threshold = 32.

--
3. Recovery Performance (hot standby, failover)
OTOH, when executing recovery performance (using 0003 patch), the results were great.

| s_b | master | patched | %reg |
|-------|--------|---------|--------|
| 128MB | 3.043 | 2.977 | -2.22% |
| 1GB | 3.417 | 3.41 | -0.21% |
| 20GB | 20.597 | 2.409 | -755% |
| 100GB | 66.862 | 2.409 | -2676% |

To execute this on a hot standby setup (after inserting rows to tables)
1. [Standby] Pause WAL replay
SELECT pg_wal_replay_pause();

2. [Master] Measure VACUUM timing. Then stop server.
\timing
VACUUM;
\q
pg_ctl stop -mi -w

3. [Standby] Use the attached script to promote standby and measure the performance.
# test.sh recovery

So the current issue I'm still investigating is why the performance for non-recovery is bad,
while OTOH it's good when InRecovery.

Regards,
Kirk Jamison

Attachment Content-Type Size
v25-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch application/octet-stream 1.1 KB
v25-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch application/octet-stream 8.7 KB
v25-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch application/octet-stream 7.5 KB
v25-0004-V25-with-ereport-for-debug.patch application/octet-stream 2.9 KB
018_wal_optimize_node_replica.log application/octet-stream 138.5 KB
test.sh application/octet-stream 405 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Fan 2020-10-15 03:38:23 Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY
Previous Message Amit Kapila 2020-10-15 03:32:17 Re: VACUUM PARALLEL option vs. max_parallel_maintenance_workers