RE: [Patch] Optimize dropping of relation buffers using dlist

From: "k(dot)jamison(at)fujitsu(dot)com" <k(dot)jamison(at)fujitsu(dot)com>
To: "k(dot)jamison(at)fujitsu(dot)com" <k(dot)jamison(at)fujitsu(dot)com>, 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>
Cc: "tanghy(dot)fnst(at)cn(dot)fujitsu(dot)com" <tanghy(dot)fnst(at)cn(dot)fujitsu(dot)com>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "jankirk(dot)jamison(at)gmail(dot)com" <jankirk(dot)jamison(at)gmail(dot)com>
Subject: RE: [Patch] Optimize dropping of relation buffers using dlist
Date: 2021-01-06 13:13:03
Message-ID: OSBPR01MB2341B75F355B285199AD1BD3EFD00@OSBPR01MB2341.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, January 6, 2021 7:04 PM (JST), I wrote:
> I will resume the test similar to Tang, because she also executed the original
> failover test which I have been doing before.
> To avoid confusion and to check if the results from mine and Tang are
> consistent, I also did the recovery/failover test for VACUUM on single relation,
> which I will send in a separate email after this.

A. Test to find the right THRESHOLD

So below are the procedures and results of the VACUUM recovery performance
test on single relation.
I followed the advice below and applied the supplementary patch on top of V39:
Test-for-threshold.patch
This will ensure that we'll always enter the optimized path.
We're gonna use the threshold then as the relation size.

> >One idea could be to remove "nBlocksToInvalidate <
> >BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached &&
> >nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it
> >always use optimized path for the tests. Then use the relation size
> >as NBuffers/128, NBuffers/256, NBuffers/512 for different values of
> >shared buffers as 128MB, 1GB, 20GB, 100GB.

Each relation size is NBuffers/XXX, so I used the attached "rel.sh" script
to test from NBuffers/512 until NBuffers/8 relation size per shared_buffers.
I did not go further beyond 8 because it took too much time, and I could
already observe significant results until that.

[Vacuum Recovery Performance on Single Relation]
1. Setup synchronous streaming replication. I used the configuration
written at the bottom of this email.
2. [Primary] Create 1 table. (rel.sh create)
3. [Primary] Insert data of NBuffers/XXX size. Make sure to use the correct
size for the set shared_buffers by commenting out the right size in "insert"
of rel.sh script. (rel.sh insert)
4. [Primary] Delete table. (rel.sh delete)
5. [Standby] Optional: To double-check that DELETE is reflected on standby.
SELECT count(*) FROM tableXXX;
Make sure it returns 0.
6. [Standby] Pause WAL replay. (rel.sh pause)
(This script will execute SELECT pg_wal_replay_pause(); .)
7. [Primary] VACUUM the single relation. (rel.sh vacuum)
8. [Primary] After the vacuum finishes, stop the server. (rel.sh stop)
(The script will execute pg_ctl stop -D $PGDATA -w -mi)
9. [Standby] Resume WAL replay and promote the standby.
(rel.sh resume)
It basically prints a timestamp when resuming WAL replay,
and prints another timestamp when the promotion is done.
Compute the time difference.

[Results for VACUUM on single relation]
Average of 5 runs.

1. % REGRESSION
% Regression: (patched - master)/master

| rel_size | 128MB | 1GB | 20GB | 100GB |
|----------|--------|--------|--------|----------|
| NB/512 | 0.000% | 0.000% | 0.000% | -32.680% |
| NB/256 | 0.000% | 0.000% | 0.000% | 0.000% |
| NB/128 | 0.000% | 0.000% | 0.000% | -16.502% |
| NB/64 | 0.000% | 0.000% | 0.000% | -9.841% |
| NB/32 | 0.000% | 0.000% | 0.000% | -6.219% |
| NB/16 | 0.000% | 0.000% | 0.000% | 3.323% |
| NB/8 | 0.000% | 0.000% | 0.000% | 8.178% |

For 100GB shared_buffers, we can observe regression
beyond NBuffers/32. So with this, we can conclude
that NBuffers/32 is the right threshold.
For NBuffers/16 and beyond, the patched performs
worse than master. In other words, the cost of for finding
to be invalidated buffers gets higher in the optimized path
than the traditional path.

So in attached V39 patches, I have updated the threshold
BUF_DROP_FULL_SCAN_THRESHOLD to NBuffers/32.

2. [PATCHED]
Units: Seconds

| rel_size | 128MB | 1GB | 20GB | 100GB |
|----------|-------|-------|-------|-------|
| NB/512 | 0.106 | 0.106 | 0.106 | 0.206 |
| NB/256 | 0.106 | 0.106 | 0.106 | 0.306 |
| NB/128 | 0.106 | 0.106 | 0.206 | 0.506 |
| NB/64 | 0.106 | 0.106 | 0.306 | 0.907 |
| NB/32 | 0.106 | 0.106 | 0.406 | 1.508 |
| NB/16 | 0.106 | 0.106 | 0.706 | 3.109 |
| NB/8 | 0.106 | 0.106 | 1.307 | 6.614 |

3. MASTER
Units: Seconds

| rel_size | 128MB | 1GB | 20GB | 100GB |
|----------|-------|-------|-------|-------|
| NB/512 | 0.106 | 0.106 | 0.106 | 0.306 |
| NB/256 | 0.106 | 0.106 | 0.106 | 0.306 |
| NB/128 | 0.106 | 0.106 | 0.206 | 0.606 |
| NB/64 | 0.106 | 0.106 | 0.306 | 1.006 |
| NB/32 | 0.106 | 0.106 | 0.406 | 1.608 |
| NB/16 | 0.106 | 0.106 | 0.706 | 3.009 |
| NB/8 | 0.106 | 0.106 | 1.307 | 6.114 |

I used the following configurations:
[postgesql.conf]
shared_buffers = 100GB #20GB,1GB,128MB
autovacuum = off
full_page_writes = off
checkpoint_timeout = 30min
max_locks_per_transaction = 10000
max_wal_size = 20GB

# For streaming replication from primary. Don't uncomment on Standby.
synchronous_commit = remote_write
synchronous_standby_names = 'walreceiver'

# For Standby. Don't uncomment on Primary.
# hot_standby = on
#primary_conninfo = 'host=... user=... port=... application_name=walreceiver'

----------
B. Regression Test using the NBuffers/32 Threshold (V39 Patches)

For this one, we do NOT need the supplementary Test-for-threshold.patch.
Apply only the V39 patches.
But instead of using "rel.sh" test script, please use the attached "test.sh".
Similar to the tests I did before for 1000 relations, I executed the recovery
performance test, now with the threshold NBuffers/32.
The configuration setting in postgresql.conf is similar to the test above.

Each relation has 1 block, 8kB size. Total of 1000 relations.

Test procedures is almost similar to A, so I'll just summarize it,
1. Setup synchronous streaming replication and config settings.
2. [Primary] test.sh create
(The test.sh script will create 1000 tables)
3. [Primary] test.sh insert
4. [Primary] test.sh delete (Skip step 4-5 for TRUNCATE test)
5. [Standby] Optional for VACUUM test: To double-check that DELETE
is reflected on standby. SELECT count(*) FROM tableXXX;
Make sure it returns 0.
6. [Standby] test.sh pause
7. [Primary] "test.sh vacuum" for VACUUM test
"test,sh truncate" for TRUNCATE test
8. [Primary] If #7 is done, test.sh stop
9. [Standby] If primary is fully stopped, run "test.sh resume".
Compute the time difference.

[Results for VACUUM Recovery Performance for 1000 relations]
Unit is in seconds. Average of 5 executions.
% regression = (patched-master)/master

| s_b | Master | Patched | %reg |
|--------|--------|---------|---------|
| 128 MB | 0.306 | 0.306 | 0.00% |
| 1 GB | 0.506 | 0.306 | -39.53% |
| 20 GB | 14.522 | 0.306 | -97.89% |
| 100 GB | 66.564 | 0.306 | -99.54% |

[Results for TRUNCATE Recovery Performance for 1000 relations]
Unit is in seconds. Average of 5 executions.
% regression = (patched-master)/master

| s_b | Master | Patched | %reg |
|--------|--------|---------|---------|
| 128 MB | 0.206 | 0.206 | 0.00% |
| 1 GB | 0.506 | 0.206 | -59.29% |
| 20 GB | 16.476 | 0.206 | -98.75% |
| 100 GB | 88.261 | 0.206 | -99.77% |

The results for the patched were constant for all shared_buffers
settings for both TRUNCATE and VACUUM.
That means we can gain huge performance benefits with the patch.

The performance benefits have been tested a lot so there's no question
about that. So I think the final decision for value of threshold would come
if the results will be consistent with others. For now, in my test results,
the threshold NBuffers/32 is what I concluded. It's already indicated in
the attached V39 patch set.

[Specs Used]
Intel(R) Xeon(R) CPU E5-2637 v4 @ 3.50GHz
8 CPUs, 256GB Memory
XFS, RHEL7.2, latest Postgres(Head version)

Feedbacks are definitely welcome.
And if you want to test, I have already indicated the detailed steps
including the scripts I used. Have fun testing!

Regards,
Kirk Jamison

Attachment Content-Type Size
v39-0001-Optimize-DropRelFileNodeBuffers-for-recovery.patch application/octet-stream 12.2 KB
v39-0002-Optimize-DropRelFileNodesAllBuffers-for-recovery.patch application/octet-stream 6.2 KB
v39-0003-Prevent-invalidating-blocks-in-smgrextend-during.patch application/octet-stream 1.1 KB
Test-for-threshold.patch application/octet-stream 2.0 KB
rel.sh application/octet-stream 8.6 KB
test.sh application/octet-stream 3.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2021-01-06 13:27:03 Re: Incorrect allocation handling for cryptohash functions with OpenSSL
Previous Message Bharath Rupireddy 2021-01-06 13:06:38 Re: New Table Access Methods for Multi and Single Inserts