Quick Links

Re: Relation extension scalability

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Relation extension scalability
Date:	2015-07-19 13:58:41
Message-ID:	20150719135841.GG25610@awork2.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

I, every now and then, spent a bit of time making this more efficient
over the last few weeks.

I had a bit of a problem to reproduce the problems I'd seen in
production on physical hardware (found EC2 to be to variable to
benchmark this), but luckily 2ndQuadrant today allowed me access to
their four socket machine[1] of the AXLE project. Thanks Simon and
Tomas!

First, some mostly juicy numbers:

My benchmark was a parallel COPY into a single wal logged target
table:
CREATE TABLE data(data text);
The source data has been generated with
narrow:
COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinary' WITH BINARY;
wide:
COPY (select repeat(random()::text, 10) FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY;

Between every test I ran a TRUNCATE data; CHECKPOINT;

For each number of clients I ran pgbench for 70 seconds. I'd previously
determined using -P 1 that the numbers are fairly stable. Longer runs
would have been nice, but then I'd not have finished in time.

shared_buffers = 48GB, narrow table contents:
client tps after: tps before:
1 180.255577 210.125143
2 338.231058 391.875088
4 638.814300 405.243901
8 1126.852233 370.922271
16 1242.363623 498.487008
32 1229.648854 484.477042
48 1223.288397 468.127943
64 1198.007422 438.238119
96 1201.501278 370.556354
128 1198.554929 288.213032
196 1189.603398 193.841993
256 1144.082291 191.293781
512 643.323675 200.782105

shared_buffers = 1GB, narrow table contents:
client tps after: tps before:
1 191.137410 210.787214
2 351.293017 384.086634
4 649.800991 420.703149
8 1103.770749 355.947915
16 1287.192256 489.050768
32 1226.329585 464.936427
48 1187.266489 443.386440
64 1182.698974 402.251258
96 1208.315983 331.290851
128 1183.469635 269.250601
196 1202.847382 202.788617
256 1177.924515 190.876852
512 572.457773 192.413191

1
shared_buffers = 48GB, wide table contents:
client tps after: tps before:
1 59.685215 68.445331
2 102.034688 103.210277
4 179.434065 78.982315
8 222.613727 76.195353
16 232.162484 77.520265
32 231.979136 71.654421
48 231.981216 64.730114
64 230.955979 57.444215
96 228.016910 56.324725
128 227.693947 45.701038
196 227.410386 37.138537
256 224.626948 35.265530
512 105.356439 34.397636

shared_buffers = 1GB, wide table contents:
(ran out of patience)

Note that the peak performance with the patch is significantly better,
but there's currently a noticeable regression in single threaded
performance. That undoubtedly needs to be addressed.

So, to get to the actual meat: My goal was to essentially get rid of an
exclusive lock over relation extension alltogether. I think I found a
way to do that that addresses the concerns made in this thread.

Thew new algorithm basically is:
1) Acquire victim buffer, clean it, and mark it as pinned
2) Get the current size of the relation, save buffer into blockno
3) Try to insert an entry into the buffer table for blockno
4) If the page is already in the buffer table, increment blockno by 1,
goto 3)
5) Try to read the page. In most cases it'll not yet exist. But the page
might concurrently have been written by another backend and removed
from shared buffers already. If already existing, goto 1)
6) Zero out the page on disk.

I think this does handle the concurrency issues.

This patch very clearly is in the POC stage. But I do think the approach
is generally sound. I'd like to see some comments before deciding
whether to carry on.

Greetings,

Andres Freund

PS: Yes, I know that precision in the benchmark isn't warranted, but I'm
too lazy to truncate them.

[1]
[10:28:11 PM] Tomas Vondra: 4x Intel Xeon E54620 Eight Core 2.2GHz
Processor’s generation Sandy Bridge EP
each core handles 2 threads, so 16 threads total
256GB (16x16GB) ECC REG System Validated Memory (1333 MHz)
2x 250GB SATA 2.5” Enterprise Level HDs (RAID 1, ~250GB)
17x 600GB SATA 2.5” Solid State HDs (RAID 0, ~10TB)
LSI MegaRAID 92718iCC controller and Cache Vault Kit (1GB cache)
2 x Nvidia Tesla K20 Active GPU Cards (GK110GL)

Attachment	Content-Type	Size
0001-WIP-Saner-heap-extension.patch	text/x-patch	26.2 KB

In response to

Relation extension scalability at 2015-03-29 18:56:19 from Andres Freund

Responses

Re: Relation extension scalability at 2015-07-19 14:07:46 from Andres Freund
Re: Relation extension scalability at 2015-07-19 15:28:25 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2015-07-19 14:07:46	Re: Relation extension scalability
Previous Message	Alvaro Herrera	2015-07-19 07:27:29	Re: BRIN index and aborted transaction