RE: Locking B-tree leafs immediately in exclusive mode

From: "Imai, Yoshikazu" <imai(dot)yoshikazu(at)jp(dot)fujitsu(dot)com>
To: 'Alexander Korotkov' <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: Locking B-tree leafs immediately in exclusive mode
Date: 2018-07-09 03:18:53
Message-ID: 0F97FA9ABBDBE54F91744A9B37151A51186E9D@g01jpexmbkw24
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, June 11, 2018 at 4:31 PM, Alexander Korotkov wrote:
> On Mon, Jun 11, 2018 at 1:06 PM Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > On 5 June 2018 at 14:45, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
> > wrote:
> > > Currently _bt_search() always locks leaf buffer in shared mode (aka
> > > BT_READ), while caller can relock it later. However, I don't see
> > > what prevents
> > > _bt_search()
> > > from locking leaf immediately in exclusive mode (aka BT_WRITE) when
> > > required.
> > > When traversing downlink from non-leaf page of level 1 (lowest level
> > > of non-leaf pages just prior to leaf pages), we know that next page
> > > is going to be leaf. In this case, we can immediately lock next page
> > > in exclusive mode if required.
> > > For sure, it might happen that we didn't manage to exclusively lock
> > > leaf in this way when _bt_getroot() points us to leaf page. But
> > > this case could be handled in _bt_search() by relock. Please, find
> > > implementation of this approach in the attached patch.
> >
> > It's a good idea. How does it perform with many duplicate entries?
>
> Our B-tree is currently maintaining duplicates unordered. So, during
> insertion we can traverse rightlinks in order to find page, which would
> fit new index tuple.
> However, in this case we're traversing pages in exclusive lock mode, and
> that happens already after re-lock. _bt_search() doesn't do anything
> with that.
> So, I assume there shouldn't be any degradation in the case of many
> duplicate entries.
>
> But this case definitely worth testing, and I'm planning to do it.
>

Hi, I'm reviewing this.

Firstly, I did performance tests on 72-cores machines(AWS c5.18xlarge) same as you did.

> # postgresql.conf
> max_connections = 300
> shared_buffers = 32GB
> fsync = off
> synchronous_commit = off
>
>
>
> # DDL:
> CREATE UNLOGGED TABLE ordered (id serial primary key, value text not null);
> CREATE UNLOGGED TABLE unordered (i integer not null, value text not null);
>
>
>
> # script_ordered.sql
> INSERT INTO ordered (value) VALUES ('abcdefghijklmnoprsqtuvwxyz');
>
>
>
> # script_unordered.sql
> \set i random(1, 1000000)
> INSERT INTO unordered VALUES (:i, 'abcdefghijklmnoprsqtuvwxyz');
>
>
>
> # commands
> pgbench -T 60 -P 1 -M prepared -f script_ordered.sql -c 150 -j 150 postgres
> pgbench -T 60 -P 1 -M prepared -f script_unordered.sql -c 150 -j 150 postgres
>
>
>
> # results
> ordered, master: 157473 TPS
> ordered, patched 231374 TPS
> unordered, master: 232372 TPS
> unordered, patched: 232535 TPS

# my results
ordered, master: 186045 TPS
ordered, patched: 265842 TPS
unordered, master: 313011 TPS
unordered, patched: 309636 TPS

I confirmed that this patch improves ordered insertion.

In addition to your tests, I did the following tests with many duplicate entries

# DDL
CREATE UNLOGGED TABLE duplicated (i integer not null, value text not null);

# script_duplicated.sql
INSERT INTO unordered VALUES (1, 'abcdefghijklmnoprsqtuvwxyz');

# commands
pgbench -T 60 -P 1 -M prepared -f script_duplicated.sql -c 150 -j 150 postgres

# results
duplicated, master: 315450 TPS
duplicated, patched: 311584 TPS

It seems there are almostly no performance degression in case of many duplicated entries.

I'm planning to do code review and send it in the next mail.

Yoshikazu Imai

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-07-09 05:03:09 Incorrect error handling for two-phase state files resulting in data loss
Previous Message Amit Langote 2018-07-09 02:48:32 Re: no partition pruning when partitioning using array type