ExclusiveLock on extension of relation with huge shared_buffers

From: Borodin Vladimir <root(at)simply(dot)name>
To: pgsql-hackers(at)postgresql(dot)org
Subject: ExclusiveLock on extension of relation with huge shared_buffers
Date: 2014-10-23 17:09:15
Message-ID: 0DDFB621-7282-4A2B-8879-A47F7CECBCE4@simply.name
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

Hi all.

Some time ago I wrote [0] on pgsql-performance@ about having problems with ExclusiveLock on extending relation while checkpointing on system with large shared_buffers. That time the problem was fixed with getting shared_buffers much smaller.

Right now I am testing 9.4beta3 with large shared_buffers (100GB of 128 GB RAM) and huge pages under writing (insert only) load and I am having the same issue. The reason I do it is that such configuration shows much better performance (than 4GB shared_buffers and all other RAM for page cache) on our read-only load profile (some kind of 2,5x better).

The symptoms are well described in the first letter but I will repeat shortly. Most of the time processor spends in system, pg_locks shows that many backends are waiting because of ExclusiveLock on extending some relation. In the log I can see lots of such lines:

< 2014-10-23 18:03:39.287 MSK >LOG: process 5426 acquired ExclusiveLock on extension of relation 111669 of database 110866 after 3421.279 ms
< 2014-10-23 18:03:39.501 MSK >LOG: process 2904 still waiting for ExclusiveLock on extension of relation 111714 of database 110866 after 1000.053 ms

I started to dig to find the root of the problem. The CPU spike does not happen on the start of checkpoint or at the end. It happens one or several times during checkpoint is happening. Since it is not production environment I have written SystemTap script that logs all checkpoint events and tracks timings of doing something under ExclusiveLock. The script is attached to the letter (checkpoint_markers.stp). Also in attaches you can find output of this script (stap.out) and output of dstat (dstat.log) to see the system resource consumption.

From stap.out you can see that there are some strange spikes of timings spent between LockRelationForExtension and UnlockRelationForExtension from the same pid and for the same relation. And from dstat.log you can see that at the same time there are spikes of cpu spent in system, "perf top" says that most of the time is spent in _spin_lock_irq at this time.

I have looked at src/backend/access/heap/hio.c:RelationGetBufferForTuple and further until smgr* functions to see what is happening under this lock but haven’t found any potential problems except one. Relation extending is done by backend under exclusive lock and it is extended by only one page. Aren’t there any plans for making yet another background process for extending relations after reaching some threashold?

I’m pretty sure that I have missed something (for example, I still can’t explain why this is not happening on system with small shared_buffers). What kind of debug can I turn on or where else should I look? Any ideas would be really appropriate. Below are some details about the system.

I am running PostgreSQL 9.4beta3 installed from RPM packages (compiled with —enable-debug and —enable-dtrace) on RHEL 6.5. Host is running with the following CPU (32 cores) and memory:

root(at)xdb01d ~ # fgrep -m1 'model name' /proc/cpuinfo
model name : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
root(at)xdb01d ~ # free -m
total used free shared buffers cached
Mem: 129027 128293 734 0 21 17345
-/+ buffers/cache: 110927 18100
Swap: 16378 0 16378
root(at)xdb01d ~ #

PGDATA lives on RAID6 array of 8 ssd-disks with ext4 (pg_xlog is on separate RAID10 array mounted with nobarrier). Right now PGDATA takes a bit more than 100G.

When checkpoint starts I/O load on the system becomes much higher than before checkpoint starts. That is happening because of bgwriter doesn’t really do the job (although bgwriter settings are really aggressive):

xdb01d/postgres # select * from pg_stat_bgwriter;
-[ RECORD 1 ]---------+------------------------------
checkpoints_timed | 0
checkpoints_req | 19
checkpoint_write_time | 22769904
checkpoint_sync_time | 199006
buffers_checkpoint | 85116980
buffers_clean | 92815
maxwritten_clean | 80
buffers_backend | 3743898
buffers_backend_fsync | 0
buffers_alloc | 13366074
stats_reset | 2014-10-23 10:59:17.123677+04

Time: 10.451 ms
xdb01d/postgres #

< 2014-10-23 18:54:13.913 MSK >LOG: checkpoint starting: xlog
< 2014-10-23 19:23:36.432 MSK >LOG: checkpoint complete: wrote 4586752 buffers (35.0%); 0 transaction log file(s) added, 0 removed, 1881 recycled; write=1761.595 s, sync=0.722 s, total=1762.518 s; sync files=172, longest=0.162 s, average=0.004 s

< 2014-10-23 19:28:47.526 MSK >LOG: checkpoint starting: xlog
< 2014-10-23 19:59:29.860 MSK >LOG: checkpoint complete: wrote 4735580 buffers (36.1%); 0 transaction log file(s) added, 0 removed, 2138 recycled; write=1841.799 s, sync=0.328 s, total=1842.334 s; sync files=179, longest=0.124 s, average=0.001 s

The configuration for postgres is also in attached file.

[0] http://www.postgresql.org/message-id/466D72D2-68EC-4FF4-93B8-43B687E7B705@simply.name

P.S. Since this letter is more the question about why relation is extended by one page I’m writing the letter on hackers list but not performance. But of course it can be moved to performance list if someone finds it correct.



Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2014-10-23 18:03:16 Re: idea: allow AS label inside ROW constructor
Previous Message Dag-Erling Smørgrav 2014-10-23 16:40:50 Re: [PATCH] add ssl_protocols configuration option