|From:||"Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>|
|Subject:||Re: MultiXact\SLRU buffers configuration|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
> 8 мая 2020 г., в 21:36, Andrey M. Borodin <x4mmm(at)yandex-team(dot)ru> написал(а):
> *** The problem ***
> I'm investigating some cases of reduced database performance due to MultiXactOffsetLock contention (80% MultiXactOffsetLock, 20% IO DataFileRead).
> The problem manifested itself during index repack and constraint validation. Both being effectively full table scans.
> The database workload contains a lot of select for share\select for update queries. I've tried to construct synthetic world generator and could not achieve similar lock configuration: I see a lot of different locks in wait events, particularly a lot more MultiXactMemberLocks. But from my experiments with synthetic workload, contention of MultiXactOffsetLock can be reduced by increasing NUM_MXACTOFFSET_BUFFERS=8 to bigger numbers.
> *** Question 1 ***
> Is it safe to increase number of buffers of MultiXact\All SLRUs, recompile and run database as usual?
> I cannot experiment much with production. But I'm mostly sure that bigger buffers will solve the problem.
> *** Question 2 ***
> Probably, we could do GUCs for SLRU sizes? Are there any reasons not to do them configurable? I think multis, clog, subtransactions and others will benefit from bigger buffer. But, probably, too much of knobs can be confusing.
> *** Question 3 ***
> MultiXact offset lock is always taken as exclusive lock. It turns MultiXact Offset subsystem to single threaded. If someone have good idea how to make it more concurrency-friendly, I'm willing to put some efforts into this.
> Probably, I could just add LWlocks for each offset buffer page. Is it something worth doing? Or are there any hidden cavers and difficulties?
I've created benchmark imitating MultiXact pressure on my laptop: 7 clients are concurrently running select "select * from table where primary_key = ANY ($1) for share" where $1 is array of identifiers so that each tuple in a table is locked by different set of XIDs. During this benchmark I observe contention of MultiXactControlLock in pg_stat_activity
пятница, 8 мая 2020 г. 15:08:37 (every 1s)
pid | wait_event | wait_event_type | state | query
41344 | ClientRead | Client | idle | insert into t1 select generate_series(1,1000000,1)
41375 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share
41377 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share
41378 | | | active | select * from t1 where i = ANY ($1) for share
41379 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share
41381 | | | active | select * from t1 where i = ANY ($1) for share
41383 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share
41385 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share
Finally, the benchmark is measuring time to execute select for update 42 times.
I've went ahead and created 3 patches:
1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
2. Reduce locking level to shared on read of MultiXactId members
3. Configurable cache size
I've found out that:
1. When MultiXact working set does not fit into buffers - benchmark results grow very high. Yet, very big buffers slow down benchmark too. For this benchmark optimal SLRU size id 32 pages for offsets and 64 pages for members (defaults are 8 and 16 respectively).
2. Lock optimisation increases performance by 5% on default SLRU sizes. Actually, benchmark does not explicitly read MultiXactId members, but when it replaces one with another - it have to read previous set. I understand that we can construct benchmark to demonstrate dominance of any algorithm and 5% of synthetic workload is not a very big number. But it just make sense to try to take shared lock for reading.
3. Manipulations with cache size do not affect benchmark anyhow. It's somewhat expected: benchmark is designed to defeat cache, either way OffsetControlLock would not be stressed.
For our workload, I think we will just increase numbers of SLRU sizes. But patchset may be useful for tuning and as a performance optimisation of MultiXact.
Also MultiXacts seems to be not very good fit into SLRU design. I think it would be better to use B-tree as a container. Or at least make MultiXact members extendable in-place (reserve some size when multixact is created).
When we want to extend number of locks for a tuple currently we will:
1. Iterate through all SLRU buffers for offsets to read current offset (with exclusive lock for offsets)
2. Iterate through all buffers for members to find current members (with exclusive lock for members)
3. Create new members array with +1 xid
4. Iterate through all cache members to find out maybe there are any such cache item as what we are going to create
5. iterate over 1 again for write
6. Iterate over 2 again for write
Obviously this does not scale well - we cannot increase SLRU sizes for too long.
Thanks! I'd be happy to hear any feedback.
Best regards, Andrey Borodin.
|Next Message||Dmitry Dolgov||2020-05-11 11:26:55||Re: Index Skip Scan|
|Previous Message||Dilip Kumar||2020-05-11 10:34:00||Re: Index Skip Scan|