non-HOT update not looking at FSM for large tuple update

From: Floris Van Nee <florisvannee(at)Optiver(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: non-HOT update not looking at FSM for large tuple update
Date: 2021-02-24 14:44:30
Message-ID: 6e263217180649339720afe2176c50aa@opammb0562.comp.optiver.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

Recently we found a table that was slowly, but consistently increasing in size. The table has a low fill-factor set and was updated very frequently. As expected, almost all updates are HOT updates, but for some of the non-HOT updates it always wanted to use a new page, rather than reuse an existing empty page. This led to a steady growth in table size (and a steady growth in the number of empty pages in the table).

I've managed to create a very simple reproducing example that shows the problem (original problem occurred on 12.4, but I've tested this example on latest master). It only occurs for updates where the new tuple is larger than the size of what "fillfactor" would normally allow. In real life, this would only be a very small portion of the updates to a certain table of course, but in this example every update will be this large.

Create a table with a low fill-factor and insert one row into it. Note that, in this case, the row that we're inserting is by itself larger than the "max fill factor space".

create table t1 (a int primary key, b text) with (fillfactor=10);
insert into t1 select 1, (select string_agg('1', '') from generate_series(1,1000)); -- 1000 byte text field
vacuum t1;

postgres=# select * from pg_freespace('t1');
blkno | avail
-------+-------
0 | 7104
(1 row)

This looks alright - there's 1 page and the available space is indeed roughly 1000 bytes less, because of our tuple and page header.

Now, in a different backend, initiate a longer query.

select pg_sleep(600); -- just sleep 600 seconds so that we have enough time to do some updates during this

Then, in the original backend, update the tuple 7 times.

-- execute this 7 times
update t1 set b=(select string_agg((random()*9)::int::text, '') from generate_series(1,1000)) where a=1;

Cancel the pg_sleep call.
Then execute

vacuum t1; -- cleans rows and updates the fsm

postgres=# select * from pg_freespace('t1');
blkno | avail
-------+-------
0 | 8128
1 | 7104
(2 rows)

This still looks OK. There's an extra page, because a total of 8 tuples needed to kept alive for the pg_sleep query. These didn't fit on one page, so a new page was created.

Now, repeat it (the pg_sleep, update 7 times, cancel the pg_sleep and vacuum).

postgres=# select * from pg_freespace('t1');
blkno | avail
-------+-------
0 | 8128
1 | 8128
2 | 7104
(3 rows)

This does not look good anymore. The tuple was on page 1, so at first there were several HOT updates on page 1. Then, when page 1 was full, it needed to search for another page to put the tuple. It did not consider page 0, but instead decided to create a new page 2.

Repeating this process would create a new page each time, never reusing the empty old pages.

The reason it does not consider page 0 is because of this piece of code in function RelationGetBufferForTuple in hio.c:

/* Compute desired extra freespace due to fillfactor option */
saveFreeSpace = RelationGetTargetPageFreeSpace(relation, HEAP_DEFAULT_FILLFACTOR);
...
if (len + saveFreeSpace > MaxHeapTupleSize)
{
/* can't fit, don't bother asking FSM */
targetBlock = InvalidBlockNumber;
use_fsm = false;
}

The problem here is two-folded: for any non-HOT update of a tuple that's larger than the size of the fillfactor, the fsm will not be used, but instead a new page will be chosen.
This seems to rely on the false assumption that every existing page has at last one tuple on it.
Secondly, and this is a bit trickier.. Even if we would ask the FSM to come up with a free page with a free size of "MaxHeapTupleSize", it wouldn't find anything... This is, because the FSM tracks free space excluding any unused line pointers. In this example, if we look at block 0:

postgres=# select * from page_header(get_raw_page('t1', 0));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
0/16D75A0 | 0 | 5 | 52 | 8192 | 8192 | 8192 | 4 | 0
(1 row)

postgres=# select * from heap_page_items(get_raw_page('t1', 0));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+--------
1 | 0 | 0 | 0 | | | | | | | | | |
2 | 0 | 0 | 0 | | | | | | | | | |
3 | 0 | 0 | 0 | | | | | | | | | |
4 | 0 | 0 | 0 | | | | | | | | | |
5 | 0 | 0 | 0 | | | | | | | | | |
6 | 0 | 0 | 0 | | | | | | | | | |
7 | 0 | 0 | 0 | | | | | | | | | |
(7 rows)

There are 7 line pointers on this page, consuming 28 bytes. Plus the 24 byte header, that means that lower=52. However, all line pointers are unused, so the page really is empty. The FSM does not see the page as empty though, as it only looks at "upper-lower".

When asking the FSM for slightly less space (MaxHeapTupleSize - 50 for example), it does find the free pages. I've confirmed that with such a hack the table is not growing indefinitely anymore. However, this number 50 is rather arbitrary obviously, as it depends on the number of unused line items on a page, so that's not a proper way to fix things.

In any case, the behavior feels like a bug to me, but I don't know what the best way would be to fix it. Thoughts?

-Floris

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2021-02-24 14:51:32 Re: Is Recovery actually paused?
Previous Message gkokolatos 2021-02-24 14:35:51 Re: PATCH: Attempt to make dbsize a bit more consistent