Re: vacuum on table1 skips rows because of a query on table2

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Virender Singla <virender(dot)cse(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: vacuum on table1 skips rows because of a query on table2
Date: 2019-10-28 17:00:47
Message-ID: CA+Tgmobgc3o+emMH3MXGxxg884j_V+kTwQ9sbJBRb-5d2OSyLw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Oct 26, 2019 at 1:44 PM Virender Singla <virender(dot)cse(at)gmail(dot)com> wrote:
> If long-running transaction is "read committed", then we are sure that any new query coming
> (even on same table1 as vacuum table) will need snapshot on point of time query start and not the time transaction
> starts (but still why read committed transaction on table2 cause vacuum on table1 to skip rows).

I wish that this argument were completely correct, but it isn't,
because the current query could involve a function written in some
procedural language (or in C) which could do anything, including
accessing tables that the query hasn't previously touched. It could be
that the function will only be called towards the end of the current
query's execution, or it could be that it's going to be called
multiple times and does different things each time.

Now, this is pretty unlikely and most queries don't behave anything
like that. They do things like "+" or "coalesce" which don't open new
tables. There are contrary examples, though, even among functions
built into core, like "table_to_xmlschema", which takes a relation OID
as an argument and thus may open a new relation each time it's called.
If we had some way of analyzing a query and determining whether it
uses any functions or operators that open new tables, then this kind
of optimization might be possible, but we don't.

However, even if we did have such infrastructure, it wouldn't solve
all of our problems, because vacuum would have to know which sessions
were running queries that might open new tables and which were running
queries that won't open new tables -- and among the latter, it would
need to know which tables those sessions already have open. We could
make the former available via a new shared memory flag and the latter
could, perhaps, be deduced from the lock table, which is already
shared. However, if we did all that, VACUUM would potentially have to
do significantly more work to deduce the xmin horizon for each table
that it wanted to process.

Even given all that, I'm moderately confident that something like this
would benefit a lot of people. However, it would probably hurt some
people too, either because the overhead of figuring out that the
current query won't lock any more relations, so that we can advertise
that fact in shared memory, or because of the increased overhead of
figuring out the xmin horizon for a table to be vacuumed. Users with
short-running queries and small tables would be the most likely to be
harmed. On the other hand, for users with giant tables, even more
aggressive approaches might pay off - e.g. recompute the xmin horizon
every 1GB or so, because it might have advanced, and the effort to
recheck that might pay off by allowing us to vacuum more stuff sooner.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Juan José Santamaría Flecha 2019-10-28 17:13:58 Re: BUG #15858: could not stat file - over 4GB
Previous Message Robert Haas 2019-10-28 16:40:46 Re: [Proposal] Global temporary tables