Re: Nasty VACUUM/bgwriter/segmentation bug

From: Russell Smith <mr-russ(at)pws(dot)com(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Nasty VACUUM/bgwriter/segmentation bug
Date: 2006-11-19 01:18:29
Message-ID: 455FB0E5.7020203@pws.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> While working on fixing the recently reported hash-index problem,
> I was using a test build with a very small RELSEG_SIZE (128K),
> so that I could trigger the reported bug with a reasonably small
> amount of data. And I started finding some unexpected data corruption.
> I eventually reduced it to this test case:
>
> checkpoint;
> create table foo (f1 int);
> insert into foo select x from generate_series(1,100000) x;
> -- wait 30 seconds
> delete from foo;
> vacuum verbose foo;
> insert into foo select x from generate_series(1,100000) x;
> \q
> stop and restart postmaster, then
> vacuum verbose foo;
>
> This vacuum will generate a whole lot of
> WARNING: relation "foo" page 16 is uninitialized --- fixing
> WARNING: relation "foo" page 17 is uninitialized --- fixing
> ...
> and when the dust settles, most of the second batch of 100000 rows
> is gone.
>
> What is happening is that during that 30-second wait, the bgwriter is
> dumping out all the dirty pages, and acquiring open file references
> to each segment of table "foo" as it does so. The VACUUM then truncates
> "foo" back to zero size, since it contains no data after the DELETE,
> and then the second INSERT bulks it up again. The problem is that the
> bgwriter still has open file references to the deleted segments after
> the first one, and that's where it'll write the data if given a chance.
> So the updates disappear into the ether as far as any other process is
> concerned, for each segment except the first.
>
Does TRUNCATE suffer from the same issue?
> There's a rather indirect mechanism that's supposed to prevent similar
> problems between two backends: a file truncation is supposed to be
> associated with a forced relcache flush, and that causes an smgrclose(),
> so other backends will be forced to reopen the file(s) before they can
> do any more work with the truncated relation. On reflection I think
> I don't trust this though, because in the case where a backend writes a
> dirty buffer because it needs to reclaim a buffer, it doesn't try to
> open the relation at the relcache level (it cannot, because the rel
> might be in a different database). So it could use a stale smgr
> relation, same as the bgwriter. The bgwriter does not participate
> in shared cache inval, not having a relcache in the first place, and
> so this mechanism doesn't help it anyway.
>
> This is a fairly low-probability bug in real-world cases, because it
> could only happen when a relation is truncated and then re-expanded
> across a 1GB segment boundary. Moreover, because the bgwriter flushes
> all its open files after each checkpoint, the window for trouble only
> extends to the next checkpoint. But it definitely could happen, and
> it might explain some irreproducible corruption reports.
>
Regular imports that delete data or truncate relations would increase
this probability wouldn't they?
Autovac is also likely to run on that relation in the "wait" phase,
which other relations are being truncated by an import process.

> I think that the easiest fix might be to not remove no-longer-used
> segment files during a truncate, but simply reduce them to zero size
> rather than delete them. Then any open file pointers aren't
> invalidated. The only alternative I can see is to invent some new
> signaling mechanism to force closure of open files, but that seems
> ugly, complex, and perhaps subject to race conditions.
>
> Thoughts?
>
Seems reasonable from my lowly user point of view. Would there be a
requirement to remove the extra segments at any point in the future or
would they hang around on the disk forever?

Russell Smith
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mike Rylander 2006-11-19 01:37:44 Re: Ontology on PostgreSQL - is there something?
Previous Message Tom Lane 2006-11-19 00:34:51 Nasty VACUUM/bgwriter/segmentation bug