RE: Plans for solving the VACUUM problem

From: "Mikheev, Vadim" <vmikheev(at)SECTORBASE(dot)COM>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: RE: Plans for solving the VACUUM problem
Date: 2001-05-19 00:08:07
Message-ID: 3705826352029646A3E91C53F7189E3201662C@sectorbase2.sectorbase.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> I have been thinking about the problem of VACUUM and how we
> might fix it for 7.2. Vadim has suggested that we should
> attack this by implementing an overwriting storage manager
> and transaction UNDO, but I'm not totally comfortable with
> that approach: it seems to me that it's an awfully large
> change in the way Postgres works.

I'm not sure if we should implement overwriting smgr at all.
I was/is going to solve space reusing problem with non-overwriting
one, though I'm sure that we have to reimplement it (> 1 table
per data file, stored on disk FSM etc).

> Second: if VACUUM can run in the background, then there's no
> reason not to run it fairly frequently. In fact, it could become
> an automatically scheduled activity like CHECKPOINT is now,
> or perhaps even a continuously running daemon (which was the
> original conception of it at Berkeley, BTW).

And original authors concluded that daemon was very slow in
reclaiming dead space, BTW.

> 3. Lazy VACUUM processes a table in five stages:
> A. Scan relation looking for dead tuples;...
> B. Remove index entries for the dead tuples...
> C. Physically delete dead tuples and compact free space...
> D. Truncate any completely-empty pages at relation's end.
> E. Create/update FSM entry for the table.
...
> If a tuple is dead, we care not whether its index entries are still
> around or not; so there's no risk to logical consistency.

What does this sentence mean? We canNOT remove dead heap tuple untill
we know that there are no index tuples referencing it and your A,B,C
reflect this, so ..?

> Another place where lazy VACUUM may be unable to do its job completely
> is in compaction of space on individual disk pages. It can physically
> move tuples to perform compaction only if there are not currently any
> other backends with pointers into that page (which can be tested by
> looking to see if the buffer reference count is one). Again, we punt
> and leave the space to be compacted next time if we can't do it right
> away.

We could keep share buffer lock (or add some other kind of lock)
untill tuple projected - after projection we need not to read data
for fetched tuple from shared buffer and time between fetching
tuple and projection is very short, so keeping lock on buffer will
not impact concurrency significantly.

Or we could register callback cleanup function with buffer so bufmgr
would call it when refcnt drops to 0.

> Presently, VACUUM deletes index tuples by doing a standard index
> scan and checking each returned index tuple to see if it points
> at any of the tuples to be deleted. If so, the index AM is called
> back to delete the tested index tuple. This is horribly inefficient:
...
> This is mainly a problem of a poorly chosen API. The index AMs
> should offer a "bulk delete" call, which is passed a sorted array
> of main-table TIDs. The loop over the index tuples should happen
> internally to the index AM.

I agreed with others who think that the main problem of index cleanup
is reading all index data pages to remove some index tuples. You told
youself about partial heap scanning - so for each scanned part of table
you'll have to read all index pages again and again - very good way to
trash buffer pool with big indices.

Well, probably it's ok for first implementation and you'll win some CPU
with "bulk delete" - I'm not sure how much, though, and there is more
significant issue with index cleanup if table is not locked exclusively:
concurrent index scan returns tuple (and unlock index page), heap_fetch
reads table row and find that it's dead, now index scan *must* find
current index tuple to continue, but background vacuum could already
remove that index tuple => elog(FATAL, "_bt_restscan: my bits moved...");

Two ways: hold index page lock untill heap tuple is checked or (rough
schema)
store info in shmem (just IndexTupleData.t_tid and flag) that an index tuple
is used by some scan so cleaner could change stored TID (get one from prev
index tuple) and set flag to help scan restore its current position on
return.

I'm particularly interested in discussing this issue because of it must be
resolved for UNDO and chosen way will affect in what volume we'll be able
to implement dirty reads (first way doesn't allow to implement them in full
- ie selects with joins, - but good enough to resolve RI constraints
concurrency issue).

> There you have it. If people like this, I'm prepared to commit to
> making it happen for 7.2. Comments, objections, better ideas?

Well, my current TODO looks as (ORDER BY PRIORITY DESC):

1. UNDO;
2. New SMGR;
3. Space reusing.

and I cannot commit at this point anything about 3. So, why not to refine
vacuum if you want it. I, personally, was never be able to convince myself
to spend time for this.

Vadim

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2001-05-19 00:27:15 Re: Plans for solving the VACUUM problem
Previous Message Thomas Lockhart 2001-05-18 23:30:14 Re: Problems with avg on interval data type