Re: Savepoints

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: "Mikheev, Vadim" <vmikheev(at)SECTORBASE(dot)COM>
Cc: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Savepoints
Date: 2002-01-24 19:22:19
Message-ID: 200201241922.g0OJMJf13378@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


OK, I have had time to think about this, and I think I can put the two
proposals into perspective. I will use Vadim's terminology.

In our current setup, rollback/undo data is kept in the same file as our
live data. This data is used for two purposes, one, for rollback of
transactions, and perhaps subtransactions in the future, and second, for
MVCC visibility for backends making changes.

So, it seems the real question is whether a database modification should
write the old data into a separate rollback segment and modify the heap
data, or just create a new row and require the old row to be removed
later by vacuum.

Let's look at this behavior without MVCC. In such cases, if someone
tries to read a modified row, it will block and wait for the modifying
backend to commit or rollback, when it will then continue. In such
cases, there is no reason for the waiting transaction to read the old
data in the redo segment because it can't continue anyway.

Now, with MVCC, the backend has to read through the redo segment to get
the original data value for that row.

Now, while rollback segments do help with cleaning out old UPDATE rows,
how does it improve DELETE performance? Seems it would just mark it as
expired like we do now.

One objection I always had to redo segments was that if I start a
transaction in the morning and walk away, none of the redo segments can
be recycled. I was going to ask if we can force some type of redo
segment compaction to keep old active rows and delete rows no longer
visible to any transaction. However, I now realize that our VACUUM has
the same problem. Tuples with XID >= GetOldestXmin() are not recycled,
meaning we have this problem in our current implementation too. (I
wonder if our vacuum could be smarter about knowing which rows are
visible, perhaps by creating a sorted list of xid's and doing a binary
search on the list to determine visibility.)

So, I guess the issue is, do we want to keep redo information in the
main table, or split it out into redo segments. Certainly we have to
eliminate the Oracle restrictions that redo segment size is fixed at
install time.

The advantages of a redo segment is that hopefully we don't have
transactions reading through irrelevant undo information. The
disadvantage is that we now have redo information grouped into table
files where a sequential scan can be performed. (Index scans of redo
info are a performance problem currently.) We would have to somehow
efficiently access redo information grouped into the redo segments.
Perhaps a hash based in relid would help here. Another disadvantage is
concurrency. When we start modifying heap data in place, we have to
prevent other backends from seeing that modification while we move the
old data to the redo segment.

I guess my feeling is that if we can get vacuum to happen automatically,
how is our current non-overwriting storage manager different from redo
segments?

One big advantage of redo segments would be that right now, if someone
updates a row repeatedly, there are lots of heap versions of the row
that are difficult to shrink in the table, while if they are in the redo
segments, we can more efficiently remove them, and there is only on heap
row.

How is recovery handled with rollback segments? Do we write old and new
data to WAL? We just write new data to WAL now, right? Do we fsync
rollback segments?

Have I outlined this accurately?

---------------------------------------------------------------------------

Mikheev, Vadim wrote:
> > > How about: use overwriting smgr + put old records into rollback
> > > segments - RS - (you have to keep them somewhere till TX's running
> > > anyway) + use WAL only as REDO log (RS will be used to rollback TX'
> > > changes and WAL will be used for RS/data files recovery).
> > > Something like what Oracle does.
> >
> > I am sorry. I see what you are saying now. I missed the words
>
> And I'm sorry for missing your notes about storing relid+tid only.
>
> > "overwriting smgr". You are suggesting going to an overwriting
> > storage manager. Is this to be done only because of savepoints.
>
> No. One point I made a few monthes ago (and never got objections)
> is - why to keep old data in data files sooooo long?
> Imagine long running TX (eg pg_dump). Why other TX-s must read
> again and again completely useless (for them) old data we keep
> for pg_dump?
>
> > Doesn't seem worth it when I have a possible solution without
> > such a drastic change.
> > Also, overwriting storage manager will require MVCC to read
> > through there to get accurate MVCC visibility, right?
>
> Right... just like now non-overwriting smgr requires *ALL*
> TX-s to read old data in data files. But with overwriting smgr
> TX will read RS only when it is required and as far (much) as
> it is required.
>
> Simple solutions are not always the best ones.
> Compare Oracle and InterBase. Both have MVCC.
> Smgr-s are different. What RDBMS is more cool?
> Why doesn't Oracle use more simple non-overwriting smgr
> (as InterBase... and we do)?
>
> Vadim
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2002-01-24 19:42:32 Re: C++ problems with RC1
Previous Message Jan Wieck 2002-01-24 19:14:45 Re: PostgreSQL crashes with Qmail-SQL