Re: Contrib -- PostgreSQL shared variables

From: pgsql(at)mohawksoft(dot)com
To: "Heikki Linnakangas" <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Contrib -- PostgreSQL shared variables
Date: 2004-08-29 13:24:59
Message-ID: 17189.24.91.171.78.1093785899.squirrel@mail.mohawksoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

> On Sat, 28 Aug 2004 pgsql(at)mohawksoft(dot)com wrote:
>
>>
>>> I don't see how this is different from "CREATE TABLE shared_variables
>>> (name
>>> VARCHAR PRIMARY KEY, value VARCHAR)" and
>>> inserting/updating/deleting/selecting from that. Perhaps these are
>>> per-session shared variables? IN which case, what is the utility if
>>> sharing
>>> them across shared memory?
>>>
>>> - --
>>> Jonathan Gardner
>>
>> Well, the issues you don't see is this:
>>
>> What if you have to update the variables [n] times a second?
>>
>> You have to vacuum very frequently. If you update a variable a hundred
>> times a second, and vacuum only once every minute, the time it takes to
>> update ranges from reading one row from the database to reading 5999
>> dead
>> rows to get to the live one. Then you vacuum, then you are back to one
>> row
>> again.
>
> I think the right approach is to tackle that problem instead of working
> around it with a completely new variable mechanism.

You assume there is a problem with the existing mechanism that needs
fixing. There isn't. PostgreSQL does the right thing already. The issue
is that the class of problem we are trying to solve does not fit the MVCC
methodology very well.

I've given this example a number of times:

Consider this query:

select sum(col) from table;

If this is an active system, with many inserts and updates a second, and
table has a couple million rows, there is NO way to get any meaningful
answer out of this query. It is only a snapshot, not much better than an
estimate.

The shared variables are a way of expressing, "I know this is just an
estimate, but I need to do it FAST."

A LOT of websites can benefit from this, like a quick "Number of people
online"

>
> I've been playing with the idea of a quick vacuum that runs through the
> shmem buffers. The idea is that since the pages are already in memory,
> the vacuum runs very quickly. Vacuuming the hot pages frequently
> enough should avoid the problem you describe. It also saves I/O in the
> long run since dirty pages are vacuumed before they are written to
> disk, eliminating the need to read in, vacuum and write the same pages
> again later.

That's fine, but how does it affect performance?

>
> The problem is of course that to vacuum the heap pages, you have to make
> sure that there is no references to the dead tuples from any indexes.

Exactly.

>
> The trivial case is that the table has no indexes. But I believe that
> even if the table has ONE index, it's very probable that the corresponding
> index pages of the dead tuples are also in memory, since the tuple was
> probably accessed through the index.
>
> As the number of indexes gets bigger, the chances of all corresponding
> index pages being in memory gets smaller.
>
> If the "quick" vacuum or opportunistic vacuum as I call it is clever
> enough to recognize that there is a dead tuple in memory, and all the
> index pages that references are in memory too, it could reliably vacuum
> just those tuples without scanning through the whole relation and without
> doing any extra I/O.
>
> I've written some code that implements the trivial case of no indexes. I'm
> hoping to extend it to handle the indexes too if I have time. Then we'll
> see if it's any good. I've attached a patch with my current ugly
> implementation if you want to give it a try.

This is cool, but is it actually "more" work over the long run than
vacuum? I mean, a wile ago, vacuum had to lock the tables. This
non-locking vacuum is pretty cool and takes away almost all objections
that I can think about with PostgreSQL.

>
>> On top of that, all the WAL logging that has to take place for each
>> "transaction."
>
> How is that a bad thing? You don't want to give up ACID do you?

Actually, yes, yes I do. Not for everything, but for classes of problems
that do not need it.

I've some experience with this. I have a session manager for PHP. I wrote
it because databases fail miserably when what you want to do is have
SQL-esque information that neeeds to be FAST and can go away without
worrying too much about it.

Think about a website shopping cart. The session and the things in the
cart don't matter after 10 or 15 minutes. Most of the time, it will just
be abandoned. The only time it is important is at checkout. At which point
it is transfered to an ACID compliant system.

My session manager on a modern dual PIII can handle 8000 full "get vars,
set vars" cycles a second. This is more or less impossible with a SQL
database without spending a LOT of money on clustering. Mine can do it
with a single standard x86 Linux server box.

Some information can go away without worry. Paying for protecting it with
CPU and disk I/O can be VERY expensive. PostgreSQL does what it does very
well. Sometimes what it does can get in the way of using it.

>
> - Heikki
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2004-08-29 14:27:25 Re: psql questions: SQL, progname, copyright dates
Previous Message pgsql 2004-08-29 13:06:03 Re: Contrib -- PostgreSQL shared variables

Browse pgsql-patches by date

  From Date Subject
Next Message Peter Eisentraut 2004-08-29 16:29:51 Re: New Language Translation
Previous Message Jan Wieck 2004-08-29 12:38:36 Re: log_filename_prefix --> log_filename + strftime()