Re: Bidirectional hard joins (fwd)

From: Hannu Krosing <hannu(at)tm(dot)ee>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ivan Panchenko <ivan(at)xray(dot)sai(dot)msu(dot)ru>, Teodor Sigaev <teodor(at)stack(dot)net>
Subject: Re: Bidirectional hard joins (fwd)
Date: 2002-04-04 15:31:47
Message-ID: 1017934312.12446.31.camel@taru.tm.ee
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2002-04-04 at 14:17, Oleg Bartunov wrote:
Subject: Bidirectional hard joins
>
> Hi,
>
>
> Here we propose some essential improvement of postgreSQL functionality,
> which may provide a great perfomance increase.
>
> 1. Problem
>
> The fastest way to find and fetch a record from a table is to
> perform a SELECT ... WHERE record.id = value.
> Probably, an index scan would be performed for this SELECT.
>
> Such index scan seems to be fast, but there are some cases where it may
> appear too slow. The most evident case is the case of a sub-query, which
> can arise as a result of a join or a nested select statement.
>
> If it were possible to store direct references to database
> records in the tables, joins could be implemented in much more effective
> way.
>
> 2. Possible soultion
>
> Creating a datatype which stores direct reference
> to the record (i.e., physical location of the tuple) is only a part of the
> solution.

The tid type does exaclty what is needed.

>
> When a record that is referenced is updated its physical location can be
> changed, so the references to it should be updated. To make this
> possible, the referenced record should remember all the references to
> itself. Thus, we consider the direct tuple references as bidirectional
> links, or "bidirectional hard joins".
>
> These "hard joins" are in some sense similar to hard links in a
> filesystem (in this analogy, classic joins are like symbolic links).
>
> Philosophically, this means a convergence between indexes and tables: a
> table behaives like an index for an other table.
>
> Obviously, this is is a nonrelational feature, and it requires some
> special SQL syntax. Below we provide some examples for clarification of
> the use of the proposed feature.

Or we can just use tid's and ordinary joins to make it a relational
feature.

IIRC this has been discussed on this list a few months ago. I'm not sure
if bi-directional tid usage was discussed, but I can't see how to use
them efficiently in a non-overwrite storage manager.

...

> 4. Performance
>
> When direct joins are used in select statements, they can strongly
> increase performance.
>
> Let us examine the query plan of the request ("Find all Bob's
> children") from the example above in the present day postgres.
> create table man (id SERIAL,name text);
> create table child (id SERIAL,name text, parent_id int4 references man(id));
> .. populate the tables ... and create indexes...
> explain select child.name from child, man
> where child.parent_id = man.id
> and man.name = 'Bob Scott';
>
> Nested Loop
> -> Index Scan using man_n on man
> -> Index Scan using child_par on child
>
>
>
> In a hypotetical postgres with hard joins it could be:
>
> Nested Loop
> -> Index Scan using man_n on man
> -> Direct retrieval on child
>
> I.e., the for each retrieved "man" record we retrieve the "child" records
> directly using hard join. The real overhead for this operation should be
> neglible in comparison with index scan.

OTOH, if index is in memory and the retrieved tuple is not then the
_speed_difference_ could be neglible.

> Using the hard joins require some additional overhead in updates. In fact,
> after updating the record which takes part in such join, the references
> to this record in the other records should be also updated.

And this should be in a non-overwriting way. If we just do a standard
UPDATE, causing a new heap record to be added this will result in a
circle as then the original records references are not valid anymore and
so also need to be updated and so on ...

> This operation
> is not essentially new for postgres as similar things are done with
> indexes when an indexed record is updated. Hence, the overhead for updates
> is not greater than the overhead for updating indexes.

AFAIK indexes are not "updated" but a new index entry is added as the
old tuple may be still visible to some other transaction.

> 5. Implementation and conclusion
>
> Effective implementing of hard joins requires hard changes to postgres,
> most serious of them probably in the executor, where a new method "fetch
> record by reference" should be added in addition to "index scan" and "seq
> scan". Also the optimizer should be taught to deal with this.
>
> The update support is not so hard as it is similar to the updating of
> indexes.
>
> Though the implementation of such hard joins is really a complicated task,
> the performance it brings should be tremendous, so we consider discussing
> this important.

Depending on usage the performance degradation can also be tremendous,
as a simple update can trigger an avalance of referencing tid updates
...

--------------
Hannu

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2002-04-04 15:34:42 Re: Changing column types...
Previous Message Tom Lane 2002-04-04 15:25:10 Re: timeout implementation issues