Implementing and Experimenting with a Full Disjunctions Operator.

From: Tzahi Fadida <tzahi_ml(at)myrealbox(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Implementing and Experimenting with a Full Disjunctions Operator.
Date: 2005-01-03 20:20:33
Message-ID: 002101c4f1d1$b0d27d20$0b00a8c0@llord
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi All,
As part of my thesis I need to implement a new algorithm for
Full-Disjunctions and 2 older ones.
A short explanation of what Full-Disjunction is, is that it comes to
solve what A natural outer join
usually can't do when you have more than 2 relations.
The main goal is to retrieve maximal answers from a set of relations.
The natural representation of the operator that is usually depicted in
the literature is FD(r1,...,rN).
(r=relation/table)
Ullman's algorithm uses several natural outer joins so there is no
problem there but our
algorithm must be run internally at the server since it uses no existing
operators but it is
also not limited to the gamma-acyclic restriction of Ullman's algorithm.

I have already read most of the development documentations, faqs,
presentations, listened on
this mailing list, etc... I already compiled a dynamically loaded
library with a function and ran it successfully.
The research part of implementing the algorithm is theoretical and
experimentation.
After looking around in the code and seeing how the SPI generally works
I have several concerns
(the first one is the most crucial to me):
1) As part of the experimentation I need to know exactly how many blocks
have been read when
the algorithm ran. I need complete control over the process to run my
simulations.
I see that there are functions like heap heap_getnext() heap_fetch()
SearchSysCache().
Our algorithm usually read sequentially the relations and I don't see
how to read complete blocks and
count these blocks. In addition temporary queues that must be held in
memory will be needed to be dumped to
disk at various times (because of their size) and fetched. Is there a
way to control this process
with accuracy and calculate the exact disk writes?
2) As part of the theoretical work and experimentation we want to load
blocks of relation rows to the
main memory and cache them using our techniques. Is there a way to
control the memory blocks so
they won't be swapped. In addition, is there a way to get a specific
size of memory so we can
plan our operator running path. I see that palloc return's to me a
chunck of memory but I don't know
in advance how much is available to me (aside from polling for it).
3) When outputting the results as a set of records, I cannot know in
advance the type of temporary
table that will come out just like a subquery like (select * from
relationA,relationB); Is there a problem
outputting this kind of table?
4) When inputting the various tables to the operator I understand that
the function is limited to a fixed
number of arguments. Is there a way to circumvent that or would I need
to use an ugly ARRAY construct.

Obviously there are better ways than a dynamically loaded library
function, so after we study the algorithm
I don't think there should be any problem integrating it to postgreSQL,
of course if it will be good enough :)

Thank you.

Regards,
tzahi.

* - * - *
Itzhak Fadida
MSc Student
Information System Engineering Area
Faculty of Industrial Engineering & Management
Technion - Israel Institute of Technology
Technion City, Haifa, Israel 32000
Technion Email: Tzahi(at)TX(dot)Technion(dot)ac(dot)il
Alternative Email: TzahiFadida(at)MyRealBox(dot)com
* - * - * - * - * - * - * - * - * - * - * - * - * - * - *

WARNING TO SPAMMERS: see at
http://members.lycos.co.uk/my2nis/spamwarning.html

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-01-03 20:29:36 Re: [HACKERS] Bgwriter behavior
Previous Message Andrew Dunstan 2005-01-03 20:13:36 [Fwd: Re: postgres+tcl on cygwin]