Re: Dreaming About Redesigning SQL

From: "Bob Badour" <bbadour(at)golden(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Dreaming About Redesigning SQL
Date: 2003-10-21 22:52:18
Message-ID: 0J-dncfSRf9EJQiiU-KYuA@golden.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Lauri Pietarinen" <lauri(dot)pietarinen(at)atbusiness(dot)com> wrote in message
news:bn4cca$dj0$1(at)nyytiset(dot)pp(dot)htv(dot)fi(dot)(dot)(dot)
> Anthony W. Youngman wrote:
>
> >In article <3F94BCBB(dot)7030001(at)atbusiness(dot)com>, Lauri Pietarinen
> ><lauri(dot)pietarinen(at)atbusiness(dot)com> writes
> >
> >
> >>>Okay. Give me a FORMULA that returns a time in seconds for your query.
> >>>
> >>>Let's assume I want to print a statement of how many invoices were sent
> >>>to a customer, along with various details of those invoices. My invoice
> >>>file is indexed by company/month, and we can reasonably assume that the
> >>>time taken to produce the statement is infinitesimal compared to the
> >>>time taken to retrieve the invoice data from disk. For MV
> >>>
> >>>T = (2 + N) * ST * 1.05
> >>>
> >>>Where T is the time taken to produce the report, N is the number of
> >>>invoices, and ST is the hard disk seek time.
> >>>
> >>>
> >>>
> >>First of all it is important to note that an important component of all
modern
> >>SQL-DBMS's is
> >>the buffer pool (or cache) meaning that in a reasonably well tuned
database you
> >>get very few
> >>disk I/O's, even when *writing* data into tables.
> >>
> >>
> >
> >Fine. But MV *doesn't* *need* much of a cache. Let's assume both SQL and
> >MV have the same amount of RAM to cache in - i.e. *not* *much*. I did
> >say the spec said "extract maximum performance from the hardware
> >available".
> >
> So what's wrong with gettng a machine with lots of memory? How much
> does 2G of
> memory for an Intel-box cost now a days? Is this some kind of new
> ultimate sport, trying
> to get along with as little memory as possible?
>
> >You're assuming that you can throw hardware at the problem - fine, but
> >that's not always possible. You might have already maxed out the ram,
> >you might have a "huge" database, you might be sharing your db server
> >with other programs (BIND really likes to chew up every available drop
> >of ram, doesn't it :-).
> >
> >I'm not saying that you shouldn't throw hardware at it, but what if you
> >can't?
> >
> >
> >>SQL-DBMS's also are very clever at using indexes, i.e. if they can find
all
> >>necessary data
> >>
> >>
> >>from an index it will not even look at the table, so to speak.
> >
> >Same with MV
> >
> >
> >>And, even when presuming conservatively that there is no data in cache,
> >>depending on how
> >>the data is clustered, you will get more than one row/disk read (= 8K in
most(?)
> >>systems).
> >>
> >>
> >
> >Same with MV
> >
> >
> >>I could now denormalise OrderDetail so that it contains cust_id also
> >>and cluster by cust_id
> >>(might cause you trouble down the road, if you can change the customer
> >>of an order), in which case, with 3 I/O's I would get
> >>- 8 customer rows
> >>- 16 order rows
> >>- 24 order detail rows (which would all apply to one customer)
> >>
> >>Now the amout of I/O's would depend on how many detail rows
> >>we have per customer.
> >>
> >>And, of course, because we are using sequential prefetch, we would be
> >>getting more than one I/O block (8?, 16?) per disk seek, so it's a hard
> >>comparison to
> >>make but I suspect that it would about equal your example.
> >>
> >>
> >
> >Except my example was an *average* case, and yours is a *best* case. Oh,
> >and my data is still normalised - I haven't had to denormalise it! AND I
> >haven't run an optimiser over it :-)
> >
> Are you hiding your optimiser behind the curtain? ;-)
>
> Well, if it is normalised, how easy is it for you to change the
> customer_id of an order? Anyway,
> if we stick to your example and even if we don't normalise using e.g.
> clustering features of Oracle,
> as Bob pointed out, we are getting at most the same number of I/O's.
> So, answer to your
> question: our formula is at least as good as yours.

Actually, Bob pointed out we are getting at most 12.5% as many disk head
movements or I/O's. I'll take an 87.5% improvement any day.

> >>Now, that was a *conservative* estimate, and we assumed that we did not
have
> >>any rows lying around in the (global!) cache. As the size of the cache
grows in
> >>proportion to the size of the total database we can assume less and less
disk
> >>I/O.
> >
> >You're relying on the hardware to bale you out :-) We can do the same!
> >
> Well why don't you?

We achieved 8 times the performance with exactly the same hardware. What the
hell is this idiot talking about us relying on hardware? He is a moron. You
will do everyone a favour if you just bounce him off the bottom of your
killfile.

> >>Note also that the cache can be configured many ways, you can put
different
> >>tables (or indexes) in different caches, and even change the size of the
cache
> >>on the fly (you might want a bigger cache during evening and night when
your
> >>batch programs are running) so you can rig your system to favour certain
> >>types of queries.
> >>
> >>I havn't even gone into the topic of using thick indexes so table access
can
> >>be totally avoided (=we are reading into memory only interesting
columns).
> >>
> >>Now, in your example, what if the product department comes along and
> >>wants to make a report with sales / product? What would be your formula
> >>in that case?
> >
> >I'm not quite sure what you're trying to do. I'll assume you want a
> >report of all invoices which refer to a given product. Assuming I've got
> >the relevant indices defined, I can simply read a list of invoices from
> >the product code index, a second list of invoices from the month index,
> >and do an intersect of the two lists.
> >
> I want a list with all products with corresponding total sales, read
> from order detail e.g.
>
> Hammer 10000$
> Nail 5000$
> Screw 1200$
>
> How many disk reads (or head movements)?
>
> >So again, T = (2+N) * ST * 1.05 where N is the number of invoices that
> >reference that product. And now ALL the invoice data has been retrieved
> >from disk to ram ...
> >
> >>And: what if I was just reading customer-data. Would the same formula
> >>apply (= (2+N)*ST*1.05)?
> >
> >Nope. If I understand you correctly, you want attributes that belong to
> >the entity "customer", not the entity "invoice". T = ST * 1.05. (By the
> >way, billing and/or invoice address (for example) are invoice
> >attributes, not company attributes.)
> >
> No, I want you to give me a list of all your customers. How many disk
> reads?
>
> >>>But as I understand relational theory, such a question is completely
> >>>outside the scope of the theory. Seeing as it tries to divorce the
> >>>database logic from the practical implementation ...
> >>>
> >>The theory, indeed, does not say anything about buffer pools, but by
decoupling
> >>logic
> >>from implementation we leave the implementor (DBMS) to do as it feels
fit to do.
> >
> >>As DBMS technology advances, we get faster systems without having to
change our
> >>programs.
> >
> >But with MV, if our database is too large for current technology, we
> >kick the shit out of relational for speed ...

This idiot is a fucking fool. He makes untrue assertions and actually
believes them.

> >Don't forget. You've already said that, if nothing is cached, my average
> >case exceeds your best.

No, actually, we did not. We already said that, assuming identical hardware
and caching, our average case exceeds his best case by a factor of eight.

If you are going to engage these ridiculously ignorant and stupid pick
zealots, you must do a better job of identifying the horseshit they spout
with just about every word.

> And my case is *already* assuming that the
> >system is seriously stressed and struggling ...
> >
> >
> >>When we design databases we can decouple logical planning from
performance
> >>considerations, which, you must agree, are two separate issues.
> >>
> >>
> >>
> >>>And you know it's been proven that Huffman coding is the most efficient
> >>>compression algorithm? (Actually, it isn't - it's been proven it can't
> >>>be improved upon, which isn't the same thing...). Can you improve on
the
> >>>formula I've just given you?

Yes, by a factor of eight as already demonstrated.

> >>>Given that if we could change the 1.05 to 1
> >>>then we can prove it can't be improved upon ...

Okay, let him prove it in spite of the factor of 8 improvement we already
achieved.

[remainder of Wol's unthinking, blind horseshit snipped]

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Clift 2003-10-22 00:41:57 Re: [HACKERS] Mapping Oracle types to PostgreSQL
Previous Message Lauri Pietarinen 2003-10-21 22:34:28 Re: Dreaming About Redesigning SQL