Re: experiments in query optimization

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Faheem Mitha <faheem(at)email(dot)unc(dot)edu>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-performance(at)postgresql(dot)org
Subject: Re: experiments in query optimization
Date: 2010-03-31 15:04:13
Message-ID: y2r603c8f071003310804g31081375l5a61624927a4bfb@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Wed, Mar 31, 2010 at 6:10 AM, Faheem Mitha <faheem(at)email(dot)unc(dot)edu> wrote:
>
> [If Kevin Grittner reads this, please fix your email address. I am getting
> bounces from your email address.]
>
> On Tue, 30 Mar 2010, Robert Haas wrote:
>
>> On Tue, Mar 30, 2010 at 12:30 PM, Faheem Mitha <faheem(at)email(dot)unc(dot)edu>
>> wrote:
>>>
>>> Sure, but define sane setting, please. I guess part of the point is that
>>> I'm
>>> trying to keep memory low, and it seems this is not part of the planner's
>>> priorities. That it, it does not take memory usage into consideration
>>> when
>>> choosing a plan. If that it wrong, let me know, but that is my
>>> understanding.
>>
>> I don't understand quite why you're confused here.  We've already
>> explained to you that the planner will not employ a plan that uses
>> more than the amount of memory defined by work_mem for each sort or
>> hash.
>
>> Typical settings for work_mem are between 1MB and 64MB.  1GB is enormous.
>
> I don't think I am confused. To be clear, when I said "it does not take
> memory usage into consideration' I was talking about overall memory usage.
> Let me summarize:
>
> The planner will choose the plan with the minimum total cost, with the
> constraint that the number of memory used for each of certain steps is less
> than work_mem. In other words with k such steps it can use at most
>
> k(plan)*work_mem
>
> memory where k(plan) denotes that k is a function of the plan. (I'm assuming
> here that memory is not shared between the different steps). However,
> k(plan)*work_mem is not itself bounded. I fail to see how reducing work_mem
> significantly would help me. This would mean that the current plans I am
> using would likely be ruled out, and I would be left with plans which, by
> definition, would have larger cost and so longer run times. The current
> runtimes are already quite long - for the PED query, the best I can do with
> work_mem=1 GB is 2 1/2 hrs, and that is after splitting the query into two
> pieces.
>
> I might actually be better off *increasing* the memory, since then the
> planner would have more flexibility to choose plans where the individual
> steps might require more memory, but the overall memory sum might be lower.

OK, your understanding is correct.

>>>>>> You might need to create some indices, too.
>>>>>
>>>>> Ok. To what purpose? This query picks up everything from the
>>>>> tables and the planner does table scans, so conventional wisdom
>>>>> and indeed my experience, says that indexes are not going to be so
>>>>> useful.
>>>>
>>>> There are situations where scanning the entire table to build up a
>>>> hash table is more expensive than using an index.  Why not test it?
>>>
>>> Certainly, but I don't know what you and Robert have in mind, and I'm not
>>> experienced enough to make an educated guess. I'm open to specific
>>> suggestions.
>>
>> Try creating an index on geno on the columns that are being used for the
>> join.
>
> Ok, I'll try that. I guess the cols in question on geno are idlink_id and
> anno_id. I thought that I already had indexes on them, but no. Maybe I had
> indexes, but removed them.
>
> If I understand the way this works, if you request, say an INNER JOIN, the
> planner can choose different ways/algorithms to do this, as in
> http://en.wikipedia.org/wiki/Join_(SQL)#Nested_loops . It may choose a hash
> join, or an nested loop join or something else, based on cost. If the
> indexes don't exist that may make the inner loop join more expensive, so tip
> the balance in favor of using a hash join. However, I have no way to control
> which option it chooses, short of disabling eg. the hash join option, which
> is not an option for production usage anyway. Correct?

Yep.

...Robert

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Ľubomír Varga 2010-03-31 15:46:38 Some question
Previous Message Bruce Momjian 2010-03-31 14:23:29 Re: mysql to postgresql, performance questions