Re: experiments in query optimization

From: Faheem Mitha <faheem(at)email(dot)unc(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: experiments in query optimization
Date: 2010-03-29 20:22:05
Message-ID: alpine.DEB.2.00.1003300141290.13883@orwell.homelinux.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Mon, 29 Mar 2010, Robert Haas wrote:

> On Mon, Mar 29, 2010 at 2:31 PM, Faheem Mitha <faheem(at)email(dot)unc(dot)edu> wrote:
>>> It's not really too clear to me from reading this what specific
>>> questions you're trying to answer.
>>
>> Quote from opt.{tex/pdf}, Section 1:
>>
>> "If I have to I can use Section~\ref{ped_hybrid} and
>> Section~\ref{tped_hybrid}, but I am left wondering why I get the performance
>> I do out of the earlier versions. Specifically, why is
>> Section~\ref{ped_bigjoin} so much slower than Section~\ref{ped_trunc}, and
>> why does the memory usage in Section~\ref{ped_phenoout} blow up relative to
>> Section~\ref{ped_bigjoin} and Section~\ref{ped_trunc}?"
>
> Here and in the document, you refer to section numbers for the
> "hybrid" version but I don't see where you define what the "hybrid"
> version actually is.

It is defined later in the file. I don't know if you are looking at the
pdf, but if so, it is Section 2.4 (for the hybrid PED query). In the text
file, I guess the easist way would be to grep for the label ped_hybrid.

> And the differences between your queries are not real clear either -
> first you say you took out pheno and sex because they weren't necessary,
> but then you decide to put them back. I don't know what that means.
> If they're not necessary, leave them out.

I don't see where I say that pheno and sex weren't necessary. In fact, the
word 'necessary' does not appear in the opt document. I took them out to
see how it would affect performance. Which is does, dramatically. I say

"So, I decided to remove the joins to tables corresponding to the patient
data, namely pheno and sex, and the runtime dropped to 150 min, while the
memory stayed around 5G."

Maybe I wasn't being sufficiently explicit here. Perhaps

"So, I decided to remove the joins to tables corresponding to the patient
data, namely pheno and sex, to see how it would affect performance..."

would have been better.

>>> One random thought: WHERE row_number() = 1 is not too efficient.
>>> Try using LIMIT or DISTINCT ON instead.
>>
>> Possibly. However, the CTE that uses
>>
>> WHERE row_number() = 1
>>
>> doesn't dominate the runtime or memory usage, so I'm not too concerned
>> about it.
>
> Hmm, you might be right.
>
>>> If you're concerned about memory usage, try reducing work_mem; you've
>>> probably got it set to something huge.
>>
>> work_mem = 1 GB (see diag.{tex/pdf}).
>>
>> The point isn't that I'm using so much memory. Again, my question is, why
>> are these changes affecting memory usage so drastically?
>
> Well each sort or hash can use an amount of memory that is limited
> from above by work_mem. So if you write the query in a way that
> involves more sorts or hashes, each one can add up to 1GB to your
> memory usage, plus overhead. However, it doesn't look like any of
> your queries including 30 sorts or hashes, so I'm thinking that the
> RSS number probably also includes some of the shared memory that has
> been mapped into each backend's address space. RSS is not a terribly
> reliable number when dealing with shared memory; it's hard to say what
> that really means.

>>> You might need to create some indices, too.

>> Ok. To what purpose? This query picks up everything from the tables and the
>> planner does table scans, so conventional wisdom and indeed my experience,
>> says that indexes are not going to be so useful.

> Well, a hash join is not usually the first thing that pops to mind when
> dealing with a table that has 825 million rows (geno). I don't know if
> a nested loop with inner-indexscan would be faster, but it would almost
> certainly use less memory.

Can you provide an illustration of what you mean? I don't know what a
"nested loop with inner-indexscan" is in this context.

Regards, Faheem.

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Josh Berkus 2010-03-30 06:18:27 Re: why does swap not recover?
Previous Message Robert Haas 2010-03-29 19:55:46 Re: experiments in query optimization