Skip site navigation (1) Skip section navigation (2)

Re: experiments in query optimization

From: Faheem Mitha <faheem(at)email(dot)unc(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: experiments in query optimization
Date: 2010-03-29 20:22:05
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-performance

On Mon, 29 Mar 2010, Robert Haas wrote:

> On Mon, Mar 29, 2010 at 2:31 PM, Faheem Mitha <faheem(at)email(dot)unc(dot)edu> wrote:
>>> It's not really too clear to me from reading this what specific
>>> questions you're trying to answer.
>> Quote from opt.{tex/pdf}, Section 1:
>> "If I have to I can use Section~\ref{ped_hybrid} and
>> Section~\ref{tped_hybrid}, but I am left wondering why I get the performance
>> I do out of the earlier versions. Specifically, why is
>> Section~\ref{ped_bigjoin} so much slower than Section~\ref{ped_trunc}, and
>> why does the memory usage in Section~\ref{ped_phenoout} blow up relative to
>> Section~\ref{ped_bigjoin} and Section~\ref{ped_trunc}?"
> Here and in the document, you refer to section numbers for the
> "hybrid" version but I don't see where you define what the "hybrid"
> version actually is.

It is defined later in the file. I don't know if you are looking at the 
pdf, but if so, it is Section 2.4 (for the hybrid PED query). In the text 
file, I guess the easist way would be to grep for the label ped_hybrid.

> And the differences between your queries are not real clear either - 
> first you say you took out pheno and sex because they weren't necessary, 
> but then you decide to put them back.  I don't know what that means. 
> If they're not necessary, leave them out.

I don't see where I say that pheno and sex weren't necessary. In fact, the 
word 'necessary' does not appear in the opt document. I took them out to 
see how it would affect performance. Which is does, dramatically. I say

"So, I decided to remove the joins to tables corresponding to the patient 
data, namely pheno and sex, and the runtime dropped to 150 min, while the 
memory stayed around 5G."

Maybe I wasn't being sufficiently explicit here. Perhaps

"So, I decided to remove the joins to tables corresponding to the patient
data, namely pheno and sex, to see how it would affect performance..."

would have been better.

>>> One random thought: WHERE row_number() = 1 is not too efficient.
>>> Try using LIMIT or DISTINCT ON instead.
>> Possibly. However, the CTE that uses
>> WHERE row_number() = 1
>> doesn't dominate the runtime or memory usage, so I'm not too concerned
>> about it.
> Hmm, you might be right.
>>> If you're concerned about memory usage, try reducing work_mem; you've
>>> probably got it set to something huge.
>> work_mem = 1 GB (see diag.{tex/pdf}).
>> The point isn't that I'm using so much memory. Again, my question is, why
>> are these changes affecting memory usage so drastically?
> Well each sort or hash can use an amount of memory that is limited
> from above by work_mem.  So if you write the query in a way that
> involves more sorts or hashes, each one can add up to 1GB to your
> memory usage, plus overhead.  However, it doesn't look like any of
> your queries including 30 sorts or hashes, so I'm thinking that the
> RSS number probably also includes some of the shared memory that has
> been mapped into each backend's address space.  RSS is not a terribly
> reliable number when dealing with shared memory; it's hard to say what
> that really means.

>>> You might need to create some indices, too.

>> Ok. To what purpose? This query picks up everything from the tables and the
>> planner does table scans, so conventional wisdom and indeed my experience,
>> says that indexes are not going to be so useful.

> Well, a hash join is not usually the first thing that pops to mind when 
> dealing with a table that has 825 million rows (geno).  I don't know if 
> a nested loop with inner-indexscan would be faster, but it would almost 
> certainly use less memory.

Can you provide an illustration of what you mean? I don't know what a 
"nested loop with inner-indexscan" is in this context.

                                                        Regards, Faheem.

In response to

pgsql-performance by date

Next:From: Josh BerkusDate: 2010-03-30 06:18:27
Subject: Re: why does swap not recover?
Previous:From: Robert HaasDate: 2010-03-29 19:55:46
Subject: Re: experiments in query optimization

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group