Re: experiments in query optimization

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Faheem Mitha <faheem(at)email(dot)unc(dot)edu>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: experiments in query optimization
Date: 2010-03-29 19:55:46
Message-ID: 603c8f071003291255x151746dif0e33d18124c8bd2@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Mon, Mar 29, 2010 at 2:31 PM, Faheem Mitha <faheem(at)email(dot)unc(dot)edu> wrote:
>> It's not really too clear to me from reading this what specific
>> questions you're trying to answer.
>
> Quote from opt.{tex/pdf}, Section 1:
>
> "If I have to I can use Section~\ref{ped_hybrid} and
> Section~\ref{tped_hybrid}, but I am left wondering why I get the performance
> I do out of the earlier versions. Specifically, why is
> Section~\ref{ped_bigjoin} so much slower than Section~\ref{ped_trunc}, and
> why does the memory usage in Section~\ref{ped_phenoout} blow up relative to
> Section~\ref{ped_bigjoin} and Section~\ref{ped_trunc}?"

Here and in the document, you refer to section numbers for the
"hybrid" version but I don't see where you define what the "hybrid"
version actually is. And the differences between your queries are not
real clear either - first you say you took out pheno and sex because
they weren't necessary, but then you decide to put them back. I don't
know what that means. If they're not necessary, leave them out.

>> One random thought: WHERE row_number() = 1 is not too efficient.
>> Try using LIMIT or DISTINCT ON instead.
>
> Possibly. However, the CTE that uses
>
> WHERE row_number() = 1
>
> doesn't dominate the runtime or memory usage, so I'm not too concerned
> about it.

Hmm, you might be right.

>> If you're concerned about memory usage, try reducing work_mem; you've
>> probably got it set to something huge.
>
> work_mem = 1 GB (see diag.{tex/pdf}).
>
> The point isn't that I'm using so much memory. Again, my question is, why
> are these changes affecting memory usage so drastically?

Well each sort or hash can use an amount of memory that is limited
from above by work_mem. So if you write the query in a way that
involves more sorts or hashes, each one can add up to 1GB to your
memory usage, plus overhead. However, it doesn't look like any of
your queries including 30 sorts or hashes, so I'm thinking that the
RSS number probably also includes some of the shared memory that has
been mapped into each backend's address space. RSS is not a terribly
reliable number when dealing with shared memory; it's hard to say what
that really means.

>> You might need to create some indices, too.
>
> Ok. To what purpose? This query picks up everything from the tables and the
> planner does table scans, so conventional wisdom and indeed my experience,
> says that indexes are not going to be so useful.

Well, a hash join is not usually the first thing that pops to mind
when dealing with a table that has 825 million rows (geno). I don't
know if a nested loop with inner-indexscan would be faster, but it
would almost certainly use less memory.

...Robert

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Faheem Mitha 2010-03-29 20:22:05 Re: experiments in query optimization
Previous Message Faheem Mitha 2010-03-29 18:31:44 Re: experiments in query optimization