Re: Out of Memory errors are frustrating as heck!

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Gunther <raj(at)gusw(dot)net>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-performance(at)lists(dot)postgresql(dot)org, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Subject: Re: Out of Memory errors are frustrating as heck!
Date: 2019-04-24 00:36:33
Message-ID: 20190424003633.ruvhbv5ro3fawo67@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, Apr 23, 2019 at 07:09:00PM -0400, Gunther wrote:
> On 4/23/2019 16:43, Justin Pryzby wrote:
>
> It wrote 40GB tempfiles - perhaps you can increase work_mem now to improve the
> query time.
>
> I now upped my shared_buffers back from 1 to 2GB and work_mem from 4 to
> 16MB. Need to set vm.overcommit_ratio from 50 to 75 (percent, with
> vm.overcommit_memory = 2 as it is.)
>
> We didn't address it yet, but your issue was partially caused by a misestimate.
> It's almost certainly because these conditions are correlated, or maybe
> redundant.
>
> That may be so, but mis-estimates happen. And I can still massively
> improve this query logically I am sure.  In fact it sticks out like a sore
> thumb, logically it makes no sense to churn over 100 million rows here,
> but the point is that hopefully PostgreSQL runs stable in such outlier
> situations, comes back and presents you with 2 hours of work time, 40 GB
> temp space, or whatever, and then we users can figure out how to make it
> work better. The frustrating thing it to get out of memory and we not
> knowing what we can possibly do about it.
>

Sure. And I think the memory balancing algorithm implemented in the v2
patch is a step in that direction. I think we can do better in terms of
memory consumption (essentially keeping it closer to work_mem) but it's
unlikely to be any faster.

In a way this is similar to underestimates in hash aggregate, except
that in that case we don't have any spill-to-disk fallback at all.

> From my previous attempt with this tmp_r and tmp_q table, I also know that
> the Sort/Uniqe step is taking  a lot of extra time. I can cut that out too
> by addressing the causes of the "repeated result" rows. But again, that is
> all secondary optimizations.
>
> Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text))
>
> If they're completely redundant and you can get the same result after dropping
> one or two of those conditions, then you should.
>
> I understand. You are saying by reducing the amount of columns in the join
> condition, somehow you might be able to reduce the size of the hash
> temporary table?
>

No. When estimating the join result size with multiple join clauses, the
optimizer essentially has to compute

1: P((x1 = y1) && (x2 = y2) && (x3 = y3))

so it assumes statistical independence of those conditions and splits
that into

2: P(x1 = y1) * P(x2 = y2) * P(x3 = y3)

But when those conditions are dependent - for example when (x1=y1) means
that ((x2=y2) && (x3=y3)) - this results into significant underestimate.
E.g. let's assume that each of those conditions matches 1/100 rows, but
that essentially x1=x2=x3 and y1=y2=y3. Then (1) is 1/100 but (2) ends
up being 1/1000000, so 10000x off.

Chances are this is what's happenning with the inner side of the hash
join, which is estimated to return 14425 but ends up returning 37826042.

There's one trick you might try, though - using indexes on composite types:

create table t1 (a int, b int);
create table t2 (a int, b int);


insert into t1 select mod(i,1000), mod(i,1000)
from generate_series(1,100000) s(i);

insert into t2 select mod(i,1000), mod(i,1000)
from generate_series(1,100000) s(i);

analyze t1;
analyze t2;

explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);

QUERY PLAN
--------------------------------------------------------------------
Merge Join (cost=19495.72..21095.56 rows=9999 width=16)
(actual time=183.043..10360.276 rows=10000000 loops=1)
Merge Cond: ((t1.a = t2.a) AND (t1.b = t2.b))
...

create type composite_id as (a int, b int);

create index on t1 (((a,b)::composite_id));
create index on t2 (((a,b)::composite_id));

analyze t1;
analyze t2;

explain analyze select * from t1 join t2
on ((t1.a,t1.b)::composite_id = (t2.a,t2.b)::composite_id);
QUERY PLAN
--------------------------------------------------------------------------
Merge Join (cost=0.83..161674.40 rows=9999460 width=16)
(actual time=0.020..12726.767 rows=10000000 loops=1)
Merge Cond: (ROW(t1.a, t1.b)::composite_id = ROW(t2.a, t2.b)::composite_id)

Obviously, that's not exactly free - you have to pay price for the index
creation, maintenance and storage.

> ...
>
> Unique/Sort actual time   6,150,303.060 ms = 6,150 s <~ 2 h.
> Hash Right Join actual time 325,240.679 ms.
>
> So really all time is wasted in that sort, no need for you guys to worry
> about anything else with these 2 hours.  Tomas just stated the same thing.
>
> Right. Chances are that with a bettwe estimate the optimizer would pick
> merge join instead. I wonder if that would be significantly faster.
>
> The prospect of a merge join is interesting here to consider: with the
> Sort/Unique step taking so long, it seems the Merge Join might also take a
> lot of time? I see my disks are churning for the most time in this way:
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 7.50 0.00 2.50 89.50 0.00 0.50
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> nvme1n1 0.00 0.00 253.00 131.00 30.15 32.20 332.50 2.01 8.40 8.41 8.37 2.59 99.60
> nvme1n1p24 0.00 0.00 253.00 131.00 30.15 32.20 332.50 2.01 8.40 8.41 8.37 2.59 99.60
>
> I.e. 400 IOPS at 60 MB/s half of it read, half of it write. During the
> previous steps, the hash join presumably, throughput was a lot higher,
> like 2000 IOPS with 120 MB/s read or write.
>
> But even if the Merge Join would have taken about the same or a little
> more time than the Hash Join, I wonder, if one could not use that to
> collapse the Sort/Unique step into that? Like it seems that after the
> Sort/Merge has completed, one should be able to read Uniqe records without
> any further sorting? In that case the Merge would be a great advantage.
>

Probably not, because there are far more columns in the Unique step. We
might have done something with "incremental sort" but we don't have that
capability yet.

> What I like about the situation now is that with that 4x bigger work_mem,
> the overall memory situation remains the same. I.e., we are scraping just
> below 1GB for this process and we see oscillation, growth and shrinkage
> occurring. So I consider this case closed for me. That doesn't mean I
> wouldn't be available if you guys want to try anything else about it.
>
> OK, now here is the result with the 16 MB work_mem:
>
> Unique (cost=5462874.86..5465557.83 rows=34619 width=1197) (actual time=6283539.282..7003311.451 rows=435274 loops=1)
> -> Sort (cost=5462874.86..5462961.41 rows=34619 width=1197) (actual time=6283539.280..6908879.456 rows=113478386 loops=1)
> ...
> Planning Time: 2.953 ms
> Execution Time: 7004340.091 ms
> (70 rows)
>
> There isn't really any big news here. But what matters is that it works.
>

Yeah. Once the hash join outgrows the work_mem, the fallback logick
starts ignoring that in the effort to keep the memory usage minimal.

I still think the idea with an "overflow batch" is worth considering,
because it'd allow us to keep the memory usage within work_mem. And
after getting familiar with the hash join code again (haven't messed
with it since 9.5 or so) I think it should not be all that difficult.
I'll give it a try over the weekend if I get bored for a while.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tomas Vondra 2019-04-28 14:19:01 Re: Out of Memory errors are frustrating as heck!
Previous Message Gunther 2019-04-23 23:09:00 Re: Out of Memory errors are frustrating as heck!