Skip site navigation (1) Skip section navigation (2)

Re: Analyzing foreign tables & memory problems

From: "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To: "Noah Misch *EXTERN*" <noah(at)leadboat(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Analyzing foreign tables & memory problems
Date: 2012-04-30 19:01:54
Message-ID: D960CB61B694CF459DCFB4B0128514C2049FCE85@exadv11.host.magwien.gv.at (view raw or flat)
Thread:
Lists: pgsql-hackers
Noah Misch wrote:
>> During ANALYZE, in analyze.c, functions compute_minimal_stats
>> and compute_scalar_stats, values whose length exceed
>> WIDTH_THRESHOLD (= 1024) are not used for calculating statistics
>> other than that they are counted as "too wide rows" and assumed
>> to be all different.
>> 
>> This works fine with regular tables;

>> With foreign tables the situation is different.  Even though
>> values exceeding WIDTH_THRESHOLD won't get used, the complete
>> rows will be fetched from the foreign table.  This can easily
>> exhaust maintenance_work_mem.

>> I can think of two remedies:
>> 1) Expose WIDTH_THRESHOLD in commands/vacuum.h and add documentation
>>    so that the authors of foreign data wrappers are aware of the
>>    problem and can avoid it on their side.
>>    This would be quite simple.

> Seems reasonable.  How would the FDW return an indication that a value was
> non-NULL but removed due to excess width?

The FDW would return a value of length WIDTH_THRESHOLD+1 that is
long enough to be recognized as too long, but not long enough to
cause a problem.

> Not all databases can cheaply filter out wide column values; by the time the
> remote side has an exact width, the remote I/O damage may already be done.  To
> dodge that problem, when a column has "SET STATISTICS 0", the FDW should be
> able to completely omit reading it.  (I haven't studied the API needs, if any,
> to make that possible.)

Depending on the capabilities of the remote side, a FDW can
do more or less intelligent things to avoid the problem.
But it must know WIDTH_THRESHOLD.

Disabling statistics for a column as a workaround is an
interesting idea, but would be more work for the FDW writer
and the user.

>> 2) Instead of one callback that returns all sample rows, have
>>    a callback that just returns the next table row (or the next
>>    table row from a subset of the pages of the table in the
>>    internal case).  This function could be called in a short-lived
>>    memory context.  Vitter's algorithm for selecting a sample
>>    and the truncation of excessively long values would then be
>>    handled in analyze.c.
>>    This would avoid the problem completely and make it easier
>>    to write a foreign data wrapper.

> This solves the (in your downthread example) 30 GiB of memory consumption, but
> you'll still read 30 GiB on the remote side and ship it all over the network.
> To call this fixed, we'll need something like (1) that lets the FDW limit
> volume at the remote side.

You are right.  I guess the first idea is the more promising one.

Yours,
Laurenz Albe

In response to

Responses

pgsql-hackers by date

Next:From: David JohnstonDate: 2012-04-30 19:33:48
Subject: Re: precision and scale functions for numeric
Previous:From: Simon RiggsDate: 2012-04-30 18:55:00
Subject: Re: Future In-Core Replication

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group