Re: Analyzing foreign tables & memory problems

From: "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To: "Noah Misch *EXTERN*" <noah(at)leadboat(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Analyzing foreign tables & memory problems
Date: 2012-04-30 19:01:54
Message-ID: D960CB61B694CF459DCFB4B0128514C2049FCE85@exadv11.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Noah Misch wrote:
>> During ANALYZE, in analyze.c, functions compute_minimal_stats
>> and compute_scalar_stats, values whose length exceed
>> WIDTH_THRESHOLD (= 1024) are not used for calculating statistics
>> other than that they are counted as "too wide rows" and assumed
>> to be all different.
>>
>> This works fine with regular tables;

>> With foreign tables the situation is different. Even though
>> values exceeding WIDTH_THRESHOLD won't get used, the complete
>> rows will be fetched from the foreign table. This can easily
>> exhaust maintenance_work_mem.

>> I can think of two remedies:
>> 1) Expose WIDTH_THRESHOLD in commands/vacuum.h and add documentation
>> so that the authors of foreign data wrappers are aware of the
>> problem and can avoid it on their side.
>> This would be quite simple.

> Seems reasonable. How would the FDW return an indication that a value was
> non-NULL but removed due to excess width?

The FDW would return a value of length WIDTH_THRESHOLD+1 that is
long enough to be recognized as too long, but not long enough to
cause a problem.

> Not all databases can cheaply filter out wide column values; by the time the
> remote side has an exact width, the remote I/O damage may already be done. To
> dodge that problem, when a column has "SET STATISTICS 0", the FDW should be
> able to completely omit reading it. (I haven't studied the API needs, if any,
> to make that possible.)

Depending on the capabilities of the remote side, a FDW can
do more or less intelligent things to avoid the problem.
But it must know WIDTH_THRESHOLD.

Disabling statistics for a column as a workaround is an
interesting idea, but would be more work for the FDW writer
and the user.

>> 2) Instead of one callback that returns all sample rows, have
>> a callback that just returns the next table row (or the next
>> table row from a subset of the pages of the table in the
>> internal case). This function could be called in a short-lived
>> memory context. Vitter's algorithm for selecting a sample
>> and the truncation of excessively long values would then be
>> handled in analyze.c.
>> This would avoid the problem completely and make it easier
>> to write a foreign data wrapper.

> This solves the (in your downthread example) 30 GiB of memory consumption, but
> you'll still read 30 GiB on the remote side and ship it all over the network.
> To call this fixed, we'll need something like (1) that lets the FDW limit
> volume at the remote side.

You are right. I guess the first idea is the more promising one.

Yours,
Laurenz Albe

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Johnston 2012-04-30 19:33:48 Re: precision and scale functions for numeric
Previous Message Simon Riggs 2012-04-30 18:55:00 Re: Future In-Core Replication