Re: Rapid disk usage spikes when updating large tables with GIN indexes

From: Jonathan Marks <jonathanaverymarks(at)gmail(dot)com>
To: Nikolay Samokhvalov <samokhvalov(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Rapid disk usage spikes when updating large tables with GIN indexes
Date: 2018-05-21 23:41:33
Message-ID: 086267F7-EDBC-4B1A-9C73-2CDC07C27691@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I believe that we have figured it out. It indeed was a WAL issue — the WAL wasn’t getting measured because it had been moved into an archived folder.

We resolved this in a two main ways:

1. By dramatically increasing max_wal_size to decrease the frequency of commits
2. By turning on wal_compression

We’re going to continue looking into whether we can reduce wal generation / write volume further by turning back on fast_update on our GIN indexes.

> On May 16, 2018, at 4:31 PM, Nikolay Samokhvalov <samokhvalov(at)gmail(dot)com> wrote:
>
> Why not set up a spot EC2 instance with Postgres 10.1, load database from a dump (yes you’ll need to create one from RDS because they don’t provide direct access to dumps/backups; probably you need to get only specific tables) and repeat your actions, closely looking at filesystem.
>
> ср, 16 мая 2018 г. в 13:10, Jonathan Marks <jonathanaverymarks(at)gmail(dot)com <mailto:jonathanaverymarks(at)gmail(dot)com>>:
> Hi Tom —
>
> We turned on log_temp_files and since the last stats reset (about a week ago) we’re seeing 0 temp files altogether (grabbing that info from pg_stat_database).
>
> So, as far as we know:
>
> 1) It’s not WAL
> 2) It’s not tempfiles
> 3) It’s not the size of the error logs
> 4) It’s not the size of the actual rows in the database or the indexes
>
> Another thread we found suggested pg_subtrans — this seems less likely because we’ve been able to replicate this across many different types of connections etc. but thought it might be a potential source.
>
> Any other system-monitoring queries that we can run that might further illuminate the issue?
>
> Thank you!
>
> > On May 14, 2018, at 3:31 PM, Jonathan Marks <jonathanaverymarks(at)gmail(dot)com <mailto:jonathanaverymarks(at)gmail(dot)com>> wrote:
> >
> > We’ll turn on log_temp_files and get back to you to see if that’s the cause. Re: the exact queries — these are just normal INSERTs and UPDATEs. This occurs as part of normal database operations — i.e., we are processing 10% of a table and marking changes to a particular row, or happen to be inserting 5-10% of the table volume with new rows. Whenever we bulk load we have to drop the indexes because the disk space loss just isn’t tenable.
> >
> > Re: extra disk space consumption not within PG — the AWS folks can’t tell me what the problem is because it’s all internal to the PG part of the instance they can’t access. Doesn’t mean your last suggestion can’t be the case but makes it slightly less likely.
> >
> > Any chance that GIN indexes are double-logging? I.e. with fastupdate off they are still trying to keep track of the changes in the pending list or something?
> >
> > Our thought has been temp files for a while, but we’re not sure what we should do if that turns out to be the case.
> >
> >> On May 14, 2018, at 3:08 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us <mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us>> wrote:
> >>
> >> [ please keep the list cc'd ]
> >>
> >> Jonathan Marks <jonathanaverymarks(at)gmail(dot)com <mailto:jonathanaverymarks(at)gmail(dot)com>> writes:
> >>> Thanks for your quick reply. Here’s a bit more information:
> >>> 1) to measure the “size of the database” we run something like `select datname, pg_size_pretty(pg_database_size(datname)) from pg_database;` I’m not sure if this includes WAL size.
> >>> 2) I’ve tried measuring WAL size with `select sum(size) from pg_ls_waldir();` — this also doesn’t budge.
> >>> 3) Our current checkpoint_timeout is 600s with a checkpoint_completion_target of 0.9 — what does that suggest?
> >>
> >> Hmph. Your WAL-size query seems on point, and that pretty much destroys
> >> my idea about a WAL emission spike.
> >>
> >> pg_database_size() should include all regular and temporary tables/indexes
> >> in the named DB. It doesn't include WAL (but we've eliminated that), nor
> >> cluster-wide tables such as pg_database (but those seem pretty unlikely
> >> to be at issue), nor non-relation temporary files such as sort/hash temp
> >> space. At this point I think we have to focus our attention on what might
> >> be creating large temp files. I do not see anything in the GIN index code
> >> that could do that, especially not if you have fastupdate off. I wonder
> >> whether there is something about the particular bulk-insertion queries
> >> you're using that could result in large temp files --- which'd make the
> >> apparent correlation with GIN index use a mirage, but we're running out
> >> of other ideas. You could try enabling log_temp_files to see if there's
> >> anything to that.
> >>
> >> In the grasping-at-straws department: are you quite sure that the extra
> >> disk space consumption is PG's to begin with, rather than something
> >> outside the database entirely?
> >>
> >> regards, tom lane
> >
>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Stuart McGraw 2018-05-22 05:48:20 source of connection fails at pg startup?
Previous Message Ron 2018-05-21 23:33:06 Re: posgresql.log