Re: UPDATE many records

From: Justin <zzzzz(dot)graf(at)gmail(dot)com>
To: Israel Brewster <ijbrewster(at)alaska(dot)edu>
Cc: Michael Lewis <mlewis(at)entrata(dot)com>, Rob Sargent <robjsargent(at)gmail(dot)com>, Alban Hertroys <haramrae(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)gmail(dot)com>, "pgsql-generallists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: UPDATE many records
Date: 2020-01-07 02:28:08
Message-ID: CALL-XeNRB5_mnC3646x09DZUc7svfoNb5Moof-6F5crNRk7A0Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

system monitor i like is glances, have to have python installed to run,
it can run in command console or present a web interface. its a very nice
to get a quick detail few what is going on
https://nicolargo.github.io/glances/

just monitoring the system CPU utilization does not give enough information
to state if the system is CPU bound or IO bound or network IO bound.

on simple selects assuming the data is in shared buffers its trivial
event, if the select has to go to disk to get the data then its not
trivial especially if its a big table. One thing that can happen is the
update and the selects in the trigger are pushing data in and out of shared
memory causing the disk to thrash which explain does not report.

Explain does not peer into function, triggers, etc select myfunction(),
column list, from mytable. or update mytable set column = myfunction() ,
it will not report what happened inside just the total time it took to run.
looking at the explain the Update only took 0.6 second while the trigger
took 6.5

rotate_tilt_data: time=6560.401 calls=9299

I'm a little confused is this trigger function going to be around after
this update? If yes it should be worth looking into

If its a one time run or every 12 months who cares, Start the update on
friday night, go in on Saturday to check it

On Mon, Jan 6, 2020 at 5:38 PM Israel Brewster <ijbrewster(at)alaska(dot)edu>
wrote:

>
> On Jan 6, 2020, at 12:49 PM, Justin <zzzzz(dot)graf(at)gmail(dot)com> wrote:
>
> What was the HD wait time ? What tool is being use to monitor the server
> resources??
>
>
> No idea on the HD wait time - how would I best monitor that? That said,
> this machine does have NVMe drives, so the speed should be fairly high/wait
> time fairly low. It’s also running as a VM, which could affect things, but
> is a bit of a moot point as far as this update goes. As far as monitoring
> server resources, I was just using top.
>
> It appears based on this information there is allot more going on than a
> simple Update command
>
>
> Depending on your definition of “simple update” of course, very true. As I
> stated in the original message, the actual update value is the result of a
> function. The psql command is a simple update, but the function does a bit
> of stuff (primarily trigonometry). According to the EXPLAIN ANALYZE, about
> .7 ms of stuff per record, which of course is most of the runtime. It is
> entirely possible that the function could be optimized to run more quickly.
>
>
> Moving code out of the trigger probably not going to improve
> performance, unless there is allot of code that does not need to be
> processed for this update or code touching other tables
>
>
> One SELECT query on another table to get some values I need to use for the
> calculation. No code that is not needed for the update. Given the nature of
> this bulk update, I *could* make a separate function that simply takes
> those values as parameters, since the same value will be applied to a lot
> of records. I’d just have to be careful about how I applied the update, so
> rows get processed with the correct values. I’m not convinced it would be
> worth it though - might shave a few hours off the total execution time
> (assuming that SELECT is expensive - EXPLAIN ANLYZE shows an index scan, on
> a table with only 12,761 rows, which seems to be about as simple as it
> gets), but I doubt it would be enough for me to feel comfortable simply
> running the update as one monolithic unit.
>
> Study the trigger identify what has to run, pull that code out, then
> disable the trigger. Move the necessary code to a new function for
> Updating..
>
>
> Sure. But I feel we are getting a bit off track. Optimizing the runtime of
> the update is great, but this is a one-off (hopefully) event. I want to
> accomplish it as quickly as possible, of course, but at the same time it
> doesn’t make sense to spend a lot of time optimizing every component of the
> query. The main purpose of the question was honestly for my sanity, to
> reduce the likelihood of having it run for several hours only to error out
> due to bad data or whatever and have to start over from the top. Running in
> parallel simply seemed to be a no-brainer option to make it go quicker,
> assuming CPU bound updating. Optimizations that are going to take work are
> probably not worth it. We can wait for the data to be updated.
>
> Thanks again!
>
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory
> Geophysical Institute - UAF
> 2156 Koyukuk Drive
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell: 907-328-9145
>
>
>
> On Mon, Jan 6, 2020 at 4:24 PM Israel Brewster <ijbrewster(at)alaska(dot)edu>
> wrote:
>
>>
>> On Jan 6, 2020, at 11:54 AM, Michael Lewis <mlewis(at)entrata(dot)com> wrote:
>>
>> I’m thinking it might be worth it to do a “quick” test on 1,000 or so
>>> records (or whatever number can run in a minute or so), watching the
>>> processor utilization as it runs. That should give me a better feel for
>>> where the bottlenecks may be, and how long the entire update process would
>>> take. I’m assuming, of course, that the total time would scale more or less
>>> linearly with the number of records.
>>>
>>
>> I think that depends on how your identify and limit the update to those
>> 1000 records. If it is using a primary key with specific keys in an array,
>> probably close to linear increase because the where clause isn't impactful
>> to the overall execution time. If you write a sub-query that is slow, then
>> you would need to exclude that from the time. You can always run explain
>> analyze on the update and rollback rather than commit.
>>
>> So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN
>> ANALYZE output at https://explain.depesz.com/s/lIYn if it matters),
>> during which time I did see a postmaster process consuming 100% CPU. Upping
>> the test to 20,819 records took about 16.5 seconds, so that looks
>> relatively linear to me. Also, CPU bound. So by my calculations, doing all
>> 50M records would take around 10 hours.
>>
>> One potentially significant note: most of the execution time is spent in
>> a trigger. This trigger is actually what’s doing the REAL update that I
>> need to happen. If it would make a difference, I could easily pull the
>> trigger code out to a separate function that I just call directly (with
>> triggers temporarily disabled). My thinking is that calling a function is
>> calling a function, and the fact that it is currently called via a trigger
>> rather than direct is of little consequence, but I’m willing to be
>> corrected on that :-)
>>
>> ---
>> Israel Brewster
>> Software Engineer
>> Alaska Volcano Observatory
>> Geophysical Institute - UAF
>> 2156 Koyukuk Drive
>> Fairbanks AK 99775-7320
>> Work: 907-474-5172
>> cell: 907-328-9145
>>
>>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Nagaraj Raj 2020-01-07 06:15:09 pg_repack failure
Previous Message Christopher Browne 2020-01-06 23:38:31 Re: UPDATE many records