Re: Best way to delete big amount of records from big table

From: Ekaterina Amez <ekaterina(dot)amez(at)zunibal(dot)com>
To: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
Cc: postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Best way to delete big amount of records from big table
Date: 2020-03-27 15:15:04
Message-ID: CAFijohjCdcu7w0Oi2ppEc5_mbR3WisQ6kvRyF=3SwntAGr3Qcw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi Laurenz,

El vie., 27 mar. 2020 a las 15:46, Laurenz Albe (<laurenz(dot)albe(at)cybertec(dot)at>)
escribió:

> On Fri, 2020-03-27 at 15:13 +0100, Ekaterina Amez wrote:
> > I'm trying to clean up a database with millions of records of
> > useless-but-don't-remove-just-in-case data. [...]
> >
> > But also I'm cleaning tables with 150million records where I'm going to
> > remove 60% of existing data and after a few tests I'm not sure what's
> > the best approach as all seem to take similar time to run. These tables
> > are grouped in 4 tables group with master, detail, master_history,
> > detail_history structure. None of the tables have primary key nor
> > foreign key or any constraint but the sequence used for what should be
> > the PK column, though this column is not defined as PK.
>
> You should define primary and foreign keys if you can, but I guess
> I don't have to tell you that.
>

I know about DB design ;)
This structure of master-detail-master_hist-detail_hist is repeated all
over the DB and other groups of tables are perfectly created with theri
PK-FK-UQ-IX... I don't know why these ones haven't been created in the same
way.
Excuse me if this is a silly question but I've read (or understood) that
it's better to remove constraints to improve delete performance... this is
related to indexes only? or also to PK-FK?

> > I've decided to delete from the last one in chunks (10 days of data per
> > chunk but it coud be any other quantity) so I've created a function.
> > I've tested it with indexes (in master_hist for filtering data and in
> > detail_hist for the fk and pk), without indexes, after analyzing table,
> > and no matter what I always end up with more or less the same execution
> > time. I can afford the time it's getting to run but I'd like to know if
> > it's there a better way to do this.
>
> There is no need to delete in batches unless you have a need to keep
> transactions short (danger of deadlock because the data are still
> modified, or you cannot afford to block autovacuum that long).
>

I prefer doing it in batches because I know there are other processes
accessing this table and I can't assure they won't change any data.

> If you can drop the indexes while you do it (downtime), go for it.
> Perhaps there is a way to use partial indexes that exclude all the
> data that you have to delete, then work could go on as normal.
>

As I said, these particular tables doesn't have any indexes at all. I'll
give a try to the partial index suggestion, thanks.

> > I'm testing on version 9.2 BUT
> > production server is 8.4 (legacy application, supposed to be in at least
> > 9.2 but recently discovered it was 8.4, planning upgrade but not now).
> > Config parameters are default ones.
>
> Now that is a seriously bad idea. You should test on the same version
> as you have running in production. And you should insist in an upgrade.
> People who insist in running ancient software often insist in ancient
> hardware as well, and both is a good way to get data corruption.
> If the system blows up, they are going to blame you.
>

Believe me, I'm totally aware of all of this. Upgrade is planned to happen
after I clean up the database. I'm the one that has discover that
production server is so old, it looked like no one knew it before. In the
time I've been working here I've upgraded 2 servers.

> Yours,
> Laurenz Albe
> --
> Cybertec | https://www.cybertec-postgresql.com
>
>
Regards,
Ekaterina

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message daya airody 2020-03-27 15:24:54 Re: JOIN on partitions is very slow
Previous Message Justin Pryzby 2020-03-27 15:09:54 Re: Best way to delete big amount of records from big table