Re: [PATCH] vacuumlo: print the number of large objects going to be removed

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "Timur Birsh" <taem(at)linukz(dot)org>
Cc: "Michael Paquier" <michael(at)paquier(dot)xyz>,"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] vacuumlo: print the number of large objects going to be removed
Date: 2019-07-17 11:31:05
Message-ID: ffd69b41-660b-4d1a-8305-d0913aaceffa@manitou-mail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Timur Birsh wrote:

> Please find attached patch v2.
> I fixed some indentation in the variable declaration blocks.

The tab width should be 4. Please have a look at
https://www.postgresql.org/docs/current/source-format.html
It also explains why opportunistic reformatting is futile, anyway:

"Your code will get run through pgindent before the next release, so
there's no point in making it look nice under some other set of
formatting conventions. A good rule of thumb for patches is “make
the new code look like the existing code around it”."

> An application that uses this database from time to time deletes and
> adds a lot of rows, it happens that more than 10,000,000 orphaned
> LOs remain in the database. Removing such a number of items takes a
> long time.

It might be useful to display the progress report in the loop, but
it appears that even when there's nothing to remove, vacuumlo is
likely to take a long time, because of the method it uses:

#1. it builds a temp table with the OIDs of all large objects.

#2. for each non-system OID column in the db, it deletes from the temp
table each value existing under that column, assuming that it's a
reference to a large object (incidentally if you have OID columns
that don't refer to large objects in your schemas, they get
dragged into this. Also in case of OID reuse and bad luck they may
permanently block the removal of some orphaned large objects).

#3. it creates a holdable cursor to iterate on the temp table.

#4. finally it calls lo_unlink() on each remaining OID in batched
transactions.

The design with #1 and #2 dates back from the very first version,
in 1999.
Nowadays, maybe we could skip these steps by creating a cursor
directly for a generated query that would look like this:

SELECT oid FROM pg_largeobject_metadata lo WHERE NOT EXISTS (
SELECT 1 FROM schema1.tablename1 WHERE oid_column1 = lo.oid
UNION ALL
SELECT 1 FROM schema2.tablename2 WHERE oid_column2 = lo.oid
UNION ALL
...
);

That might be much faster than #1 and #2, especially in the case when
there's only one SELECT in that subquery and no UNION ALL is even
necessary.

For #4, a more modern approach could be to move that step into a
server-side DO block or a procedure, as transaction control is
available in them since version 11. This would avoid one client-server
round-trip per LO to delete, plus the round trips for the
cursor fetches. In the mentioned case of millions of objects to
unlink, that might be significant. In this case, progress report would
have to be done with RAISE NOTICE or some such.

In fact, this leads to another idea that vacuumlo as a client-side app
could be obsoleted and replaced by a paragraph in the doc
with a skeleton of an implementation in a DO block,
in which a user could replace the blind search in all OID
columns by a custom subquery targeting specifically their schema.
As a code block, it would be directly embeddable in a psql script or
in a procedure called by pg_cron or any equivalent tool.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rader 2019-07-17 11:49:16 Procedure support improvements
Previous Message tushar 2019-07-17 11:24:55 Re: Minimal logical decoding on standbys