| From: | Antonin Houska <ah(at)cybertec(dot)at> | 
|---|---|
| To: | "Euler Taveira" <euler(at)eulerto(dot)com> | 
| Cc: | "Alvaro Herrera" <alvherre(at)alvh(dot)no-ip(dot)org>, "Marcos Pegoraro" <marcos(at)f10(dot)com(dot)br>, "Michael Banck" <mbanck(at)gmx(dot)net>, "Junwang Zhao" <zhjwpku(at)gmail(dot)com>, "Kirill Reshke" <reshkekirill(at)gmail(dot)com>, "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>, "Michael Paquier" <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: why there is not VACUUM FULL CONCURRENTLY? | 
| Date: | 2025-04-04 16:38:54 | 
| Message-ID: | 55563.1743784734@localhost | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Euler Taveira <euler(at)eulerto(dot)com> wrote:
> I started reviewing this patch. It was in my radar to review it but I didn't
> have spare time until now.
Thanks!
> + <refnamediv>
> +  <refname>REPACK</refname>
> +  <refpurpose>cluster a table according to an index</refpurpose>
> + </refnamediv>
> 
> This description is not accurate because the index is optional. It means if the
> index is not specified, it is a "rewrite" instead of a "cluster". One
> suggestion is to use "rewrite" because it says nothing about the tuple order. A
> "rewrite a table" or "rewrite a table to reclaim space" are good candidates. On
> the other hand, the command is called "repack" and it should be a strong
> candidate for the verb in this description. (I'm surprised that repack is not a
> recent term [1]). It seems a natural choice.
This reveals that I used the documentation of CLUSTER for "inspiration" :-) I
prefer "rewrite" because it can help users which don't know what REPACK means
in this context.
> +<synopsis>
> +REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_name</replaceable>
> [ USING INDEX<replaceable class="parameter">index_name</replaceable> ] ]
ok, fixed
> You missed a space after INDEX.
> 
> +   <para>                                                                        
> +    Because the planner records statistics about the ordering of tables, it is 
> +    advisable to                                                               
> +    run <link linkend="sql-analyze"><command>ANALYZE</command></link> on the   
> +    newly repacked table.  Otherwise, the planner might make poor choices of   
> +    query plans.                                                                   
> +   </para>
>
> If we decide for another term (other than "repacked") then it should reflect
> here and in some other parts ("repacking" is also used) too.
I think it's ok to say "repack" as long as we explained it above. The
documentation of CLUSTER command also uses the verb "cluster".
> +   <para>                                                                      
> +    When an index scan or a sequential scan without sort is used, a temporary  
> +    copy of the table is created that contains the table data in the index     
> +    order.
> 
> That's true for CLUSTER but not for REPACK. Index is optional.
ok, removed the mention of index.
> +      Prints a progress report as each table is clustered                      
> +      at <literal>INFO</literal> level.
> 
> s/clustered/repacked/?
right
> +    Repacking a partitioned table repacks each of its partitions. If an index  
> +    is specified, each partition is clustered using the partition of that      
> +    index. <command>REPACK</command> on a partitioned table cannot be executed 
> +    inside a transaction block.
> 
> Ditto.
fixed
> +  <para>                                                                       
> +   Cluster the table <literal>employees</literal> on the basis of its          
> +   index <literal>employees_ind</literal>:                                     
> +<programlisting>                                                               
> +REPACK employees USING INDEX employees_ind;                                    
> +</programlisting>                                                              
> +  </para>
> 
> It sounds strange to use "Repack" in the other examples but this one it says
> "Cluster". Let's use the same terminology.
It's explained above on the page that if index is specified, it's
clustering. I changed it to "Repack", but added a note that this is
effectively clustering.
> +
> +     <warning>
> +      <para>
> +       The <command>FULL</command> parameter is deprecated in favor of
> +       <xref linkend="sql-repack"/>.
> +      </para>
> +     </warning>
> +
> 
> The warnings, notes, and tips are usually placed *after* the description.
You probably mean the subsecions "Notes on Clustering" and "Notes on
Resources". I moved them into the "Notes" section.
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -741,13 +741,13 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
> 
> Is it worth to rename table_relation_copy_for_cluster() and
> heapam_relation_copy_for_cluster() to replace cluster with repack?
I had thought about it and concluded that it'd make the patch too
invasive. Note that the CLUSTER still uses these functions. We can do the
renaming when removing the CLUSTER command someday.
> +    SELECT
> +        S.pid AS pid,
> +        S.datid AS datid,
> +        D.datname AS datname,
> +        S.relid AS relid,
> +        CASE S.param1 WHEN 1 THEN 'REPACK'
> +                      END AS command,
> 
> Do you really need command? IIUC REPACK is the only command that will used by
> this view. There is no need to differentiate commands here.
REPACK is a regular command, so why shouldn't it have its view? Just like
CLUSTER has one (pg_stat_progress_cluster).
> + *
> + * 'cmd' indicates which commands is being executed. REPACK should be the only
> + * caller of this function in the future.
> 
> command.
Not sure I understand this comment.
> +    *
> +    * REPACK does not set indisclustered. XXX Not sure I understand the
> +    * comment above: how can an attribute be set "only in the current
> +    * database"?
>      */   
> 
> pg_index is a local catalog. To be consistent while clustering a shared
> catalog, it should set indisclustered in all existent databases because in each
> pg_index table there is a tuple for the referred index. As the comment says it
> is not possible.
Yes, Alvaro already explained this to me [1] :-)
> -   if (OidIsValid(indexOid) && OldHeap->rd_rel->relisshared)
> +   if (cmd == CLUSTER_COMMAND_CLUSTER && OldHeap->rd_rel->relisshared)
>         ereport(ERROR,
>                 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> -                errmsg("cannot cluster a shared catalog")));
> +                errmsg("cannot %s a shared catalog", cmd_str)));
> 
> I'm confused about this change. Why is it required?
> 
> If it prints this message only for CLUSTER command, you don't need to have a
> generic message. This kind of message is not good for translation. If you need
> multiple verbs here, I advise you to break it into multiple messages.
Good point, I didn't think about translation. Fixed.
> -   {
> -       if (OidIsValid(indexOid))
> -           ereport(ERROR,
> -                   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> -                    errmsg("cannot cluster temporary tables of other sessions")));
> -       else
> -           ereport(ERROR,
> -                   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> -                    errmsg("cannot vacuum temporary tables of other sessions")));
> -   }
> +       ereport(ERROR,
> +               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                errmsg("cannot %s temporary tables of other sessions",
> +                       cmd_str)));
> 
> Ditto.
Fixed.
> -   CheckTableNotInUse(OldHeap, OidIsValid(indexOid) ? "CLUSTER" : "VACUUM");
> +   CheckTableNotInUse(OldHeap, asc_toupper(cmd_str, strlen(cmd_str)));
> 
> If the idea is to remove CLUSTER and VACUUM from this routine in the future, I
> wouldn't include formatting.h just for asc_toupper(). Instead, I would use an
> if condition. I think it will be easy to remove this code path when the time
> comes.
Fixed.
> -                errmsg("cannot cluster on index \"%s\" because access method does not support clustering",
> -                       RelationGetRelationName(OldIndex))));
> +                errmsg("cannot %s on index \"%s\" because access method does not support clustering",
> +                       cmd_str, RelationGetRelationName(OldIndex))));
> 
> Ditto. I don't think check_index_is_clusterable() should be changed. The action
> is "cluster" independently of the command. You can keep "cluster" until we
> completely remove CLUSTER command and then we can replace this term with
> "repack". It also applies to cluster_is_permitted_for_relation().
> 
> -                errmsg("cannot cluster on partial index \"%s\"",
> +                errmsg("cannot %s on partial index \"%s\"",
> +                       cmd_str,
>                         RelationGetRelationName(OldIndex))));
> 
> Ditto.
> 
> -                errmsg("cannot cluster on invalid index \"%s\"",
> -                       RelationGetRelationName(OldIndex))));
> +                errmsg("cannot %s on invalid index \"%s\"",
> +                       cmd_str, RelationGetRelationName(OldIndex))));
> 
> Ditto.
> 
> -               (errmsg("clustering \"%s.%s\" using index scan on \"%s\"",
> +               (errmsg("%sing \"%s.%s\" using index scan on \"%s\"",
> +                       cmd_str,
>                         nspname,
>                         RelationGetRelationName(OldHeap),
>                         RelationGetRelationName(OldIndex))));
> 
> This is bad for translation. Use complete sentences.
> 
> -               (errmsg("clustering \"%s.%s\" using sequential scan and sort",
> +               (errmsg("%sing \"%s.%s\" using sequential scan and sort",
> +                       cmd_str,
>                         nspname,
>                         RelationGetRelationName(OldHeap))));
> 
> Ditto.
> 
> -               (errmsg("vacuuming \"%s.%s\"",
> +               (errmsg("%sing \"%s.%s\"",
> +                       cmd_str,
>                         nspname,
>                         RelationGetRelationName(OldHeap))));
> 
> Ditto.
fixed
> /*
> - * Given an index on a partitioned table, return a list of RelToCluster for
> + * Like get_tables_to_cluster(), but do not care about indexes.
> + */
> Since the goal is to remove CLUSTER in the future, provide a comment that
> doesn't mention routines that will certainly be removed. Hence, there is no
> need to fix them in the future.
It'd be almost duplicate of the header comment of get_tables_to_cluster() and
I don't like duplication. Let's do that at removal time.
> +   /*
> +    * Get all indexes that have indisclustered set and that the current user
> +    * has the appropriate privileges for.
> +    */
> 
> This comment is not true.
Fixed.
>     ereport(WARNING,
> -           (errmsg("permission denied to cluster \"%s\", skipping it",
> +           (errmsg("permission denied to %s \"%s\", skipping it",
> +                   CLUSTER_COMMAND_STR(cmd),
>                     get_rel_name(relid))));
> 
> Fix for translation.
Fixed.
> +   if (stmt->relation != NULL)
> +   {
> +       rel = process_single_relation(stmt->relation, stmt->indexname,
> +                                     CLUSTER_COMMAND_REPACK, ¶ms,
> +                                     &indexOid);
> +       if (rel == NULL)
> +           return;
> +   }
> 
> This code path is confusing. It took me some time (after reading
> process_single_relation() that could have a better name) to understand it. I
> don't have a good suggestion but it should have at least one comment explaining
> what the purpose is.
ok, added the comment that I lost when moving the code from cluster() to
process_single_relation().
> +/*
> + * REPACK a single relation.
> + *
> + * Return NULL if done, relation reference if the caller needs to process it
> + * (because the relation is partitioned).
> + */
> 
> This comment should be expanded. As I said in the previous hunk, there isn't
> sufficient information to understand how process_single_relation() works.
This function only contains code that I moved from cluster_rel(). The header
comment is and additional information. I tried to rephrase it a bit anyway.
> +           | REPACK
> +               {
> +                   RepackStmt *n = makeNode(RepackStmt);
> +
> +                   n->relation = NULL;
> +                   n->indexname = NULL;
> +                   n->params = NIL;
> +                   $$ = (Node *) n;
> +               }
> +
> +           | REPACK '(' utility_option_list ')'
> +               {
> +                   RepackStmt *n = makeNode(RepackStmt);
> +
> +                   n->relation = NULL;
> +                   n->indexname = NULL;
> +                   n->params = $3;
> +                   $$ = (Node *) n;
> +               }
Maybe, will think about it.
> I'm wondering if there is an easy way to avoid these rules.
> 
>     PROGRESS_COMMAND_VACUUM,
>     PROGRESS_COMMAND_ANALYZE,
>     PROGRESS_COMMAND_CLUSTER,
> +   PROGRESS_COMMAND_REPACK,
>     PROGRESS_COMMAND_CREATE_INDEX,
>     PROGRESS_COMMAND_BASEBACKUP,
>     PROGRESS_COMMAND_COPY,
> 
> It is just a matter of style but I have the habit to include new stuff at the
> end.
Yes, it seems so. Fixed.
> +-- Yet another code path: REPACK w/o index.
> +REPACK clstr_tst USING INDEX clstr_tst_c;
> +-- Verify that inheritance link still works
> 
> You forgot to remove the USING INDEX here.
Good catch. I removed the test because w/o index the output order can be
unstable. (Whether index is used or not should not affect the catalog changes
related to inheritance or FKs anyway.)
> I'm still review the other patches (that is basically the implementation of
> CONCURRENTLY) and to avoid a long review, I'm sending the 0001 review. Anyway,
> 0001 is independent of the other patches and should be applied separately.
Attached is a new version of 0001.
As for the other patches, please skip the parts > 0004 - most of this code
will be removed [2]. I'll try to post the next version of the patch set next
week.
(Regarding next reviews, please try to keep hunk headers in the text.)
[1] https://www.postgresql.org/message-id/202503031807.dnacvpgnjkz7@alvherre.pgsql
[2] https://www.postgresql.org/message-id/13028.1743762516%40localhost
-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com
| Attachment | Content-Type | Size | 
|---|---|---|
| 0001-Add-REPACK-command.patch | text/x-diff | 78.0 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Masahiko Sawada | 2025-04-04 16:39:25 | Re: Restrict copying of invalidated replication slots | 
| Previous Message | Robert Haas | 2025-04-04 16:36:33 | Re: autoprewarm_dump_now |