| From: | Rustam ALLAKOV <rustamallakov(at)gmail(dot)com> |
|---|---|
| To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Cc: | Jian He <hejian(dot)mark(at)gmail(dot)com> |
| Subject: | Re: virtual generated column as partition key |
| Date: | 2026-03-14 16:54:00 |
| Message-ID: | 177350724019.946.14013342581843838650.pgcf@coridan.postgresql.org |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, failed
Spec compliant: tested, failed
Documentation: tested, failed
Hi Jian,
thank you for your patch.
Myself and Haibo reviewed this patch. Here are our comments:
At the beginning of generated_stored.sql and generated_virtual.sql files we
can observe the following comments respectively:
> -- keep these tests aligned with generated_virtual.sql
> -- keep these tests aligned with generated_stored.sql
Although deviation between two happened before your patch, your changes to
generated_virtual.sql deviates them even more. We guess, it is required either
to update the comment on top of files or align both files.
Next, we though about whether this feature is mainly a syntactic sugar, or
there are significant benefits in terms of space and time? We already support
expressions as partition keys. We can see some usability benefits in allowing
virtual columns in partition keys, but it doesn’t seem to bring any
performance advantage.
Based on our basic benchmarking which can be found at [1].
➜ pgbench -d benchdb -f bench_scripts/bench_physical.sql -T 120 -c 10 -j 2
pgbench (19devel)
starting vacuum...end.
transaction type: bench_scripts/bench_physical.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 2
maximum number of tries: 1
duration: 120 s
number of transactions actually processed: 2847270
number of failed transactions: 0 (0.000%)
latency average = 0.421 ms
initial connection time = 27.392 ms
tps = 23731.744001 (without initial connection time)
➜ pgbench -d benchdb -f bench_scripts/bench_virtual.sql -T 120 -c 10 -j 2
pgbench (19devel)
starting vacuum...end.
transaction type: bench_scripts/bench_virtual.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 2
maximum number of tries: 1
duration: 120 s
number of transactions actually processed: 3035490
number of failed transactions: 0 (0.000%)
latency average = 0.395 ms
initial connection time = 33.673 ms
tps = 25301.897096 (without initial connection time)
➜ psql -d benchdb -f bench_scripts/bench_summary.sql
method | row_count | total_disk_size | bytes_per_row
------------------------+------------+------------------+---------------------
Physical Partitioning | 2847270 | 121 MB | 44.6388575723412251
Virtual Partitioning | 3035490 | 106 MB | 36.5652359256660374
(2 rows)
we see some gain on the total_disk_size 121 mb vs 106 mb and slightly higher
tps.
The patch weakens a long-standing internal invariant around partattrs and
partexprs. As we understand it, the current patch represents a virtual
generated partition key with both:
- a nonzero entry in partattrs, and
- a matching expression stored in partexprs.
This is the key design choice in the patch, but it also seems to be the main
source of risk. Historically, a lot of code assumes something close to:
partattrs[i] != 0 means this is a simple column partition key
partattrs[i] == 0 means the key comes from partexprs
With this patch, that is no longer strictly true, because a virtual generated
column behaves like a hybrid case: it still has an attribute number, but it
also consumes an expression entry. The current patch does update several
important consumers, but we are still a bit worried about how easy it would
be to miss some corner of the code that still relies on the old assumption.
Rather than sprinkling attrIsVirtualGenerated() checks in many places,
we wonder if it would be better to use a clearer internal representation,
or at least make the invariant more explicit.
For example:
either treat a virtual generated partition key uniformly as an expression key,
with partattrs[i] = 0; or add an explicit flag in PartitionKeyData,
something like partisgeneratedexpr[i]; at minimum, the new invariant should be
documented much more clearly in comments, and all consumers of
partattrs and partexprs should be audited carefully. Even if the current
representation is kept, we think this deserves stronger documentation near the
relevant data structures and relcache build path.
We suggest you also adjust comments at pg_partioned_table.h file as that file
contain pg_partitioned_table definition. You should keep comments in sync with
your comment updates at partcache.h
The comment for PartitionKeyData is no longer clear enough after this
patch. Virtually generated column used as a partition key has both a nonzero
partattrs[i] and a corresponding expression in partexprs.
Given that, the comment should be updated to describe the new invariant
much more explicitly, like
```
partattrs[i]` identifies the source attribute for ordinary column partition
keys. For ordinary expression partition keys, `partattrs[i]` is 0 and the
corresponding expression is stored in `partexprs
```
or if we add an explicit flag such as partisgeneratedexpr[], the comment
should describe the representation in terms of whether a key has a source
attribute, and whether it needs expression evaluation, rather than continuing
to imply that partattrs and partexprs are mutually exclusive, like
```
- partattrs[i] != 0 and partisgeneratedexpr[i] == false: ordinary column
partition key
- partattrs[i] == 0: ordinary expression partition key, with the corresponding
expression stored in partexprs
- partattrs[i] != 0 and partisgeneratedexpr[i] == true: virtual generated
column partition key; partattrs[i] identifies the generated column, while the
corresponding generation expression is also stored in partexprs
```
At src/backend/catalog/partition.c
if (partattno != 0)
{
+ if (attrIsVirtualGenerated(rel, partattno))
+ partexprs_item = lnext(partexprs, partexprs_item);
+
if (bms_is_member(partattno - FirstLowInvalidHeapAttributeNumber,
attnums))
{
this check can be moved after bms_is_member check, advancing partexprs_item
before knowing whether we are going to continue scanning makes the control
flow harder to follow, because it mutates iterator state even on the path that
immediately returns.
if (partattno != 0)
{
if (bms_is_member(partattno - FirstLowInvalidHeapAttributeNumber,
attnums))
{
if (used_in_expr)
*used_in_expr = false;
return true;
}
if (attrIsVirtualGenerated(rel, partattno))
partexprs_item = lnext(partexprs, partexprs_item);
attrIsVirtualGenerated() is fine as a helper, but its placement in relcache
code felt a little surprising to me. we can move to partcache.[h/c]
We thought might be good to improve these comments.
> errmsg("unsupported %s constraint with partition key definition",
> constraint_type),
> errdetail("%s constraints cannot be used when partition keys include virtual
> generated column.",
> constraint_type));
You can add more info on which column, name of the column, its type, etc...
We also noticed this, which is not part of your current patch,
> elog(ERROR, "wrong number of partition key expressions")
`elog` is an old style error reporting. Changing these to `ereport` could be
a good first patch for new comers.
In your last email you mentioning:
> I have double checked all occurrences of `->partattrs` in the codebase
made us wonder, why only `partattrs` and not `partexprs`? We decided to check
all occurences of `partexprs` and this lead us to discovery of couple of small
bugs related to how postgres logs errors.
with your patch applied, we saw this on psql console:
CREATE TABLE bug_test (
f1 int,
f2 int GENERATED ALWAYS AS (f1 * 2) VIRTUAL
) PARTITION BY RANGE (f2);
CREATE TABLE bug_test_p1 PARTITION OF bug_test FOR VALUES FROM (0) TO (100);
ALTER TABLE bug_test DROP COLUMN f1;
ERROR: cannot drop column f1 of table bug_test because table bug_test
requires it
HINT: You can drop table bug_test instead.
which is a bit misleading, instead expected is:
ERROR: cannot drop column "f1" because it is part of the partition key of
relation "bug_test"
another misleading error message:
CREATE TABLE deparse_test (
a int,
b int GENERATED ALWAYS AS (a * 2) VIRTUAL
) PARTITION BY RANGE (b, (a + 1));
CREATE TABLE deparse_test_p1
PARTITION OF deparse_test
FOR VALUES FROM (10, 1) TO (20, box '(1,1),(2,2)');
ERROR: specified value cannot be cast to type integer for column "(a * 2)"
LINE 3: FOR VALUES FROM (10, 1) TO (20, box '(1,1),(2,2)');
The error should say: ...cannot be cast to type integer for column "(a + 1)"
(because the second key is a + 1).
Finally,
out of curiosity we decided to ask you
```
bool
attrIsVirtualGenerated(Relation rel, AttrNumber attnum)
{
Form_pg_attribute attr;
TupleDesc tupdesc = RelationGetDescr(rel);
Assert(attnum > 0);
attr = TupleDescAttr(tupdesc, attnum - 1);
return (attr->attgenerated == ATTRIBUTE_GENERATED_VIRTUAL);
}
```
why would you not put Assert at the beginning of the function. Before
> Form_pg_attribute attr;
?
Kindest regards.
[1]: https://github.com/IamMashed/benchmarks/tree/main/virt-gen-col-part-key
The new status of this patch is: Waiting on Author
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2026-03-14 18:13:22 | Re: Areas for Solaris support modernization |
| Previous Message | Viktor Holmberg | 2026-03-14 16:32:07 | Re: [PATCH] no table rewrite when set column type to constrained domain |