In PostgreSQL this is typically simpler and faster (more performance optimization below):
(在PostgreSQL中,这通常更简单,更快 (下面将进行更多性能优化):)
SELECT DISTINCT ON) (customer)
id, customer, total
FROM purchases
ORDER BY customer, total DESC, id;
Or shorter (if not as clear) with ordinal numbers of output columns:
(或更短(如果不清楚),输出列的序号为:)
SELECT DISTINCT ON (2)
id, customer, total
FROM purchases
ORDER BY 2, 3 DESC, 1;
If total
can be NULL (won't hurt either way, but you'll want to match existing indexes):
(如果total
可以为NULL(无论哪种方式都没有问题,但是您需要匹配现有索引):)
...
ORDER BY customer, total DESC NULLS LAST), id;
Major points (要点)
DISTINCT ON
is a PostgreSQL extension of the standard (where only DISTINCT
on the whole SELECT
list is defined).
(DISTINCT ON
是该标准的PostgreSQL扩展(仅在整个SELECT
列表上定义了DISTINCT
)。)
List any number of expressions in the DISTINCT ON
clause, the combined row value defines duplicates.
(在DISTINCT ON
子句中列出任意数量的表达式,组合的行值定义重复项。)
The manual: (手册:)
Obviously, two rows are considered distinct if they differ in at least one column value.
(显然,如果两行至少有一个列值不同,则认为它们是不同的。)
Null values are considered equal in this comparison. (在此比较中,将空值视为相等。)
Bold emphasis mine.
(大胆强调我的。)
DISTINCT ON
can be combined with ORDER BY
.
(DISTINCT ON
可以与ORDER BY
结合使用。)
Leading expressions have to match leading DISTINCT ON
expressions in the same order. (前导表达式必须以相同顺序匹配前导DISTINCT ON
表达式。)
You can add additional expressions to ORDER BY
to pick a particular row from each group of peers. (您可以向ORDER BY
添加其他表达式,以从每组对等体中选择特定的行。)
I added id
as last item to break ties: (我添加了id
作为打破联系的最后一项:)
"Pick the row with the smallest id
from each group sharing the highest total
."
(“从每个组中选择id
最小的行,共享total
最大的行。”)
To order results in a way that disagrees with the sort order determining the first per group, you can nest above query in an outer query with another ORDER BY
.
(要以与确定每个组第一个排序顺序不同的排序方式来对结果进行排序,可以将上面的查询嵌套在另一个ORDER BY
的外部查询中。)
Like: (喜欢:)
If total
can be NULL, you most probably want the row with the greatest non-null value.
(如果total
可以为NULL,则您很可能希望具有最大非空值的行。)
Add NULLS LAST
like demonstrated. (NULLS LAST
添加NULLS LAST
。)
Details: (细节:)
The SELECT
list is not constrained by expressions in DISTINCT ON
or ORDER BY
in any way.
(SELECT
列表不受DISTINCT ON
或ORDER BY
中的表达式的任何限制。)
(Not needed in the simple case above): ((在上面的简单情况下不需要):)
You don't have to include any of the expressions in DISTINCT ON
or ORDER BY
.
(您不必在DISTINCT ON
或ORDER BY
包含任何表达式。)
You can include any other expression in the SELECT
list.
(您可以在SELECT
列表中包括任何其他表达式。)
This is instrumental for replacing much more complex queries with subqueries and aggregate / window functions. (这有助于用子查询和聚合/窗口函数替换更复杂的查询。)
I tested with Postgres versions 8.3 – 12. But the feature has been there at least since version 7.1, so basically always.
(我使用Postgres 8.3 – 12版进行了测试。但是至少从7.1版开始,该功能就存在了,因此基本上总是如此。)
Index (指数)
The perfect index for the above query would be a multi-column index spanning all three columns in matching sequence and with matching sort order:
(上面查询的理想索引是一个多列索引,它以匹配顺序和匹配的排序顺序跨越所有三列:)
CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);
May be too specialized.
(可能太专业了。)
But use it if read performance for the particular query is crucial. (但是,如果特定查询的读取性能至关重要,请使用它。)
If you have DESC NULLS LAST
in the query, use the same in the index so that sort order matches and the index is applicable. (如果查询中具有DESC NULLS LAST
,则在索引中使用相同的字符,以便排序顺序匹配并且索引适用。)
Effectiveness / Performance optimization (效果/性能优化)
Weigh cost and benefit before creating tailored indexes for each query.
(在为每个查询创建量身定制的索引之前,请权衡成本和收益。)
The potential of above index largely depends on data distribution . (上述指标的潜力在很大程度上取决于数据分布 。)
The index is used because it delivers pre-sorted data.
(使用索引是因为它提供了预排序的数据。)
In Postgres 9.2 or later the query can also benefit from an index only scan if the index is smaller than the underlying table. (在Postgres 9.2或更高版本中,如果索引小于基础表,则查询也可以从仅索引扫描中受益。)
The index has to be scanned in its entirety, though. (但是,必须完整扫描索引。)
For few rows per customer (high cardinality in column customer
), this is very efficient.
(对于每个客户几行 (列customer
基数高),这是非常有效的。)
Even more so if you need sorted output anyway. (如果您仍然需要排序的输出,则更是如此。)
The benefit shrinks with a growing number of rows per customer. (随着每个客户行数的增加,收益也随之减少。)
Ideally, you have enough work_mem
to process the involved sort step in RAM and not spill to disk.
(理想情况下,您有足够的work_mem
来处理RAM中涉及的排序步骤,并且不会溢出到磁盘上。)
But generally setting work_mem
too high can have adverse effects. (但是通常将work_mem
设置得太高会产生不利影响。)
Consider SET LOCAL
for exceptionally big queries. (考虑将SET LOCAL
用于特别大的查询。)
Find how much you need with EXPLAIN ANALYZE
. (使用EXPLAIN ANALYZE
查找您需要多少。)
Mention of " Disk: " in the sort step indicates the need for more: (在排序步骤中提到“ 磁盘: ”表示需要更多:)
For many rows per customer (low cardinality in column customer
), a loose index scan (aka "skip scan") would be (much) more efficient, but that's not implemented up to Postgres 12. (An implementation for index-only scans is in development for Postgres 13. See <a href="https://stackoom.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…