Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc][2025.1] auto analyze service #25188

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@ ANALYZE collects statistics about the contents of tables in the database, and st

The statistics are also used by the YugabyteDB [cost-based optimizer](../../../../../reference/configuration/yb-tserver/#yb-enable-base-scans-cost-model) (CBO) to create optimal execution plans for queries. When run on up-to-date statistics, CBO provides performance improvements and can reduce or eliminate the need to use hints or modify queries to optimize query execution.

{{< warning title="Run ANALYZE manually" >}}
Currently, YugabyteDB doesn't run a background job like PostgreSQL autovacuum to analyze the tables. To collect or update statistics, run the ANALYZE command manually.

{{< warning title="Run ANALYZE regularly" >}}
If you have enabled CBO, you must run ANALYZE on user tables after data load for the CBO to create optimal execution plans.

You can automate running ANALYZE using the [Auto Analyze service](../../../../../explore/query-1-performance/auto-analyze/).
{{< /warning >}}

The YugabyteDB implementation is based on the framework provided by PostgreSQL, which requires the storage layer to provide a random sample of rows of a predefined size. The size is calculated based on a number of factors, such as the included columns' data types.
Expand Down
22 changes: 12 additions & 10 deletions docs/content/preview/explore/query-1-performance/_index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Query Tuning
headerTitle: Query Tuning
headerTitle: Query tuning
linkTitle: Query tuning
description: Tuning and optimizing query performance
headcontent: Optimize query performance
Expand All @@ -17,19 +17,19 @@ showRightNav: true

Query tuning is the art and science of improving the performance of SQL queries. It involves understanding the database's architecture, query execution plans, and performance metrics. By identifying and addressing performance bottlenecks, you can significantly enhance the responsiveness of your applications and reduce the load on your database infrastructure.

This guide provides a comprehensive overview of query tuning techniques for distributed SQL databases. We will explore various strategies, best practices, and tools to help you optimize your queries and achieve optimal performance.
This guide provides an overview of query tuning techniques for distributed SQL databases, including strategies, best practices, and tools to help you optimize queries, and achieve optimal performance.

## Identify slow queries

The pg_stat_statements extension provides a comprehensive view of query performance, and is essential for database administrators and developers aiming to enhance database efficiency. You can use the pg_stat_statements extension to get statistics on past queries. It collects detailed statistics on query execution, including the number of executions, total execution time, and resource usage metrics like block hits and reads. This data can help ypu identify performance bottlenecks and optimize query performance.
The pg_stat_statements extension provides a comprehensive view of query performance, and is essential for database administrators and developers aiming to enhance database efficiency. You can use the pg_stat_statements extension to get statistics on past queries. It collects detailed statistics on query execution, including the number of executions, total execution time, and resource usage metrics like block hits and reads. This data can help you identify performance bottlenecks and optimize query performance.

{{<lead link="./pg-stat-statements/">}}
Learn how to fetch query statistics and improve performance using [pg_stat_statements](./pg-stat-statements/).
{{</lead>}}

## Column statistics

The pg_stats view provides a user-friendly display of the column-level data distribution of tables. This view includes information about table columns, such as the fraction of null entries, average width, number of distinct values, and most common values. These statistics are crucial for the query planner to make informed decisions about the most efficient way to execute queries. By regularly analyzing the statistics in pg_stats, you can identify opportunities for optimization, such as creating or dropping indexes, and fine-tune your database configuration for optimal performance.
The pg_stats view provides a user-friendly display of the column-level data distribution of tables. This view includes information about table columns, such as the fraction of null entries, average width, number of distinct values, and most common values. These statistics are crucial for the query planner to make informed decisions about the most efficient way to execute queries. By regularly analyzing the statistics in pg_stats, you can identify opportunities for optimization (such as creating or dropping indexes), and fine-tune your database configuration for optimal performance.

{{<lead link="./pg-stats/">}}
Learn how to understand column level statistics and improve query performance using [pg_stats](./pg-stats/).
Expand Down Expand Up @@ -62,14 +62,16 @@ $ ./bin/yb-tserver --ysql_log_min_duration_statement 1000

Results are written to the current `postgres*log` file.

{{< note title="Note" >}}
(Depending on the database and the work being performed, long-running queries don't necessarily need to be optimized. Ensure that the threshold is high enough so that you don't flood the `postgres*log` log files.)

Depending on the database and the work being performed, long-running queries don't necessarily need to be optimized.
{{<lead link="/preview/troubleshoot/nodes/check-logs/#yb-tserver-logs">}}
Learn more about [YB-TServer logs](/preview/troubleshoot/nodes/check-logs/#yb-tserver-logs).
{{</lead>}}

Ensure that the threshold is high enough so that you don't flood the `postgres*log` log files.
## Auto Analyze

{{< /note >}}
To create optimal plans for queries, the query planner needs accurate and up-to-date statistics related to tables and their columns. ANALYZE collects statistics about the contents of tables in the database, and stores the results in the `pg_statistic` system catalog. Similar to [PostgreSQL autovacuum](https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM), the YugabyteDB Auto Analyze service automates the execution of ANALYZE commands for any table where rows have changed more than a configurable threshold for the table. This ensures table statistics are always up-to-date.

{{<lead link="/preview/troubleshoot/nodes/check-logs/#yb-tserver-logs">}}
Learn more about [YB-TServer logs](/preview/troubleshoot/nodes/check-logs/#yb-tserver-logs).
{{<lead link="./auto-analyze/">}}
To learn more, see [Auto Analyze service](./auto-analyze/).
{{</lead>}}
89 changes: 89 additions & 0 deletions docs/content/preview/explore/query-1-performance/auto-analyze.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---
title: Auto Analyze service
headerTitle: Auto Analyze service
linkTitle: Auto Analyze
description: Use the Auto Analyze service to keep table statistics up to date
headcontent: Keep table statistics up to date automatically
tags:
feature: tech-preview
menu:
preview:
identifier: auto_analyze
parent: query-tuning
weight: 700
type: docs
---

To create optimal plans for queries, the query planner needs accurate and up-to-date statistics related to tables and their columns. These statistics are also used by the YugabyteDB [cost-based optimizer](../../../reference/configuration/yb-tserver/#yb-enable-base-scans-cost-model) (CBO) to create optimal execution plans for queries. To generate the statistics, you run the [ANALYZE](../../../api/ysql/the-sql-language/statements/cmd_analyze/) command. ANALYZE collects statistics about the contents of tables in the database, and stores the results in the `pg_statistic` system catalog.

Similar to [PostgreSQL autovacuum](https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM), the YugabyteDB Auto Analyze service automates the execution of ANALYZE commands for any table where rows have changed more than a configurable threshold for the table. This ensures table statistics are always up-to-date.

## Enable Auto Analyze

The Auto Analyze service is {{<tags/feature/tp>}}. Before you can use the feature, you must enable it by setting `ysql_enable_auto_analyze_service` to true on all YB-Masters, and both `ysql_enable_auto_analyze_service` and `ysql_enable_table_mutation_counter` to true on all YB-Tservers.

For example, to create a single-node [yugabyted](../../../reference/configuration/yugabyted/) cluster with Auto Analyze enabled, use the following command:

```sh
./bin/yugabyted start --master_flags "ysql_enable_auto_analyze_service=true" --tserver_flags "ysql_enable_auto_analyze_service=true,ysql_enable_table_mutation_counter=true"
```

To enable Auto Analyze on an existing cluster, a rolling restart is required to set `ysql_enable_auto_analyze_service` and `ysql_enable_table_mutation_counter` to true.

## Configure Auto Analyze

You can control how frequently the service updates table statistics using the following YB-TServer flags:

- `ysql_auto_analyze_threshold` - the minimum number of mutations (INSERT, UPDATE, and DELETE) needed to run ANALYZE on a table. Default is 50.
- `ysql_auto_analyze_scale_factor` - a fraction that determines when enough mutations have been accumulated to run ANALYZE for a table. Default is 0.1.

Increasing either of these flags reduces the frequency of statistics updates.

If the total number of mutations for a table is greater than its analyze threshold, then the service runs ANALYZE on the table. The analyze threshold of a table is calculated as follows:

```sh
analyze_threshold = ysql_auto_analyze_threshold + (ysql_auto_analyze_scale_factor * <table_size>)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added parenthesis is unnecessary IMO.

```

where `<table_size>` is the current `reltuples` column value stored in the `pg_class` catalog.

`ysql_auto_analyze_threshold` is important for small tables. With default settings, if a table has 100 rows and 20 are mutated, ANALYZE won't run as the threshold is not met, even though 20% of the rows are mutated.

On the other hand, `ysql_auto_analyze_scale_factor` is especially important for big tables. If a table has 1,000,000,000 rows, 10% (100,000,000 rows) would have to be mutated before ANALYZE runs. Set the scale factor to a lower value to allow for more frequent statistics collection for such large tables.

In addition, `ysql_auto_analyze_batch_size` controls the maximum number of tables the Auto Analyze service tries to analyze in a single ANALYZE statement. The default is 10. Setting this flag to a larger value can potentially reduce the number of YSQL catalog cache refreshes if Auto Analyze decides to ANALYZE many tables in the same database at the same time.

For more information on flags used to configure the Auto Analyze service, refer to [Auto Analyze service flags](../../../reference/configuration/yb-tserver#auto-analyze-service-flags).

## Example

With Auto Analyze enabled, try the following SQL statements.

```sql
CREATE TABLE test (k INT PRIMARY KEY, v INT);
SELECT reltuples FROM pg_class WHERE relname = 'test';
```

```output
reltuples
-----------
-1
(1 row)
```

```sql
INSERT INTO test SELECT i, i FROM generate_series(1, 100) i;
-- Wait for few seconds
SELECT reltuples FROM pg_class WHERE relname = 'test';
```

```output
reltuples
-----------
100
(1 row)
```

## Limitations

Because ANALYZE is a DDL statement, it can cause DDL conflicts when run concurrently with other DDL statements. As Auto Analyze runs ANALYZE in the background, you should turn off Auto Analyze if you want to execute DDL statements. You can do this by setting `ysql_enable_auto_analyze_service` to false on all YB-TServers at runtime.
6 changes: 3 additions & 3 deletions docs/content/stable/explore/query-1-performance/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,19 @@ showRightNav: true

Query tuning is the art and science of improving the performance of SQL queries. It involves understanding the database's architecture, query execution plans, and performance metrics. By identifying and addressing performance bottlenecks, you can significantly enhance the responsiveness of your applications and reduce the load on your database infrastructure.

This guide provides a comprehensive overview of query tuning techniques for distributed SQL databases. We will explore various strategies, best practices, and tools to help you optimize your queries and achieve optimal performance.
This guide provides an overview of query tuning techniques for distributed SQL databases, including strategies, best practices, and tools to help you optimize queries, and achieve optimal performance.

## Identify slow queries

The pg_stat_statements extension provides a comprehensive view of query performance, and is essential for database administrators and developers aiming to enhance database efficiency. You can use the pg_stat_statements extension to get statistics on past queries. It collects detailed statistics on query execution, including the number of executions, total execution time, and resource usage metrics like block hits and reads. This data can help ypu identify performance bottlenecks and optimize query performance.
The pg_stat_statements extension provides a comprehensive view of query performance, and is essential for database administrators and developers aiming to enhance database efficiency. You can use the pg_stat_statements extension to get statistics on past queries. It collects detailed statistics on query execution, including the number of executions, total execution time, and resource usage metrics like block hits and reads. This data can help you identify performance bottlenecks and optimize query performance.

{{<lead link="./pg-stat-statements/">}}
Learn how to fetch query statistics and improve performance using [pg_stat_statements](./pg-stat-statements/).
{{</lead>}}

## Column statistics

The pg_stats view provides a user-friendly display of the column-level data distribution of tables. This view includes information about table columns, such as the fraction of null entries, average width, number of distinct values, and most common values. These statistics are crucial for the query planner to make informed decisions about the most efficient way to execute queries. By regularly analyzing the statistics in pg_stats, you can identify opportunities for optimization, such as creating or dropping indexes, and fine-tune your database configuration for optimal performance.
The pg_stats view provides a user-friendly display of the column-level data distribution of tables. This view includes information about table columns, such as the fraction of null entries, average width, number of distinct values, and most common values. These statistics are crucial for the query planner to make informed decisions about the most efficient way to execute queries. By regularly analyzing the statistics in pg_stats, you can identify opportunities for optimization (such as creating or dropping indexes), and fine-tune your database configuration for optimal performance.

{{<lead link="./pg-stats/">}}
Learn how to understand column level statistics and improve query performance using [pg_stats](./pg-stats/).
Expand Down