Understanding statistics in PostgreSQL

<section class="blog-post-content"><a href="https://aws.amazon.com/rds/postgresql/what-is-postgresql/" target="_blank" rel="noopener noreferrer">PostgreSQL</a> has become the preferred <a href="https://aws.amazon.com/products/databases/open-source-databases/" target="_blank" rel="noopener noreferrer">open-source</a> relational database for many enterprise developers and startups, and powers leading business and mobile applications. AWS provides two managed PostgreSQL options: <a href="https://aws.amazon.com/rds/postgresql/" target="_blank" rel="noopener noreferrer">Amazon Relational Database Service (Amazon RDS) for PostgreSQL</a> and <a href="https://aws.amazon.com/rds/aurora/" target="_blank" rel="noopener noreferrer">Amazon Aurora PostgreSQL-Compatible Edition.</a>Database statistics play a key role in improving the performance of the database. The query planner uses the statistical data to generate efficient run plans for queries. The purpose of this post is to explain the types of statistics in PostgreSQL and how to read and understand them. This applies to both Amazon RDS for PostgreSQL and Aurora PostgreSQL.The following types of statistics are collected and available in PostgreSQL:<ul><li>Data distribution statistics</li><li>Extended statistics</li><li>Monitoring statistics</li></ul>We explain each type in more detail in this post.<h2>Data distribution statistics</h2>These statistics are related to the data distribution for each relation. They provide information about the most common values in each column in a relation, average width of the column, number of distinct values in the column, and more. They’re collected when we run <a href="https://www.postgresql.org/docs/current/sql-analyze.html" target="_blank" rel="noopener noreferrer">ANALYZE</a> or when analyze is triggered by <a href="https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM" target="_blank" rel="noopener noreferrer">autovacuum</a>, and are stored in the <a href="https://www.postgresql.org/docs/current/catalog-pg-statistic.html" target="_blank" rel="noopener noreferrer">pg_statistic</a> system catalog (whose public readable view is <a href="https://www.postgresql.org/docs/current/view-pg-stats.html" target="_blank" rel="noopener noreferrer">pg_stats</a>).The following is an example of how you can generate and see these stats:<ol><li>Create a table and insert some dummy data:</li><li>Analyze the table to generate the stats for the table:</li><li>Check the stats of the table in the <code>pg_stats</code> view:<div class="hide-language"><pre class="lang-sql">postgres=> SELECT FROM pg_stats WHERE tablename ='test_stats';-[ RECORD 1 ]----------+------------------------------------------------schemaname | publictablename | test_statsattname | idinherited | fnull_frac | 0avg_width | 4n_distinct | -0.5most_common_vals | {1,2,3,4,5,6,7,8,9,10}most_common_freqs | {0.075,0.075,0.075,0.075,0.075,0.075,0.075,0.075,0.075,0.075}histogram_bounds | {11,12,13,14,15,16,17,18,19,20}correlation | 0.7551595most_common_elems |most_common_elem_freqs |elem_count_histogram |-[ RECORD 2 ]----------+------------------------------------------------schemaname | publictablename | test_statsattname | nameinherited | fnull_frac | 0avg_width | 6n_distinct | -0.5most_common_vals | {test1,test10,test2,test3,test4,test5,test6,test7,test8,test9}most_common_freqs | {0.075,0.075,0.075,0.075,0.075,0.075,0.075,0.075,0.075,0.075}histogram_bounds | {test11,test12,test13,test14,test15,test16,test17,test18,test19,test20}correlation | -0.19043152most_common_elems |most_common_elem_freqs |elem_count_histogram |postgres=></pre></div></li></ol>As shown in the output, the <code>pg_stats</code> view has the data distribution stats for each column of the <code>test_stats</code> table. For example, there are 20 unique values for column id; however, you can see only 10 values for <code>most_common_values</code> because these are duplicates, and for the name column, <code>test1</code>, <code>test2</code>, <code>…test9</code>, as the most common values. The most common value list is used to help the planner predict the selectivity of equality expressions, such as where <code>name='test5'</code> or <code>where state=’TX’</code>. The histogram bounds are used to help the planner predict the selectivity of inequality or range expressions, such as where id is between 5000–10000.If you look at the <code>correlation</code> column for the <code>name</code> column, its value is -0.19 (near 0). When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it’s near 0, due to the reduction of random access to the disk. Because there are only 30 rows, a value near 0 indicates that an index isn’t required for this column. There are no null values in the tables, so <code>null_frac</code> is 0.The query planner needs to estimate the number of rows retrieved by a query in order to make good choices of query plans. For making these estimations, these data distribution statistics are used.The following are a few key things to note about the statistics:<ul><li>For large tables, <code>ANALYZE</code> takes a random sample of the table contents, rather than examining every row. This allows even very large tables to be analyzed in a small amount of time.</li><li>The amount of samples considered by <code>ANALYZE</code> depends on the <a href="https://www.postgresql.org/docs/current/runtime-config-query.html#GUC-DEFAULT-STATISTICS-TARGET" target="_blank" rel="noopener noreferrer">default_statistics_target</a> parameter. Larger values increase the time needed to do <code>ANALYZE</code>, but might improve the quality of the planner’s estimates. The default value for this is 100. To get an accurate plan, the default value is sufficient; however, <code>default_statistics_target</code> is the global default. For the case where there is 1 column that needs more stats, you can use ALTER TABLE [ TABLE ] ALTER [ COLUMN ] column_name SET STATISTICS integer. However, it will consume more CPU, memory, and time. If the value of this parameter is 100, then 300 100 = 30,000 rows are sampled from each table. This sample is used to determine up to 100 most common values to store in the <code>most_common_vals</code> array column, and up to 100 histogram bounds to store in that array, plus a few other scalar statistics, like the number of distinct values.</li><li>There is no command or function for resetting these statistics (as in resetting to 0).</li><li>After you complete an engine major version upgrade, you should run the ANALYZE operation to refresh the <code>pg_statistic</code> table (to have the statistics updated for the planner’s use).</li><li>For a read replica in Amazon RDS for PostgreSQL and for a reader node in Aurora PostgreSQL, these stats are the same as for the primary or writer. This is because they are stored in a relation (<code>pg_statistics</code>) on disk (physical blocks are the same on the replica in Amazon RDS for PostgreSQL and in the case of Aurora, the reader is reading from the same storage). This is also the reason why it isn’t allowed (and also not logical) to run an <code>ANALYZE</code> on a replica or a reader node (both can read from the <code>pg_statistics</code> relation, but can’t update it).</li></ul><h2>Extended statistics</h2>By default, the statistics from <code>ANALYZE</code> are stored on a per-column per-table basis, and therefore can’t capture any knowledge about cross-column correlation. It’s common to see slow queries running bad run plans because multiple columns used in the query clauses are correlated. However, with the <a href="https://www.postgresql.org/docs/current/sql-createstatistics.html" target="_blank" rel="noopener noreferrer">CREATE STATISTICS</a> command, you can create extended statistics for correlated columns.You can use CREATE STATISTICS for a single expression (which is called univariate statistics), which provides benefits similar to an expression index without the overhead of index maintenance. You can also use CREATE STATISTICS on multiple columns or expressions (known as multivariate statistics).Consider a common table with a relation between a city and state in the US. For example, we have a query to find the rows from the city of Springfield in the state of Illinois. There is a city named Springfield in most states, and the state of Illinois has many cities. Determining the number of distinct rows with that combination without a relationship is simply combining the two probabilities. Extended statistics gather stats on the two columns together so the estimates of the number of rows from Springfield, Illinois, is a lot closer.Let’s use a table with a two integer columns to demonstrate the usage of extended stats. Complete the following steps:<ol><li>Create a table with <code>a</code> and <code>b</code> columns and insert some data:From data inserted, these the columns are functionally dependent—the knowledge of a value in the first column is sufficient for determining the value in the other column.</li><li>Look at the <code>pg_stats</code> values for these columns:You can see the distinct values for column <code>a</code> as 1000 and <code>b</code> as 100.</li><li>Without extended statistics, selecting one column looks fine with the estimated rows as 999, which is almost the same as actual rows which is 1000:</li><li>However, if you add the second column to the predicate, the planner estimates rows very differently, because no dependency statistics are gathered:You can see estimated rows as 1, which is far from the actual rows (1,000).</li><li>Let’s gather extended statistics:</li></ol>As you see, estimated rows (1,017) is very close to the actual number.All the queries with filters on both columns will have much better estimates using extended statistics.<h2>Monitoring statistics</h2>In addition to data distribution and extended statistics, there are monitoring statistics as well. These types of stats are collected by the <a href="https://www.postgresql.org/docs/current/monitoring-stats.html" target="_blank" rel="noopener noreferrer">statistics collector</a>.These statistics collect information about count accesses to tables and indexes in both disk block and individual row terms. It also tracks the total number of rows in each table, and information about vacuum and analyze actions for each table (when they were last run on the table).There is a new mechanism built to store <a href="https://www.postgresql.org/docs/15/monitoring-stats.html" target="_blank" rel="noopener noreferrer">server-level statistics</a> in shared memory in <a href="https://www.postgresql.org/docs/15/release-15.html" target="_blank" rel="noopener noreferrer">PostgreSQL 15</a>. Previously, this was updated via UDP packets, stored in the file system, and read by sessions. There is no longer a statistics collector process.The following output shows information like number of live rows, total insert, total dead rows, vacuum information, and more of the <code>monitor_stats</code> table:For PostgreSQL 14 or older, the statistics collector transmits the collected information to other PostgreSQL processes through temporary files. These files are stored in the directory named by the <a href="https://www.postgresql.org/docs/current/runtime-config-statistics.html#GUC-STATS-TEMP-DIRECTORY" target="_blank" rel="noopener noreferrer">stats_temp_directory</a> parameter, <code>pg_stat_tmp</code> by default.When the server shuts down cleanly, a permanent copy of the statistics data is stored in the <code>pg_stat</code> subdirectory, so that statistics can be retained across server restarts. When recovery is performed at server start (such as after immediate shutdown, server crash, and point-in-time recovery), all statistics counters are reset.The collector itself emits a new report at most once per <code>PGSTAT_STAT_INTERVAL</code> milliseconds (500 milliseconds unless altered while building the server).<code>pg_stat_activity</code>, <code>pg_stat_replication</code>, <code>pg_stat_all_tables</code>, <code>pg_stat_user_indexes</code>, and <code>pg_statio_all_tables</code> are some examples of the views that are refreshed by the stats collector reports. For more information on statistics views, refer to Viewing Statistics.You can use the <code>pg_stat_reset()</code> function to reset all statistics counters for the current database to zero.In the following example, <code>seq_scan</code> is 0 at this stage on table <code>reset_stats</code>:An explain analyze command runs the query, so this should increment the <code>seq_scan</code> counter:You can check the table stats after running <code>explain analyze</code> once on the table <code>tbl</code> for a <code>select *</code> query. As we can see, <code>seq_scan</code> is 1 now:<code>seq_scan</code> incremented again after running the same query:The <code>pg_stat_reset()</code> function is used to reset all statistics for the database to 0:Because collecting statistics adds some overhead to query runs, the system can be configured to collect or to not collect information. This is controlled by configuration parameters that are normally set in <code>postgresql.conf</code> using <a href="https://www.postgresql.org/docs/current/runtime-config-statistics.html#GUC-TRACK-ACTIVITIES" target="_blank" rel="noopener noreferrer">track_activities,</a> <a href="https://www.postgresql.org/docs/current/runtime-config-statistics.html#GUC-TRACK-COUNTS" target="_blank" rel="noopener noreferrer">track_counts</a>, <a href="https://www.postgresql.org/docs/current/runtime-config-statistics.html#GUC-TRACK-FUNCTIONS" target="_blank" rel="noopener noreferrer">track_functions</a>, and <a href="https://www.postgresql.org/docs/current/runtime-config-statistics.html#GUC-TRACK-IO-TIMING" target="_blank" rel="noopener noreferrer">track_io_timing</a>.Because the statistics collector processes runs on each PostgreSQL server by default and the access to the database relations will be different on each server, these stats are different for each instance. This means that the RDS for PostgreSQL replicas and Aurora PostgreSQL reader will have different stats than the primary and the writer node.<h2>Summary</h2>In this post, we learned different types of statistics available in PostgreSQL. The data distribution statistics are collected whenever <code>analyze</code> is run (either manually or by autovacuum), and these stats are used by the query planner. The extended statistics are useful when you need a correlation between multiple columns so that the planner can find the statistics based on the dependencies between the columns. You can use the monitoring statistics to see the number of sequential scans, number of index scans, live and dead tuples in each relation, I/O stats for each relation, information about vacuum and analyze on each relation, and more. These stats are refreshed every 500 milliseconds by the stats collector.If you have any questions or suggestions about post, leave a comment.<h3>About the authors</h3><img class="size-full wp-image-13524 alignleft" src="https://d2908q01vomqb2.cloudfront.net/887309d048beef83ad3eabf2a79a64a389ab1c9f/2021/02/24/Divya-Sharma-p.png" alt="Divya Sharma" width="100" height="133" />Divya Sharma is a Database Specialist Solutions architect at AWS, focusing on RDS/Aurora PostgreSQL. She has helped multiple enterprise customers move their databases to AWS, providing assistance on PostgreSQL performance and best practices.<a href="https://d2908q01vomqb2.cloudfront.net/887309d048beef83ad3eabf2a79a64a389ab1c9f/2021/03/16/hyperbadge_shbaji.png" target="_blank" rel="noopener noreferrer"><img class="size-full wp-image-13779 alignleft" src="https://d2908q01vomqb2.cloudfront.net/887309d048beef83ad3eabf2a79a64a389ab1c9f/2021/03/16/hyperbadge_shbaji.png" alt="" width="100" height="101" /></a>Baji Shaik is a Sr Database Consultant with AWS ProServe, GCC AMER. His background spans a wide depth and breadth of expertise and experience in SQL/NoSQL database technologies. He is a Database Migration Expert and has developed many successful database solutions addressing challenging business requirements for moving databases from on-premises to Amazon RDS and Aurora PostgreSQL/MySQL. He is an eminent author, having written several books on PostgreSQL. A few of his recent works include “<a class="c-link" href="https://www.apress.com/gp/book/9781484256626" target="_blank" rel="noopener noreferrer" data-stringify-link="https://www.apress.com/gp/book/9781484256626" data-sk="tooltip_parent">PostgreSQL Configuration</a>“, “<a class="c-link" href="https://www.apress.com/gp/book/9781484234464" target="_blank" rel="noopener noreferrer" data-stringify-link="https://www.apress.com/gp/book/9781484234464" data-sk="tooltip_parent">Beginning PostgreSQL on the Cloud</a>”, and “<a class="c-link" href="https://www.packtpub.com/in/big-data-and-business-intelligence/postgresql-development-essentials" target="_blank" rel="noopener noreferrer" data-stringify-link="https://www.packtpub.com/in/big-data-and-business-intelligence/postgresql-development-essentials" data-sk="tooltip_parent">PostgreSQL Development Essentials</a>“. Furthermore, he has delivered several conference and workshop sessions.</section>

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签