Cloudera Impala provides an interface for executing SQL queries on data(Big Data) stored in HDFS or HBase in a fast and interactive way. Impala improves the performance of an SQL query by applying various optimization techniques. “Compute Stats” is one of these optimization techniques.
“Compute Stats” collects the details of the volume and distribution of data in a table and all associated columns and partitions. The information is stored in the metastore database and used by Impala to help optimize queries. This would help in preparing the efficient query plan before executing a query on a large table.
Syntax:
1 2 3 4 |
COMPUTE STATS [schema_name.]table_name COMPUTE INCREMENTAL STATS [schema_name.]table_name [PARTITION (partition_spec)] partition_spec ::= partition_col=constant_value |
The PARTITION clause is only allowed in combination with the INCREMENTAL clause. It is optional for COMPUTE INCREMENTAL STATS, and required for DROP INCREMENTAL STATS. Whenever you specify partitions through the PARTITION (partition_spec) clause in a COMPUTE INCREMENTAL STATS or DROP INCREMENTAL STATSstatement, you must include all the partitioning columns in the specification, and specify constant values for all the partition key columns.
COMPUTE STATS will prepare the stats of entire table whereas COMPUTE INCREMENTAL STATS will work only on few of the partitions rather than the whole table. It will be helpful if the table is very large and takes a lot of time in performing COMPUTE STATS for the entire table each time a partition added or dropped.
We can see the stats of a table using the SHOW TABLE STATS command.
1 2 3 4 5 6 7 8 9 10 |
Query: show table stats clearmydoubt_db.test_table_1 +----------+-------+--------+------+--------------+-------------------+---------+-------------------+------------------------------------------------------------+ | part | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +----------+-------+--------+------+--------------+-------------------+---------+-------------------+------------------------------------------------------------+ | 20180101 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://myworkstation.admin:8020/test_table_1/part=20180101 | | 20180102 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://myworkstation.admin:8020/test_table_1/part=20180102 | | 20180103 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://myworkstation.admin:8020/test_table_1/part=20180103 | | 20180104 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://myworkstation.admin:8020/test_table_1/part=20180104 | | Total | -1 | 0 | 0B | 0B | | | | | +----------+-------+--------+------+--------------+-------------------+---------+-------------------+------------------------------------------------------------+ |
#Rows column displays -1 for all the partitions as the stats have not been created yet.
Let’s see the usage of Compute Stats:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
> COMPUTE INCREMENTAL STATS clearmydoubt_db.test_table_1 PARTITION(part='20180104'); Query: compute INCREMENTAL STATS clearmydoubt_db.test_table_1 PARTITION(part='20180104') +-----------------------------------------+ | summary | +-----------------------------------------+ | Updated 1 partition(s) and 2 column(s). | +-----------------------------------------+ Fetched 1 row(s) in 2.60s > show table stats clearmydoubt_db.test_table_1; Query: show table stats clearmydoubt_db.test_table_1 +----------+-------+--------+------+--------------+-------------------+---------+-------------------+------------------------------------------------------------+ | part | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +----------+-------+--------+------+--------------+-------------------+---------+-------------------+------------------------------------------------------------+ | 20180101 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://myworkstation.admin:8020/test_table_1/part=20180101 | | 20180102 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://myworkstation.admin:8020/test_table_1/part=20180102 | | 20180103 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://myworkstation.admin:8020/test_table_1/part=20180103 | | 20180104 | 0 | 0 | 0B | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myworkstation.admin:8020/test_table_1/part=20180104 | | Total | 0 | 0 | 0B | 0B | | | | | +----------+-------+--------+------+--------------+-------------------+---------+-------------------+------------------------------------------------------------+ Fetched 5 row(s) in 0.12s > COMPUTE INCREMENTAL STATS clearmydoubt_db.test_table_1; Query: compute INCREMENTAL STATS clearmydoubt_db.test_table_1 +-----------------------------------------+ | summary | +-----------------------------------------+ | Updated 3 partition(s) and 2 column(s). | +-----------------------------------------+ Fetched 1 row(s) in 0.31s > show table stats clearmydoubt_db.test_table_1; Query: show table stats clearmydoubt_db.test_table_1 +----------+-------+--------+------+--------------+-------------------+---------+-------------------+------------------------------------------------------------+ | part | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +----------+-------+--------+------+--------------+-------------------+---------+-------------------+------------------------------------------------------------+ | 20180101 | 0 | 0 | 0B | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myworkstation.admin:8020/test_table_1/part=20180101 | | 20180102 | 0 | 0 | 0B | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myworkstation.admin:8020/test_table_1/part=20180102 | | 20180103 | 0 | 0 | 0B | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myworkstation.admin:8020/test_table_1/part=20180103 | | 20180104 | 0 | 0 | 0B | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myworkstation.admin:8020/test_table_1/part=20180104 | | Total | 0 | 0 | 0B | 0B | | | | | +----------+-------+--------+------+--------------+-------------------+---------+-------------------+------------------------------------------------------------+ Fetched 5 row(s) in 0.11s |
Once we perform compute [incremental] stats on a table, the #Rows details get updated with the actual table records in those respective partitions.
Impala uses these details in preparing best query plan for executing a user query.
Source: https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_compute_stats.html