Aggregate (data warehouse): Difference between revisions

Content deleted Content added

Inline

Latest revision as of 09:57, 1 February 2024

An aggregate is a type of summary used in dimensional models of data warehouses to shorten the time it takes to provide answers to typical queries on large sets of data. The reason why aggregates can make such a dramatic increase in the performance of a data warehouse is the reduction of the number of rows to be accessed when responding to a query.^[1]

Use

In its simplest form, an aggregate is a simple summary table that can be derived by performing a Group by SQL query. A more common use of aggregates is to take a dimension and change its granularity. When changing the granularity of the dimension the fact table has to be partially summarized to fit the new grain of the new dimension, thus creating new dimensional and fact tables, to fit this new level of grain.

Design

Aggregates are sometimes referred to as pre-calculated summary data, since aggregations are usually precomputed, partially summarized data, that are stored in new aggregated tables. When facts are aggregated, it is either done by eliminating dimensionality or by associating the facts with a rolled up dimension. Rolled up dimensions should be shrunken versions of the dimensions associated with the granular base facts. This way, the aggregated dimension tables should conform to the base dimension tables.^[2]

Performance

In 1996, Ralph Kimball, who is widely regarded as one of the original architects of data warehousing, stated:^[3]

The single most dramatic way to affect performance in a large data warehouse is to provide a proper set of aggregate (summary) records that coexist with the primary base records. Aggregates can have a very significant effect on performance, in some cases speeding queries by a factor of one hundred or even one thousand. No other means exist to harvest such spectacular gains.

Complexity

Having aggregates and atomic data increases the complexity of the dimensional model. This complexity should be transparent to the users of the data warehouse, thus when a request is made, the data warehouse should return data from the table with the correct grain. So when requests to the data warehouse are made, aggregate navigator functionality should be implemented, to help determine the correct table with the correct grain. The number of possible aggregations is determined by every possible combination of dimension granularities. Since it would produce a lot of overhead to build all possible aggregations, it is a good idea to choose a subset of tables on which to make aggregations. The best way to choose this subset and decide which aggregations to build is to monitor queries and design aggregations to match query patterns.^[4]

Aggregate navigation

Having aggregate data in the dimensional model makes the environment more complex. To make this extra complexity transparent to the user, functionality known as aggregate navigation is used to query the dimensional and fact tables with the correct grain level. The aggregate navigation essentially examines the query to see if it can be answered using a smaller, aggregate table.^[5]

Implementations of aggregate navigators can be found in a range of technologies:

OLAP engines
Materialized views
Relational OLAP (ROLAP) services
BI application servers or query tools

It is generally recommended to use either of the first three technologies, since the benefits in the latter case is restricted to a single front end BI tool^[6]

Challenges

Since dimensional models only gain from aggregates on large data sets, it should be considered at what size of the data sets one should start using aggregates
One can also ask oneself if a data warehouse always handles data sets that are too large for direct queries, or if it sometimes is a good idea to omit the aggregate tables when starting a new data warehouse project
An open question is whether omitting aggregates in the first iteration of building a new data warehouse will make the structure of the dimensional model simpler

References

^ Christopher Adamson, Mastering Data Warehouse Aggregates: Solutions for Star Schema Performance, Wiley Publishing, Inc., 2006 ISBN 978-0-471-77709-0, Page 23
^ Ralph Kimball; Margy Ross (2002). The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second ed.). Wiley Computer Publishing. p. 356. ISBN 0-471-20024-7.
^ "Aggregate Navigation With (Almost) No Metadata". 1995-08-15. Archived from the original on 2010-12-11. Retrieved 2010-11-22.
^ Kimball & Data Warehouse Toolkit, p. 355.
^ Kimball & Data Warehouse Toolkit, p. 137.
^ Kimball & Data Warehouse Toolkit, p. 354.

[1] Christopher Adamson, Mastering Data Warehouse Aggregates: Solutions for Star Schema Performance, Wiley Publishing, Inc., 2006 ISBN 978-0-471-77709-0, Page 23

[2] Ralph Kimball; Margy Ross (2002). The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second ed.). Wiley Computer Publishing. p. 356. ISBN 0-471-20024-7.

[3] "Aggregate Navigation With (Almost) No Metadata". 1995-08-15. Archived from the original on 2010-12-11. Retrieved 2010-11-22.

[FOOTNOTEKimballData_Warehouse_Toolkit355-4] Kimball & Data Warehouse Toolkit, p. 355.

[FOOTNOTEKimballData_Warehouse_Toolkit137-5] Kimball & Data Warehouse Toolkit, p. 137.

[FOOTNOTEKimballData_Warehouse_Toolkit354-6] Kimball & Data Warehouse Toolkit, p. 354.

[1]

[2]

[3]

[4]

[5]

[6]

@@ Line 1: / Line 1: @@
+{{Short description|Cached summaries to speed up queries}}
-Aggregates are used in [[Dimensional_modeling|dimensional models]] of the [[Data_warehouse|data warehouse]] to produce dramatic positive effects on the time it takes to query large sets of data. At the simplest form an aggregate is a simple summary table that can be derived by performing a ''Group by'' SQL query. A more common use of aggregates is to take a [[Dimension_(data_warehouse)|dimension]] and change the granularity of this dimension. When changing the granularity of the dimension the [[Fact_(data_warehouse)|fact]] table has to be partially summarized to fit the new [[Granularity|grain]] of the new [[Dimension_(data_warehouse)|dimension]], thus creating new [[Dimension_(data_warehouse)|dimensional]] and [[Fact_(data_warehouse)|fact]] tables, fitting this new level of [[Granularity|grain]]. Aggregates are sometimes referred to as pre-calculated summary data, since aggregations are usually precomputed, partially summarized data, that are stored in new aggregated tables. So the reason why aggregates can make such an dramatic increase in the performance of the [[Data_warehouse|data warehouse]] is the reduction of the number of rows to be accessed when responding to a query.
+[[File:Data_warehouse_architecture.jpg|thumb|360px|Example of a basic [[Data architecture|architecture]] of a [[data warehouse]]]]
+An '''aggregate''' is a type of summary used in [[Dimensional modeling|dimensional models]] of [[data warehouse]]s to shorten the time it takes to provide answers to typical queries on large sets of [[data (computing)|data]]. The reason why aggregates can make such a dramatic increase in the performance of a data warehouse is the reduction of the number of rows to be accessed when responding to a query.<ref>Christopher Adamson, ''Mastering Data Warehouse Aggregates: Solutions for Star Schema Performance'', Wiley Publishing, Inc., 2006 {{ISBN|978-0-471-77709-0}}, Page 23</ref>
+== Use ==
-[[Ralph Kimball|Kimball]] which is widely regarded as one of the original architects of data warehousing says<ref>{{Cite web|url=http://www.rkimball.com/html/articles_search/articles1996/9608d54.html|title=Aggregate Navigation With (Almost) No Metadata|date=1995-08-15|accessdate=2010-11-22}}</ref>:
+In its simplest form, an '''aggregate''' is a simple summary table that can be derived by performing a ''[[Group by]]'' SQL query. A more common use of aggregates is to take a [[Dimension (data warehouse)|dimension]] and change its granularity. When changing the [[granularity]] of the dimension the [[Fact (data warehouse)|fact]] table has to be partially summarized to fit the new [[Data grain|grain]] of the new [[Dimension (data warehouse)|dimension]], thus creating new [[Dimension (data warehouse)|dimensional]] and fact tables, to fit this new level of grain.
-<blockquote>''The single most dramatic way to affect performance in a large [[Data_warehouse|data warehouse]] is to provide a proper set of aggregate (summary) records that coexist with the primary base records. Aggregates can have a very significant effect on performance, in some cases speeding queries by a factor of one hundred or even one thousand. No other means exist to harvest such spectacular gains.''</blockquote>
+== Design ==
-Having aggregates and atomic data increases the complexity of the [[Dimensional_modeling|dimensional model]]. This complexity should be transparent to the users of the [[Data_warehouse|data warehouse]], thus when a request is made, the [[Data_warehouse|data warehouse]] should return data from the table with the correct [[Granularity|grain]]. So when requests to the [[Data_warehouse|data warehouse]] are made, aggregate navigator functionality should be implemented, to help determine the correct table with the correct [[Granularity|grain]].
+Aggregates are sometimes referred to as pre-calculated summary data, since aggregations are usually precomputed, partially summarized data, that are stored in new aggregated tables. When facts are aggregated, it is either done by eliminating dimensionality or by associating the facts with a rolled up dimension. Rolled up dimensions should be shrunken versions of the dimensions associated with the granular base facts. This way, the aggregated dimension tables should conform to the base dimension tables.<ref>{{Cite book |author=Ralph Kimball |author2= Margy Ross |title=The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling |edition= Second |publisher=Wiley Computer Publishing |year=2002| isbn=0-471-20024-7|ref={{harvid|Kimball|Data Warehouse Toolkit}} |p=356}}</ref>
-The number of possible aggregations is determined by every possible combination of [[Dimension_(data_warehouse)|dimension]] granularities. Since it would produce a lot of overhead to build all possible aggregations, it is a good idea to choose a subset of tables on which to make aggregations. The best way to choose this subset and decide which aggregations to build is to monitor queries and design aggregations to match query patterns.<ref>Kimball  2008, pg. 355</ref>
+== Performance ==
-===Aggregate navigator===
+In 1996, [[Ralph Kimball]], who is widely regarded as one of the original architects of data warehousing, stated:<ref>{{Cite web|url=http://www.rkimball.com/html/articles_search/articles1996/9608d54.html |title=Aggregate Navigation With (Almost) No Metadata |date=1995-08-15 |accessdate=2010-11-22 |url-status=dead |archiveurl=https://web.archive.org/web/20101211114831/http://www.rkimball.com/html/articles_search/articles1996/9608d54.html |archivedate=2010-12-11 }}</ref>
-Having aggregate data in the [[Dimensional_modeling|dimensional model]] makes the environment more complex. To make this extra complexity transparent to the user, functionality known as aggregate navigation is used to query the [[Dimension_(data_warehouse)|dimensional]] and [[Fact_(data_warehouse)|fact]] tables with the correct [[Granularity|grain]] level. The aggregate navigation essentially examines the query to see if it can be answered using a smaller, aggregate table.<ref name="kimball137">Kimball  2008, pg. 137</ref>
+<blockquote>''The single most dramatic way to affect performance in a large data warehouse is to provide a proper set of aggregate (summary) records that coexist with the primary base records. Aggregates can have a very significant effect on performance, in some cases speeding queries by a factor of one hundred or even one thousand. No other means exist to harvest such spectacular gains.''</blockquote>
+== Complexity ==
+Having aggregates and atomic data increases the complexity of the dimensional model. This complexity should be transparent to the users of the data warehouse, thus when a request is made, the data warehouse should return data from the table with the correct grain. So when requests to the data warehouse are made, aggregate navigator functionality should be implemented, to help determine the correct table with the correct grain.
+The number of possible aggregations is determined by every possible combination of dimension granularities. Since it would produce a lot of overhead to build all possible aggregations, it is a good idea to choose a subset of tables on which to make aggregations. The best way to choose this subset and decide which aggregations to build is to monitor queries and design aggregations to match query patterns.{{sfn|Kimball|Data Warehouse Toolkit |p=355}}
+== Aggregate navigation ==
+Having aggregate data in the dimensional model makes the environment more complex. To make this extra complexity transparent to the user, functionality known as aggregate navigation is used to query the dimensional and fact tables with the correct grain level. The aggregate navigation essentially examines the query to see if it can be answered using a smaller, aggregate table.{{sfn|Kimball|Data Warehouse Toolkit |p=137}}
 Implementations of aggregate navigators can be found in a range of technologies:
-*OLAP engines
+*[[Online analytical processing|OLAP]] engines
-*Materialized views
+*[[Materialized view]]s
-*Relational OLAP (ROLAP) services
+*Relational OLAP ([[ROLAP]]) services
-*BI application servers or query tools
+*[[Business intelligence|BI]] application servers or query tools
-It is generally recommended to use either of the first three technologies, since the benefits in the latter case is restricted to a single front end BI tool<ref name="kimball354">Kimball  2008, pg. 354</ref>
+It is generally recommended to use either of the first three technologies, since the benefits in the latter case is restricted to a single front end [[Business intelligence|BI]] tool{{sfn|Kimball|Data Warehouse Toolkit |p=354}}
+==Challenges==
+*Since dimensional models only gain from aggregates on large data sets, it should be considered at what size of the data sets one should start using aggregates
+*One can also ask oneself if a data warehouse always handles data sets that are too large for direct queries, or if it sometimes is a good idea to omit the aggregate tables when starting a new data warehouse project
+* An open question is whether omitting aggregates in the first iteration of building a new data warehouse will make the structure of the dimensional model simpler
 ==References==
-<!--<nowiki>
-See http://en.wikipedia.org/wiki/Wikipedia:Footnotes for an explanation of how to generate footnotes using the <ref> and </ref> tags, and the template below.
-</nowiki>-->
 {{reflist}}
-{{DEFAULTSORT:Aggregates (data warehouse)}}
-[[Category:Data management]]
-[[Category:Data warehousing]]
 {{Data warehouse}}
+[[Category:Data warehousing]]
-{{database-stub}}