Relational algebra: Difference between revisions

Content deleted Content added

Inline

Revision as of 10:20, 5 November 2005

The relational algebra is a set of operations that manipulate relations as they are defined in the relational model and as such describes part of the data manipulation aspect of this data model. Because of their algebraic properties these operations are often used in database query optimization as an intermediate representation of a query to which certain rewrite rules can be applied to obtain a more efficient version of the query.

The exact set of operations may differ per definition and also depends on whether the unlabeled relational model (that uses mathematical relations) or the labeled relational model (that uses the labeled generalization of mathematical relations) is used. We will assume the labeled case here as this is the most common way the relational model is defined. That means that we assume that tuples are partial functions from attribute names to values. The attribute a of a tuple t is denoted in this article as t(a).

Operations

The six basic operations of the algebra are the selection, the projection the Cartesian product (also called the cross product or cross join), the set union, the set difference, and the rename. These six operations are fundamental in the sense that all other operations can be expressed as combinations of these operations and that none of these six operations can be omitted without losing expressive power. Other operations can be defined in terms of the six basic operations, among the most important are set intersection, division, and the natural join. In some cases these three operations can be considered to be a "basic" operation either because of unique algebraic properties, ease of implementation in a practical environment, or the frequency of their use.

Altogether, the operations of relational algebra have identical expressive power to that of domain relational calculus or tuple relational calculus, and similar expressive power to that of first-order predicate calculus.

Set operations

Although three of the six basic operations are taken from set theory, there are additional constraints that are present in their relational algebra counterparts: for set union and set difference, the two relations involved must be union-compatible - that is, the two relations must have the same set of attributes. As set intersection can be defined in terms of set difference, the two relations involved in set intersection must also be union-compatible.

The Cartesian product differs from the one defined in set theory in the sense that tuples are considered to be shallow tuples for this purpose - this distinction is necessary to ensure that the Cartesian product of two relations remains a relation. Unlike in set theory, where the Cartesian product of a n-tuple by an m-tuple is a set of 2-tuples, the Cartesian product in relational algebra has the 2-tuple "flattened" into an n+m-tuple. More formally, R × S is defined as follows:

R × S = {r ∪ s| r ∈ R, s ∈ S}

In addition, for the Cartesian product to be defined, the two relations involved must have disjoint headers - that is, they must not have a common attribute name.

The natural join

The natural join is a binary operation that is written as R ⋈ S where R and S are relations. The result of the natural join is the set of all combinations of tuples in R and S that are equal on their common attribute names. For an example consider the tables Employee and Dept and their natural join:

*Employee*
Name	EmpId	DeptName
Harry	3415	Finance
Sally	2241	Sales
George	3401	Finance
Harriet	2202	Sales

*Dept*
DeptName	Manager
Finance	George
Sales	Harriet
Production	Charles

*Employee* ⋈ *Dept*
Name	EmpId	DeptName	Manager
Harry	3415	Finance	George
Sally	2241	Sales	Harriet
George	3401	Finance	George
Harriet	2202	Sales	Harriet

The natural join is arguably one of the most important operators since it allows the combination of relations that are associated by a foreign key. For example, in the above example there holds probably a foreign key from Employee.DeptName to Dept.DeptName and then the natural join of Employee and Dept combines all employees with their departments. Note that this works because the foreign key holds between columns with the same name. If this is not the case such as in the foreign key from Dept.manager to Emp.emp-number then we have to rename these columns before we take the natural join. Such a join is sometimes also referred to as an equijoin (see θ-join).

More formally the semantics of the natural join is defined as follows:

R ⋈ S = { t ∪ s : t ∈ R, s ∈ S, fun(t ∪ s) }

where fun(r) is a predicate that is true for a binary relation r iff r is a functional binary relation. It is usually required that R and S must have at least one common attribute, but if this constraint is omitted then in that special case the natural join becomes exactly the Cartesian product as defined above.

The natural join can be simulated with the basic operations as follows. Assume that a₁,...,a_n are the attribute names unique to R, b₁,...,b_m are the attribute names common to R and S and c₁,...,c_m are the attribute unique to S. Furthermore assume that the attribute names d₁,...,d_m are neither in R nor in S. In a first step we can now rename the common attribute names in S: : S' := ρ_d₁/b₁(...ρ_{d_m/b_m}( S)...) Then we take the Cartesian product and select the tuples that are to be joined: : T := σ_b₁=d₁(...σ_{b_m=d_m}(R × S')...) Finally we take a projection to get rid of the renamed attributes: : U := π_{a₁,...,a_n,b₁,...,b_m,c₁,...,c_m}(T)

The division

The div $Insertformulahere$ unique to R, i.e., in the header of R but not in the header of S, for which it holds that all their combinations with tuples in S are present in R. For an example see the tables CompletedTask, DBProject and their division:

*Completed*
Student	Task
Fred	Database1
Fred	Database2
Fred	Compiler1
Eugene	Database1
Eugene	Compiler1
Sara	Database1
Sara	Database2

*DBProject*
Task
Database1
Database2

*Completed* ÷ *DBProject*
Student
Fred
Sara

If DBProject contains all the tasks of the Database project then the result of the division above contains exactly all the students that have completed the Database project.

More formally the semantics of the division is defined as follows:

R ÷ S = { t[a₁,...,a_n] : ∀ s ∈ S ( (t[a₁,...,a_n] ∪ s) ∈ R) }

where {a₁,...,a_n} is the set of attribute names unique to R and t[a₁,...,a_n] is the restriction of t to this set. It is usually required that the attribute names in the header of S are a subset of those of R because otherwise the result of the operation will always be empty.

The simulation of the division with the basic operations is as follows. We assume that a₁,...,a_n are the attribute names unique to R and b₁,...,b_m are the attribute names of S. In the first step we project R on its unique attribute names and construction all combinations with tuples in S:

T := π_{a₁,...,a_n}(R) × S

In the next step we subtract R from this relation:

U := T - R

Note that in U we have the combinations that "should have been in R but weren't". So if we now take the projection on the attribute names unique to R then we have the restrictions of the tuples in R for which not all combinations with tuples in S were present in R:

V := π_{a₁,...,a_n}(U)

So what remains to be done is take the projection of R on its unique attribute names and subtract those in V:

W := π_{a₁,...,a_n}(R) - V

The generalized selection

The θ-join and equijoin

If we want to combine tuples from two relations where the combination condition is not simply the equality of shared columns then it is convenient to have a more general form of join operator, which is the θ-join. The θ-join is a binary operation that is written as R ⋈_aθb S or R ⋈_aθv S where a and b are attribute names, θ a binary operation in the set {<, ≤, =, >, ≥}, v is a value constant and R and S are relations. The result of this operation consists of all combinations of tuples in R and S that satisfy the equation. The result of the θ-join is only defined if the headers of S and R are disjoint, i.e., do not contain a common attribute.

The simulation of this operation in the fundamental operations is therefore as follows:

R ⋈_φ S = σ_φ(R × S)

In case the operator θ is the equality operator (=) then this join is also called an equijoin.

The semijoin

The semijoin is joining similar to the natural join and written as R ⋉ S where R and S are relations. Whereas the result of the semijoin is only the set of all tuples in R for which there is a tuple in S that is equal on their common attribute names. For an example consider the tables Employee and Dept and their semi join:

*Employee*
Name	EmpId	DeptName
Harry	3415	Finance
Sally	2241	Sales
George	3401	Finance
Harriet	2202	Sales

*Dept*
DeptName	Manager
Sales	Harriet
Production	Charles

*Employee* ⋉ *Dept*
Name	EmpId	DeptName
Sally	2241	Sales
Harriet	2202	Sales

More formally the semantics of the semijoin is defined as follows:

R ⋉ S = { t : t ∈ R, s ∈ S, fun(t ∪ s) }

where fun(r) is as in the definition of natural join.

The semijoin can be simulated using the natural join as follows. Assume that a₁,...,a_n are the attribute names of R, then it holds that:

R ⋉ S = π_{a₁,..,a_n}(R ⋈ S)

Since we can simulate the natural join with the basic operators it follows that this also holds for the semijoin.

Operations for null values

The outer join is a binary operation that combines tuples from two relations. There are three types of outer joins.

Left outer join

The left outer join is written as R =X S where R and S are relations. The result of the left outer join is the set of all combinations of tuples in R and S that are equal on their common attribute names, in addition to tuples in R that have no matching tuples in S.

For an consider the tables Employee and Dept and their left outer join:

*Employee*
Name	EmpId	DeptName
Harry	3415	Finance
Sally	2241	Sales
George	3401	Finance
Harriet	2202	Sales
Tim	1123	Executive

*Dept*
DeptName	Manager
Sales	Harriet
Production	Charles

*Employee* =X *Dept*
Name	EmpId	DeptName	Manager
Harry	3415	Finance	ω
Sally	2241	Sales	Harriet
George	3401	Finance	ω
Harriet	2202	Sales	Harriet
Tim	1123	Executive	ω

In the resulting relation, tuples in S which have no common values in common attribute names with tuples in R take a null value, ω.

Since there are no tuples in Dept with a DeptName of Finance or Executive, ωs occur in the resulting relation where tuples in DeptName have tuples of Finance or Executive.

...(maths)...

The left outer join can be simulated using the natural join and set union as follows:

R =X S = R ∪ (R ⋈ S)

Right outer join

The right outer join behaves almost identically to the left outer join, above, with the exception that all the values from the "other" relation appear in the resulting relation.

The right outer join is written as R X= S where R and S are relations. The result of the right outer join is the set of all combinations of tuples in R and S that are equal on their common attribute names, in addition to tuples in S that have no matching tuples in R.

For an consider the tables Employee and Dept and their right outer join:

*Employee*
Name	EmpId	DeptName
Harry	3415	Finance
Sally	2241	Sales
George	3401	Finance
Harriet	2202	Sales
Tim	1123	Executive

*Dept*
DeptName	Manager
Sales	Harriet
Production	Charles

*Employee* X= *Dept*
Name	EmpId	DeptName	Manager
Sally	2241	Sales	Harriet
Harriet	2202	Sales	Harriet
ω	ω	Production	Charles

In the resulting relation, tuples in R which have no common values in common attribute names with tuples in S take a null value, ω.

Since there are no tuples in Employee with a DeptName of Production, ωs occur in the Name attribute of the resulting relation where tuples in DeptName had tuples of Production.

...(maths)...

The right outer join can be simulated using the natural join and set union as follows:

RX=S = S ∪ (R ⋈ S)

Outer join

The outer join or full outer join in effect combines the results of the left and right outer joins.

The full outer join is written as R =X= S where R and S are relations. The result of the full outer join is the set of all combinations of tuples in R and S that are equal on their common attribute names, in addition to tuples in S that have no matching tuples in R and tuples in R that have no matching tuples in S in their common attribute names.

For an consider the tables Employee and Dept and their full outer join:

*Employee*
Name	EmpId	DeptName
Harry	3415	Finance
Sally	2241	Sales
George	3401	Finance
Harriet	2202	Sales
Tim	1123	Executive

*Dept*
DeptName	Manager
Sales	Harriet
Production	Charles

*Employee* =X= *Dept*
Name	EmpId	DeptName	Manager
Harry	3415	Finance	ω
Sally	2241	Sales	Harriet
George	3401	Finance	ω
Harriet	2202	Sales	Harriet
Tim	1123	Executive	ω
ω	ω	Production	Charles

In the resulting relation, tuples in R which have no common values in common attribute names with tuples in S take a null value, ω. Tuples in S which have no common values in common attribute names with tuples in R also take a null value, ω.

The full outer join can be simulated using the left and right outer joins (and hence the natural join and set union) as follows:

R=X=S = (R=XS) ∪ (RX=S) or R=X=S = R ∪ S ∪ (R ⋈ S)

Operations for domain computations

The aggregation operation

There are five aggregation functions that are included with most databases. These operations are Sum, Count, Average, Maximum and Minimum. In relational algebra, it is written as _{Exp1, Exp2, Exp3...}G_{func1, func2, func3...}(Relation). While one must specify the function to use, the expressions, however, are optional. Lets assume that we have a table named Account with two columns, namely Branch_Name and Balance defined, and we wish to find the branch name with the highest balance, we would write _{Branch_Name}G_Max(Balance)(Account). To find the highest balance, we could simply write G_Max(Balance)(Account).

The extend operation

Algebraic properties

Use of algebraic properties for query optimization

Queries can be represented as a tree, where

the internal nodes are operators,
leaves are relations,
subexpressions are subtrees.

Our primary goal is to transform expression trees into equivalent expression trees, where the average size of the relations yielded by subexpressions in the tree are smaller than they were before the optimization. Our secondary goal is to try to form common subexpressions within a single query, or if there are more than one queries being evaluated at the same time, in all of those queries. The rationale behind that second goal is that it is enough to compute common subexpressions once, and the results can be used in all queries that contain that subexpression.

Here we present a set of rules, that can be used in such transformations.

Selection

Rules about selection operators play the most important role in query optimization. Selection is an operator that very effectively decreases the number of rows in its operand, so if we manage to move the selections in an expression tree towards the leaves, the internal relations (yielded by subexpressions) will likely shrink.

Basic selection properties

$\sigma _{A}(R)=\sigma _{A}\sigma _{A}(R)$
$\sigma _{A}\sigma _{B}(R)=\sigma _{B}\sigma _{A}(R)$

Breaking up selections with complex conditions

The first two rules are used to split/merge consecutive selection operators. It is useful to merge in some cases, because only one operator needs to be evaluated instead of two. It also makes sense to split selections with complex conditions, because it may be possible to move the individual selection components where the whole selection can not be moved.

$\sigma _{A\land B}(R)=\sigma _{A}(\sigma _{B}(R))=\sigma _{B}(\sigma _{A}(R))$
$\sigma _{A\lor B}(R)=\sigma _{A}(R)\cup \sigma _{B}(R)$

Selection and cross product

Cross product is the costliest operator to evaluate. If the input relations have $N$ and $M$ rows, the result will contain $NM$ rows. Therefore it is very important to do our best to decrease the size of both operands before applying the crossproduct operator.

This can be effectively done, if the cross product is followed by a selection operator, e.g. $\sigma _{A}(R\times P)$ . Considering the definition of join, this is the most likely case. If the cross product is not followed by a selection operator, we can try to push down a selection from higher levels of the expression tree using the other selection rules.

In the above case we break up condition $A$ into conditions $B$ , $C$ and $D$ using the split rules about complex selection conditions, so that $A=B\land C\land D$ and $B$ only contains attributes from $R$ , $C$ contains attributes only from $P$ and $D$ contains the part of $A$ that contains attributes from both $R$ and $P$ . Note, that $B$ , $C$ or $D$ are possibly empty. Then the following holds:

$\sigma _{A}(R\times P)=\sigma _{B\land C\land D}(R\times P)=\sigma _{D}(\sigma _{B}(R)\times \sigma _{C}(P))$ .

Selection and set operators

The following three rules are used to push selection below set operations in the expression tree. Note, that in the setminus and the intersection operators it is possible to apply the selection operator to only one of the operands after the transformation. This can make sense in cases, where one of the operands is small, and the overhead of evaluating the selection operator outweighs the benefits of using a smaller relation as an operand.

$\sigma _{A}(R\setminus P)=\sigma _{A}(R)\setminus \sigma _{A}(P)=\sigma _{A}(R)\setminus P$
$\sigma _{A}(R\cup P)=\sigma _{A}(R)\cup \sigma _{A}(P)$
$\sigma _{A}(R\cap P)=\sigma _{A}(R)\cap \sigma _{A}(P)=\sigma _{A}(R)\cap P=R\cap \sigma _{A}(P)$

@@ Line 64: / Line 64: @@
 |}
-The natural join is arguably one of the most important operators since it allows the combination of relations that are associated by a [[foreign key]]. For example, in the above example there holds probably a foreign key from ''Employee''.''DeptName'' to ''Dept''.''DeptName'' and then the natural join of ''Employee'' and ''Dept'' combines all employees with their departments. Note that this works because the foreign key holds between columns with the same name. If this is not the case such as in the foreign key from ''Dept''.''manager'' to ''Emp''.''emp-number'' then we have to rename these columns before we take the natural join. Such a join is sometimes also referred to as an '''equijoin''' (see &theta;-join).
+The natural join is arguably one of the most important [[operator]]s since it allows the combination of relations that are associated by a [[foreign key]]. For example, in the above example there holds probably a foreign key from ''Employee''.''DeptName'' to ''Dept''.''DeptName'' and then the natural join of ''Employee'' and ''Dept'' combines all employees with their departments. Note that this works because the foreign key holds between columns with the same name. If this is not the case such as in the foreign key from ''Dept''.''manager'' to ''Emp''.''emp-number'' then we have to rename these columns before we take the natural join. Such a join is sometimes also referred to as an '''equijoin''' (see &theta;-join).
 More formally the semantics of the natural join is defined as follows:
@@ Line 70: / Line 70: @@
 : ''R'' ⋈ ''S'' = { ''t'' &cup; ''s'' : ''t'' &isin; ''R'', ''s'' &isin; ''S'', fun(''t'' &cup; ''s'') }
-where fun(''r'') is a predicate that is true for a [[binary relation]] ''r'' iff ''r'' is a functional binary relation. It is usually required that ''R'' and ''S'' must have at least one common attribute, but if this constraint is omitted then in that special case the natural join becomes exactly the Cartesian product as defined above.
+where fun(''r'') is a [[predicate]] that is [[true]] for a [[binary relation]] ''r'' iff ''r'' is a functional binary relation. It is usually required that ''R'' and ''S'' must have at least one common attribute, but if this constraint is omitted then in that special case the natural join becomes exactly the Cartesian product as defined above.
 The natural join can be simulated with the basic operations as follows. Assume that ''a<sub>1</sub>,...,a<sub>n</sub>'' are the