Relational algebra

If you worked with SQL in different databases, you might’ve noticed slight differences or even large, annoying discrepancies. The reason is that SQL isn’t a formal, strict language, but rather an approach based on a general formal language called Relational algebra. Developers of different databases design their own versions of SQL and implement and modify its features at their discretion.

Relational algebra was created by Edgar F. Codd in 1960s-1970s, who worked at IBM at the time. Codd proposed this formal language as a basis for database querying.

The good news is that knowing set theory and relational algebra reduces potential SQL-related problems to basically documentation lookup. If you know how unions work, what’s the idea behind a Cartesian product and what are expressions of relational algebra, then it comes down to finding the correct SQL syntax to solve the problem at hand. We will finish up the section on set theory with a short overview of different aspects of relational algebra and their relation to SQL.

Relation

Relation is essentially a table, with columns (called “attributes”) and rows. For example, a relation User may be used to describe users in a web service.

User(id, username, email)
id username email
1 jenn jnc@hotmail.com
2 pax paxro@aol.com
91 thankso bl@wh.gov

For demonstrating purposes, let’s define another table:

Admin(id, department, level)
id department level
1 engineering 3
2 medicine 2
91 mathematics 3

The name of a relation can play a role of the simplest query:

User

which returns the whole table.

Projection

A projection is an operation of extracting records with specific columns (attributes). It is denoted by Greek letter \Pi (uppercase pi), with a list of columns as subscript, followed by relation’s name. For example, here we extract a sub-relation of User with ids and emails only:

\Pi_{\textrm{id, email}} \textrm{ User}

id email
1 jnc@hotmail.com
2 paxro@aol.com
91 bl@wh.gov

Select

A selection is an operation of filtering records by values. It is denoted by Greek letter \sigma (sigma) with a condition, followed by relation’s name. For example, here we get Admins with level higher than 2:

\sigma_{\textrm{level} > 2} \textrm{ Admin}

id department level
1 engineering 3
91 mathematics 3

Operating on expressions

Select and project operators work on any expression, not only on whole relations. This means we can combine them arbitrarily. For example, we can select (essentially, filter) first, and then extract certain columns.

\Pi_{\textrm{id, level}} (\sigma_{\textrm{level} > 2} \textrm{ Admin})

The result is a newly constructed relation consisting of only two columns, which, in turn, can be used for other operations:

id level
1 3
91 3

Cross product (cartesian product)

Recall the idea behind Cartesian product in set theory: given two sets A = \{a, b\} and B = \{1, 2\}, the Cartesian product is a set of pairs of all combinations of elements:

(a, 1), (a, 2), (b, 1), (b, 2)

Much of the power of relational algebra comes from applying this idea to relations. It is also often called cross product. Here’s the cross-product of User and Admin:

\textrm{User} \times \textrm{Admin}

User.id username email Admin.id department level
1 jenn jnc@hotmail.com 1 engineering 3
1 jenn jnc@hotmail.com 2 medicine 2
1 jenn jnc@hotmail.com 91 mathematics 3
2 pax paxro@aol.com 1 engineering 3
2 pax paxro@aol.com 2 medicine 2
2 pax paxro@aol.com 91 mathematics 3
91 thankso bl@wh.gov 1 engineering 3
91 thankso bl@wh.gov 2 medicine 2
91 thankso bl@wh.gov 91 mathematics 3

Since both relations contain a column named id, the resulting relation contains two distinct columns with tagged names User.id and Admin.id. This is similar to the way we “tagged” items in set union. This might not seem too useful: we just combined all records, and many of the new rows don’t make sense. Cross-product is rarely useful as is. Instead, the goal is often to generate raw data for the operations. For example, with this long table at hand, we can first eliminate the nonsense rows by applying select:

\sigma_{\textrm{User.id} = \textrm{Admin.id}} (\textrm{User} \times \textrm{Admin})

User.id username email Admin.id department level
1 jenn jnc@hotmail.com 1 engineering 3
2 pax paxro@aol.com 2 medicine 2
91 thankso bl@wh.gov 91 mathematics 3

Assuming user ids in the system are used to identify Admins as well, we now have a sensible relation of Admins with their complete user info intact. Now we can filter Admins by level and get rid of unneeded columns:

\Pi_{\textrm{User.id, email, department}} \big( \sigma_{\textrm{User.id} = \textrm{Admin.id}, \textrm{level} > 2} (\textrm{User} \times \textrm{Admin}) \big)

User.id email department
1 jnc@hotmail.com engineering
91 bl@wh.gov mathematics

Natural join

The process of performing a cross-product and then filtering out rows that “make sense” by comparing the attributes of the same name is common enough so that relational algebra has a special operator called natural join. It is denoted by \bowtie (bow tie). The result is a relation with matching rows and no “tagged” attributes:

\textrm{User} \bowtie \textrm{Admin}

id username email department level
1 jenn jnc@hotmail.com engineering 3
2 pax paxro@aol.com medicine 2
91 thankso bl@wh.gov mathematics 3

As you see, natural join is basically a syntactic sugar on top of existing features of relational algebra.

Union, intersection, difference

Relational algebra includes three operators straight from set theory: union, intersection and difference. They work just like you might expect, although there are some caveats. We will not focus on these topics at the moment.