The Significance of Cardinality and Selectivity in Databases

Indexes are an important way of improving database performance and virtually every db has it. Even if you might be tempted, it's usually not that simple to look at the query and slap an index. Most importantly, though, often you have to optimize the existing queries and indices to further improve performance. Now, the benefit you might get from the indices depends upon multiple factors, but two very crucial ones among them are cardinality and selectivity. These metrics can play a crucial role in determining how efficiently the database can retrieve and process data. In this blog post, we'll dive into the concepts of cardinality and selectivity, and explore how they can be used to make more informed decisions on indices.

Understanding Cardinality

Firstly, Cardinality. You might be already familiar with the word from set theory--and it certainly shares some characteristics--but in database theory, cardinality refers to the uniqueness of values in a column of a table. In simpler terms, it's the number of distinct values in a column. For example, in a table containing employee information, the "Department" column might have a low cardinality (e.g., only a few distinct department names), while the "Employee ID" column would have a high cardinality (each employee has a unique ID).

Importance of Cardinality in Indices

Cardinality is a critical factor when deciding which columns to include in an index. Columns with high cardinality are excellent choice for indexing because they offer more selective filtering or eliminate a lot of rows. For instance, if you frequently query employees by their unique ID, indexing the "Employee ID" column can significantly improve query performance.

Selectivity

Selectivity, on the other hand, measures the uniqueness of the data in an index. Strictly speaking, it can be argued that it's more property of the predicate supplied than indices. Regardless, it's a vital criterion when deciding columns to index. It is calculated as the ratio of the number of unique values to the total number of rows in the table. In a sense you can say that there are two selectivities:

Average selectivity: This is related to the cardinality. Let's say there are 20,000 distinct values for the column in question with the total no rows being 1,000,000. Then, the average selectivity of an index is 20,000/1,000,000 = 0.02.
Selectivity of a specific value: This, as the name says, relates to the selectivity of that value itself. As of definition is the number of rows with that value, divided by the total number of rows. So, if a value occurs 1000 times, its selectivity would be 0.001.

Importance of Selectivity in Index Choice

As you can see, selectivity is a powerful metric. Average selectivity is an indication of the overall uniqueness of column values and hence improves average (surprise, surprise) performance. Whilst if you want to optimize a single query (like maybe Premium+ users), you would want to look at the selectivity of the specific value.

Conclusion

In conclusion, cardinality and selectivity are key metrics for optimizing indices in your database. By understanding these concepts and applying them judiciously, you can significantly improve the performance of your queries and enhance the overall efficiency of your database operations. With this, hopefully, you should be able to make more informed decisions on which columns should be indexed to improve query performance. Good luck!