Become a SQLite expert

I learned that not all indexes are good for speeding up database searches, especially when a column has few unique values like names that are all the same. The concept of selectivity, which is the percentage of unique values compared to the total rows, and cardinality, which is the count of unique values, are important in deciding if a column is fit for an index. For the most effective indexes, I should focus on columns with high cardinality and selectivity, as these will help databases like SQLite work faster.

Unfortunately, not all indexes are created equal. Some columns are just not good candidates for indexing. Even if you are using a strict column. So when that it's a bad idea is that the index doesn't help you narrow anything down. Imagine that this users table was full of people named Aaron.

The every single person that is 1 of our users is named Aaron. I would fit in well there but that would be a weird group of people. Now, if you put an index on first name and then you say show me people where first name equals Aaron, who is that helping? It's the entire table. There's only 1 value in the table, and it's Aaron.

The index doesn't help. So what we need to talk about now is cardinality and selectivity. So cardinality refers to the number of unique values in that table as an absolute. Selectivity refers to that as a ratio. So if you have 2 unique values, that that might be good if you have 2 rows.

But if you have 2 unique values and 1, 000, 000 rows, cardinality is 2, but the selectivity is nearing 0. So let me show you how we can figure this out. We're gonna start by calculating the cardinality of the birthday column. Now, cardinality, it's a good word to know. It comes up in the documentation.

It comes up on Q and A. You can use it to talk to chat gpt about. But, it's okay if you don't remember cardinality. What we're trying to figure out here is is this column a good candidate for indexing and it it it's not a good candidate if it doesn't have unique values. So cardinality.

So let's count distinct birthday from users. That's the cardinality. There are 10, 950 unique birthdays in this table. That's honestly not very many. I guess that's the whole birthday problem.

Right? Once you have a bunch of people, the likelihood of a birthday collision goes way up. This is the cardinality. So let's do another 1. Let's do select count distinct, is pro from users.

There are 2 distinct values there. Yes and no. It's a boolean column so there are only 2 distinct values. So the cardinality here is very very low. Now, how do we figure out selectivity?

Or rather, what even is selectivity? Selectivity is the number of distinct values over the total number of rows which turns it into a percentage which is a lot more helpful. So if we do select countdistinctbirthday and then we we're gonna coerce that into a float. So then, we will wrap this whole guy up. So we'll do select count distinct birthday divided by count star.

And then, let's go ahead and just wrap this whole guy up too. We'll wrap that guy up and then we'll say, as selectivity from users. So there we go. Let's turn the widths off because the the word wrap is messing me up here. There we go.

Selectivity is point 1. Is that good or is that bad? I could tell you. I could tell you, but I don't like memorizing things. I don't like memorizing things.

So if we were to use our powerful brains and think for a second, what is the most selective column possible? Well, that would be the ID because the ID is a unique key. It's the primary key. We could use email because email is unique, I think, on this table too. But instead of teaching you or instead of making you memorize if a low selectivity or high selectivity is good, Let's just see if we can intuit it together.

So if we come all the way down here and we'll say, alright. I know that ID is the most selective thing possible, so it's gonna be very good. What is ID? ID is a selectivity of 1.0. That makes perfect sense to me because ID is distinct across the entire table.

So when you put a 1000000 ID's over a 1000000 rows, you get 1. So that tells me that point 1 is honestly not that good. So this is a good way to look for candidates for indexing. I don't like memorizing things. I like intuiting things because I'm more likely to actually remember them if if I intuit it.

So we know that this birthday column is worse selectivity than the ID column. Let's run a few experiments with some of these other columns and see if SQLite agrees with this, with this theory that we're working with here. So let us create an index. We'll do create index deleted at on users deleted at. And, yep, that should be good.

And let's do a quick, select count select count. Yeah. We'll do select count star from users where deleted at is not null. Let's do that. So we see that out of a 1000000 rows, 10, 000 of them have been deleted.

So these are, we can call these tombstoned. We can call them soft deleted. We can call them, awaiting clean up. That's insane. Let's just call them soft deleted.

So we have 10, 000 rows that are soft deleted. So if we were to do something like, what do we have now? We've got 2 indexes. We have birthday and deleted at users, email. Yeah.

That's fine. We're not touching that right now. So if we do select star from users, yep. That's fine. Select star from users where birthday equals 19890214 and deleted at is not null.

This is an exercise for you. So just listen to my voice and think with your brain what is going to happen here. We're selecting from the users table. Some we're selecting all the people that were born on February 14, 1989 and they have not been soft deleted. We'll go ahead and limit this down to 5.

We have indexes on both birthday and deleted at. And so SQLite is now given a choice, which index do you want to use. In most cases it's gonna use 1 index only. There are some optimizations that allow like some sort of index merge to happen which is pretty cool but usually it's just gonna pick 1 index and usually we want to create indexes that are pickable. So having listened to all of that, is it gonna choose birthday or delete it at?

Let's take a look. It is using birthday. So because, presumably, we don't know we don't exactly know but because birthday is more selective than deleted at is not null, it's saying, alright. I'm gonna choose all of the people that were born on that day using the birthday index. I'm gonna take those row ids back to the main table and then grab all of those row I d's and then further filter out anyone who has been soft deleted rather than going to the deleted at index, grabbing fully 99.9% of the table and then going back and saying, alright.

Well now I gotta filter out people who are not born on the right day. So whichever whichever index it can use to filter out, most or to rather select most of the correct rows, it's gonna choose that. Go back to the main table and then weed out any that don't match any other conditions. So this is a way that selectivity, cardinality all play in together. Sometimes it can be great to put an index on a boolean column even though there are only 2 values and it's very very low selectivity.

If if the query that you're using is searching for those very, very few people, right. So if if you have 99.9% of the table is, is guest users or whatever and like 0.1% is let's say admin users, putting an index on that column can be great if you're searching for the little bitty segment of the table. But if you're searching for the giant segment of the table, SQLite's gonna look at it and go man it's probably faster if I just scan the table and it might ignore the index altogether. So rule of thumb, put indexes on columns that have, a lot of distinct values, have a high cardinality and a high selectivity and that is going to make SQLite more efficient and more effective when it uses that index.

Full Course

SQLite for Production

Summary

Video Transcript