Primary, secondary, and clustered indexes

I learned about primary, secondary, and clustered indexes in MySQL and how they affect data storage. In MySQL, the primary key is the same as the clustered index, organizing how data is stored on the disk, while secondary indexes point back to this main structure. In contrast, SQLite handles indexes differently, using a secret row ID as the clustered index, which changes how primary and secondary keys function, impacting performance depending on the setup.

Now that we know a little bit about B trees, we gotta talk about primary, secondary, and clustered indexes. Three types of indexes. Usually, we only hear about primary and secondary, but clustered index does play an important role here. I think it's gonna be most helpful if we talk about the MySQL model first, because surprisingly it is simpler than the SQLite model. And so until I tell you otherwise, what you're about to hear applies to MySQL and Postgres, and not SQLite.

So when you create a primary key, when you create a primary key, you're declaring this is the identity of this table. You're saying every row in this table must have a singular unique identity declared by that primary key. Normally, that is something like an ID column. That makes enough sense. Puts a unique constraint over it.

We're used to that. In MySQL, when you declare the primary key, you are simultaneously declaring what the clustered index is. Primary key and clustered index are the same in MySQL. So what is the clustered index? Well, if we take a look at our b tree again, you'll see that we have all of these I d's.

So we have I d's 1 through 10 down here at the bottom in the leaf nodes. The clustered index is the way that the data on the disk is actually arranged. So in this B tree, this is the clustered index. When you get down to the leaf nodes, the data that you find in those leaf nodes, that is the entire row. So this is how this is how your table actually exists.

So everything is an index. Indexes are indexes. Tables are indexes. It's all indexes. Your table is a b tree that is arranged by the clustered index which is historically or rather traditionally the primary key as well.

So when you look up a value by primary key it's going to come to this clustered index and it's going to traverse looking for that ID. And when it gets to the leaf node, all of the row data is contained there. That's the clustered index. The clustered index defines how your data is stored on disk. In MySQL, the primary key and the clustered index are the same thing.

So what does a secondary index do? If we were to put a secondary index on name, you'll see that this b tree exists. So we've seen this b tree before. It it arranges the names in order. So in the leaf nodes we've got Erin all the way through Virginia at the bottom.

So it arranges those leaf nodes in order and then we traverse the tree. But what do we find when we get to the bottom? What do we find when we get to a leaf node of a secondary index? What do you think we find? We don't find the row data.

We find a pointer back to the row in the clustered index. So if we were to traverse this tree and get all the way down and find the value or find the node Aaron, in this leaf node at the bottom we would find one thing. Well, we'd find 2 things. We'd find the name, Aaron, and then we would find the primary key is ID number 1. So you look up in a secondary index, you find you go all the way down to the leaf node, you find Erin with an ID of 1.

Then you have to jump back to the clustered index, which is your entire table, You jump back to the clustered index and then you go look up the rest of the data for that row. So then you start that tree traversal starting at the root node looking for ID number 1. You finally end up down in the leaf node. That is the entire row's data And so then, it grabs all of that and sends it back to the client. That's how it normally works.

That is how a primary key and a secondary key normally work. The primary key is the clustered index and it defines how the data is stored on disk. It is your table. The clustered index is your table. Now, in SQLite, the primary key functions as a secondary key because the clustered index is the secret row ID.

So, when you create a table in SQLite, it uses it uses that secret row ID as the clustering index. Now, if you declare an ID that becomes an alias for that row ID, then the 2 are the same. But, if you declare a primary key that is not an alias for that row id it functions it functions as a secondary index. It still has a unique constraint on it, but it functions as a secondary index in that if you look up by primary key, it traverses that secondary tree, finds the row ID and then takes the row ID back to the clustered index. You may be thinking, wow, Aaron, that was a fantastic explanation about something that I do not care about at all one bit.

That that thing. That that's too bad. I think there are some real world implications though, so let's talk about them. In MySQL in MySQL, when you declare a primary key, you're declaring the clustered index. Right?

So if you declare a, auto incrementing big integer then new rows are always going to go at the end of your table. And if you turn that into a b tree structure new rows are always going to expand to the right and the tree will rebalance itself pretty pretty nice and pretty naturally, right? In MySQL, if you declared a UUID as your primary key and it's one of the random variants of UUID, I think there are maybe 7 variants, and it's one of the ones that is random, then what you're doing is functionally on on a on a table level, you're inserting rows kind of randomly in the middle, beginning and all over the place. Right? Because your primary key is kind of random.

That means you're also inserting rows randomly in that b tree. And so the b tree is constantly having to break itself and rebalance and maybe potentially move a bunch of your data around. Now regular b trees do this all the time, but the clustered index is a lot heavier because it contains all of that row data. So in MySQL, if you are declaring a primary key as some sort of u uid variant, you could have really slow inserts as you're breaking that b tree apart over and over and over. Jumping back to SQLite, you could have a completely random primary key and never break that b tree apart because the actual clustered index is the row ID.

So any performance penalty that exists for using a uuid as a primary key in my SQL does not exist for using a u uid as a primary key in SQLite because the real clustered index is that secret row ID. Now that's one benefit. Right? That's one benefit is you're functionally using auto incrementing big integers because it is that row ID. What are the drawbacks?

The drawbacks would be if you have a primary key that is not an alias of the row ID, you have to do 2 lookups just to look up by primary key, which seems like a bummer. Right? Because a lot of times you're accessing stuff by primary key and you want it to be as fast as possible. Now, it will still be fast, but if you have that primary key that's not an alias, it does have to do 2 lookups. It's gotta look up through the primary key b tree, find the row ID, and then go back to the clustered index and find the row ID over there to grab the row information.

So those are some real world impacts. Once we start doing some performance testing, we'll see if they actually matter, or to what extent they actually matter. But in the next video, let's talk about what a without row ID table looks like as it relates to clustered indexes primary and secondary keys.

Full Course

SQLite for Production

Summary

Video Transcript