(brought to you by boringcactus)

UUID versions through the ages (12 Feb 2023)

UUIDs are neat. y'know, cfbff0d1-9375-5685-968c-48ce8b15ae17 type of shit. if you're like me until a few days ago, all you know about the types of UUID is that v4 is the good one. but why are there other ones? is there a secret better one? why are the dashes asymmetrical? let's take a (roughly paraphrased from wikipedia and probably not quite accurate) look.

wait why even

sometimes you need an ID for something you are putting in the computer, so that you have a stable way to refer to it even if all the editable fields on it change. the simplest possible approach is to give the first thing ID 1, the second thing ID 2, and so on. cohost works this way right now - as i'm editing it, this draft post has ID 1009270, meaning this is the just-over-a-millionth thing in the posts table.

your database sits there going "the next post has ID 8. oh, new post? it has ID 8, okay the next post has ID 9." and all is well. except a year later you have a million posts and a bunch of people posting all at once, and every new post needs a new ID but they have to get created one at a time in the database so that they all get the right ID. If You're In Line (To Get The Next Post ID), Stay In Line. and the only way to know what the next post ID is is to check with the database, so you can't do things like save drafts offline with proper IDs. (staff probably doesn't want that anyway, but we need something vaguely similar at work, which is how i got here.) if you need to work at, say, Twitter's scale, or you need to be able to generate IDs without checking with the database, you need something more involved than just sequential IDs.

wait why even

sometimes you need an ID for something you are putting in the computer, so that you have a stable way to refer to it even if all the editable fields on it change. the simplest possible approach is to give the first thing ID 1, the second thing ID 2, and so on. cohost works this way right now - as i'm editing it, this draft post has ID 1009270, meaning this is the just-over-a-millionth thing in the posts table.

your database sits there going "the next post has ID 8. oh, new post? it has ID 8, okay the next post has ID 9." and all is well. except a year later you have a million posts and a bunch of people posting all at once, and every new post needs a new ID but they have to get created one at a time in the database so that they all get the right ID. If You're In Line (To Get The Next Post ID), Stay In Line. and the only way to know what the next post ID is is to check with the database, so you can't do things like save drafts offline with proper IDs. (staff probably doesn't want that anyway, but we need something vaguely similar at work, which is how i got here.) if you need to work at, say, Twitter's scale, or you need to be able to generate IDs without checking with the database, you need something more involved than just sequential IDs.

version 1

in the early 90s, some UNIX people ran into this problem when drawing up their Distributed Computing Environment. they called their solution "Universal Unique Identifiers", which they call "an identifier that is unique across both space and time". it's written as a hexadecimal string, but it can be stored as just the 16 bytes that are represented by that hexadecimal string. the way they make sure it's unique across both space and time is actually pretty straightforward: part of the UUID encodes the space where it was generated, and part of the UUID encodes the time where it was generated.

the UUID format has two control fields and three data fields. the version field is pretty straightforward - it's 1 for UUIDv1, 2 for UUIDv2, etc. at this point, they only had 1 and 2, but they left room in the spec for up to 15 just in case. there's also a variant field, which says whether it's a normal UUID (10, hex value 8 through b) or some other bullshit that may or may not adhere to any of the rest of this spec.

other bullshit

if the variant field is 0 then it's a UUID from Apollo Computer's Network Computing System, which had UUIDs before DCE but defined them in a slightly different way. if it's 110 then it's a UUID but the wrong endian, which Microsoft does sometimes when it makes UUIDs (it calls them GUIDs, because they're thinking too small, merely Global rather than Universal). if it's 111 then you're living in the future where they assigned a meaning to variant 111. what's it like? how's the whole climate change thing going?

the data fields are

this is how we get that weird hyphen asymmetry, the groups come directly from the UUID data fields:

versionvariant
1a8188ce-aa78-11ed-afa1-0242ac120002
time_lowtime_midtime_hiclock sequencenode

you may have noticed that the time is split into three pieces, for the low 32 bits, the middle 16 bits, and the high 12 bits. why split it up like that? well, i don't know, but i suspect it makes a lot of things easier to split the 60 bit timestamp into at least 32 and 28 (and maybe splitting the 28 into 16 and 12 makes something easier in a way i'm not seeing?). for one, 64-bit CPUs weren't mainstream yet, and for two, they had creative alternate uses for that time_low field.

version 2

like a lot of multi-user operating systems, UNIX has users and groups, and allows for permissions management based on those users and groups. users and groups both have textual names and numeric IDs, so if you want to stably refer to a specific user or group, you can use its user ID or group ID. however, different computers can have different sets of users and groups, so if you're making the Distributed Computing Environment, you need a way to refer to a specific user or group on a specific machine. you're building on top of UNIX, because you're the Open Software Foundation (later The Open Group), so you have user and group IDs locally already. and you already made this UUID format, which has fields that refer to a specific machine. the other fields are already taken for time and also-time, but you didn't promise they'd always be time, right?

in UUIDv2 (which the DCE spec calls the "security version"), the time_low field is literally just a UNIX user/group ID. the low byte of the clock sequence field is repurposed to specify whether it's a user or a group (or a secret third thing, an organization).

i have several questions. for one, what about the other time fields? time_mid ticks up once every 7 minutes, if you construct your UUIDv2 out of a UUIDv1. do you just leave it at zero and let time_hi tick up every 325 days? do you leave mid and hi both at zero and party like it's 1582? for two, had they not invented MAC address spoofing yet? these days you can usually change your network card's MAC address to something else, so using that for anything security-related strikes me as highly dubious. for three,, what? just in general? why would you do this? this is some 5 Minute Crafts tier lifehackery. please refrain.

presumably this worked well enough for DCE, but it has not withstood the test of time. i don't know that UUIDv2 even counts as a UUID, but it follows the UUID format and put a 2 in the version number slot, and so it lives on solely as negative space in the UUID version number range. (this is also apparently the deal with IPv5.)

UUIDv2 may or may not have been a good idea, but the concept of "what if you had a UUID based on some specific value other than the current time" had legs.

version 3

DCE was done being written, and then it kinda died, but people kept using UUIDs. DCE was a legacy-style Proper Goddamn Specification, written by the consortium that had since become The Open Group, who also run POSIX and the Single UNIX Specification and all that jazz (?? when the posix is sus !), but that sort of doorstopper spec was overkill for the humble UUID. what it needed, as a piece of computer bullshit, was an RFC. and so in 2005 the UUID was defined again in RFC 4122, which kept v1, reduced v2 to one sentence, and added some new versions.

one way to think of the goal of UUIDv2 is that it's about referring to an object that already has a contextually unique ID. in v2, that object is either a user or a group, and that context is a machine. v3 is a little more flexible, but one of the contexts mentioned in the RFC is domain names, so let's look at that.

say i want something in the format of a UUID that refers to the domain name example.com. one option would be to take the MD5 hash of "example.com", look at the first 16 bytes, line that up with the UUID format definition, and set the version and variant to the right values. this is cool, and it basically already works for domain names, but we want flexibility. if you and i both want to do the UUIDv2 thing of referring to users on a machine, and my context is my machine and your context is your machine, and both of us have a user named cactus, oops, we have the same UUID, that's hardly Universally Unique. we need to include the context in what we're MD5ing, and we need to guarantee that different contexts have different values. and there's nothing computer people love more than recursion, so let's give the context a damn UUID.

to make a UUIDv3, you need a name (which is just some text) and a UUID for your "name space" (which is the context in which your name is unique). take the binary representation of the namespace UUID, append the name, MD5 it, copy that into your UUID structure, set the version and variant, and you are done.

the RFC defines some name space UUIDs already, like 6ba7b810-9dad-11d1-80b4-00c04fd430c8 for domain names, so we can check this ourself:

>>> import hashlib
>>> import uuid
>>> dns_namespace = uuid.UUID("6ba7b810-9dad-11d1-80b4-00c04fd430c8")
>>> hashlib.md5(dns_namespace.bytes + b"example.com").hexdigest()
'9073926b929fd1c26bc9fad77ae3e8eb'
>>> uuid.uuid3(dns_namespace, "example.com")
UUID('9073926b-929f-31c2-abc9-fad77ae3e8eb')

this is pretty damn neat. if you have something that's contextually unique and you want to turn it into something that's globally unique, this is a really cool way to do that. (spoilers, except for one thing, which you may have noticed if you know your hash functions.) but this is only sometimes a problem you have; other times, you don't have anything unique yet, and you want something ex nihilo. v1 is still good, if you've got a timestamp and a MAC address, but what if you're doing something like JS development where you can't exactly check the MAC address you're running on? well, satan help you. or what if that 100ns resolution isn't good enough like it was in the 90s?

version 4

set the version and variant. fill the rest with random bits. there is no step 3.

if you had asked me to guess last week what a UUIDv4 was, i'd have just guessed it was 128 random bits (if you made me count how long it was). and i'd have been wrong, but only by six bits. which is neat, but also a little bit bullshit, because like. me from last week wants those bits back!

those six bits are for compatibility with the rest of the UUID universe, but if you're just looking for some random bytes to throw in your id column, you don't need compatibility with UUIDv1, you could just make some random bytes! and 16 of them is probably overkill for your use case anyway!

wait hang on a minute, did that say MD5 earlier?

version 5

turns out MD5 sucks. you know what's really cool? SHA-1.

UUIDv5 is just UUIDv3 again, but with SHA-1 instead of MD5.

>>> hashlib.sha1(dns_namespace.bytes + b"example.com").hexdigest()
'cfbff0d193753685568c48ce8b15ae17d93cc34c'
>>> uuid.uuid5(dns_namespace, "example.com")
UUID('cfbff0d1-9375-5685-968c-48ce8b15ae17')

thankfully, SHA-1 is the last word in hashing algorithms, it's never had problems and it's known to be very good. now to take a big sip of my coffee and check the NIST website.

The venerable cryptographic hash function has vulnerabilities that make its further use inadvisable.

NISTnist.gov/Dec 15, 2022

oh. that seems bad. when is UUIDv6?

version 6

well. there's a draft RFC making updates to the UUID RFC, but it doesn't solve that problem, it solves different problems.

one of the cool things about UUIDv1 is that you can decode the timestamp back out of it, and you don't need a separate field for the time when your object was created, because its ID tells you when it was created. however, the weird slicing and dicing that UUIDv1 does to the timestamp field means sorting by time is complicated, since the low 32 bits come first and the high 12 bits come last.

UUIDv6 puts the whole timestamp in order, so that the most significant bit is first.

versionvariant
1edaa9b4-e919-6172-a0d0-6721ef312724
time_hightime_midtime_lowclock sequencenode

it keeps the clock sequence as-is from v1, but it explicitly recommends using random data instead of the MAC address for the node field, which is good.

hang on, while we're messing with silly things from v1, what the hell is up with time since the gregorian calendar?

version 7

what if we just did unix timestamp and randomness, so that we had easy sorting and decoding but also uniqueness, without any bullshit.

well, v7 is that. 48 bits of unix timestamp in milliseconds (rolls over every 8920 years), 74 random bits, 6 control bits for version and variant.

versionvariant
0186443f-2c00-75fb-800a-7ec0f02a852d
timerand_arand_b

this is the real good one. the only reason not to use it is that the RFC isn't approved so it isn't quite official yet, but if you don't need to care, find an implementation in your language of choice and go to fuckin town.

but what if yolo?

version 8

if yolo, then you can use version 8. all of the version-specific fields are defined to be custom, so you can put whatever the hell nonsense you need to there.

versionvariant
b00b5101-6969-8420-9a55-676179736578
custom_acustom_bcustom_c

so what did we learn

UUIDs are pretty neat. if you need a database identifier and you get to pick something from scratch, use v7, it gives you timestamps for free. if you have things that are contextually unique and you want to turn them into universally unique IDs with a standard format, v5 is for you. if all you need is some random shit, v4 is that in the UUID format, but if you don't need the UUID format, you can also just use random bytes directly. if what you need is very specific shit nobody's thought of before, but in the UUID format, that's v8. v6 is the worse version of v7, v3 is the worse version of v5, v1 is the worse version of v6, and v2 was a mistake.

see also: creative use of v1's MAC address field, current RFC draft status and other notes, bluetooth doing crimes

in conclusion,

UUIDs nuts.