I have recently learned that YouTube was using random unique IDs for videos using base64 encoding of 64 bit integers. This leads to more readable URLs than using 128 bit UUIDs for the same purpose. Let’s try to generate these identifiers with our favorite prototyping language: Python!
The case for unique random IDs
Whenever you are building a service or database that is facing public access, you should not expose the internal auto-incrementing numbers directly. This makes your software vulnerable to prediction attacks.
Say that you are numbering your customers from 1, 2, … and that number is visible in an URL:
Malicious (or just curious) actors can try to change this number to another one, in order to access another user’s data. Even if the data is not displayed because of proper rights management, one can try to count the number of total users: increase that number until the HTTP 403 error code becomes a 404, and you can deduce the number of users in the system.
Same principle applies for YouTube videos. Some of them are unlisted, so their URL should not be guessed easily by using the same kind of increment deduction. This is why numbering them sequentially is a security risk: one can try random numbers between 0 and N existing videos to access unlisted ones.
The solution to this problem is to associate a random identifier to each and every entry that can be addressed by the public-facing software. Keep the internal ID for server-side manipulations like SQL JOIN requests, but never expose this ID to the public and use the random identifier instead.
An obvious answer: 128 bits UUIDs
The most popular unique random identifier is UUID (Universally unique identifier). It is a big-endian, hex-encoded 128 bits number that generates strings like this:
$ for i in $(seq 5); do uuidgen -r; done
196458f5-9e40-4408-b8a9-e026d97aad5c
d388af1e-4696-4f42-b567-22e29ce895f1
ab7a74a0-c13f-4459-a49d-e602419946c6
08004758-9ea9-4091-ab6a-5316ca3c7a20
66deec01-4b94-4405-9b87-c5c1e63e82ef
You may have encountered this style of random numbers inside URLs before. Using 128 bits guarantees a huge number of identifiers can be used. Generating an already-used random identifier should be extremely rare, but should still be accounted for. This is why the generated number should be verified against existing ones in the database. In practice this check should always pass.
Considering a bigger ASCII subset
One problem of UUIDs is that they are lengthy. They take a significant amount of space inside of an URL:
https://example.com/user/ab7a74a0-c13f-4459-a49d-e602419946c6
This is because UUIDs are not very efficient at using the available ASCII character set, by only using the [0-9a-f] characters and interleaving them with dashes for clarity.
This is where it becomes interesting: how does YouTube handles this? The requirement is basically the same: have one unique identifier for every video. However something shorter than UUIDs is used:
https://www.youtube.com/watch?v=dQw4w9WgXcQ
We can see that the unique identifier is using both upper and lowercase characters. The dash (-) and underscore (_) characters may also appear from time to time. Where UUIDs need 36 characters, this notation only uses 11 (31% the size of UUIDs).
Generating the identifiers
So how are these identifiers generated? They are based on a 64 bit random number, that is then encoded to this ASCII format. While 64 bits brings down the total number of possible identifiers compared to UUIDs, it is still acceptable. 2^64 remains huge for “common” usage.
Then, this number is encoded using a base64 variant, that replaces characters that are problematic for URLs (plus + and slash /) by URL-friendly symbols (respectivelly dash – and underscore _). This encoding is named “base64url”.
It is very easy to use the Python standard library to generate such identifiers:
import base64
import random
value = random.getrandbits(64)
by = value.to_bytes(8, byteorder='little')
# Equivalent to b64 = base64.b64encode(by, altchars=b'-_')
b64 = base64.urlsafe_b64encode(by)
if b64[-1] == ord('='):
b64 = b64[:-1]
print(f"{value=}")
print(f"{by.hex()=}")
print(f"{b64=}")
print(f"{len(b64)=}")
Note that we remove the final equal sign (=) from the result after base64 encoding, as it is considered useless in our usecase. We are using little endian encoding for the integer to bytes conversion, because in 2021 all major CPUs are using little endian. This may save some cycles, but Python overhead is probably heavier.
Example output:
value=16312555207629389598 by.hex()='1ebfb2a188d661e2' b64=b'Hr-yoYjWYeI' len(b64)=11
This looks much like YouTube identifiers!
In order to keep efficiency, the random 64 bits unsigned integer should be stored with the associated item, as it will take less memory space than storing the encoded string directly. It will also be faster to compare integers rather than strings for looking up the item.
Decoding the identifiers
The reverse process can be applied in order to retrieve the 64 bit integer value from an identifier:
import sys
import base64
import binascii
import random
b64 = sys.argv[1].encode('utf-8')
try:
by = base64.urlsafe_b64decode(b64)
except binascii.Error:
by = base64.urlsafe_b64decode(b64 + b'=')
value = int.from_bytes(by, byteorder='little')
print(f"{b64=}")
print(f"{len(by)=}")
print(f"{by.hex()=}")
print(f"{value=}")
And here is the result:
$ python3 decode64.py Hr-yoYjWYeI
b64=b'Hr-yoYjWYeI'
len(by)=8
by.hex()='1ebfb2a188d661e2'
value=16312555207629389598
Given that we have removed the padding equal = symbol, we first try to decode without it, or add it and retry if failure occurs. This is not elegant nor efficient. But the Python library does not seem to provide an option for ignoring incorrect padding directly.
The important part is that we get our 64 bits integer value back, and can use it for querying a database or any other storage system.
Getting back to UUIDs
The amount of code required to handle the 64 bit URLs is small, but not trivial. I am curious how the Python code would look like for encoding and decoding 128 bit integers using UUIDs.
UUID are backed by RFC 4122, and like a lot of other RFCs big endian is used for integer to bytes encoding. Bye bye our premature optimization of using little endian for modern CPUs. But at least UUID are specified, not like our previous hack.
Generating a random UUID:
import uuid
# Equivalent to:
# value = random.getrandbits(128)
# uu = uuid.UUID(int=value)
uu = uuid.uuid4()
value = uu.int
print(f'{value=}')
print(f'{uu.hex=}')
print(f'{str(uu)=}')
Output:
value=141823391717206992948815466394279191934 uu.hex='6ab23112beab459cab30c9ef1597a57e' str(uu)='6ab23112-beab-459c-ab30-c9ef1597a57e'
Parsing the UUID:
import sys
import uuid
uu = uuid.UUID(sys.argv[1])
value = uu.int
print(f'{str(uu)=}')
print(f'{uu.hex=}')
print(f'{value=}')
Output:
$ python3 uuiddecode.py 6ab23112-beab-459c-ab30-c9ef1597a57e str(uu)='6ab23112-beab-459c-ab30-c9ef1597a57e' uu.hex='6ab23112beab459cab30c9ef1597a57e' value=141823391717206992948815466394279191934
We also manage to get back to our integer value, but with a lot less of code. This is because UUIDs are exactly fit for this purpose and specified. It means that we do not have to manually choose the endianness, nor to remove the padding character of base64 for aesthetics.
Conclusion
While looking more user-friendly and less computer-ish, 64 bit ASCII identifiers are not as easy to use as 128 bit UUIDs from the programmer side. This is because UUIDs are perfectly fit for this purpose, specified, and software libraries exists for almost every programming language.
If you are doing a quick prototype, go with UUIDs. If you do bother to display shorter URLs for your users, then you can use 64 bit (or even 128 bit) base64 identifiers like described in this post.
Sources: