Design a URL Shortening Service like TinyURL, Bitly or shorturl

This is a very common interview questions for product and FANG companies. Even though the problem statement appears to be too simple the actual implementation has some architectural challenges.

Why do we need URL shortening?

URL shortening service or a webservice is used to create shorter/terse aliases for long URLs. We call these shortened aliases as “short links.” Users are redirected to the original URL when they hit these short links.

Short links save a lot of space when displayed, messaged, printed or tweeted or for any other purpose.

Additionally, users are less likely to mistype shorter URLs and thereby avoiding lot of 404’s.

For example, if we shorten the below page through TinyURL:

https://teachyourselfcoding.com/articles/full-stack-master-class-vanilla-javascript-react-angular/

we get

https://tinyurl.com/y5vrhsln

A shortened URL is a combination of

Numbers (0-9)
Characters (a-Z)
And usually length around 7 to 8 characters

Functional Requirements

Our URL shortening system should meet the following functional requirements.

Given a URL, the service should generate a shorter and unique URL from it. This is called a short link.
When users access a short link, the service should redirect them to the original link.
Links can expire after a standard default expiry timespan. Users may optionally be able to specify expiry time.
Users can optionally be able to create a custom short link for their URL (Note: The custom short link should be unique).

Non-Functional Requirements

Our URL shortening service should also adhere to the below NFR’s.

URL redirection should happen with minimal latency.
The system should be highly available. If the service is down, all the URL redirections will fail.
URL redirection should happen in real time.

Analytics requirements

Apart from the functional and NFR’s the below analytical requirements are nice to have features.

How many times a redirection happened?
REST API end point.
How many failed redirection happened?

Capacity Planning

The critical factor to consider in this kind of application/service is that the reads will be more than the writes.

We can assume the ratio of read to write as 100: 1. (Or depending on your context you can change this)

Let’s start small. Assume we will have 500 million new URL shortening request per month, with 100:1 read/write ratio, we can expect 50B redirections.

100 * 500M => 50B

Let’s also calculate the Queries Per Second (QPS) for our new system.New URL Shortening requests per second

New URL Shortening requests per second

500M / (30 days * 24 hours * 3600 seconds) = ~200 URLs/second

So, URL redirections per second will be (100:1 read/write ratio)

100 * 200 URLs/s = 20K/s

Storage Estimates

Let’s assume we store every URL shortening request for 10 years. Since we have 500M new URLs every month, the total number of objects we expect to store will be

500 million * 10 years * 12 months = 60 billion

Let’s assume that each stored object will be approximately 500 bytes (just a rough estimate).

We will need 30TB of total storage.

60 billion * 500 bytes = 30 TB

Bandwidth Estimates

Bandwidth is (network bandwidth) is the term used to denote the amount of data transfer(data size) in a period of 1 second.

Data transfer will included, incoming as well as outgoing data.

Incoming data is the amount of data transferred to the server (same as file upload type)
Outgoing data is the amount of data returned from the server to the client (like downloading).

We expect 200 new URLs per second. So a total incoming data for our service will be 100KB per second.

200 * 500 bytes = 100 KB/s

For read requests, since we expect ~20K URL redirections every second, total outgoing data for our service will be 10MB per second.

20K * 500 bytes = ~10 MB/s

Cache (Memory Estimation)

To optimize performance and reduce the DB round trip caching URL’s will be an efficient solution.

Here’s let’s follow the, Pareto principle, aka 80:20 rule, i.e. 20% of the short links generate 80% of the system traffic.

So, since we have a total of 20K URLs / s (or 20K requests/ s), then , 1 day will have:

20K * 3600 seconds * 24 hours = 1.7B request / day

To cache 20% of these requests, we will need

0.2 * 1.7B * 500 bytes = 170GB memory

NOTE: Due to duplicate requests, our actual memory usage will be lesser thatn 170GB.

Caching considerations

Built in cache of the backend server vs Redis or Memcached
Cache invalidation strategy

System Size (ballpark)

Since our system has 500M urls in a month and has a read:write ratio of 100:1, the below will be the system specifications.

200 URLs generated every second
20K requests / per second
100 KB/s incoming data
10MB/s outgoing data
Disk capacity for 10 years: 30 TB
Memory Capacity for cache: 170 GB

Database Design

Based on the analysis done so far the below will be our database requirements.

Billions of records needs to be saved
The system has a high read rate

Database Schema:

Either SQL or NoSQL will work for this solution (But a nosql will be much better)

You can also use the auto increment ID feature of database (for base62 encoding approach as shown in the algorithms section)

Load Balancer (LB)

As scale is important adding load balancer will significantly improve the end user experience. Though load balancing can be applied at may places, here starting to load balance between the clients and application servers will be sufficient. Then based on estimates of scale additionally load balancing can be applied at

Security and Permissions

If the system has the requirements to create private URL’s then these factors need to be taken into consideration to compute storage.

Basic System Design and Algorithm

Shortening Techniques/Algorithms

There are multiple approaches to shorten the URL’s and few are outlined below.

Base 62 encoded string
Hashing function
Integer based sequence

Base62Encoding or use a Bijective function.

In mathematics, a bijection, bijective function, one-to-one correspondence, or invertible function, is a function between the elements of two sets, where each element of one set is paired with exactly one element of the other set, and each element of the other set is paired with exactly one element of the first set. There are no unpaired elements. In mathematical terms, a bijective function f: X → Y is a one-to-one (injective) and onto (surjective) mapping of a set X to a set Y.^[1][2] The term one-to-one correspondence must not be confused with one-to-one function (an injective function; see figures).

Here is is a one to one mapping between base 10 (decimal) to base 62.

Base62 uses 62 possible ASCII letters, 0 – 9, a – z and A – Z, therefore it is often used to represent large numbers in short length of string. Mainly it has two advantages:

A shorter number representation in base62 yields a smaller risk of error entered by human and the number can be typed in faster.
The second advantage is that it can be used in a more restricted application where the length for URL or name, identify is limited.

Again why 62 possible characters? Check the calculations below for base 62

A-Z = 26 characters
a-z = 26 characters
0-9 = 10 characters

Which totals to 62 characters.

The JavaScript code to convert to and from decimal(1) to base 62 and vice versa.

const BASE_62_VALUES = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";

// convert from decimal to base (default 62)
function to_base(num, base = 62) {
  if (num === 0 ) return BASE_62_VALUES[0]
  result = []
  while (num > 0) {
    r = num % base
    console.log(r);
    result.unshift(BASE_62_VALUES[r])
    num = Math.floor(num / base)
  }
  return result.join("");
}

// Convert to decimal from base (default 62) encoded string
function from_base(value, base = 62) {
  base_10_num = 0
  digit = 0
  value.split('').reverse().forEach(char => { 
    num = BASE_62_VALUES.indexOf(char)
    base_10_num += num * Math.pow(base, digit) 
    digit += 1
  })
 
  return base_10_num
}


console.log(`1:  ${to_base(1)}`);  // => 1
console.log(`decode:  ${from_base(to_base(1))}`);  // => 1
  
console.log(`98:  ${to_base(98)}`); // => 1A
console.log(`decode:  ${from_base(to_base(98))}`); // => 98

Code can be tested here https://jsbin.com/nohujes/edit?js,console,output

Now once we have this basic conversion function we can use the DB auto increment entity ID for the long URL and save the shortened form. (You can take a different approach as well)

How Base 62 encoding works?

More details here https://en.wikipedia.org/wiki/Base62

Setup up the encoding characters
1. “0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ”;
2. NOTE: Here character at index 0 is ‘0’ and character at index 61 is ‘Z’
Create empty array or hash
While the input number is greater than 0
1. Take the base 10 input and modulus it with 62 and store the result into a variable. The resultant value will always be between 0 and 61 (inclusive)
2. Get the character at that position using the variable defined above.
3. Reset the input with the floor value of the input divided by the base.
Convert the resultant array/hash as string

Assume our short service URL is https://shrt.ly

Steps to create short links

User visits https://shrt.ly
The user enters the long URL and submit the form
In the backend the service will create a db entry for this long url
It then uses the entity ID and crates a short url using the base 62 function.
The shortened URL will be returned back to the user.

Steps to navigate the original URl using short link

User visits a short url page for e.g. https://shrt.ly/1a
The server decodes the short url to get the id and uses the ID to fetch the long URL.
The server redirects the user using the long URL.

NOTE: The URL routing/matching part of your service should be optimized as that can result in slower user responses.

Hash function (like MD5 or SHA-2 etc)

<TODO>

Assignment

Create a website/webservice for URL shortening using the above algorithm.

Blog