Today, there already exist numerous URL shortening services, such as tinyurl, bit.ly, qr.cx and many more. Although they are very handy when it comes to shortening URLs in blog posts and articles oftentimes they can’t be used in proprietary software due to certain copyright and usage restrictions. Thus, a vast range of projects have evolved aiming to implement custom string (URL) shortening solutions.
This article presents a simple but effective template for creating a custom string shortening service, based a on proven and widely used algorithm – Base 62 encoding. Prior to discussing the source code let’s first break down the theory behind the shortening algorithm using Base 62 encoding.
Theory
What we want to achieve is a unique bi-directional mapping of a given string (e.g. URL) and a hash:
string <=> hash
Furthermore, we want this unique hash to be (much) shorter than the original string. Let’s have a look at the following example string and the (possible) shortened outcome:
This is an examplary long string to be shortened <=> T4Xza
Of course to achieve this bi-directional property, we need to be able to retrieve the original string based on the hash calculated and additionally, we must be given the same hash for the same string on every attempt. Thus, there must not exist more than one hash for the same original string and the encoding operation must be deterministic. This leads us to certain filtering operations for the input string prior to calculating the hash, but that’s just a matter of programmatic effort that we can handle later on.
Base 62
The encoding schema to be used by this shortening algorithm will be Base 62 encoding, as we are going to use 62 letters: [a-zA-Z0-9]. As always, the indexing begins at 0.
Database
In order to manage our list of shortened strings and their corresponding hashes we are going to use a database. In this example I’ve decided to use a MySQL database with the following schema: id : INT (primary key, auto increment) hash: VARCHAR (unique) url: VARCHAR Each encoded url (i.e. hash) entry will be identified by a unique id. Of course, whenever an entry should be added it needs to be checked if the same hash already exists, i.e. if the same url already exists in our database. Once this check is done and no same url exists we have to add a new entry and based on the last id generated we are going to calculate our hash. Following you find the corresponding SQL code for the schema:
CREATE TABLE IF NOT EXISTS `phpss` ( `id` int(11) NOT NULL AUTO_INCREMENT, `hash` varchar(100) COLLATE utf8_unicode_ci NOT NULL, `url` varchar(255) COLLATE utf8_unicode_ci NOT NULL, PRIMARY KEY (`id`), KEY `hash` (`hash`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Algorithm
Obviously, the underlying algorithm represents the main part of this solution. As previously mentioned, each shortened string will be represented by a unique id (and hash) in the database. We will make use of this unique ID to create hashes, using Base62 encoding. So, let’s say our next unique ID for the string to be shortened is 100. In this case 100 is Base10 encoded (100×10^0). Consequently, the next step would be to convert it to Base62 encoding. This can be done quite easily using modulo operation:
hashDigits = [] dividend = ID remainder = 0 while(dividend > 0) remainder = modulo(dividend, 62) dividend = divide(dividend, 62) hashDigits.prepend(remainder) endwhile
Thus, 100 would lead to 1×62^1 + 38×62^0 using Base62 encoding (hashDigits = [1, 38]). The next step is to convert these hashDigits to their corresponding Base62 representation, resulting in a unique hash string:
base62Alphabet = [a,b,c,...,A,B,C,...,0,1,2,...] hashDigitsCount = hashDigits.count() hashString = "" i = 0 while(hashDigitsCount > i) hashString += base62Alphabet[hashDigits[i]] i++ endwhile
This would lead to the hash bM: b = base62Alphabet[1] M = base62Alphabet[38]
Improvements
This implementation is for reference only, since it (currently) lacks important checking functionality. At the moment it only checks the database for identical strings. While this is perfectly fine for normal text, when using URLs there are oftentimes multiple versions possible, e.g. http://google.com vs. http://www.google.com vs. http://www.google.com/. The algorithm itself seems pretty stable. If you find any bugs and have suggestions for improving the code please let me know.
I will add the code to GitHub in the near future, so please stand by for updates.
Source Code
Below you find the source code of the php string shortener.
phpStringShortener.php
/** * PHPStringShortener * * A simple string shortener class to demonstrate how to shorten strings, such * as URLs using Base62 encoding. * * @author Matthias Kerstner <matthias@kerstner.at> * @uses PDO * @link https://www.kerstner.at/phpStringShortener */ require_once './config.php'; class PhpStringShortener { /** * @var PDO */ private $DB_HANDLE = null; /** * The Base62 alphabet to be used. * @var array */ private $BASE62_ALPHABET = array( 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'); //potential additional characters allowed unencoded in URLs include: //'$', '-', '_', '.', '+', '!', '*', '(', ')', ',' //@see http://www.faqs.org/rfcs/rfc1738.html /** * Returns DB handle, implemented as singleton. * @return PDO * @see config.php */ private function getDbHandle() { if ($this->DB_HANDLE !== null) { return $this->DB_HANDLE; } try { $this->DB_HANDLE = new PDO('mysql:host=' . PHPSS_DBHOST . ';dbname=' . PHPSS_DBDB, PHPSS_DBUSER, PHPSS_DBPASS, array( PDO::ATTR_PERSISTENT => true )); $this->DB_HANDLE->exec("SET CHARACTER SET utf8"); return $this->DB_HANDLE; } catch (PDOException $e) { throw new Exception('Error: ' . $e->getMessage()); } } /** * Closes the DB handle. */ private function closeDbHandle() { if ($this->DB_HANDLE !== null) $this->DB_HANDLE = null; } /** * Generates hash for $id specified using Base62 encoding. * @param int $id * @return string|null */ private function generateHashForId($id) { $hash = ''; $hashDigits = array(); $dividend = (int) $id; $remainder = 0; while ($dividend > 0) { $remainder = floor($dividend % 62); $dividend = floor($dividend / 62); array_unshift($hashDigits, $remainder); } foreach ($hashDigits as $v) { $hash .= $this->BASE62_ALPHABET[$v]; } return $hash; } /** * Closes DB handle. */ public function __destruct() { $this->closeDbHandle(); } /** * Returns string identified by $hash. * @param string $hash * @return string|null */ public function getStringByHash($hash) { $dbHandle = $this->getDbHandle(); $sql = 'SELECT id, hash, url FROM phpss WHERE hash = :hash'; $sth = $dbHandle->prepare($sql, array(PDO::ATTR_CURSOR => PDO::CURSOR_FWDONLY)); $sth->execute(array(':hash' => $hash)); $entry = $sth->fetch(PDO::FETCH_ASSOC); if (!$entry || count($entry) < 1 || !isset($entry['url'])) { return null; } return $entry['url']; } /** * Returns hash identified by $string. * @param string $string * @return string|null */ public function getHashByString($string) { $dbHandle = $this->getDbHandle(); $sql = 'SELECT hash, url FROM phpss WHERE url = :url'; $sth = $dbHandle->prepare($sql, array(PDO::ATTR_CURSOR => PDO::CURSOR_FWDONLY)); $sth->execute(array(':url' => $string)); $entry = $sth->fetch(PDO::FETCH_ASSOC); if (count($entry) > 0 && isset($entry['hash'])) { //hash already exists return $entry['hash']; } return null; } /** * Adds hash identified by $string if it does not already exist. * @param string $string * @return string */ public function addHashByString($string) { $hash = $this->getHashByString($string); if ($hash !== null) { //hash already exists return $hash; } $dbHandle = $this->getDbHandle(); $sql = 'insert into phpss (id, hash, url) values(0, :hash, :url)'; $sth = $dbHandle->prepare($sql, array(PDO::ATTR_CURSOR => PDO::CURSOR_FWDONLY)); if (!$sth->execute(array(':hash' => '', ':url' => $string))) { throw new Exception('Error: failed to add entry (1)'); } $lastInsertId = $dbHandle->lastInsertId(); if (!$lastInsertId) { throw new Exception('Error: failed to add entry (2)'); } $hash = $this->generateHashForId($lastInsertId); $sql = 'update phpss set hash = :hash where id = :id'; $sth = $dbHandle->prepare($sql, array(PDO::ATTR_CURSOR => PDO::CURSOR_FWDONLY)); if (!$sth->execute(array(':hash' => $hash, ':id' => $lastInsertId))) { throw new Exception('Error: failed to add entry (3)'); } if ($hash !== null) { //hash already exists return $hash; } } }
config.php
/** * PHPStringShortener * * A simple string shortener class to demonstrate how to shorten strings, such * as URLs using Base62 encoding. * * @author Matthias Kerstner <matthias@kerstner.at> * @uses PDO * @link https://www.kerstner.at/phpStringShortener */ define('PHPSS_DBHOST', 'localhost'); define('PHPSS_DBUSER', 'root'); define('PHPSS_DBPASS', ''); define('PHPSS_DBDB', 'phpss');
index.php
/** * PHPStringShortener * * A simple string shortener class to demonstrate how to shorten strings, such * as URLs using Base62 encoding. * * @author Matthias Kerstner <matthias@kerstner.at> * @uses PDO * @link https://www.kerstner.at/phpStringShortener */ require_once './phpStringShortener.php'; $cmd = isset($_GET['cmd']) ? $_GET['cmd'] : null; if ($cmd !== null) { try { if ($cmd == 'get') { $hash = isset($_GET['hash']) ? $_GET['hash'] : null; if (!$hash) { die('No hash specified'); } $phpSS = new PhpStringShortener(); $string = $phpSS->getStringByHash($hash); //we expect strings to be URLs, so redirect now echo 'Redirecting to ' . $string; header('refresh:5;url=' . $string); exit; } else if ($cmd == 'add') { $string = isset($_GET['string']) ? $_GET['string'] : null; if (!$string) { die('No string specified'); } $phpSS = new PhpStringShortener(); $hash = $phpSS->addHashByString($string); die('Your hash for ' . $string . ' is: ' . $hash); } } catch (Exception $e) { die('Error: ' . $e->getMessage()); } }
It’s difficult to find educated people for this subject, however, you sound like you know what you’re talking about!
Thanks
Thanks, you’re welcome 😛
This is awesome!
Hi, Reading the above it makes me wonder how url shortners work with strings or parameters like “%” or “?” inside a the url? Do they just not work?
Hi Matthias,
you have been talking about using unique ids from the database for generating a hash for a requested url.
Could you please explain how would you handle concurrency issues?
The use case is the following:
– there is a requester 1 from ip “IP1” who wants to encode http://www.first.example.com
– there is another requester 2 from ip “IP2” who wants to encode http://www.second.example.com
– the requesters issue the requests at the exactly same moment
– the last id is 101
Hi Timofey,
thanks for your feedback. The DB connection example provided in the post is a rather naive implementation, thus simultaneous requests have not been taken into special consideration (ACID principle). To make sure the code works in multi-threaded setups you will need to use transactions. If you are using a database that does not support transactions (e.g. MyISAM) you will most likely need to LOCK the table to avoid race conditions, or use INSERT IGNORE and check if the row was inserted successfully. Hope that helps, cheers.
Hi Matthias,
thanks for the answer!
Don’t you think that locking a table for every hash generation request would be a bottleneck?
I think it is possible to hash based on the url itself. One possible solution may look like this:
urlHash = base62(md5(originalUrl + predefinedSalt)).substring(0, 6)
– for the same url it produces the same hash.
– it depends only on the input url itself and some predefinedSalt which you control
What do you think about this approach?
Hey,
sorry for the late response. Sure, I was just mentioning the possibility of locking the table when using databases that do not support transactions. Your solution is correct too, although I don’t quite get the substring() on the hash? Cheers
substring is intended to make the url short, in this case 6 characters.
Lets take an example of http://www.google.com
md5_res = md5(“www.google.com”) = 0a137b375cc3881a70e186ce2172c8d1
base62_res = base62(md5_res) = 5pQ6eyBeZ0XkVbn0XO8O
base62_res.substring(0, 6) = 5pQ6ey
Hey, I’m aware of what substr() does 🙂 But this way you are losing information and chances are much higher that you get duplicate substr’ed hashes.
Hi, the number of urls which can be represented by base62 encoded string of length 20 is 62**20 = 704423425546998022968330264616370176L which is incredibly large number.
Taking 6 chars we have 62**6 = 56800235584 = ~56G. The number of urls (or web pages) are approximately estimated to be 1T=1000G, therefore, if all such urls are shortened, some of them would get the same shortened hash.
However, taking 8 chars = 62**8 = 218T would be enough to shorten all the urls on the web.
Any way, we can shorten strings without using the database?
Hello,
@Nick, I think they will work. A hash is a number that represents a data stream in a unique way. So, the stream of url characters (including % and other special characters) will be transformed into one large number, this number is then converted into base62 to shorten the number of characters you need to represent this hash, because you represent more information per char/byte using base62 than using base10 so you need fewer char/bytes.
@Pulkit Sharma. Of course you can shorten strings without a database. Database here is just for the purpose of the example.
@matthias.kerstner. What is the bigest number a integer can represent in PHP? can your hashes be securely represented in this way?
@matthias.kerstner. The timestamps in comments are showing format instead of the value :p
Hey Gilberto,
thanks for your feedback. I will answer your questions in an additional comment a bit later. Yes, the date-format is not correct at the moment, sorry 😉
Hi Gilberto, sorry for the late reply. As always the size of integers is platform-dependent. On 32 bit system the maximum value is about two billion, on 64-bit platforms we usually have a maximum value of about 9E18 (except on Windows prior to PHP 7, where it was always 32 bit). Also note that PHP does not support unsigned integers. To get the maximum integer size on your system use the constant PHP_INT_SIZE and for the maximum value use the constant PHP_INT_MAX. Hope that helps. And yes, thanks again for the heads up on the date formatting issue, which should be fixed now. Cheers
Pingback: Design a URL shortner – Lets Code
How do you do lookup for previous/existing URL’s? For example, you have “www.google.com” already in the table, now someone requests shortening of this same URL, how do you do lookup to return the existing shortened link?
Why not use a unique random number generator that generates the shortcode for a given url which has 6 or 7 chars? and then index the shortcode in the db , every time a new short code is being inserted , just do a check if the code exisits to ensure its uniqueness .. ?
Pingback: Building URL shortener using React, Apollo and GraphQL — Part II - Blog Pays | A truth of your life
Suggest “utf8_bin” collation over “utf8_unicode_ci”. Using a case-insensitive collation otherwise ends up being equivalent to using Base 36.
Wow! very valuable information. Thanks a lot for the write up.
its is a very helpful article..Thanks !