Today, there already exist numerous URL shortening services, such as tinyurl, bit.ly, qr.cx and many more. Although they are very handy when it comes to shortening URLs in blog posts and articles oftentimes they can’t be used in proprietary software due to certain copyright and usage restrictions. Thus, a vast range of projects have evolved aiming to implement custom string (URL) shortening solutions.
This article presents a simple but effective template for creating a custom string shortening service, based a on proven and widely used algorithm – Base 62 encoding. Prior to discussing the source code let’s first break down the theory behind the shortening algorithm using Base 62 encoding.
Theory
What we want to achieve is a unique bi-directional mapping of a given string (e.g. URL) and a hash:
string <=> hash
Furthermore, we want this unique hash to be (much) shorter than the original string. Let’s have a look at the following example string and the (possible) shortened outcome:
This is an examplary long string to be shortened <=> T4Xza
Of course to achieve this bi-directional property, we need to be able to retrieve the original string based on the hash calculated and additionally, we must be given the same hash for the same string on every attempt. Thus, there must not exist more than one hash for the same original string and the encoding operation must be deterministic. This leads us to certain filtering operations for the input string prior to calculating the hash, but that’s just a matter of programmatic effort that we can handle later on.
Base 62
The encoding schema to be used by this shortening algorithm will be Base 62 encoding, as we are going to use 62 letters: [a-zA-Z0-9]. As always, the indexing begins at 0.
Database
In order to manage our list of shortened strings and their corresponding hashes we are going to use a database. In this example I’ve decided to use a MySQL database with the following schema: id : INT (primary key, auto increment) hash: VARCHAR (unique) url: VARCHAR Each encoded url (i.e. hash) entry will be identified by a unique id. Of course, whenever an entry should be added it needs to be checked if the same hash already exists, i.e. if the same url already exists in our database. Once this check is done and no same url exists we have to add a new entry and based on the last id generated we are going to calculate our hash. Following you find the corresponding SQL code for the schema:
CREATE TABLE IF NOT EXISTS `phpss` ( `id` int(11) NOT NULL AUTO_INCREMENT, `hash` varchar(100) COLLATE utf8_unicode_ci NOT NULL, `url` varchar(255) COLLATE utf8_unicode_ci NOT NULL, PRIMARY KEY (`id`), KEY `hash` (`hash`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Algorithm
Obviously, the underlying algorithm represents the main part of this solution. As previously mentioned, each shortened string will be represented by a unique id (and hash) in the database. We will make use of this unique ID to create hashes, using Base62 encoding. So, let’s say our next unique ID for the string to be shortened is 100. In this case 100 is Base10 encoded (100×10^0). Consequently, the next step would be to convert it to Base62 encoding. This can be done quite easily using modulo operation:
hashDigits = [] dividend = ID remainder = 0 while(dividend > 0) remainder = modulo(dividend, 62) dividend = divide(dividend, 62) hashDigits.prepend(remainder) endwhile
Thus, 100 would lead to 1×62^1 + 38×62^0 using Base62 encoding (hashDigits = [1, 38]). The next step is to convert these hashDigits to their corresponding Base62 representation, resulting in a unique hash string:
base62Alphabet = [a,b,c,...,A,B,C,...,0,1,2,...] hashDigitsCount = hashDigits.count() hashString = "" i = 0 while(hashDigitsCount > i) hashString += base62Alphabet[hashDigits[i]] i++ endwhile
This would lead to the hash bM: b = base62Alphabet[1] M = base62Alphabet[38]
Improvements
This implementation is for reference only, since it (currently) lacks important checking functionality. At the moment it only checks the database for identical strings. While this is perfectly fine for normal text, when using URLs there are oftentimes multiple versions possible, e.g. http://google.com vs. http://www.google.com vs. http://www.google.com/. The algorithm itself seems pretty stable. If you find any bugs and have suggestions for improving the code please let me know.
I will add the code to GitHub in the near future, so please stand by for updates.
Source Code
Below you find the source code of the php string shortener.
phpStringShortener.php
/** * PHPStringShortener * * A simple string shortener class to demonstrate how to shorten strings, such * as URLs using Base62 encoding. * * @author Matthias Kerstner <matthias@kerstner.at> * @uses PDO * @link https://www.kerstner.at/phpStringShortener */ require_once './config.php'; class PhpStringShortener { /** * @var PDO */ private $DB_HANDLE = null; /** * The Base62 alphabet to be used. * @var array */ private $BASE62_ALPHABET = array( 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'); //potential additional characters allowed unencoded in URLs include: //'$', '-', '_', '.', '+', '!', '*', '(', ')', ',' //@see http://www.faqs.org/rfcs/rfc1738.html /** * Returns DB handle, implemented as singleton. * @return PDO * @see config.php */ private function getDbHandle() { if ($this->DB_HANDLE !== null) { return $this->DB_HANDLE; } try { $this->DB_HANDLE = new PDO('mysql:host=' . PHPSS_DBHOST . ';dbname=' . PHPSS_DBDB, PHPSS_DBUSER, PHPSS_DBPASS, array( PDO::ATTR_PERSISTENT => true )); $this->DB_HANDLE->exec("SET CHARACTER SET utf8"); return $this->DB_HANDLE; } catch (PDOException $e) { throw new Exception('Error: ' . $e->getMessage()); } } /** * Closes the DB handle. */ private function closeDbHandle() { if ($this->DB_HANDLE !== null) $this->DB_HANDLE = null; } /** * Generates hash for $id specified using Base62 encoding. * @param int $id * @return string|null */ private function generateHashForId($id) { $hash = ''; $hashDigits = array(); $dividend = (int) $id; $remainder = 0; while ($dividend > 0) { $remainder = floor($dividend % 62); $dividend = floor($dividend / 62); array_unshift($hashDigits, $remainder); } foreach ($hashDigits as $v) { $hash .= $this->BASE62_ALPHABET[$v]; } return $hash; } /** * Closes DB handle. */ public function __destruct() { $this->closeDbHandle(); } /** * Returns string identified by $hash. * @param string $hash * @return string|null */ public function getStringByHash($hash) { $dbHandle = $this->getDbHandle(); $sql = 'SELECT id, hash, url FROM phpss WHERE hash = :hash'; $sth = $dbHandle->prepare($sql, array(PDO::ATTR_CURSOR => PDO::CURSOR_FWDONLY)); $sth->execute(array(':hash' => $hash)); $entry = $sth->fetch(PDO::FETCH_ASSOC); if (!$entry || count($entry) < 1 || !isset($entry['url'])) { return null; } return $entry['url']; } /** * Returns hash identified by $string. * @param string $string * @return string|null */ public function getHashByString($string) { $dbHandle = $this->getDbHandle(); $sql = 'SELECT hash, url FROM phpss WHERE url = :url'; $sth = $dbHandle->prepare($sql, array(PDO::ATTR_CURSOR => PDO::CURSOR_FWDONLY)); $sth->execute(array(':url' => $string)); $entry = $sth->fetch(PDO::FETCH_ASSOC); if (count($entry) > 0 && isset($entry['hash'])) { //hash already exists return $entry['hash']; } return null; } /** * Adds hash identified by $string if it does not already exist. * @param string $string * @return string */ public function addHashByString($string) { $hash = $this->getHashByString($string); if ($hash !== null) { //hash already exists return $hash; } $dbHandle = $this->getDbHandle(); $sql = 'insert into phpss (id, hash, url) values(0, :hash, :url)'; $sth = $dbHandle->prepare($sql, array(PDO::ATTR_CURSOR => PDO::CURSOR_FWDONLY)); if (!$sth->execute(array(':hash' => '', ':url' => $string))) { throw new Exception('Error: failed to add entry (1)'); } $lastInsertId = $dbHandle->lastInsertId(); if (!$lastInsertId) { throw new Exception('Error: failed to add entry (2)'); } $hash = $this->generateHashForId($lastInsertId); $sql = 'update phpss set hash = :hash where id = :id'; $sth = $dbHandle->prepare($sql, array(PDO::ATTR_CURSOR => PDO::CURSOR_FWDONLY)); if (!$sth->execute(array(':hash' => $hash, ':id' => $lastInsertId))) { throw new Exception('Error: failed to add entry (3)'); } if ($hash !== null) { //hash already exists return $hash; } } }
config.php
/** * PHPStringShortener * * A simple string shortener class to demonstrate how to shorten strings, such * as URLs using Base62 encoding. * * @author Matthias Kerstner <matthias@kerstner.at> * @uses PDO * @link https://www.kerstner.at/phpStringShortener */ define('PHPSS_DBHOST', 'localhost'); define('PHPSS_DBUSER', 'root'); define('PHPSS_DBPASS', ''); define('PHPSS_DBDB', 'phpss');
index.php
/** * PHPStringShortener * * A simple string shortener class to demonstrate how to shorten strings, such * as URLs using Base62 encoding. * * @author Matthias Kerstner <matthias@kerstner.at> * @uses PDO * @link https://www.kerstner.at/phpStringShortener */ require_once './phpStringShortener.php'; $cmd = isset($_GET['cmd']) ? $_GET['cmd'] : null; if ($cmd !== null) { try { if ($cmd == 'get') { $hash = isset($_GET['hash']) ? $_GET['hash'] : null; if (!$hash) { die('No hash specified'); } $phpSS = new PhpStringShortener(); $string = $phpSS->getStringByHash($hash); //we expect strings to be URLs, so redirect now echo 'Redirecting to ' . $string; header('refresh:5;url=' . $string); exit; } else if ($cmd == 'add') { $string = isset($_GET['string']) ? $_GET['string'] : null; if (!$string) { die('No string specified'); } $phpSS = new PhpStringShortener(); $hash = $phpSS->addHashByString($string); die('Your hash for ' . $string . ' is: ' . $hash); } } catch (Exception $e) { die('Error: ' . $e->getMessage()); } }
Leave a Reply