Avoid Duplicate Content by enforcing trailing slash in URLs

Most of today’s popular Content Management Systems (CMS) support the option to use Search Engine Friendly URLs (SEF URLs). This option is either provided by using permalink structures or simply by deploying dynamic URL rewrites based on pre-defined URL schemas (which the aforementioned permalink structures basically are anyway).

Let’s take WordPress or TYPO3 for instance. Both systems ship with a SEF URL feature that can be easily customized to your needs. In WordPress you can set your required URL schemas based on the Permalink Settings as shown below:

WordPress Permalinks

WordPress Permalinks Settings

In TYPO3 you will want to use the popular realurl extension to setup your URL structures and various i18n settings.

Mind Duplicate Content based on multiple URLs

So you’ve setup your CMS to use pretty SEF URLs instead of parameterized ones. Nice! A common mistake when using these URL rewrites mechanisms is the fact that these URLs are by default accessible through at least 3 different URL paths:

  1. URL reference by ID
  2. SEF-URL without trailing slash
  3. SEF-URL with trailing slash

In order to uniquely identify pages, post or resources in general CMS deploy unique identifiers. Thus, by default your resources will be accessible by using the respective unique identifier, e.g.

http://www.yourdomain.com/?p=123

Next, when enabling and configuring SEF-URLs you also need to keep in mind that there are always two variants, the URL with and without a trailing slash:

http://www.yourdomain.com/some-page vs. http://www.yourdomain.com/some-page/

As you can imagine having a resource accessible through multiple URLs makes your site vulnerable to the issue of Duplicate Content (DC). Thus, you should make sure that your URLs are only accessible through a single URL based on a schema of your choice (with/without trailing slash, etc.).

Enforce Trailing slash URLs using .htaccess

My recommendation is to always use SEF URLs with a trailing slash and block access to all other variants to avoid duplicate content and keep your URL space clean. Below you find a snippet to enforce trailing slash URLs using .htaccess for Apache. Of course you can achieve the same behavior for any other web server too, like nginx.

Basically, what we are doing here is to do some preliminary checks on the current request and finally redirect the request to the trailing slash version when needed.

First, let’s only check GET requests here:

RewriteCond %{REQUEST_METHOD} ^GET$

Second, ignore rewrites for existing files:

RewriteCond %{REQUEST_FILENAME} !-f

Third, in case we want to exclude certain paths from rewriting:

RewriteCond %{REQUEST_URI} !^/exclude-me.*$

Fourth, check if we actually need to do a rewrite (hint: you might want to check here for your root page too):

RewriteCond %{REQUEST_URI} !^(.+)/$

Finally, do the rewrite by redirecting to the trailing slash version using a HTTP 301 redirect:

RewriteRule ^(.*[^/])$ /$1/ [L,R=301]

Since there is no all-in-one solution you might need to customize the snippet to your needs but I believe that you get the gist.

Final remarks: Make sure that your generated sitemaps also only use the actual URL variant that you’ve decided to use, for instance the trailing slash version. Otherwise the various search engine crawlers will not be amused to be redirected on every entry in your sitemap. Also, make sure that you don’t use !^(.*)/$ to check for existing trailing slash URLs as this expression also matches the root directory (* instead of + in the expression to only match one or more characters in front of the trailing slash).

Enjoy!

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *