Posted on Leave a comment

Avoid Duplicate Content by enforcing trailing slash in URLs

Source Code Icon

Most of today’s popular Content Management Systems (CMS) support the option to use Search Engine Friendly URLs (SEF URLs). This option is either provided by using permalink structures or simply by deploying dynamic URL rewrites based on pre-defined URL schemas (which the aforementioned permalink structures basically are anyway).

Let’s take WordPress or TYPO3 for instance. Both systems ship with a SEF URL feature that can be easily customized to your needs. In WordPress you can set your required URL schemas based on the Permalink Settings as shown below:

WordPress Permalinks
WordPress Permalinks Settings

In TYPO3 you will want to use the popular realurl extension to setup your URL structures and various i18n settings.

Mind Duplicate Content based on multiple URLs

So you’ve setup your CMS to use pretty SEF URLs instead of parameterized ones. Nice! A common mistake when using these URL rewrites mechanisms is the fact that these URLs are by default accessible through at least 3 different URL paths:

  1. URL reference by ID
  2. SEF-URL without trailing slash
  3. SEF-URL with trailing slash

In order to uniquely identify pages, post or resources in general CMS deploy unique identifiers. Thus, by default your resources will be accessible by using the respective unique identifier, e.g.

http://www.yourdomain.com/?p=123

Next, when enabling and configuring SEF-URLs you also need to keep in mind that there are always two variants, the URL with and without a trailing slash:

http://www.yourdomain.com/some-page vs. http://www.yourdomain.com/some-page/

As you can imagine having a resource accessible through multiple URLs makes your site vulnerable to the issue of Duplicate Content (DC). Thus, you should make sure that your URLs are only accessible through a single URL based on a schema of your choice (with/without trailing slash, etc.).

Enforce Trailing slash URLs using .htaccess

My recommendation is to always use SEF URLs with a trailing slash and block access to all other variants to avoid duplicate content and keep your URL space clean. Below you find a snippet to enforce trailing slash URLs using .htaccess for Apache. Of course you can achieve the same behavior for any other web server too, like nginx.

Basically, what we are doing here is to do some preliminary checks on the current request and finally redirect the request to the trailing slash version when needed.

First, let’s only check GET requests here:

RewriteCond %{REQUEST_METHOD} ^GET$

Second, ignore rewrites for existing files:

RewriteCond %{REQUEST_FILENAME} !-f

Third, in case we want to exclude certain paths from rewriting:

RewriteCond %{REQUEST_URI} !^/exclude-me.*$

Fourth, check if we actually need to do a rewrite (hint: you might want to check here for your root page too):

RewriteCond %{REQUEST_URI} !^(.+)/$

Finally, do the rewrite by redirecting to the trailing slash version using a HTTP 301 redirect:

RewriteRule ^(.*[^/])$ /$1/ [L,R=301]

Since there is no all-in-one solution you might need to customize the snippet to your needs but I believe that you get the gist.

Final remarks: Make sure that your generated sitemaps also only use the actual URL variant that you’ve decided to use, for instance the trailing slash version. Otherwise the various search engine crawlers will not be amused to be redirected on every entry in your sitemap. Also, make sure that you don’t use !^(.*)/$ to check for existing trailing slash URLs as this expression also matches the root directory (* instead of + in the expression to only match one or more characters in front of the trailing slash).

Enjoy!

Posted on 4 Comments

Sample SEO Magento robots.txt file

Magento Logo

Since I get a lot of requests for a robots.txt file designed for Magento SEO here is a sample to get you started. This Magento robots.txt makes the following assumptions:

  • We don’t differentiate between search engines, hence User-agent: *
  • We allow assets to be crawled
    • i.e. images, CSS and JavaScript files
  • We only allow SEF URLs set in Magento
    • e.g. no direct access to the front controller index.php, view categories and products by ID, etc.
  • We don’t allow filter URLs
    • Please note: The list provided is not complete. In case you have custom extension that use filtering make sure to include these filter URLs and parameters in the filter URLs section.
  • We don’t allow session related URL segments
    • e.g. product comparison, customer, etc.
  • We don’t allow specific files to be crawled
    • e.g. READMEs, cron related files, etc.

Magento robots.txt

Enough of the talking, here comes your SEO Magento robots.txt:

# Crawlers Setup
User-agent: *

# Directories
Disallow: /app/
Disallow: /cgi-bin/
Disallow: /downloader/
Disallow: /includes/
Disallow: /lib/
Disallow: /pkginfo/
Disallow: /report/
Disallow: /shell/
Disallow: /var/

# Paths (clean URLs)
Disallow: /index.php/
Disallow: /catalog/product_compare/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/
Disallow: /catalogsearch/
#Disallow: /checkout/
Disallow: /control/
Disallow: /contacts/
Disallow: /customer/
Disallow: /customize/
Disallow: /newsletter/
Disallow: /poll/
Disallow: /review/
Disallow: /sendfriend/
Disallow: /tag/
Disallow: /wishlist/
Disallow: /catalog/product/gallery/

# Misc. files you don’t want search engines to crawl
Disallow: /cron.php
Disallow: /cron.sh
Disallow: /composer.json
Disallow: /LICENSE.html
Disallow: /LICENSE.txt
Disallow: /LICENSE_AFL.txt
Disallow: /STATUS.txt
Disallow: /mage
#Disallow: /modman
#Disallow: /n98-magerun.phar
Disallow: /scheduler_cron.sh
Disallow: /*.php$

# Disallow filter urls
Disallow: /*?min*
Disallow: /*?max*
Disallow: /*?q*
Disallow: /*?cat*
Disallow: /*?manufacturer_list*
Disallow: /*?tx_indexedsearch

Feel free to leave comments below for additional remarks and suggestions for improvement.

Posted on 1 Comment

SEO Key Metrics and Factors for implementing a SEO quick check tool

Source Code Icon

In case you are searching for ways on how to implement a simple but useful SEO quick check tool here are some tips on the key metrics and factors required to get you started. The SEO quick check tool itself is based on PHP to provide a proof of concept that can be easily adapted and extended. First, let’s have a quick review of the SEO key metrics and factors.

SEO Key Metrics and Factors

First, most of the work related to SEO is based on working on the HTML code level. Thus, inspecting the HTML code via its DOM tree plays and important part when conducting and evaluating SEO checks. A few of the key metrics and factors relevant to SEO are:

  1. Title
  2. Meta-description
  3. Meta-keywords (not really anymore but for the sake of completeness)
  4. OpenGraph Meta-tags (as alternative or addition to traditional meta-tags)
  5. Additional general Meta-tags (locale, Google webmaster tools verification, etc.)
  6. Headers <h*> and their ordering
  7. Alternate text attributes for images
  8. Microdata

Based on the the underlying HTML code the following metrics can be calculated:

  1. Length and “quality” of data provided
  2. Data volume
  3. Text to HTML ratio
  4. Loading time

Apart from these core metrics make sure that the general syntax is correct and matches the W3C standards:

  1. W3C validation

You should even go one step further and validate against the Web Content Accessibility Guidelines (WCAG):

  1. WCAG validation (level A-AAA)

In addition to the HTML generated make sure to provide search engines enough information on the pages available to be indexed and those that should be left aside, i.e. by providing a XML sitemap and a robots.txt file:

  1. XML sitemap
  2. robots.txt

The XML sitemap can either by a sitemap index consisting of multiple sitemaps where each is for instance referring to a special page type (posts vs. pages) or a simple list of URLs. Link metrics in return can be differentiated by site internal and external links:

  1. internal links
  2. external links

When it comes to linking and SEO acquiring link juice is the ultimate goal you should be going for. By getting backlinks from preferably established websites link juice is transferred back to your site, thus strengthening it. This list is not complete and there are loads of details you need to keep in mind when dealing with SEO. Nevertheless, this post is about implementing a SEO quick check tool, right?

Implementing a SEO quick check tool

The following presents a proof of concept for implementing a SEO quick check tool written in PHP. Feel free to use it as a foundation. First of all, let’s assemble our toolset to save us a lot of trouble parsing and evaluating the DOM tree.

cURL

Of course there also exists a PHP extension of cURL. Make sure that the corresponding extension is activated in your php.ini. We will be using cURL for getting various remote assets, starting with the webste HTML code itself:

 
function curl_get($url) {
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_TIMEOUT, 30);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

  $start = microtime();
  $result = curl_exec($ch);

  if ($result === false) {
    return array('error' => curl_error($ch));
  }

  curl_close($ch);
  $end = microtime();

  return array('data' => $result, 'duration' => abs(round($end - $start, 2)));
}

This function will be used throughout the SEO quick check tool and returns an array containing the fields

  1. data: data received based on $url
  2. duration: for benchmarking loading duration
  3. error: in case something went wrong otherwise not set

In case you need additional headers, etc. feel free to adjust this function to your needs.

Simple HTML DOM

Once we have the HTML code  we need to parse it into a DOM tree that we can evaluate. For PHP there exists a handy tool called Simple HTML DOM Parser that does a nice job parsing HTML code to a DOM:

$htmlDOM = <strong>str_get_html</strong>($html);

Yes, that’s all you need to parse the HTML code into a DOM object which we will use the evaluate various tags. Please refer to the Simple HTML DOM Parser Manual for more information on how to use this tool.

SimpleXML

When dealing with XML in PHP SimpleXML is definitely the way to go. We will be using SimpleXML for parsing XML sitemaps. First, we need to check if a XML sitemap is present by inspecting the robots.txt and then we will be using the cURL function defined above to retrieve the sitemap for further inspection.

Check robots.txt

 
$robotsTxtResponse = <strong>curl_get</strong>($robotsUrl); //$url + "/robots.txt", use e.g. parse_url() to assemble URL correctly
$robotsTxt = $robotsTxtResponse['data']; //make sure to check if 'error' is not set in the response

So, let’s assume that robots.txt exists and the content is available through $robotsTxt.

Load XML Sitemap

Based on the contents of robots.txt we can check if a XML sitemap is present:

 
$xmlResponse = <strong>curl_get</strong>($html); 
$xml = $xmlResponse['data']; //make sure to check if 'error' is not set in the response

$siteMapUrl = null;
$siteMapMatches = array();

if (preg_match('#Sitemap:\s?(.+)$#', $robotsTxt, $siteMapMatches)) {
  if (count($siteMapMatches) < 3) {
    // we got ourselves a sitemap URL in $siteMapMatches[1]
    $siteMapUrl = $siteMapMatches[1]);
  }
}

Let’s assume we have a sitemap URL determined above in $siteMapUrl our next step would be to check if it’s a plain sitemap or a sitemap index, i.e. a list of sitemaps for various content types such as pages, posts, categories, etc.

 
// load sitemap
$siteMapData = curl_get($siteMapUrl);

$isSitemapIndex = false;
$sitemaps = array();
$sitemapUrls = array();

if (preg_match('/<urlset/', $siteMapData)) { // plain sitemap
  $sitemapUrlIndex = $xml = new SimpleXMLElement($siteMapData);
 
  if (isset($sitemapUrlIndex->url)) {
    foreach ($sitemapUrlIndex->url as $v) {
      $sitemapUrls[] = $v->loc;
    }
  }
} else if (preg_match('/<sitemapindex/', $siteMapData)) { // sitemap index
  $sitemapIndex = $xml = new SimpleXMLElement($siteMapData);

  if (isset($sitemapIndex->sitemap)) {
    $isSitemapIndex = true;
    foreach ($sitemapIndex->sitemap as $v) {
      $sitemaps[] = $v->loc;
    }
  }
}

Depending on the contents of the original sitemap this snippet parses plain sitemaps or nested sitemaps inside.

W3C Validator

In order to validate an URL for W3C conformity you can use the handy w3c-validator by micheh. The code required to run this validator is pretty simple:

 
$validator = new \W3C\HtmlValidator();
$result = $validator->validateInput($html); // $html from above

if ($result->isValid()) {
  // Hurray! no errors found :)
} else {
  // Hmm... check failed
  //$result->getErrorCount()
  //$result->getWarningCount()
}

Again, please refer to the w3c-validator documentation for more information.

Google Web Search API

Although technically speaking deprecated, the Google Web Search API still is handy to quickly generate the search preview:

 
// use user's IP to reduce&nbsp;server-to-server requests
$googleWebSearchApiUrl = "https://ajax.googleapis.com/ajax/services/search/web?v=1.0&amp;"
 . "q=site:" . urlencode($url) . "&amp;userip=" . $_SERVER['REMOTE_ADDR'];

$googleWebSearchApiResponse = curl_get($googleWebSearchApiUrl);
$googleWebSearchApiResponseArray = json_decode($googleWebSearchApiResponse, true); // do some checks here if request succeeded

// access data from response
$searchResults = $searchResultData['responseData']['cursor']['resultCount'];
$searchResultAdditionalData = $searchResultData['responseData']['results'];

Conclusion

As you can see implementing a basic SEO quick check tool can be achieved with a basic set of tools and frameworks. Furthermore, based on the key metrics determined you are able to quickly identify potential SEO problems.

Live Demo

Enough of the theoretical information? Ok! Head over to the Both Interact SEO Quick Check Tool for a live demonstration of this SEO Quick Check Tool. In case you like it feel free to drop a comment.