Posted on 4 Comments

Sample SEO Magento robots.txt file

Magento Logo

Since I get a lot of requests for a robots.txt file designed for Magento SEO here is a sample to get you started. This Magento robots.txt makes the following assumptions:

  • We don’t differentiate between search engines, hence User-agent: *
  • We allow assets to be crawled
    • i.e. images, CSS and JavaScript files
  • We only allow SEF URLs set in Magento
    • e.g. no direct access to the front controller index.php, view categories and products by ID, etc.
  • We don’t allow filter URLs
    • Please note: The list provided is not complete. In case you have custom extension that use filtering make sure to include these filter URLs and parameters in the filter URLs section.
  • We don’t allow session related URL segments
    • e.g. product comparison, customer, etc.
  • We don’t allow specific files to be crawled
    • e.g. READMEs, cron related files, etc.

Magento robots.txt

Enough of the talking, here comes your SEO Magento robots.txt:

# Crawlers Setup
User-agent: *

# Directories
Disallow: /app/
Disallow: /cgi-bin/
Disallow: /downloader/
Disallow: /includes/
Disallow: /lib/
Disallow: /pkginfo/
Disallow: /report/
Disallow: /shell/
Disallow: /var/

# Paths (clean URLs)
Disallow: /index.php/
Disallow: /catalog/product_compare/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/
Disallow: /catalogsearch/
#Disallow: /checkout/
Disallow: /control/
Disallow: /contacts/
Disallow: /customer/
Disallow: /customize/
Disallow: /newsletter/
Disallow: /poll/
Disallow: /review/
Disallow: /sendfriend/
Disallow: /tag/
Disallow: /wishlist/
Disallow: /catalog/product/gallery/

# Misc. files you don’t want search engines to crawl
Disallow: /cron.php
Disallow: /
Disallow: /composer.json
Disallow: /LICENSE.html
Disallow: /LICENSE.txt
Disallow: /LICENSE_AFL.txt
Disallow: /STATUS.txt
Disallow: /mage
#Disallow: /modman
#Disallow: /n98-magerun.phar
Disallow: /
Disallow: /*.php$

# Disallow filter urls
Disallow: /*?min*
Disallow: /*?max*
Disallow: /*?q*
Disallow: /*?cat*
Disallow: /*?manufacturer_list*
Disallow: /*?tx_indexedsearch

Feel free to leave comments below for additional remarks and suggestions for improvement.

Posted on 1 Comment

SEO Key Metrics and Factors for implementing a SEO quick check tool

Source Code Icon

In case you are searching for ways on how to implement a simple but useful SEO quick check tool here are some tips on the key metrics and factors required to get you started. The SEO quick check tool itself is based on PHP to provide a proof of concept that can be easily adapted and extended. First, let’s have a quick review of the SEO key metrics and factors.

SEO Key Metrics and Factors

First, most of the work related to SEO is based on working on the HTML code level. Thus, inspecting the HTML code via its DOM tree plays and important part when conducting and evaluating SEO checks. A few of the key metrics and factors relevant to SEO are:

  1. Title
  2. Meta-description
  3. Meta-keywords (not really anymore but for the sake of completeness)
  4. OpenGraph Meta-tags (as alternative or addition to traditional meta-tags)
  5. Additional general Meta-tags (locale, Google webmaster tools verification, etc.)
  6. Headers <h*> and their ordering
  7. Alternate text attributes for images
  8. Microdata

Based on the the underlying HTML code the following metrics can be calculated:

  1. Length and “quality” of data provided
  2. Data volume
  3. Text to HTML ratio
  4. Loading time

Apart from these core metrics make sure that the general syntax is correct and matches the W3C standards:

  1. W3C validation

You should even go one step further and validate against the Web Content Accessibility Guidelines (WCAG):

  1. WCAG validation (level A-AAA)

In addition to the HTML generated make sure to provide search engines enough information on the pages available to be indexed and those that should be left aside, i.e. by providing a XML sitemap and a robots.txt file:

  1. XML sitemap
  2. robots.txt

The XML sitemap can either by a sitemap index consisting of multiple sitemaps where each is for instance referring to a special page type (posts vs. pages) or a simple list of URLs. Link metrics in return can be differentiated by site internal and external links:

  1. internal links
  2. external links

When it comes to linking and SEO acquiring link juice is the ultimate goal you should be going for. By getting backlinks from preferably established websites link juice is transferred back to your site, thus strengthening it. This list is not complete and there are loads of details you need to keep in mind when dealing with SEO. Nevertheless, this post is about implementing a SEO quick check tool, right?

Implementing a SEO quick check tool

The following presents a proof of concept for implementing a SEO quick check tool written in PHP. Feel free to use it as a foundation. First of all, let’s assemble our toolset to save us a lot of trouble parsing and evaluating the DOM tree.


Of course there also exists a PHP extension of cURL. Make sure that the corresponding extension is activated in your php.ini. We will be using cURL for getting various remote assets, starting with the webste HTML code itself:

function curl_get($url) {
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_TIMEOUT, 30);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

  $start = microtime();
  $result = curl_exec($ch);

  if ($result === false) {
    return array('error' => curl_error($ch));

  $end = microtime();

  return array('data' => $result, 'duration' => abs(round($end - $start, 2)));

This function will be used throughout the SEO quick check tool and returns an array containing the fields

  1. data: data received based on $url
  2. duration: for benchmarking loading duration
  3. error: in case something went wrong otherwise not set

In case you need additional headers, etc. feel free to adjust this function to your needs.


Once we have the HTML code  we need to parse it into a DOM tree that we can evaluate. For PHP there exists a handy tool called Simple HTML DOM Parser that does a nice job parsing HTML code to a DOM:

$htmlDOM = <strong>str_get_html</strong>($html);

Yes, that’s all you need to parse the HTML code into a DOM object which we will use the evaluate various tags. Please refer to the Simple HTML DOM Parser Manual for more information on how to use this tool.


When dealing with XML in PHP SimpleXML is definitely the way to go. We will be using SimpleXML for parsing XML sitemaps. First, we need to check if a XML sitemap is present by inspecting the robots.txt and then we will be using the cURL function defined above to retrieve the sitemap for further inspection.

Check robots.txt

$robotsTxtResponse = <strong>curl_get</strong>($robotsUrl); //$url + "/robots.txt", use e.g. parse_url() to assemble URL correctly
$robotsTxt = $robotsTxtResponse['data']; //make sure to check if 'error' is not set in the response

So, let’s assume that robots.txt exists and the content is available through $robotsTxt.

Load XML Sitemap

Based on the contents of robots.txt we can check if a XML sitemap is present:

$xmlResponse = <strong>curl_get</strong>($html); 
$xml = $xmlResponse['data']; //make sure to check if 'error' is not set in the response

$siteMapUrl = null;
$siteMapMatches = array();

if (preg_match('#Sitemap:\s?(.+)$#', $robotsTxt, $siteMapMatches)) {
  if (count($siteMapMatches) < 3) {
    // we got ourselves a sitemap URL in $siteMapMatches[1]
    $siteMapUrl = $siteMapMatches[1]);

Let’s assume we have a sitemap URL determined above in $siteMapUrl our next step would be to check if it’s a plain sitemap or a sitemap index, i.e. a list of sitemaps for various content types such as pages, posts, categories, etc.

// load sitemap
$siteMapData = curl_get($siteMapUrl);

$isSitemapIndex = false;
$sitemaps = array();
$sitemapUrls = array();

if (preg_match('/<urlset/', $siteMapData)) { // plain sitemap
  $sitemapUrlIndex = $xml = new SimpleXMLElement($siteMapData);
  if (isset($sitemapUrlIndex->url)) {
    foreach ($sitemapUrlIndex->url as $v) {
      $sitemapUrls[] = $v->loc;
} else if (preg_match('/<sitemapindex/', $siteMapData)) { // sitemap index
  $sitemapIndex = $xml = new SimpleXMLElement($siteMapData);

  if (isset($sitemapIndex->sitemap)) {
    $isSitemapIndex = true;
    foreach ($sitemapIndex->sitemap as $v) {
      $sitemaps[] = $v->loc;

Depending on the contents of the original sitemap this snippet parses plain sitemaps or nested sitemaps inside.

W3C Validator

In order to validate an URL for W3C conformity you can use the handy w3c-validator by micheh. The code required to run this validator is pretty simple:

$validator = new \W3C\HtmlValidator();
$result = $validator->validateInput($html); // $html from above

if ($result->isValid()) {
  // Hurray! no errors found :)
} else {
  // Hmm... check failed

Again, please refer to the w3c-validator documentation for more information.

Google Web Search API

Although technically speaking deprecated, the Google Web Search API still is handy to quickly generate the search preview:

// use user's IP to reduce&nbsp;server-to-server requests
$googleWebSearchApiUrl = ";"
 . "q=site:" . urlencode($url) . "&amp;userip=" . $_SERVER['REMOTE_ADDR'];

$googleWebSearchApiResponse = curl_get($googleWebSearchApiUrl);
$googleWebSearchApiResponseArray = json_decode($googleWebSearchApiResponse, true); // do some checks here if request succeeded

// access data from response
$searchResults = $searchResultData['responseData']['cursor']['resultCount'];
$searchResultAdditionalData = $searchResultData['responseData']['results'];


As you can see implementing a basic SEO quick check tool can be achieved with a basic set of tools and frameworks. Furthermore, based on the key metrics determined you are able to quickly identify potential SEO problems.

Live Demo

Enough of the theoretical information? Ok! Head over to the Both Interact SEO Quick Check Tool for a live demonstration of this SEO Quick Check Tool. In case you like it feel free to drop a comment.

Posted on 1 Comment

Configure robots.txt for Realurl in TYPO3

TYPO3 Logo

In order to configure robots.txt for the Realurl extension in TYPO3 you need to set two things:

  1. Add filename for page type 201 in realurl_config.php
  2. Add some TypoScript to process robots generation

Add filename for page type in realurl_config.php

$TYPO3_CONF_VARS['EXTCONF']['realurl']['_DEFAULT'] = array(
  // configure filenames for different pagetypes
    'fileName' => array(
        'defaultToHTMLsuffixOnPrev' => 0,
        'index' => array(
            'print.html' => array(
                'keyValues' => array(
                    'type' => 98,
            // add robots.txt page type
            'robots.txt' => array(
                'keyValues' => array(
                    'type' => 201

Add TypoScript to process robots.txt generation

robots = PAGE
robots {
  typeNum = 201
  config {
    disableAllHeaderCode = 1
    additionalHeaders = Content-type:text/plain
  10 = TEXT
  10 {
    wrap (
User-Agent: *
Allow: /                #allow everything by default
Disallow: /fileadmin/templates     #then add restrictions
Disallow: /typo3/        
Disallow: /t3lib/        
Disallow: /typo3conf/    
Disallow: /typo3temp/    
Disallow: /*?id=*        

User-agent: googlebot  # Google specific settings
Disallow: /*?tx_indexedsearch

Sitemap: /?eID=dd_googlesitemap # finally add some sitemap

Be sure to flush the cache and you are all set!