Here are a few tips for keeping search engines happy with your Big Medium-powered site and, along the way, improving security to boot. These tips involve tweaking your site's robots.txt and .htaccess files.
What's robots.txt?
The robots.txt file is a file that you can add to your web root directory to tell search engines about files and directories that you do not want them to index. Well-behaved search engines check this file before crawling your site. For all the juicy details, check out the robots.txt site.
Two important rules to remember:
Always keep your robots.txt file in your domain's top-level directory:
http://www.example.com/robots.txt
Inside the file, you never list the full URL of a resource, except for the
sitemap, which we’ll get to in a moment; other items should only use the address that comes after the domain name, including the slash. For example, if you're referring to http://www.example.com/page.html, you would use:
/page.html
Okay, on to business...
Big Medium and robots.txt
You can use robots.txt to ask search engines not to index some of Big Medium's support files and, even better, to point them to the location of your site's sitemap.
Follow the instructions embedded in the comments to make the entries match the locations on your site. If you already have a robots.txt file for your site, you can add this to the existing file.
User-agent: *
# SEARCH ENGINE SITEMAP
# Big Medium builds a sitemap that tells search engines
# the location of all of the pages on your site.
# If your site's homepage directory is in your web root:
# http://www.example.com
#
# ...then your sitemap is located here:
# http://www.example.com/bm~sitemap_index.xml
#
# Unlike other items in the sitemap, use the full URL:
Sitemap: http://www.example.com/bm~sitemap_index.xml
#block big medium directories ------------------
Disallow: /cgi-bin/moxiebin/
Disallow: /bmadmin/
#block page-directory support files ------------
#these examples assume that your page directory
#is located at: http://www.example.com/bm
Disallow: /bm/bm~comments/
Disallow: /bm/bm~theme/
Disallow: /bm/bm~assets/
If you don't want search engines to index images or documents on your site, you can also add this:
#block indexing of documents and images --------
Disallow: /bm/bm~doc/
Disallow: /bm/bm~pix/
Some folks prefer not to have their printer-friendly or email-this-page pages indexed. If that's you, you can add this:
#block indexing of email and print pages -------
Disallow: /*~email.shtml
Disallow: /*~print.shtml
Big Medium and .htaccess
If your site is hosted on an Apache server (almost certainly the case if it's a Unix-flavored server), you can usually specify configuration preferences with a file named .htaccess. Just add a file to this location:
http://www.example.com/.htaccess
Here are some recommended directives to use (if a file already exists at that location, you can add this to the existing file):
#improve security by disallowing includes to
#execute code
Options +includesNOEXEC
#don't let people peek into directories without an index file
Options -Indexes
#Custom error document for page not found;
#the following displays a page at http://example.com/error.html
ErrorDocument 404 /error.html
Tags:
hosting,
htaccess,
search,
seo,
server
Comments
3 comment(s) on this page (times are local Paris time). Add your own comment below.
Josh,
Set up my robots.txt file as you recommended above but when I looked at Google webmaster tools, it pops this up as an issue:
Line 3 : Sitemap: /bm~sitemap_index.xml Invalid sitemap URL detected; syntax not understood
Does it require full URL or is this something specific to Google?
thnx! Jim
Josh, After looking into the subject on Google's Webmaster central blog, it looks like a recent change (by them) might require absolute as opposed to relative URL's for sitemaps:
See discussion here: http://googlewebmastercentral.blogspot.com/2007/08/new-robotstxt-feature-and-rep-meta-tags.html
Good catch, Jim, thanks. I've updated the example above to use the full url for the sitemap in
robots.txt:Add a Comment
Don't be shy.