Here are a few tips for keeping search engines happy with your Big Medium-powered site and, along the way, improving security to boot. These tips involve tweaking your site's robots.txt and .htaccess files.
What's robots.txt?
The robots.txt file is a file that you can add to your web root directory to tell search engines about files and directories that you do not want them to index. Well-behaved search engines check this file before crawling your site. For all the juicy details, check out the robots.txt site.
Two important rules to remember:
Always keep your robots.txt file in your domain's top-level directory:
http://www.example.com/robots.txtInside the file, you never list the full URL of a resource, except for the sitemap, which we’ll get to in a moment; other items should only use the address that comes after the domain name, including the slash. For example, if you're referring to
http://www.example.com/page.html, you would use:/page.html
Okay, on to business...
Big Medium and robots.txt
You can use robots.txt to ask search engines not to index some of Big Medium's support files and, even better, to point them to the location of your site's sitemap.
Follow the instructions embedded in the comments to make the entries match the locations on your site. If you already have a robots.txt file for your site, you can add this to the existing file.
User-agent: *
# SEARCH ENGINE SITEMAP
# Big Medium builds a sitemap that tells search engines
# the location of all of the pages on your site.
# If your site's homepage directory is in your web root:
# http://www.example.com
#
# ...then your sitemap is located here:
# http://www.example.com/bm~sitemap_index.xml
#
# Unlike other items in the sitemap, use the full URL:
Sitemap: http://www.example.com/bm~sitemap_index.xml
#block big medium directories ------------------
Disallow: /cgi-bin/moxiebin/
Disallow: /bmadmin/
#block page-directory support files ------------
#these examples assume that your page directory
#is located at: http://www.example.com/bm
Disallow: /bm/bm~comments/
Disallow: /bm/bm~theme/
Disallow: /bm/bm~assets/
If you don't want search engines to index images or documents on your site, you can also add this:
#block indexing of documents and images --------
Disallow: /bm/bm~doc/
Disallow: /bm/bm~pix/
Some folks prefer not to have their printer-friendly or email-this-page pages indexed. If that's you, you can add this:
#block indexing of email and print pages -------
Disallow: /*~email.shtml
Disallow: /*~print.shtml
Big Medium and .htaccess
If your site is hosted on an Apache server (almost certainly the case if it's a Unix-flavored server), you can usually specify configuration preferences with a file named .htaccess. Just add a file to this location:
http://www.example.com/.htaccess
Here are some recommended directives to use (if a file already exists at that location, you can add this to the existing file):
#improve security by disallowing includes to
#execute code
Options +includesNOEXEC
#don't let people peek into directories without an index file
Options -Indexes
#Custom error document for page not found;
#the following displays a page at http://example.com/error.html
ErrorDocument 404 /error.html












Comments
10 comment(s) on this page. Add your own comment below.
Josh,
Set up my robots.txt file as you recommended above but when I looked at Google webmaster tools, it pops this up as an issue:
Line 3 : Sitemap: /bm~sitemap_index.xml Invalid sitemap URL detected; syntax not understood
Does it require full URL or is this something specific to Google?
thnx! Jim
Josh, After looking into the subject on Google's Webmaster central blog, it looks like a recent change (by them) might require absolute as opposed to relative URL's for sitemaps:
See discussion here: http://googlewebmastercentral.blogspot.com/2007/08/new-robotstxt-feature-and-rep-meta-tags.html
Good catch, Jim, thanks. I've updated the example above to use the full url for the sitemap in
robots.txt:In robots.txt I have the following line: "Disallow: /scripts/", but Google still indexed some files in this folder :( what am I doing wrong?
Your
robots.txtfile looks fine, and it should work as you expect. If you've added the file only recently, though, it may take some time for Google to update its index. If you continue to have trouble, you might put the question to Google's webmaster help forum.great pics of robots
Is the robots.txt file used to filter search engines?
Finally found what I needed! I wanted to .htaccess password protect all files except for the robots.txt file!
Solid guide, :). What is the point of adding a sitemap to htaccess though if we already have our site added to Google Webmaster Tools?
@raptorak: because only Google can access it then. If you list it in your .htaccess file then other crawlers can also find and download it. The impact (ie. increase in hits) might not be huge, seeing as Google dominates the search engine market, but it's there.
Add a Comment
Don't be shy.