Blog

Or search support forum

What's Global Moxie?

Global Moxie is the hypertext laboratory of Josh Clark, whose projects include the Big Medium web content management system. Josh creates web applications and websites from his multimedia studio in Paris, France.

What's Big Medium?

Big Medium is flexible, easy-to-use server software for creating and editing websites directly from your browser. Check out the features or download now.

Moxiemail

Enter your e-mail to receive occasional updates:

Fun with robots.txt and .htaccess

Posted Aug 3, 2007 (updated Aug 29, 2007)

Here are a few tips for keeping search engines happy with your Big Medium-powered site and, along the way, improving security to boot. These tips involve tweaking your site's robots.txt and .htaccess files.

What's robots.txt?

The robots.txt file is a file that you can add to your web root directory to tell search engines about files and directories that you do not want them to index. Well-behaved search engines check this file before crawling your site. For all the juicy details, check out the robots.txt site.

Two important rules to remember:

  • Always keep your robots.txt file in your domain's top-level directory:

    http://www.example.com/robots.txt
    
  • Inside the file, you never list the full URL of a resource, except for the sitemap, which we’ll get to in a moment; other items should only use the address that comes after the domain name, including the slash. For example, if you're referring to http://www.example.com/page.html, you would use:

    /page.html
    

Okay, on to business...

Big Medium and robots.txt

You can use robots.txt to ask search engines not to index some of Big Medium's support files and, even better, to point them to the location of your site's sitemap.

Follow the instructions embedded in the comments to make the entries match the locations on your site. If you already have a robots.txt file for your site, you can add this to the existing file.

User-agent: *

# SEARCH ENGINE SITEMAP
# Big Medium builds a sitemap that tells search engines
# the location of all of the pages on your site.
# If your site's homepage directory is in your web root:
# http://www.example.com
#
# ...then your sitemap is located here:
# http://www.example.com/bm~sitemap_index.xml
#
# Unlike other items in the sitemap, use the full URL:

Sitemap: http://www.example.com/bm~sitemap_index.xml


#block big medium directories ------------------

Disallow: /cgi-bin/moxiebin/
Disallow: /bmadmin/


#block page-directory support files ------------
#these examples assume that your page directory
#is located at: http://www.example.com/bm

Disallow: /bm/bm~comments/
Disallow: /bm/bm~theme/
Disallow: /bm/bm~assets/

If you don't want search engines to index images or documents on your site, you can also add this:

#block indexing of documents and images --------
Disallow: /bm/bm~doc/
Disallow: /bm/bm~pix/

Some folks prefer not to have their printer-friendly or email-this-page pages indexed. If that's you, you can add this:

#block indexing of email and print pages -------
Disallow: /*~email.shtml
Disallow: /*~print.shtml

Big Medium and .htaccess

If your site is hosted on an Apache server (almost certainly the case if it's a Unix-flavored server), you can usually specify configuration preferences with a file named .htaccess. Just add a file to this location:

http://www.example.com/.htaccess

Here are some recommended directives to use (if a file already exists at that location, you can add this to the existing file):

#improve security by disallowing includes to
#execute code
Options +includesNOEXEC

#don't let people peek into directories without an index file
Options -Indexes

#Custom error document for page not found;
#the following displays a page at http://example.com/error.html
ErrorDocument 404 /error.html

Tags: , , , ,

Want more? Recent blog entries...

Comments

3 comment(s) on this page (times are local Paris time). Add your own comment below.

Jim
Aug 28, 2007 11:01pm [ 1 ]

Josh,

Set up my robots.txt file as you recommended above but when I looked at Google webmaster tools, it pops this up as an issue:

Line 3 : Sitemap: /bm~sitemap_index.xml Invalid sitemap URL detected; syntax not understood

Does it require full URL or is this something specific to Google?

thnx! Jim

Jim
Aug 29, 2007 12:09am [ 2 ]

Josh, After looking into the subject on Google's Webmaster central blog, it looks like a recent change (by them) might require absolute as opposed to relative URL's for sitemaps:

See discussion here: http://googlewebmastercentral.blogspot.com/2007/08/new-robotstxt-feature-and-rep-meta-tags.html

Josh
Aug 29, 2007 8:34am [ 3 ]

Good catch, Jim, thanks. I've updated the example above to use the full url for the sitemap in robots.txt:

Sitemap: http://www.example.com/bm~sitemap_index.xml

Add a Comment

Don't be shy.

( )

( Use Markdown for formatting.)

Download Big Medium
Try it free for 30 days, or buy to unlock.

State of Josh's Brain