Wednesday, October 31, 2007

Search Engine Tips & Tricks: Create a Robots Text File for Your Web Site

Search Engine Tips & Tricks: Create a Robots Text File for Your Web Site

Writen by Sandra Waggett

Search engines index millions of web sites to generate the search results they return for key words. They do this using “spiders”.

Most search engines have their own spider that crawls around the web looking for web pages. Spiders are also known as “robots” because they are simply tiny little programs that run automatically, looking for web pages and recursively traveling through the embedded text links to index them.
Most robots look for a robots.txt file in the top-level directory of your web site, also known as the “root” where your home page is located on the web server.

The robots.txt file is a simple text file created in a basic text editor, like Notepad. It allows you to control what the spider is allowed to access and what it is not allowed to access or index.

The format of the basic robots.txt file is pretty simple:

User-Agent: [Spider Name]

Disallow: [File Name]

For example, to allow ALL robots complete access to your web site, your robots.txt file will look like this:

User-agent: *

Disallow:

The asterisk is a “wild card” character that represents ALL robots. Leaving the Disallow line blank indicates to the robots, that nothing on the site is disallowed.

The next example bars all robots from the cgi-bin (where your scripts are typically located), images directories, and the portfolio directories:

User-agent: *

Disallow: /cgi-bin/

Disallow: /images/

Disallow: /portfolio/

Note: You should use a separate Disallow line for each directory or individual file.

In this example, you may wonder why you would want to disallow a robot from indexing your portfolio directory.

If you are a photographer and you have thumbnail images on a portfolio page that link to enlargement pages launched in a pop-up window, you may not want those pop-up pages indexed. These are called “dead-end” or “orphaned” pages because only the enlarged image appears on the page with no contact info or menu links back to the main site. If the visitor entered your site on one of these pages, they would have nowhere to go and no way to contact you.

For a live example, check out www.AnJPhotography.com and look at her wedding portfolio. When you click on an image, it opens in a new window. The page in the new window is a “dead-end” page. A robots.txt file can keep search engines from indexing these “dead” pages so you don’t leave site visitors stranded.

This example keeps googlebot (the Google spider) from getting at the
private.htm file:

User-agent: googlebot

Disallow: private.htm

When you create your robots.txt file it is extremely important that you use a basic text editor (like Notepad) and NOT a word processing application like Microsoft Word. Applications like Microsoft Word can insert hidden characters that may make your robots.txt file unreadable. After you post your robots.txt file to the web server, you can validate it to make sure it is properly formatted. There are several free validators on the web. Here is one:
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi

There are several advantages and some disadvantages of having the robots.txt file in your root directory. Protocol requires that all search engine robots start indexing your web site with the robots.txt file. This is the default entry point for robots if the file is present. Major search engines will never violate the Standard for Robots Exclusion. This is the primary reason it should be there. Beyond that, it can help with your search engine rankings when used correctly, and it can keep dead pages on your web site from being indexed. The primary disadvantage is that the robots.txt file may be viewed by nefarious individuals on the web, so you never want to use the robots.txt file to try to hide sensitive pages or directories on your web site (like passwords or private information).
For more information about the robots.txt file and complete list of robots, visit the following web site: http://www.robotstxt.org/wc/robots.html

Sandra Waggett is the founder and principal designer of MSW Interactive Designs LLC (MSW-ID) major products and websites. MSW-ID provides custom website design, hosting, ecommerce and online marketing solutions to nearly 400 small business clients nationwide. MSW-ID helps small business professionals achieve an effective Internet presence.

Prior to founding MSW Interactive Designs LLC, she spent nearly 5 years working as a Senior Engineer for BAE Systems on the Lockheed Martin Mission Systems Team in Colorado Springs, CO.

While with BAE, she was the training lead for the proposal phase of the Integrated Space Command and Control (ISC2) program. In this role, she authored the 10 year training plan for the proposal and developed web-based training prototypes for presentation to to the Government decision makers.
Sandy earned her Master of Arts of degree from the University of CO, Colorado Springs, in Curriculum and Instruction, Corporate Track. Her specialties include web design, interface design, instructional design, and computer-based training development.

Free 2GB Online Storage