Spider Bytes

asp:Q&A

LANGUAGES: C#

ASP.NET VERSIONS: 1.x

Spider Bytes

Prepare Your Web Site for Search Engines

By Josef Finsel

Q. What do I need to do to prepare my Web site for search engines?

A. Search engines use a combination of technologies to crawl through Web sites and collect information about them. But to prepare your Web site for search engine spiders (as the programs that do the actual crawling are called), you need a couple of things, beginning with making sure your Web site doesn t have any broken links. To do that, I m going to update a program I wrote a couple of years ago to search for broken links on your site (you can download CheckWebLinks; see end of article for details). The biggest change I made was to implement a class from a Microsoft sample that made the gleaning of links from a Web page simpler. It s a simple program that takes a link and checks all of the links below it to find any missing links. I ll look at the code in more detail in a minute because I m going to be enhancing this code at the end of the article.

Once you re sure that you don t have any broken links, the next thing to look at is a robots.txt file. This is a text file that contains instructions for spiders. It should be placed on the root of your Web site (http://mydomain.com/robots.txt). A well-designed Web crawling engine will first search for this file to see if there are any commands in this file. I should note that these files are merely suggestions. The CheckWebLinks program in this article, for instance, has no provision for paying attention to a robots.txt file but the spiders used by major search engines will pay attention to the directives in this file.

As the extension implies, it s a simple text file. Lines that begin with # are comments and ignored. The other two types of data that you can have are to define user-agents and what you don t want the spider to search. Let s take a look at a sample robots.txt file:

# A standard robots.txt file that allows

# all crawlers access to everything.

User-agent: *

Disallow:

The first two lines are comments for humans reading the file. The second line defines the user-agent that the commands target, with * defining all of them. If you look in your Web site log files, you ll find a column of information for user-agent data. A normal Web browser might show up as MSIE 6.0, while a spider might show up as Yahoo! Slurp. Finally, there is a Disallow statement that has no data, which is a roundabout way of telling the crawlers they can search your entire site. If you wanted to request that no crawling could be done, you could use Disallow: /.

A slightly more advanced file might request that a specific search engine not crawl your site, but let everyone else do so. A robot is supposed to pay attention to all of the Disallow statements that follow the line containing its user-agent code or what is in the *:

# Request Yahoo not to crawl the site

User-agent: Yahoo Slurp

Disallow: /

# Let everyone else search

User-agent: *

Disallow:

Now that we have informed the spiders what we would like them to look at, let s take a look at what they are going to do. Much like the CheckWebLink program, they are going to load a page and take a look at what links it has; but they are also going to parse the content of the page and store that in a database. One way that you can help is to use Meta tags. Meta is short for meta-data data that is more comprehensive than the rest of the data in your page. When you build pages with Visual Studio, a meta tag is generally inserted that looks something like this:

All meta tags have two properties: a name and the contents. In this case, the name of the meta-data is the generator and the definition/content is Microsoft Visual Studio 6.0. Meta tags go in the header section of a Web page and the two most common Name properties are description and keywords. For the crossword dictionary site I put up, those meta tags are shown here:

dictionary of crossword clues and answers in an easy to

access format." />

clues, crossword dictionary" />

It s important to note that search engine spiders are smart enough to ignore the meta information if it doesn t track with the actual content of the page.

The last step to prepare your site for search engines is a new technology that Google is launching called Google Sitemaps. Sitemaps are XML documents that are designed to take the robots.txt files to the next level by providing more information to the spider about your site. The schema that Google s spider is looking for provides several pieces of information for the spider about a URL: the location, when it was last modified, how often it changes, and what priority it should be given in being searched relative to other pages on your site. With those in mind, let s dig into the code that creates our Sitemap XML.

If you used the old CheckWebLinks program from a couple of years ago, you ll find some significant improvements in the version that I m using as the base for creating this Sitemap. But the first thing I am doing is using an XMLTextWriter (to create the XML file) and writing the urlset element. Then I can start diving into the Web pages. Listing One shows all the key elements for creating the Sitemap file; again, the code is available for download.

The first improvement is in using HttpWebRequest and HttpWebResponse objects to put the Web addresses in the correct format. This also makes it much simpler to get the base URL, because I only want to check sites that are below the link I started with. With all of that out of the way, I clear the two dictionary objects I am going to use one to store links I have processed, one to store links queued up for processing and then start processing Web pages. If I have a valid URL, I ll go ahead and write the information to the XMLTextWriter. After that is the next big change: using the GetPageLinks function from the Microsoft sample code. This does a much better job of catching all the anchor tags than the code in the original sample, and I use the output from that to add links to the dictionary of links to process if it doesn t already exist in the processed dictionary and if the Web page is below the original link. When all the links are processed, close the XMLTextWriter and we re done. Well, we re mostly done.

We now have the data, but it is going to require some tweaking. You can load this file into Excel or your favorite database and tweak it by changing the values for changefreq and priority. These are key reasons for having a Sitemap file in the first place. A spider may not crawl your entire site at one time. If you have static content that doesn t change, you don t want the spider re-crawling those pages. By defining the frequency that the page changes as never, it will lessen the chance of the spider crawling it. And a page that changes hourly is more likely to get crawled more frequently. The same goes for the priority element. If you have a set of pages that are more important, you ll want to give them a higher priority. Remember, this priority is relative only within your site, so setting all your pages to 1 is ineffective. Setting some to 1 and the rest to 0.5 or less will tell the spider that these are pages that it may want to look at.

As with meta tags, these are hints for the spider, not orders; but making it easier for the spider makes it more likely that your site will get visited by the spiders and make it more likely to appear in the search engines. Plus, if you have a highly dynamic site, you can keep a list of the most recently changed URLs in the Sitemap for the spider to reference.

The final step for a Sitemap is to place it on your Web site and tell Google it s there. There s a link at the end of this article that will take you to the Google Sitemap page where you can register your Sitemap and get more information.

To briefly recap, if you want your site to be friendly for search engine spiders, make sure you don t have any broken links, add meta tags relevant to your Web site s content, and create a Sitemap to help the spiders do their job.

The original Microsoft WebCrawler code can be found at http://winfx.msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_fxsamples/html/12ce37e0-64a2-405a-8fd4-4848f254424f.asp. For more on Google Sitemaps, visit http://www.google.com/webmasters/sitemaps/docs/en/overview.html.

That wraps up this month s column. Remember, this is your column; it runs on the questions you ask, so send your ASP.NET questions to [email protected] and I ll help you get the answers you need.

C# code examples are available for download to asp.netPRO.

Josef Finsel is a software consultant with Strategic Data Solutions (http://www.sds-consulting.com). He has published a number of VB and SQL Server-related articles and is currently working with VS 2005 to build a crossword dictionary (http://crosswords.reluctantdba.com). He s also author of The Handbook for Reluctant Database Administrators (Apress, 2001).

Begin Listing One Processing a Web Site

XmlTextWriter writer = new

XmlTextWriter("sitemap.xml",null);

writer.WriteStartElement("urlset");

writer.WriteAttributeString("xmlns",

"http://www.google.com/schemas/sitemap/0.84");

string strWebpage ="";

string strCurrentLink ;

Boolean bValidURL = false;

string strBaseURL = txtURL.Text;

int iValid =0;

int i404 = 0;

strBaseURL.Replace("\\","/");

HttpWebRequest req =

(HttpWebRequest)WebRequest.Create(strBaseURL);

HttpWebResponse resp = (HttpWebResponse)req.GetResponse();

Uri baseUri = resp.ResponseUri;

if (baseUri.AbsolutePath.EndsWith("/"))

strBaseURL = baseUri.AbsoluteUri;

else

strBaseURL = baseUri.AbsoluteUri.Substring(

0, baseUri.AbsoluteUri.LastIndexOf("/")+1);

strBaseURL = strBaseURL.ToUpper();

if( ! strBaseURL.EndsWith("/") )

strBaseURL = strBaseURL + "/";

dctQueued.Clear();

dctProcessed.Clear();

dctQueued.Add(txtURL.Text, txtURL.Text);

//As long as I have links to process

while (dctQueued.Keys.Count > 0){

lblCounter.Text = "Valid count: " + iValid.ToString() +

" / 404 Errors: " + i404.ToString();

bValidURL = true;

strCurrentLink = dctQueued.GetKey(0);

if (! strCurrentLink.ToUpper().StartsWith(strBaseURL)){

dctQueued.Remove(strCurrentLink);

dctProcessed.Add(strCurrentLink, strCurrentLink);

} else {

try{

strWebpage = GetWebPage(strCurrentLink);

}

catch( Exception j ) {

//404 Error

bValidURL = false;

}

dctQueued.Remove(strCurrentLink);

if(! bValidURL) {

dctProcessed.Add(strCurrentLink, "-1");

i404++;

} else {

dctProcessed.Add(strCurrentLink, "1");

iValid++;

writer.WriteStartElement("url");

writer.WriteElementString("loc",strCurrentLink);

writer.WriteElementString("lastmod",

resp.LastModified.ToString("u",null));

writer.WriteElementString("changefreq","never");

writer.WriteElementString("priority","0.5");

writer.WriteEndElement();

System.Collections.Hashtable found =

new System.Collections.Hashtable();

Uri CurrentPage = new Uri(strCurrentLink);

// <a href=

GetPageLinks(CurrentPage, strWebpage, "a", "href", found);

// <frame src=

GetPageLinks(CurrentPage, strWebpage, "frame", "src", found);

// <area href=

GetPageLinks(CurrentPage, strWebpage, "area", "href", found);

// <link href=

GetPageLinks(CurrentPage, strWebpage, "link", "href", found);

System.Windows.Forms.Application.DoEvents();

}

writer.WriteEndElement();

writer.Close();

sbStatus.Text = "Ready";

Comments

Plain text

Spider Bytes

End Listing One

Comments

Plain text