Friday, June 8, 2007

Getting Indexed?

As an Search Engine Optimization Guy I am always get asked the same question. When will google list my site?

There is a lot of speculation about how search engines index websites. The topic is shrouded in mystery about the exact workings of a search engine indexing process, since most search engines offer limited information about how they architect the indexing process. The truth is none really knows. The only thing some experts can do is draw conclusions based data from log files. Here is some known information. Google runs from about 10 Internet Data Centers (IDCs), each having 1000 to 2000 Pentium-3 or Pentium-4 servers running Linux OS.

Google has over 200 (some think over 1000) crawlers/bots scanning the web each day. These do not necessarily follow an exclusive pattern, which means different crawlers may visit the same site on the same day, not knowing other crawlers have been there before. This is makes web masters very happy.

Some crawlers jobs are only to grab new URLs (let us call them URL Grabbers for convenience). The URL grabbers grab links and URLs they detects on various websites (including links pointing to your site) and old/new URL's it detects on your site. They also capture the Date Stamp' of files when they visit your website, so that they can identify new content or updated content pages.

The URL grabbers write the captured URL's with their date stamps and other stats in a Master URL List so that these can be deep-indexed by other special crawlers.

The master list is then processed and classified

a) New URLs detected b) Old URLs with new date stamp c) 301 & 302 redirected URLs d) Old URLs with old date stamp e) 404 error URLs f) Other URLs

The real indexing is done by (what we're calling) Deep Crawlers. A deep crawler's job is to pick up URLs from the master list and deep crawl each URL and capture all the content, text, HTML, images, flash etc.

Priority is given to existing URLs with a new date stamp as they relate to already indexed but updated content. 301 and 302 redirected URLs come next in priority followed by New URLs detected. High priority is given to URLs whose links appear on several other sites. These are classified as Important URLs. Sites and URL's whose date stamp and content changes on a daily or hourly basis are stamped as News sites which are indexed hourly or even on a minute-by-minute basis.

Indexing of Old URLs with old date stamps and 404 error URLs are altogether ignored. There is no point wasting resources indexing Old URLs with old date stamp, since the search engine already has the content indexed, which is not yet updated.

The Other URLs may contain URLs which are dynamic URLs, have session IDs, PDF documents, Word documents, PowerPoint presentations, Multimedia files etc. Google needs to further process these and assess which ones are worth indexing and to what depth. It perhaps allocates indexing task of these to Special Crawlers.

When Google schedules the Deep Crawlers to index New URLs and 301 and 302 redirected URLs, just the URLs (not the descriptions) start appearing in search engines result pages when you run the a search.

Since Deep Crawlers need to crawl Billions of web pages each month, they take as many as 4 to 8 weeks to index even updated content. New URL's may take longer to index.

Once the Deep Crawlers index the content, it goes into their originating IDCs. Content is then processed, sorted and replicated (synchronized) to the rest of the IDCs. A few years back, when the data size was manageable, this data synchronization used to happen once a month, lasting for 5 days, nicknamed Google Dance. Nowadays, the data synchronization happens constantly, which some people call Everflux.

When you hit www.google.com from your browser, you can land at any of their 10 IDCs depending upon their speed and availability. Since the data at any given time is slightly different at each IDC, you may get different results at different times or on repeated searches of the same term, thus the name Google Dance.

Bottom line is that one needs to wait for as long as 8 to 20 weeks, to see full indexing in Google. Unless you can increase the importance of your web pages by getting several high quality incoming links from good sites, there is no way to speed up the indexing process.

Dynamic URLs may take longer to index (sometimes they do not get indexed at all) since even a small data change can create unlimited URLs, which can clutter Google index with duplicate content.

Conclusion:

First of all, most of this will be GREEK to most people, but to the ones that have an idea, and attempt SEO on their own, it should be a wake up call. Not only do you need to know what to do and how to do it, you also need patients, lots of patients. Too much tweaking, too much subversive SEO and you'll end up getting banned.

Patiently wait for 4 to 20 weeks for the indexing to happen. And Then comes the big work Getting Page Ranking!

No comments: