Feedback Form
Home Features Mastermind Forums About Advertise Blog Network Contact Be An Author

Process of website indexing by Google & other Search Engines

Process of website indexing by Google & other Search Engines

There is a lot of speculation about how search engines index websites. The topic is shrouded in mystery about exact working of search engine indexing process since most search engines offer limited information about how they architect the indexing process. Webmasters get some clues by checking their log reports about the crawler visits but are unaware of how the indexing happens or which pages of their website were really crawled.

While the speculation about search engine indexing process may continue, here is a theory, based on experience, research and clues, about how they may be going about indexing 8 to 10 billion web pages even so often or the reason why there is a delay in showing up newly added pages in their index. This discussion is centered around Google, but we believe that most popular search engines like Yahoo and MSN follow a similar pattern.

* Google runs from about 10 Internet Data Centers (IDCs), each having 1000 to 2000 Pentium-3 or
Pentium-4 servers running Linux OS.

* Google has over 200 (some think 'over 1000') crawlers / bots scanning the web each day. These
do not necessarily follow an exclusive pattern, which means different crawlers may visit the
same site on the same day, not knowing other crawlers have been there before. This is what
probably gives a 'daily visit' record in your traffic log reports, keeping web masters very
happy about their frequent visits.

* Some crawlers' jobs are only to grab new URLs (lets call them 'URL Grabbers' for convenience)
- The URL grabbers grab links & URLs they detects on various websites (including links
pointing to your site) and old/new URL's it detects on your site. They also capture the 'date
stamp' of files when they visit your website, so that they can identify 'new content' or
'updated content' pages. The URL grabbers respect your robots.txt file & Robots Meta Tags so
that they can include / exclude URLs you want / do not want indexed. (Note: same URL with
different session IDs are recorded as different 'unique' URLs. For this reason, session ID’s
are best avoided, otherwise they can be misled as duplicate content. The URL grabbers spend
very little time & bandwidth on your website, since their job is rather simple. However, just
so you know, they need to scan 8 to 10 Billion URLs on the web each month. Not a petty job in
itself, even for 1000 crawlers.

* The URL grabbers write the captured URL's with their date stamps and other status in a 'Master
URL List' so that these can be deep-indexed by other special crawlers.

* The master list is then processed and classified somewhat like -
a) New URLs detected
b) Old URLs with new date stamp
c) 301 & 302 redirected URLs
d) Old URLs with old date stamp
e) 404 error URLs
f) Other URLs

* The real indexing is done by (what we're calling) 'Deep Crawlers'. A deep crawler’s job is to
pick up URLs from the master list and deep crawl each URL and capture all the content - text,
HTML, images, flash etc.

* Priority is given to ‘Old URLs with new date stamp’ as they relate to already indexed but
updated content. ‘301 & 302 redirected URLs’ come next in priority followed by ‘New URLs
detected’. High priority is given to URLs whose links appear on several other sites. These are
classified as 'important' URLs. Sites and URL's whose date stamp and content changes on a
daily or hourly basis are 'stamped' as 'News' sites which are indexed hourly or even on
minute-by-minute basis.

* Indexing of ‘Old URLs with old date stamp’ and ‘404 error URLs’ are altogether ignored. There
is no point wasting resources indexing ‘Old URLs with old date stamp’, since the search engine
already has the content indexed, which is not yet updated. ‘404 error URLs’ are URLs collected
from various sites but are broken links or error pages. These URLs do not show any content on
them.

* The 'Other URLs' may contain URLs which are dynamic URLs, have session IDs, PDF documents,
Word documents, PowerPoint presentations, Multimedia files etc. Google needs to further
process these and assess which ones are worth indexing and to what depth. It perhaps allocates
indexing task of these to 'Special Crawlers'.

* When Google 'schedules' the 'Deep Crawlers' to index 'New URLs' and '301 & 302 redirected
URLs', just the URLs (not the descriptions) start appearing in search engines result pages
when you run the search "site:www.domain.com" in Google. These are called 'supplemental
results', which mean that Deep Crawlers shall index the content 'soon' when the crawlers get
the time to do so.

* Since Deep Crawlers need to crawl 'Billions' of web pages each month, they take as many as 4
to 8 weeks to index even updated content. New URL’s may take longer to index.

* Once the Deep Crawlers index the content, it goes into their originating IDCs. Content is then
processed, sorted and replicated (synchronized) to the rest of the IDCs. A few years back,
when the data size was manageable, this data synchronization used to happen once a month,
lasting for 5 days, called 'Google Dance'. Nowadays, the data synchronization happens
constantly, which some people call 'Everflux'

* When you hit www.google.com from your browser, you can land at any of their 10 IDCs depending
upon their speed and availability. Since the data at any given time is slightly different at
each IDC, you may get different results at different times or on repeated searches of the same
term (Google Dance).

* Bottom line is that one needs to wait for as long as 8 to 12 weeks, to see full indexing in
Google. One should consider this as 'cooking time' in 'Google's kitchen'. Unless you can
increase the 'importance' of your web pages by getting several incoming links from good sites,
there is no way to speed up the indexing process, unless you personally know Sergey Brin &
Larry Page, and have a significant influence over them.

* Dynamic URLs may take longer to index (sometimes they do not get indexed at all) since even a
small data can create unlimited URLs, which can clutter Google index with duplicate content.

Summary & Advise:

1. Ensure that you have cleared all roadblocks for crawlers and they can freely visit your site
and capture all URLs. Help crawlers by creating good interlinking and sitemaps on your
website.
2. Get lots of good incoming links to your pages from other websites to improve the 'importance'
of your web pages. There is no special need to submit your website to search engines. Links to
your website on other websites are sufficient.
3. Patiently wait for 4 to 12 weeks for the indexing to happen.

Disclaimer: The actual functioning and exact architecture of the search engines may vary but in essence, this is what we believe they do.


© Copyright 2006, RedAlkemi





Process of website indexing by Google other Search Engines - To learn more about this author, visit Atul Gupta's Website.

Like this article? Share it with your friends

Article Feedback
 Article Feedback No article feedback found.
  Leave Your Feedback
article feedback

Article Feedback
John Alexander
John has taught keyword research and SEO skills to small groups of business owners and Webmasters from over 80 different countries world wide since 2002. John is also the Director of Search Engine Academy ; Co-director of Training at Search Engine Workshops offering live, SEO Workshops with his partner SEO educator Robin Nobles, author of the very first comprehensive online search engine marketing courses at SEO Training Online and the SEO Workshop Resource Center.
I look forward to hearing from you! - Visit John Alexander's Website

Jeff Foster
WebBizIdeas.com is a Minneapolis website design company founded to help people start an internet business by providing them with website, business, and internet resources that help foster the growth of successful online businesses and develop innovative Internet business ideas.  We specialize in internet consulting & internet marketing
- Visit Jeff Foster's Website

Dianne Crampton
Dianne Crampton is an executive leadership coach, team consultant, author and president of TIGERS Success Series, Inc. Dianne has been helping CEO's and Executives connect their employees to their core values and goals for over 20 years using the trademarked TIGERS team culture process, which stands for trust, interdependence, genuineness, empathy, risk and success. To download a free white paper on behaviors that build strong teams and behaviors that will predictably tear them down go here. - Visit Dianne Crampton's Website

Kim Castle
With nearly two decades in the advertising and design business, with clients like Domino's Pizza, General Motors, Direct TV, Pedigree, Wolfgang Puck, Higher Octave Music, Hollywood Celebrity Products, Disney, and Paramount, as well as thousands of entrepreneurs around the world define, structure, communicate, and position their business for greater profits, BrandU(R) co-creators Kim Castle and W. Vito Montone discovered that entrepreneurs could experience the same power that big brands command for a fraction of the cost with the world's only process-based results-drive Integral approach to business creation. BrandU(R) is helping entrepreneurs grow with the power of extreme clarity from idea...to brand...to market(TM) and helping one million entrepreneurs become successful and whole so that they can make a difference in the world. Are you one of them? If you want to experience clarity all the way to the bank(TM), get started now at http://www.brandu.com. - Visit Kim Castle's Website

Dave Kurlan
Dave Kurlan is the founder and CEO of Objective Management Group, Inc., the industry leader in sales assessments and sales force evaluations, and the CEO of David Kurlan & Associates, Inc., a consulting firm specializing in sales force development. Dave has been a top rated speaker at Inc. Magazine's Conference on Growing the Company, the Sales & Marketing Management Conference and the Gazelles Sales & Marketing Summit. He has been featured on radio and TV, including World Business Review with General Norman Schwarzkopf, in Inc. Magazine, Selling Power Magazine, Sales & Marketing Management Magazine and Incentive Magazine. He is the author of Mindless Selling and Baseline Selling – How to Become a Sales Superstar by Using What You Already Know about the Game of Baseball. He created and wrote STAR, a proprietary recruiting process for hiring great salespeople, and he writes Understanding the Sales Force, a popular business Blog and is a contributing author to The Death of 20th Century Selling and 101 Great Ways to Improve Your Life, Volume 2. - Visit Dave Kurlan's Website

Casey Gollan
Casey Gollan, Business Coaching & Mentoring Programs. Add $1 Million to $10 Million in the next 1 to 3 years. Since 1996 Casey has to added hundreds of millions of dollars to businesses. Watch a free video see client results Business Coaching website. - Visit Casey Gollan's Website

Stephanie Robey
Stephanie Robey is President and CoFounder of Pivot Positive, LLC - an Internet marketing business focused on helping people start work at home ventures. Previously, she was employed at The Search Agency with over 20 years experience in graphic design and 10 years experience in online marketing. She was responsible for launching the Conversion Path Optimization (CPO) unit where she and her team have conducted hundreds of optimization tests for online companies across multiple verticals.

She is a successful entrepreneur having started and sold 2 companies and remains on the board of directors of the third, PhotoSpin.com   Stephanie began her career in the direct marketing realm creating and producing direct mail for many of the major cable television companies and directly attributes her understanding of Internet marketing to those early offline experiences.  Stephanie is a graduate of San Diego State University with a BFA in Graphic Arts and also holds an Executive MBA from the Graziadio School of Business and Management at Pepperdine University.

Read Steph's Blog
Meet Steph and Dave
Sign up for our Free 7-Day BootCamp: Self Employed & Rich
- Visit Stephanie Robey's Website

Jay Kubassek
(Jay's Full Bio: EvanCarmichael.com/jaykubassek)  In five years, Canadian-born entrepreneur Jay Kubassek went from selling mufflers at a Midas franchise to revolutionizing Internet marketing with the 2004 launch of CarbonCopyPRO, a online marketing education company, now worth over $20 million with customers in over 160 countries.

 

As an independent film producer, his upstart film fund Aliquot Films is currently producing a films with Spike Lee and Abel Fererra (starring Ethan Hawke and Dennis Hopper.)

 

Jay's entrepreneurial spirit is irrepressible. He’s the owner of five companies, a professional speaker and trainer, international real estate developer/investor, extreme sport enthusiast and emerging philanthropist. 

 

Jay resides in NYC with his wife Jamie, son Milo and dog Cooper.  Visit Jay's official website: www.JayKubassek.com - Visit Jay Kubassek's Website


To learn more about the Evan Elite Author Program please contact us.

About The Author


Atul Gupta
(Visit Atul's Website)

Atul Gupta is a Gold author on EvanCarmichael.com
About The Author

View Author Blog
View Author Blog

View Author Video
View Author Video

Free Downloads


Atul Gupta's

Complete
List Of
SEO
Articles

Name
Email
If you enjoyed this article, get Atul Gupta's Complete List of SEO Articles For FREE!

More Atul Gupta
Working with robotstxt file
Search Engine Copywriting
Google Jagger Algorithm Update Part 1
Analysis and Implications of Hilltop Algorithm
Where does your Site Rank on Google
Google Florida Algo Update
Why do we need Search Engine Optimization
Keyword Research for Search Engine Optimization
Title Tag and Meta Description Tag Optimization
Google PageRank Algorithm Explained
Free Downloads


 
 
 


Evan Elite Authors
John Brennan  
Cheryl Matthynssens  
Anne Barr  
Evan Elite Authors

Become An Author
Have you written articles that would be of value to entrepreneurs? Become an expert on our site by publishing them! Expose yourself to a wide audience, drive more traffic to your website and get more sales! Click Here for details.
Become An Author

Evan's Latest Video
Modeling the Masters: Learn the true secrets behind Walt Disney's business success factors & grow your company! Video produced by Phanta Media
Evan's Latest Video

Business Opportunities
"Learn straight from Evan how you can Make a Full Time Income (And More) from a Website"

How to Start An Online Business

Click Here To Learn More
Business Opportunities



Evan's Newsletter
Get advice & tips from famous business owners, new articles by entrepreneur experts, my latest website updates, & special sneak peaks at what's to come!
Name:
Email:
Evan`s Newsletter

Free Downloads
Innovative Thinking Icon Innovative Thinking
Fortune Small Business Icon Fortune Small Business
Business Checkup Icon Business Checkup
Campaigns to Customers Icon Campaigns to Customers
Quadruple Your Business Icon Quadruple Your Business
Free Downloads - Complete List

Entrepreneur Tools and Guides
The Top 10 GTD Times Posts - Best Posts for Productivity
The Top 10 GTD Times Posts
Best Posts for Productivity
 
Top 50 Debt Blogs
Top 50 Debt Blogs
Learn To Get Out Of Debt
 
Entrepreneur Tools and Guides

SEO For Africa
SEO For Africa
Yvonne Nogbou Amien Treichville, Cote D'Ivoire,
Yvonne Nogbou Amien
Treichville, Cote D'Ivoire
SEO For Africa

If I Were A Startup...
Jeff Roick, $1.4 to $6.5 Mil in 2 years
Jeff Roick
$1.4 to $6.5 Mil in 2 years
Adam and Matthew Toren , $200k to $3.4 Mil in 3 Years
Adam and Matthew Toren
$200k to $3.4 Mil in 3 Years
If I Were A Startup... - Complete List

Famous Entrepreneurs
Ted Turner, TBS
Gordon Ramsay, Gordon Ramsay
Gordon Ramsay
Gordon Ramsay
Famous Entrepreneurs - Complete List

Entrepreneur Advice
Seth Godin, Ideavirus Author
Seth Godin
Ideavirus Author
T. Harv Eker, Millionaire Mind
T. Harv Eker
Millionaire Mind
Entrepreneur Advice - Complete List

Popular Articles
(Premium Authors)

     ROI vs Return on Relationships syndrome
By Michel Amour
     Effective Coaching and YOU
By Michel Amour
     Rolling with the punches or rolling out?
By Michel Amour

Have A Suggestion?
Toronto Salsa Classes / Toronto Salsa Lessons Email us your ideas on how to make our website more valuable! Thank you Sharon from Toronto Salsa Lessons / Classes for your suggestions to make the newsletter look like the website and profile younger entrepreneurs like Jennifer Lopez and Sean Combs!
Have A Suggestion?

More Evan Carmichael
More Information