Googlebot

Googlebot und Crawling

The idea that Google crawls URLs directly as soon as the Googlebot encounters a link is wrong. In fact, the URLs are first collected and only then visited. The process of Google crawling websites is more complex than many people think. Gary Illyes (data analyst at Google) explained this in detail in his Google SEO podcast.

This behavior can be seen, for example, in the server log files of websites. There is more than just the direct calling of URLs from link elements. Mechanisms such as prioritization and deduplication play an important role.

Although in some situations it may make sense to simply state that Google “follows” the links, in other cases it is better to describe the procedure in more detail.

The following illustration shows how Google actually proceeds when finding and retrieving new URLs: A downloader ensures that content is downloaded. In addition to texts and metadata, URLs that end up in a waiting list are also found. A control component regulates when which URLs are crawled.

asynchronous crawling

This procedure explains why the order of the links on a page does not determine when the URLs are crawled. John Mueller had already explained this once in 2022.

Anyone who thinks that it is enough to create new content once and then expect it to be visible “tomorrow” could be proven wrong. As already described in other articles, SEO is a medium to long-term strategy. It requires regular, monthly updated content. I would be happy to advise you on this in a free initial consultation.

Article to listen to in German:

I would be happy to explain the process of crawling websites by search engines such as Google in more detail:

1. discovery of URLs

The crawling process begins with the discovery of new URLs. Google uses various methods to find new pages:

  • Links from other websites: Googlebot follows links from already crawled pages to new URLs.
  • Sitemaps: Website operators can submit XML sitemaps that contain a list of all the pages on their website.
  • User submissions: Sometimes users can submit URLs directly through tools like Google Search Console.

2. collection and waiting list

As soon as URLs are discovered, they are first collected in an internal database. These URLs are not crawled immediately, but are added to a waiting list. The URLs in this list are prioritized and sorted.

3. prioritization and deduplication

Google uses various mechanisms to decide when and which URLs should be crawled:

  • Prioritization: URLs are ranked according to relevance, popularity and topicality.
  • Deduplication: Google recognizes duplicate content and ensures that it is not crawled more than once.

4. downloader

The next step in the process is to actually download the content. Here:

  • Texts and metadata: The Googlebot reads and saves the complete content of the page, including text, images and metadata.
  • Other URLs: During crawling, the bot may discover additional URLs, which are then also added to the waiting list.

5. control component

A special control component regulates access to the crawled URLs. This component decides when which URLs should be crawled again based on various factors:

  • Timeliness of content: Websites that are updated frequently are visited more often.
  • Server resources: Google tries to avoid overloading the servers of websites by controlling the frequency of visits.

6. indexing

The crawl process is followed by indexing and ranking the content in the search results :

  • Indexing: The crawled content is analyzed and included in the Google index. Important information such as keywords, page titles and metadata are saved in the process.
  • Ranking algorithms: The indexed pages are evaluated and ranked according to numerous criteria in order to display them in the search results.

7. monitoring and feedback

Google continuously monitors the status and performance of the crawled pages. Information from this monitoring flows back into the crawling process in order to adjust the priorities and frequencies.

Conclusion

Google’s crawling process is a complex interplay of discovering, collecting, prioritizing, downloading and indexing websites. Through this structured approach, Google ensures efficient and comprehensive crawling of the internet to display the most relevant content in search results. If you have any questions about specific aspects of this process, I will be happy to provide you with a free initial consultation.

Free initial consultation

Do you need more visibility? I would be happy to support you and advise you. Every project is different and involves individual strategies.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top