Every time someone types a question into Google, a result appears within a second. Behind that instant result is a three-stage process running continuously, across billions of web pages, every day.
Understanding how search engines work is the most foundational concept in all of SEO. Every technical decision you make, every piece of content you publish, and every link you earn either helps or hinders one of three stages: crawling, indexing, and ranking.
This guide walks through each stage in plain language, identifies the specific failure points at each one, and explains how Google’s AI systems have changed the process. Start here before anything else in SEO.
Not sure what SEO is yet? Read the beginner’s guide first
Key Terms
Crawler (or Spider): Automated software that browses the web by following links from page to page to discover content.
Googlebot: Googlebot is Google’s own web crawler — the program that visits and reads your web pages on Google’s behalf. For more details, you can check the official Google Search documentation.
Index: Google’s database of every page it has assessed and approved to appear in search results.
Algorithm: The rules Google uses to decide which indexed pages appear in which order for any given search.
SERP: Search Engine Results Page — the page you see after typing any query into Google.
Stage 1 — Crawling
Crawling is how Google discovers web pages. Google operates a program called Googlebot — an automated web crawler — that moves continuously across the internet, following links from one page to the next.
Googlebot starts from a seed list: URLs it already knows about, plus new ones submitted by website owners through Google Search Console. It visits each URL, downloads the page content, and scans it for links to other pages. Those linked pages join the crawl queue. The cycle continues, indefinitely, at scale.
This is why internal links are non-negotiable. A page with no links pointing to it — from within your own site or from any external site — may never be found by Googlebot. It can be live, fully written, and technically functional. To Google, it may as well not exist.
How Googlebot Decides What to Crawl First
How does Googlebot work when it comes to prioritisation? Google does not crawl every page on every website every day. Each site receives a crawl budget — a ceiling on how many pages Googlebot will visit in a given period. Large, established, frequently updated sites are allocated more crawl budget and visited more often. A brand new website may wait days or weeks between crawl visits.
Crawl budget is not a ranking signal, but it is a practical constraint. A site with 10,000 pages but a tight crawl budget may have important pages sitting undiscovered for weeks. Efficient site architecture and strong internal linking are the primary levers for improving crawl coverage.
One critical clarification: being crawled is not the same as appearing in search results. Crawling is only the first gate.
Stage 2 — Indexing
Once Googlebot has crawled a page, Google decides whether to add it to the index.
The index is Google’s database — a record of every page it has processed and approved for search results. When you search Google, you are not searching the live web. You are searching this database of pre-evaluated pages.
Crawling and indexing are two separate decisions. A page can be crawled and still not be indexed. This happens when Google determines the content is too thin, too similar to pages already in the index, blocked by a technical directive, or inaccessible at the time of the crawl.
What Prevents a Crawled Page from Being Indexed
Google doesn’t simply record raw page text at indexing time. It processes and understands the content — the topic, intent, entities mentioned, links contained, and how the page relates to everything else on the web. Pages that don’t meet Google’s quality threshold at this stage do not make it into the database.
An unindexed page will never appear in search results, regardless of how well-written or well-optimised it is. This makes the indexing decision the most critical checkpoint in the entire pipeline.
To verify whether Google has indexed a specific page, open Google Search Console and use the URL Inspection tool. Enter the full page URL and Google will return its current crawl status, index status, and any detected issues — including the exact reason a page is not indexed, if applicable.
Stage 3 — Ranking
A page can be crawled, indexed, and still rank nowhere near the top of search results. Indexing puts a page in Google’s database. Ranking determines where it appears when someone actually searches.
Google uses a collection of algorithms to order results for every query. These evaluate hundreds of signals simultaneously, but the logic reduces to two questions: Is this page relevant to what the searcher needs? And is this page trustworthy?
The Two Forces Behind Every Ranking Decision
Relevance is determined by how well your content matches what the searcher actually needs — not just the keywords they typed, but the intent behind them. A search for “how does Google rank websites” signals a need for a clear explanation of a process. A product comparison page or a list of paid SEO tools would not satisfy that intent, regardless of how many times the phrase appears on the page.
Authority is built through external validation over time. When credible websites link to your page, Google treats those links as endorsements — signals that your content is worth recommending to searchers. A page with no inbound links from other websites starts with no established authority in Google’s eyes.
Relevance and authority are the two forces at the core of how search engines rank pages. Every other SEO practice either supports one, the other, or both.
Google’s official documentation on how Search works
What Can Block Each Stage
Each stage has specific failure points. Knowing them is how you diagnose any ranking problem — and the correct diagnostic sequence always starts by identifying which stage has broken down.
Crawl Blocks
A file called robots.txt sits at the root of your website and gives direct instructions to web crawlers. A misconfigured Disallow: / directive tells Googlebot to stay off your entire site — a catastrophic outcome that is surprisingly common during website migrations and platform changes. Used correctly, robots.txt blocks only what you intend: duplicate parameter-based URLs, staging environments, admin directories, and internal search result pages that serve no purpose in Google’s index.
Index Blocks
A noindex directive — delivered via a meta robots tag in the page <head> or via an HTTP response header — tells Google not to store a page in its index. Applied intentionally, this is correct practice: paginated filter pages, order confirmation pages, and tag archive pages generally should not appear in search results. Applied to the wrong pages by accident, the result is those pages being permanently absent from Google search, with no warning in standard analytics reporting.
Ranking Blocks
A page that is crawled and indexed but ranks on page 5 or lower has one of three problems: the content does not match the searcher’s intent (a relevance failure), no other credible sites link to it (an authority gap), or Google’s quality classifiers have deprioritised it for thin or duplicated content. Each problem has a distinct fix. Ranking troubleshooting only works if you know which of the three applies.
How Modern AI Changes All Three
Google has incorporated machine learning into its core systems for over a decade. The scope of that integration has accelerated substantially, and it now shapes all three stages of how search engines work.
AI’s Role in Crawling
AI helps Google prioritise the crawl queue — predicting which pages are likely to contain fresh, valuable content worth visiting first. A new page published on an authoritative, frequently updated site may be crawled within hours of going live. The same page on an unknown domain could sit in the queue for weeks. Domain authority, update frequency, and content signals all feed into crawl prioritisation.
The Helpful Content System and Indexing
Google’s Helpful Content System is a machine learning classifier that evaluates whether content is written to genuinely help people or primarily to achieve rankings. Pages that score poorly on this assessment may be indexed but ranked poorly, or their weak signals may suppress the performance of the entire site’s content. It is a site-wide classifier, not a page-level penalty — meaning a large volume of low-quality content affects how Google treats your stronger pages too.
AI Overviews and Ranking
AI Overviews — Google’s generative summaries that appear above standard results for many queries — draw from indexed pages but apply a separate evaluation layer on top of traditional ranking signals. A page sitting in position 3 may be cited in an AI Overview, while the page in position 1 is not. Traditional rank position does not predict AI Overview inclusion. The signals that drive inclusion are different, and optimising for them requires a different content approach.
Frequently Asked Questions
What is the difference between crawling and indexing?
Crawling is the discovery stage — Googlebot visits your page and reads its content. Indexing is the storage stage — Google decides to add that page to its search database. A page can be crawled but not indexed if Google considers the content low quality, duplicate, technically blocked, or inaccessible during the crawl.
How long does it take for Google to index a new page?
Typically between a few hours and several weeks, depending on your site’s crawl frequency, how deep the page sits in your site structure, and whether you have submitted the URL via Google Search Console. High-authority sites with active crawl cycles will see new pages indexed significantly faster than new or low-traffic domains.
Can I check whether my page is indexed by Google?
Yes. Type site:yourdomain.com/your-page-url into Google’s search bar — a result appearing confirms the page is indexed. For full detail including crawl status, detected issues, and mobile usability, use the URL Inspection tool inside Google Search Console. It will also show when Google last crawled the page.
Why does my page appear in Google but rank on page 5?
Being indexed means Google has stored your page — not that it considers it the best result for a query. Low rankings typically indicate a relevance problem (the content does not fully match search intent), an authority gap (few or no quality inbound links), or a content quality issue flagged by Google’s ranking classifiers. Confirming which of the three applies is the starting point for any fix.
What to Read Next
Once you understand how search engines work, the natural next step is learning how to structure your pages so Googlebot can find, crawl, and index them without friction. How to Get Your Pages Crawled and Indexed by Google]
For the complete foundation before going further into the domain. | The Skill Journey Beginner’s Complete Guide to SEO]
