Google Indexing – What it is and How it Works?
Google crawls the web and makes a copy of it. This is called an index.
Think of an index you might find at the end of a book. Traditional information retrieval systems (search engines) work similarly when they look up web documents.
But the web is ever-changing. Size isn’t everything, as Google’s Pandu Nayak explained, and there’s a lot of duplication on the web. Google’s goal is to create a “comprehensive index.”
In 2020, the index was “maybe” about 400 billion documents, Nayak said.
“I don’t know in the past three years if there’s been a specific change in the size of the index.”
“Bigger is not necessarily better, because you might fill it with junk.”
“You can keep the size of the index the same if you decrease the amount of junk in it,” Nayak said. “Removing stuff that is not good information” is one way to “improve the quality of the index.”
Nayak also explained the role of the index in information retrieval:
“So when you have a query, you need to go and retrieve documents from the index that match the query. The core of that is the index itself. Remember, the index is for every word, what are the pages on which that word occurs.”
“And so — this is called an inverted index for various reasons. And so the core of the retrieval mechanism is looking at the words in the query, walking down the list — it’s called the postings list — and intersecting the postings list.”
“This is the core retrieval mechanism. And because you can’t walk the lists all the way to the end because it will be too long, you sort the index in such a way that the likely good pages, which are high quality — so sometimes these are sorted by page rank, for example, that’s been done in the past, are sort of earlier in the thing. And once you’ve retrieved enough documents to get it down to tens of thousands, you hope that you have enough documents. So this is the core of the retrieval mechanism, is using the index to walk down these postings lists and intersect them so that all the words in the query are retrieved.”