Google Search Hack for SEO: Internal Search Engineering Documentation Leak

8 min readMay 28, 2024

Internal documentation for Google Search’s Content Warehouse API has leaked. An internal version of the outdated Document AI Warehouse documentation was mistakenly made public in a code repository for the client library. An external automated documentation service also picked up this documentation (Link here, till the site gets taken down!)

While there are many things that we can talk about here — how Google’s previous statements to the public or DOJ mismatch the leaked documentation, we will directly dive into the inferences from the doc, and how we can make the best use of the knowledge to improve our site’s rankings.

There are 14,000+ ranking factors

screenshot of google search ranking factors from the leaked documentation

The modules are related to parts of YouTube, Assistant, Books, video search, links, web documents, crawl infrastructure, an internal calendar system, and the People API. So, many features are not for ranking.

How Panda Works

Google Panda or the Farmer algorithm update aimed to boost high-quality websites and reduce the visibility of low-quality websites in Google’s search results. Panda was released under the direction of Amit Singhal, who opposed using machine learning because of its limited observability. Here is what the update focused on :

Thin content: Pages with very little relevant or substantial text, such as brief descriptions of health conditions.
Content mismatching search query: Pages that claim to provide relevant answers but fail to deliver, leading to user disappointment, such as a page titled “Coupons for Whole Foods” that doesn’t actually provide coupons.
Duplicate content: Identical or nearly identical content appearing on multiple pages, either on the same site or across different sites. For example, a business creating similar pages for different cities by just changing the city name.
Lack of authority: Content from sources not considered reliable or verified. Websites should aim to be authoritative and trustworthy.
Content farming: Large numbers of low-quality pages created for the sole purpose of ranking in search engines, often aggregated from other sites.
Low-quality user-generated content (UGC): Poor-quality content from users, such as short guest blog posts with errors and lacking authoritative information.
High ad-to-content ratio: Pages dominated by ads rather than original content.
Low-quality content surrounding affiliate links: Poor content around affiliate links.
Websites blocked by users: Sites blocked by users in search results or through browser extensions, indicating low quality.

To maintain and improve your rankings, you need to generate more successful clicks from a wider range of queries and increase link diversity. The key to achieving this is producing good research and quality content.

Key points

Twiddlers are re-ranking functions that run on the unranked results. Twiddlers can adjust the information retrieval score of a document or change the ranking of a document.

Here are some of the Boosts identified in the docs:

NavBoost : Ranking system based on click logs of user behavior.
QualityBoost : Ranking system based on Quality of the content — determined by ton of factors.
RealTimeBoost : Ranking system based on the freshness of the content — mostly determined by the publishing/update date and other dates mentioned in the content.
WebImageBoost : Ranking system based on Images present in the document.

While there is no information on weights of the ranking parameters to determine how much a particular feature impacts the ranking, here are some things we can infer :

Authors are an explicit feature

Google explicitly stores the authors of a document as text. They also check if an entity mentioned on the page is the author of the page. The importance given makes it clear that author information plays a part in the rankings.

Backlinks ARE STILL IMPORTANT

Effectively, this is saying the higher the tier, the more valuable the link.

Pages that are considered “fresh” are also considered high quality.

Anchors created during the spike are tagged with LINK_SPAM_PHRASE_SPIKE. So you should not be getting hundreds of backlinks in a very short time, or the weight of these could be reduced, or even set to 0, creating little to no impact of the links.

While the disavow data can be stored somewhere else, there is no mention of disavow in the API. There is a possibility that disavow has been a crowd sourced engineering effort to train Google’s spam classifiers.

Homepage PageRank is considered for all pages

Every document has its homepage PageRank associated with it. This HomePageRank and siteAuthority are likely used as a proxy for new pages until they capture their own PageRank.

Font Size of Terms and Links Matters

Formatting matters. All the italicizing, bolding, and font size variations that we do, have an overall impact on the ranking. Google is tracking the average weighted font size of terms in documents.

Documents Get Truncated

The documentation indicates that there is a maximum number of tokens considered for a document in the Mustang system, emphasizing that authors should place their most important content early.

Page Titles Are Still Measured Against Queries

Placing your target keywords in the Title is still the move.

There Are No Character Counting Measures

There is no metric in this dataset that counts the length of page titles or snippets.

Only snippetPrefixCharCount has a character counting measure — which it should have to show featured snippets!

Dates are Very Important

Google tries to store various dates from a document to determine the freshness of the content :

bylineDate — This is the explicitly set date on the page, shown on the search results.

syntacticDate — This is an extracted date from the URL or in the title.

semanticDate — This is date derived from the content of the page.

Domain Registration Info is Stored About the Pages

While there have been a lot of arguments on whether the domain age plays any role in the rankings, we now know that Google stores the latest registration information on a composite document level.

It may also be used to sandbox a previously registered domain that has changed ownership.

A lot of grey-hat strategies that include buying reputed expired domains should be a thing of the past because of expired domain abuse policies taking this data into consideration.

There are Gold Standard Documents

The description mentions “human-labeled documents” versus “automatically labeled annotations.” Human-labeled documents might be used to up-rank or de-rank certain pages or sites. Maybe this is why some sites always get ranked very easily (Oh Google can give numerous explanations so as to why a particular page needed manual labeling)

Site Embeddings Are Used to Measure How On-Topic a Page is

Google is specifically vectorizing pages and sites and comparing the page embeddings to the site embeddings to see how off-topic the page is. This is where the Site’s topical authority and page’s relevance to that particular topic comes into the picture.

Google Identifies and tags small sites

While we do not know how this flag data is used, but Google has a specific flag that indicates if a site is a “small personal site.”

Demotions

There are some series of algorithmic demotions that Google has implemented to control spam :

Anchor Mismatch :
When a link does not match the target site it’s linking to, it is demoted in the calculations. Google looks for relevance on both sides of a link.

SERP Demotion :
Demotion based on factors observed from the SERP suggests potential user dissatisfaction with the page

Nav Demotion :
Demotion applied to pages exhibiting poor navigation practices or user experience issues.

Exact Match Domains Demotion :
A demotion to ensure that exact match domains would not get as much value as they did historically.

Special categorization:

Video Focused Sites are Treated Differently

If more than 50% of pages on the site have video on them (videos that are not hosted on other sites), the site is considered video-focused and will be treated differently.

Your Money Your Life is specifically scored

Google has classifiers that generate scores for YMYL Health and for YMYL News. Since these topics impact the lives of people significantly, they have special categorization.

Google Lies (Probably?)

While we know that these metrics are collected, we cannot tell with surety of the weights they carry (Can even be zero!)

WE DON’T USE CLICKS FOR RANKINGS

NavBoost is a system that employs click-driven measures to boost, demote, or otherwise reinforce a ranking in Web Search.

WE DON’T USE ANYTHING FROM CHROME FOR RANKING

One of the modules related to page quality scores features a site-level measure of views from Chrome.

WE DON’T HAVE ANYTHING LIKE DOMAIN AUTHORITY

We have seen people saying that DA and DR are metrics made by popular SEO tools to sell the data. While the calculations might be very different, we now know that Google uses something similar — which might be taken into consideration for pages that do not have a PageRank of their own initially.

THERE IS NO SANDBOX

The documentation indicates an attribute called hostAge that is used specifically “to sandbox fresh spam in serving time.”

Notes

Was reading about the topic, and wrote this just as an exercise.

Broke down some things to explain them in a simpler way.

References :

What Is Google Panda? How To Recover From Google Updates

What is Google Panda? Google Panda Initial Release Date: February 23, 2011 The stated purpose of the Google Panda…

moz.com

Secrets from the Algorithm: Google Search's Internal Engineering Documentation Has Leaked …

Use this Google Sheets Apps Script to find k-Nearest Neighbors across your site with a highly relevant cosine…

ipullrank.com