Sat Dec 24 20221,505 words
Indexing and the crawling of a website can be one of the most overlooked areas of SEO. On large websites with thousands or millions of pages, huge issues can occur related to indexing that can cause existing pages to rank poorly and new pages not being discovered.
Good indexing is about getting all your unique pages crawled as effectively and often as possible, utilizing crawler resources in the most efficient way possible, maximizing the distribution of page rank, while excluding any duplicate pages.
An easy way to look at duplicate pages and how they affect overall rank on a website and across pages is to think of them as draining page rank and in turn reducing the position in search results of unique pages. The following diagram gives a basic example of this.
There are usually more factors in play here but the concept is the same. Duplicate pages waste crawler resources, slow the crawling of unique pages, impact the overall relevance across pages and site wide, and in turn reduce ranking and organic traffic from search engines.
Some websites can suffer from extreme cases of duplicate content where 95% or more of pages can be duplicates. Instead of having an average position of lets say 10 in search results, the average position could be 100+.
There are 2 approaches to dealing with the indexing of duplicate pages - a pro-active and a reactive approach. A pro-active approach is the best strategy to follow.
If you index all your pages by default - when duplicate pages arise they will be indexed. This is the reactive approach and on large sites you can find yourself constantly having to chase down duplicate pages wasting time, resources, and it can heavily impact ranking across your whole site if you don't stay on top of it. Duplicate pages can appear and easily go unnoticed for a very long time even if you check tools like search console frequently.
<meta name="robots" content="index,follow">
Noindex your pages by default. If you default all your pages to noindex and whitelist only pages you want indexed - when duplicate pages arise they will not be indexed. The only caveat here is when pages are added to your site and you want them indexed, you must ensure to apply index rules to them. Its easy to add this check / task to your workflow to deal with this.
<meta name="robots" content="noindex,nofollow">
The complexity here comes when you have to take into account URL parameters.
URL parameters can be one of the biggest causes of duplicate content ballooning the number of pages on your site and if left unchecked can destroy your ranking.
Search engines treat URL params as separate pages even if only 1 character is different e.g
/articles?page=1
/articles?page=2
URL params can cause huge issues for a number of reasons.
Avoid setting any sort of random or incremental URL params (pagination is fine), or if you absolutely have to do this, then make sure these are not indexed. Take the following random URL param example:
/cart?id=543548374276
/cart?id=763271638212
Technically here the number of cart pages is infinite when there should only be 1 page. While you may think a search engine cannot reach and crawl these random pages, in fact they can. Links posted online in public forums or other sites can lead search singes to these pages causing them to be crawled and indexed. The same goes for incremental URL params.
/cart?id=1
/cart?id=2
Using values like this can get completely out of control leading to hundreds, thousands, or even millions of duplicate pages.
Setting a canonical tag can point search engines to the proper version of the page, but it still does not solve the issue of duplicate pages being crawled and indexed. A robots meta tag should always be used to de-index these duplicate pages.
A lot of web pages have filter logic. Take an articles landing page for example that allows your to filter and sort a list of articles. There may be a number of filters and they can be combined in different ways which can cause duplicate issues. Take the following example. The params and page content here are the same but the order is different. These will be treated as separate pages.
/articles?category=web&sort=1
/articles?sort=1&category=web
The same goes for specific values within a URL param
/articles?category=web|mobile
/articles?category=mobile|web
A strategy should be used so params and their values always follow the same order. If not its extremely easy to introduce duplicate pages. A good strategy is to make sure params and their values are always alphabetical. If for some reason a user visits a page where the alphabetical order is not followed then noindex these pages.
Any URL params should be validated to ensure the value contained within them is valid. If the value is not correct, and a 404 response is not returned, more than likely a soft 404 will occur. Take for instance a page pagination param that has an invalid value:
/articles?page=invalid
Or a pagination param that is outside the max page range. For example if a pagination set has 100 pages but you go beyond this:
/articles?page=101
These soft 404 issues can also occur on pages that use dynamic URL slugs that are not validated and a 404 response returned when the slug isn't valid.
/article/non-existing-article
Always validate params and slugs otherwise you can end up in a similar situation to using random values in params - you can end up with essentially an infinite number of duplicate pages that can be indexed which will heavily impact your site ranking.
A common pitfall when dealing with crawling of pages you don't want indexed is to block these pages in robots.txt and add a noindex tag after they have been crawled and indexed by search engines. This creates an issue:
Before adding any pages to robots.txt make sure they are not indexed already. Most times you don't need to add entries to robots.txt for specific pages, you can just add a robots meta tag with a noindex,nofollow attribute to stop these pages being indexed in search or to de-index them if they already are.
Always be careful when adding rules to robots.txt to ensure you don't block pages being crawled that need to be.