Very easy to create, not so easy to get rid of, and quite tangibly harmful to your website—that’s how we can describe duplicate pages. Why duplicate content is bad and how can it appear on a website without you even knowing? We’ll cover these issues in this article.
Why are duplicate pages bad?
There are 5 significant reasons why duplicate pages cause damage to your website.
Any search engine is, essentially, a business. And like any business, it doesn’t want to waste its efforts for nothing. Thus, there’s a crawling budget set for a single website, which is a limit of web resources search robots are going to crawl and index.
With that said, the first reason to avoid duplicate pages is crawling budget efficiency: while crawling duplicates, search bots can miss some important pages.
Crawling issues, in their turn, lead to indexing issues—and this is your second reason to get rid of duplicates. If an important web page doesn’t get crawled, it won’t make it to the index. The only thing here would be to fix the problem and wait for reindexing, but it can take a while, especially if you’re working on a new site.
The third reason is a risk of keyword cannibalization or different pages competing against each other for the same target keywords. Imagine you’ve entered a new supermarket and you’re seeing the bread section sign both in the right corner and in the left corner. You probably will be wondering, where bread actually is and why would the supermarket put the sign in two different places to mess with you? This is what happens when a search engine spots several of the same pages—it wouldn’t know which one to rank. So, you’d better not mess up the search crawlers.
Note that duplicate pages are not the only thing that can cause keyword cannibalization. Duplicate titles or H1 headings, lots of the same keywords placed in the content, external links pointing to a non-target page that have a keyword included in the anchor text—all these can also lead to such a problem.
Getting external links to duplicate pages instead of the main page versions is the fourth reason. It can also aggravate the cannibalization problem—we’ll return to this later.
Finally, the fifth reason is Google’s Panda algorithm that can penalize a website for duplicate content.
Types of duplicates
There are complete and partial page duplicates. The former refers to several absolutely identical web pages. For example:
- The main site mirror isn’t specified and the mainpage is accessible both as https://site.com/ and https://www.site.com/
- The page is accessible both with a slash symbol (/) at the end and without it: site.com/page/ and site.com/page
- The page is accessible with both upper-case and lower-case symbols: site.com/PAGE and site.com/page
- The page is accessible with both the category specified in the URL and without it: site.com/phone/iphone/ and site.com/iphone/
- The page is accessible both with an extension like .html, .htm, .php, .aspx and without one: https://site.com/page and https://site.com/page.php
- The page is accessible with a different number of slash symbols in its URL: https://site.com/page/, https://site.com/page/////////////, or https://site.com///page/
- The page is accessible with additional symbols in its URL: https://site.com/page/, https://site.com/page/cat/, or https://site.com/page/*
- Several of the above mentioned options combined
Partial duplicates are similar pages designed for the same user intent and with the same goal in mind. They share semantics and compete against each other, which results in keyword cannibalization. For example:
- The URL site.com/phome/?price=min set for sorting items by price leads to partial content duplication.
- The page version for print is, essentially, its duplicate.
- Same content blocks like comments displayed across multiple pages also lead to partial duplication.
The reasons why websites get duplicate pages
Content manager fault
The same content can simply be put on a website twice just by mistake. Fortunately, it’s easy to avoid such situations.
First of all, it always makes sense to develop and follow a content plan so that you can track the content that gets published on the website. It’s also important to regularly re-check your content to make sure there are no duplicates or cannibalization issues.
If you do end up with duplicate content being added on the website and indexed by search engines, you need to set the main version of the page and remove all the others. We’ll explain this process in more detail later.
Parameters in the URL
Most often, it’s URL parameters that cause content duplication and an ineffective crawling budget expenditure. The following contribute to duplication:
- Filtering options for displaying content (list view, grid view, etc.)
- UTM tags
- Filtering options for items presented on a website
- Parameters for several sorting options (by price, by date, etc.)
- Incorrect pagination
- Other technical information included in URL parameters
One common solution is canonicalizing a page. This way, all pages containing parameters will point to a page that doesn’t have any as the canonical one. For example, https://seranking.com/?sort=desc contains <link rel=”canonical” href=”https://seranking.com/“/>.
Another common solution is using the robots meta tag or X-Robots-Tag with the noindex attribute to prevent unnecessary pages from being indexed.
Personally, I prefer canonicalizing a page when it comes to URL parameters. But keep in mind that canonical is only a recommendation for search engines.
To resolve filtering issues, I suggest replacing <a> tags with <span> for those filters that invariably create duplicate pages. This is a more complex solution that saves the crawling budget but requires you to set the task for a developer.
When it comes to similar products—like one t-shirt model with different colors, I recommend using a single product page and providing color choice options on it. This way, you’ll minimize the number of duplicates and help users find the product they need without having to visit several similar pages. It will also save the website’s crawling budget and will let you avoid keyword cannibalization.
Localized site versions
Sometimes, websites use folders instead of subdomains for their regional versions, and these folders contain the same content. I think it’s best to use subdomains instead, but there are also other things to do to avoid page duplicates in case of several site versions.
If you do use folders on an e-commerce site, you should at least make unique title meta tags and H1 headings and also make products displayed differently on each site version. There’s no guarantee that search robots will crawl the pages correctly, so you might consider putting additional efforts into making the content more unique.
It’s easier with a website that doesn’t sell products but promotes certain services. If you have pages targeted at different countries or cities, just write unique content for each specific location.
Using hreflang helps eliminate partial duplicates but it’s crucial to apply the attribute carefully and track the situation.
Some websites have local domains with the same or similar content. In this case, it’s reasonable to rewrite the content taking into account each location’s specifics and also set the hreflang attributes.
Product accessibility across different categories
Usually, products are added in several categories on e-commerce websites. This can cause duplication if the URL contains a complete path to a product: for example, https://site.com/t-shirt/nike/t-shirt-best.html and https://site.com/t-shirt/red/t-shirt-best.html.
You can solve this issue in a CMS, by setting a single URL for each product available under different categories. Or, by using the canonical tag.
Technical problems are one of the most common reasons for duplication. It often happens in less popular or customly created CMSs. An SEO specialist should keep track of those parameters that might lead to duplication (website mirrors, slash symbols in URLs, etc.). I’ve mentioned all potential issues to track earlier.
How to avoid duplicate pages
When creating a website, you can prevent unnecessary URLs from being crawled with the help of the robots.txt file. Note, however, that you should always check it with the Robots Testing Tool in Google Search Console. This way, robots.txt won’t close important pages for search crawlers.
Also, you should close unnecessary pages from indexing with the help of <meta name=”robots” content=”noindex”> or the X-Robots-Tag: noindex in the server response. These are the easiest and most common ways to avoid problems with duplicate page indexing.
Important! If search engines have already seen the pages you don’t need them to see and you’re using the canonical tag or the noindex directive to solve the problem, wait until search robots crawl the pages and only then block them in the robots.txt file. Otherwise, the crawler won’t see the canonical tag or the noindex directive.
How to find duplicate pages using SE Ranking
SE Ranking’s Website Audit tool can help you identify all page duplicates. In the Duplicate Content section, you’ll find a list of pages that contain the same content, as well as a list of technical duplicates: pages accessible with and without www, with and without slash symbols, etc. If you’ve used canonical to solve the duplication problem but happen to specify several canonical URLs, the audit will highlight this mistake as well.
Choosing the main version of a page
If you’ve detected duplicate content on your website, don’t rush into just removing it. First, check the following:
- Define which page is ranking better for a target keyword. You can do so in several ways:
- By using the site:your-url keyword operator. If you see that not one of the duplicates but another page is ranking first, it means that duplicate pages need to be better optimized. If you don’t know what the target keyword for a given page is, you can spot it in the title meta tag.
- By using Google Search Console. Go to the Performance report that allows filtering by query.
- By using SE Ranking, in the Rankings module, you can specify a target URL for each added keyword. If the URL ranking for the keyword and the specified target URL are two different pages, the link icon will be marked red.
You’ll also spot if URLs shown in the search constantly replace each other: the system will show the number next to the link icon. This number indicates how many web pages compete against each other and by clicking on the icon, you’ll see these pages and their title texts.
You can track how URLs change their rankings under the SERP Competitors section in the My Competitors tool. This way, you’ll easily spot the cannibalization problem if there’s such:
- Define the number of external links pointing to each of the duplicate pages.
- Define the number of keywords those pages rank for. Use Google Search Console and filter the data by page.
- Define how much traffic those pages receive and what are their bounce rates and conversion rates. To do so, use Google Search Console or SE Ranking’s Analytics & Traffic.
- Decide which page to leave on a website based on all the mentioned factors.
After removing a duplicate page, you should set a 301 redirect from it to the main version of a page. Then, audit your website again to find internal links pointing to a page that has been deleted—you should replace those with the main page’s URL. In SE Ranking’s Website Audit, you’ll find this information under the HTTP Status Code section:
Avoiding duplicates is beyond argument
It’s evident that duplicate pages can cause damage to a website so you shouldn’t underestimate their impact. Understanding where the problem can come from, you’ll be able to easily control your web pages and avoid duplicates. If you do have duplicate pages, it’s crucial to take action on time. Website audits, setting target URLs for keywords, and regular ranking checks will help you spot the issue as soon as it happens.
Have you ever struggled with duplicate pages? How have you managed the situation? Share your experience with us in the comment section!