$2,000 Free Audit

blog

Duplicate Content: How to Find and Fix It For Your SEO

Make your content unique to your site. Duplicate information can have a big impact on your search rankings. Here's your guide on how to find duplicate content and fix it for your SEO strategy.

If there's one thing keeping website owners up at night, it's how to deal with duplicate content. There are lots of causes of duplicate content - some you can avoid, and some you can't.

But the impact of duplicate content on your SEO is no secret: Google doesn't like it.

And when Google doesn't like something, it hurts your search rankings.

The reality is there's a lot of duplicate content out there. According to Matt Cutts, 25 to 30% of the web is duplicate content.

To back that up, a recent study by Raven Tools found that 29% of pages using their site audit tool had duplicate content.

In other words, it's highly likely your site has duplicate contentwhich is impacting your search performance.

But how do you know where your duplicate content is? And most importantly, what steps can you take to fix it?

This article will help you to understand the many causes of duplicate content, find duplicate content on your website or external websites, and take action.

 

What is duplicate content?

Duplicate content refers to content that’s exactly the same or very similar to content on other websites or on different pages on the same website.

Sometimes the content is word-for-word the same, and may have been copy and pasted from one page to the next. It is exactly the same text.

Other times, duplicate content is very similar to content on another page. It could be content that has been rewritten and reworded slightly - think of this as near duplicate content.

 

Why is duplicate content bad for SEO?

Duplicate content is bad for several important reasons - mostly because Google doesn't like it.

Google ranks down pages with duplicate content.

Google is all about providing unique, valuable content to its users. In fact, Google states that:

“Google tries hard to index and show pages with distinct information”.

So naturally, it doesn't want to rank pages with duplicate content. That's wouldn't be providing Google's users with the best user experience.

Also, when there are several versions of the same content available, it's hard for search engines to determine which version they should actually index and show in their search results.

This lowers performance for all versions of that content, because each duplicate content version is competing with the next.

This means your organic traffic drops for each page - which can be extremely detrimental to your conversion rate for those pages you want to rank highly for.

You will have fewer indexed pages

This is especially important for websites with lots of pages, like large brand sites or eCommerce sites.

Sometimes Google doesn’t just rank down duplicate content - the search engines will not rank it, period.

Google and other search engines have crawlers that go through websites and gather data to build an index. Crawling and indexing is everything in search engine optimization. If the crawlers can’t crawl your website, they cannot index or rank it, which means it won’t be shown to searchers. 

Here's the thing:

Google only allocates a certain amount of crawling per website, aka the crawl budget.

So, if your site has duplicate pages and you are wasting your valuable crawl budget on crawling and indexing duplicate content, it means other important web pages won't get crawled, indexed and ranked.

Your link juice is diluted

Let's say you have two versions of the same content. External websites are linking back to that content, giving you the all important backlink authority. But because you have two versions of the same page, some websites link to one page, and others link to the duplicate page.

The authority that backlinks give your content is therefore diluted across the multiple pages. Search engines won't know which page is the one with the highest authority, and therefore which one page to rank.

Take a look at this example by Ahrefs. It shows two locations of the same content on Buffer.com, and the difference in backlinks. 

One article has more backlinks than the other - but if it was one version only, the link juice would not be diluted at all:

ahrefs 1

ahrefs 2

 

You could receive a penalty

Google has previously stated that duplicate content can lead to a penalty or, worse, the complete deindexing of an entire site.

In reality, this is said to be rare and is really only done by Google when a website scrapes content from another site.

While duplicate content can hurt your SEO performance, sometimes significantly, it won't get you a penalty from Google so long as it was not intentional.

Google recognises that websites are often complex beasts, and it can be tough for website owners to deal with technical website challenges.

This is what Google says about it:

"Duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results. If your site suffers from duplicate content issues, and you don't follow the advice listed in this document, we do a good job of choosing a version of the content to show in our search results."

 

What is the most common fix for duplicate content?

There are lots of ways to fix duplicate content, which we'll go through later, but if you just want to try the most common fix, here it is:

Choose your preferred version and implement 301 redirects from the non-preferred versions of URLs to the preferred versions.

The next most common fix is to use canonical tags. 

The rel=canonical element is a snippet of HTML code that tells Google that the content is yours and that it is the main version of the page, even when the content can be found elsewhere or there are multiple versions on your site.

The canonical tag can be used for:

  • Print vs. web versions of content

  • Mobile and desktop page versions

  • Multiple location targeting pages

  • And more...

There are two types of canonical tags: tags that point to a page, and point away from a page - these are telling search engines that another version of the page is the main version.

Here’s what it looks like:

Source: Moz

Ultimately, to get the best results, you need to include several proactive steps in your content digital strategy.

The best way to work out how to fix your duplicate content is to first understand the most common causes.

 

Common causes of duplicate content

Duplicate content due to technical reasons

Let's face it, websites are complicated. The bigger your site and the more pages you have, the more complex it gets and the more chance you will make mistakes.

Which is why often, duplicate content is simply because of an incorrectly set up web server or website. You might have recently migrated your website, or changed the structure or updated the on-page optimization. It happens.

The good news is that these causes are technical in nature, which means they will likely never result in a Google duplicate content penalty.

The bad news?

They can still seriously harm your rankings though, so it's important to fix them fast.

Non-WWW vs WWW and HTTP vs HTTPs

If you are using HTTPs and the WWW subdomain, your preferred way of serving content is via https://www.mysite.com.

This is your canonical domain.

However, if your web server has been poorly configured, your content may also be accessible through:

  • https://mysite.com

  • http://mysite.com

  • http://www.mysite.com

This is one of the oldest problems in the book and means that both versions of your site are accessible.

How to fix it:

Choose which way you want to serve your content. Then, implement 301 redirects to send non-preferred pages to the preferred version.

URL structure

It may surprise you but URLs are case sensitive for Google.

This means these two versions are seen as different URLs:

https://mysite.com/blue-dress

https://mysite.com/blue-DRESS

So, if you have both upper- and lowercase versions of your site's domain, you could be making it difficult for search engines indexing your site, and hurt your site's performance.

This topic is addressed by Google’s John Mueller in Ask Googlebot on the Google Search Central YouTube channel.

Just to confuse things further, URLs are not case-sensitive for Bing.

Another thing that causes issues is the 'trailing slash'.

That's where you have a forward slash (/) at the end of an URL (think of it as trailing behind).

https://mysite.com/blue-dress/

https://mysite.com/blue-dress

As Mueller says:

"By definition, URLs are case sensitive, and also things like slashes at the end do matter. So, technically, yes — these things matter. They make URLs different.”

Google recognizes there are multiple versions of the same URL and will try to crawl all of them and figure out which one to show in search results.

How to fix it:

Again, choose a preferred structure for your URLs, and for non-preferred versions, use a 301 redirect. If you have multiple pages that need to be redirected though, you’ll probably find it much easier to implement a site-wide fix.

For example, you can enforce lowercase URIs with rewriting. Use server config files in your HTML code. If you don’t have access to your httpd.conf files, your hosting company may be able to turn the feature on for you.

Access through different index pages is available

If your web server is misconfigured, there's a chance your homepage may be accessible via multiple URLs.

For example, along with https://www.mysite.com, your homepage may be accessible through:

  • https://www.mysite.com/index.html

  • https://www.mysite.com/index.php

  • https://www.mysite.com/index.asp

  • https://www.mysite.com/index.aspx

How to fix it:

Choose a preferred way to serve your homepage and use 301 redirects.

Parameters for filtering

Does your site use URL parameters for page variations?

Websites often use parameters in URLs so they can offer filtering functionality. Take this URL for example:

https://www.mysite.com/clothing/shoes?colour=red

This page would show all the red shoes.

This is great for your users, as they can find the products they want quickly and easy.

BUT it can cause major issues for search engines.

The problem with filter options is they can generate a never-ending amount of combinations, especially when there is more than one filter option available. For clothing, you might filter by type, style, size, price, colour, brand, new release, on sale - the options are endless.

The parameters can also be rearranged, so even though the URLs below are different, they would show the same content:

  • https://www.mysite.com/clothing/shoes?colour=red,type=long

  • https://www.mysite.com/clothing/shoes?type=long&colour=red

Here is another example of URLs that lead to essentially duplicate content, distinguished only by different parameters:

url and desc

Source: Google

How to fix it:

You can prevent Google from crawling URLs that contain specific parameters or parameters with specific values, to stop it from crawling duplicate pages.

If you have lots of URL parameters on your site, it's worth using the URL Parameters tool to reduce crawling of duplicate URLs.

Google recommends you only use the URL Parameters tool if your site fulfils ALL of the following requirements:

  • Your site has more than 1,000 pages

  • In your logs, you see a significant number of duplicate pages being indexed by Googlebot and all duplicate pages vary only by URL parameters

  • You are an experienced SEO.

If you get it wrong, Google could end up ignoring important pages on your site, which means they won't rank.

Another thing you can do is implement canonical URLs to prevent duplicate content- you'd need to do this for each main, unfiltered page.

However, this doesn't prevent crawl budget issues.

Taxonomies

In content management systems, it might be that your posts are available in more than one category. This is to do with taxonomies.

A taxonomy is a grouping mechanism to classify content. You see them in your content management system (like Wordpress) to support categories and tags.

However, if you don't appoint a primary one, all will be considered duplicates.

Let's explain -

Imagine you have a blog post about electric cars - it's in three categories and is accessible through all three:

  • https://www.mysite.com/cars/topic/

  • https://www.mysite.com/travel/topic/

  • https://www.mysite.com/technology/topic/

So, the search engine will see it as multiple versions of the same content - i.e. duplicate content.

How to fix it:

Choose one of these categories as the primary one and use the canonical tag to tell search engines it is the master version.

Comment pages

If you have comments enabled on your website, after a certain amount of comments, they might automatically roll over to more pages. This is the case with Wordpress.

The problem is, the next page will show the same content with only the comments at the bottom being different.

This means for every page of new comments, you have a duplicate page of content.

How to fix it:

Use the pagination link relationships which will signal to the search engine that these are a series of paginated pages, and not duplicate content.

Localization

If you're trying to target people in different regions who speak the same language, such as different states across the U.S. or Australia and Canada, you can end up with duplicate content.

After all, the brand is the same, so it’s natural that the content will overlap on both sites. 

Luckily, Google usually works this out and it won't impact your results.

However, you can take proactive steps just in case.

How to fix it:

Use the hreflang attribute to help prevent duplicate content.

The hreflang attribute is used to indicate to the crawlers what language your content is in and what geographical region it is meant for.

Indexable search result pages

If you offer a search functionality for visitors to find content on your site, this can be another cause of duplicate content.

Often the search results pages on your site are very similar.

At the same time, they don't provide much value to search engines and you don't want them to rank. They're really just there to give your visitors a better user experience.

How to fix it:

Use the noindex attribute to stop a search engine from indexing the search result pages. You can also stop search engines from accessing them in the first place using the robots.txt file - this is recommended if you have lots of results pages.

Indexable staging or testing environment

If you or your web developers are about to launch some new features, such as a new eCommerce platform within your site, it's best practice to use staging environments. That way you're not testing new features on a live website.

However, sometimes these can be left indexable for search engines, and that means search engines are coming across identical content in two places.

This can cause duplicate content issues that impact the ranking of the live site, while also meaning that the public could potentially access the test site rather than the live site.

How to fix it:

Use HTTP authentication to prevent search engines (and the public) from accessing to staging and testing environments. You enable HTTP authentication in the test domain, so it is blocked for search engines and the live site will remain indexable.

If you've found that something has been indexed that shouldn't have, you can use the URL removal tool to remove the indexed URL from the search engine and from cache. This can be done through Google Search Console.

Parameters for tracking

Parameters can also be used for tracking. A tracking parameter is a piece of code that's added to the end of a URL. It can then be parsed by a system to share information contained by that URL.

For instance, when a user shares a URL on Facebook, the source is added to the URL.

However, this is actually another cause of duplicate content.

How to fix it:

Use self-referencing canonical URLs on pages. Self-referencing canonical tags tell search engines which version is the main version. So, any URLs with these tracking parameters are canonicalized to the version without the parameters. Problem solved.

Session IDs

Session IDs are appended to URLs to manage or handle user sessions.

How does duplicate content happen then?

The duplicate content issue occurs when these session IDs are used in internal links such as sitemaps, or are shared socially.

In other words, if every URL a visitor requests gets a session ID appended, this creates a lot of duplicate content!

How to fix it:

It's best practice to use self-referencing canonical URLs. That way all URLs with these tracking parameters are canonicalized to the version without the parameters.

Print-friendly version

Sometimes a web page is set up to provide a print-friendly version of the content at a separate URL.

This means there are two versions of the same content with different URLs - one is just more print friendly than the other!

How to fix it:

Avoid duplicate content issues by using a canonical URL directing from the print friendly version to the main version of the page.

Dedicated pages for images

Another issue with some content management systems is that they create a separate, dedicated page for each image.

This separate page just shows the image on a blank page. It looks similar to all the other image pages, and can be seen as duplicate content.

How to fix it:

Simply disable the feature that gives images dedicated pages. You don't need it! If you can't do this on your CMS, add a meta robots noindex attribute to the page.

Duplicate content caused by copied content

The other cause of duplicate content is when people have copied and published the content in multiple places. It's that simple.

Sometimes this is malicious - and can attract penalties. But other times, there is a very good reason for the duplicate content.

Take a look at the causes below:

Landing pages for paid search

Any seasoned digital marketer will tell you that the best paid search campaigns require specially designed landing pages that target specific keywords and drive conversions.

Here's the problem:

To save time and budget, sometimes marketers use copies of an original page and make some small changes to target the keywords, based on their keyword research.

If you have lots of paid search campaigns, this results in tonnes of pages that have largely identical content. And you know what that means - duplicate content.

How to fix it:

One way to fix it is to prevent search engines from indexing the landing pages using the noindex attribute.

Better still, create completely unique landing pages for your paid search campaigns - this will also ensure you provide more relevant, conversion-driven content for your audience.

Syndicating content

Syndication is when you take content that is already published on your own site, and you give other publishers permission to post the same content on their site.

Sometimes the syndicated content can be an exact duplicate of the content on your site, or it might only be a part of it.

For marketers, syndicate content is a great way to get eyes on your content and send them to your site. It might get you in front of more of your target audience, and help you build authority in a certain niche.

In short, syndicated content is a good thing.

We know what you're thinking - isn't it duplicate content?

Unfortunately, syndicated content does create duplicate content. BUT there are ways to make sure the search engines rankings are not impacted for your original content.

How to fix it

Here are the four best ways for dealing with the duplicate content problem:

  1. rel=canonical. The best solution to try is to ask the web owners syndicating your content to place a rel=canonical tag on the page with your article. This tag should point back to the original article on your site to tell the search engines that the syndicated content is a copy, and that you are the real publisher.

  2. NoIndex. You could ask the web owners to NoIndex their article copy. This tells search engines not to index the syndicated copy. The benefit of this is that links from the syndicated article back to your site will pass on PageRank.

  3. Direct Attribution Link: If you can't get the web owner to do options 1 or 2, make sure you get a link directly from the syndicated content to the original article - and not to your home page. This should be enough to tell the search engine that your version is the original.

fast company

Image source: Fast Company

Other websites copying your content

We've all seen it - sometimes websites blatantly copy and paste content and put it on their site. Often very few words (if any) are changed.

This is a big problem and is a main cause of duplicate content.

The real problem happens when your website has a lower domain authority than the one copying your content (yes, it happens!). Websites with a higher domain authority tend to get crawled more often than those with a lower domain authority, which means the duplicate content (not your original) will be indexed to the wrong site first.

This could mean they rank above you for your own content.

How to fix it:

You can try to make sure that other websites credit you for your content, as with syndicated content above. However, if they've taken your content with permission, chances are this will be hard to enforce.

You can request that Google removes the "wrong" page from its search results by filing a request under the Digital Millennium Copyright, and/or consider taking legal action (depending on the seriousness of the infringement).

You copy content from other websites

If you copy content from other websites, just like when others do it to you, this is a form of duplicate content too.

This doesn't mean you are acting maliciously. You might just be copying the supplier's description of a new range of products to save time when publishing them to your eCommerce site.

But it all counts as duplicate content.

How to fix it:

Google has suggested to link to the original source, using either a canonical URL or a robots noindex tag to show which is the original content.

But the better solution is to always create your own unique content. Even when you are dealing with thousands of products, it pays to create a unique version of product descriptions on your site. Your shoppers will thank you for it - and your search engine rankings will reward you for it.

 

Finding duplicate content

Finding duplicate content in your website

The most common way to find duplicate content on your own website is to use Google Search Console.

Go to Google Search Console and navigate to the Index Coverage report.

The Google Search Console Index Coverage report is an invaluable tool for understanding which URLs have been crawled and indexed by Google, and which have not. It also tells you why the search engine has made that choice about a URL - and that's the really important part.

Your goal is to get the canonical version of every important page indexed by Google, and for any duplicate or alternate pages to be labeled "Excluded" in the Index Coverage report.

Here are a few things to look out for:

  • Duplicate, Google chose different canonical than user: This means the page is marked as canonical for a set of pages, but Google chose to ignore your request because it thinks another URL is a better canonical. To fix this, inspect the URL to see what Google has selected as the canonical URL. If you agree, change the rel=canonical link. Otherwise, work on your website architecture to reduce the duplicate content and send stronger ranking signals to the page you prefer as the canonical, so that Google changes its mind.

  • Duplicate without user-selected canonical: Google has found multiple URLs that aren't canonicalized to a preferred version. Google has identified a page with duplicates, but none of them are marked canonical. And the search engine does not think this page is the canonical version. To fix this, you should explicitly mark the canonical for this page, using rel=canonical links, for every crawlable URL on your website.

  • Duplicate, submitted URL not selected as canonical: You asked for this URL to be indexed, but because it is duplicate content, and Google thinks that another URL is a better candidate for canonical, Google did not index it and instead indexed the canonical that it chose.

Another great tool to use to find duplicate content issues on your own site is SiteLiner.

SiteLiner is designed to identify internal duplicate content. Do a search of your chosen URL and you’ll see an overview page.

This gives you the percentage of internal duplicate content, then you can also click on the results to see more details of the duplicate content.

The free version is great but it is limited to 250 pages and once every 30 days.

Finding duplicate content outside your own website

The best tool to find duplicate content on external websites is CopyScape.

CopyScape is essentially a duplicate content checker and is free to use.

This tool is simple to use: simply insert a link in the box on the homepage, and CopyScape will crawl the web looking for more than one page with similar or the same content.

You will then see a number of results where similar or duplicate content appears.

Then, you can click on each of the results to see specifically which parts of your text are duplicate content.

Keep in mind, you won’t get unlimited scans for one website on the free version. If you want to really manage your duplicate content issues, CopyScape offers a premium version for more insights.

Another way to identify duplicate content is to directly search for your page title or blog post title in the search engines.

If you have a certain page you’d like to check, you can also go to that page and copy and paste a text snippet into Google search.

Here’s one Ahrefs did on their own blog article:

scraped content

Source: Ahrefs

 

Steps you can take to proactively fix duplicate content issues

Sure, you can do nothing and hope Google works out what is and is not duplicate content. But we've already shown above that sometimes Google does not get it right. So, if you want your pages to rank, it's better to take some proactive steps.

We've outlined some ways to fix your duplicate content above - here's a full list of the steps you can take to manage duplicate content:

  • Use 301s: Use 301 redirects in your .htaccess file to smartly redirect users, Googlebot, and other spiders. This helps prevent pages from having most duplication issues by preventing alternate versions from being displayed.

  • Tell Google how to handle URL parameters: Tell Google what the parameters are doing instead of letting the search engine try to figure it out.

  • Be consistent: Keep your internal linking consistent.

  • Use top-level domains: This helps Google to serve the most appropriate version of a document for country-specific content. For example, http://www.example.de will be identified to contain Germany-focused content, whereas http://www.example.com/de or http://de.example.com are less obvious to the search engine.

  • Manage syndicated content: Google may or may not choose the correct version of the content to index. Therefore, tell Google which version is the preferred version using a link back to your original article and the noindex tag.

  • Rel=”alternate”. Consolidate alternate versions of a page, such as mobile or country/language pages. Use the hreflang to show the correct country/language page in the search results.

  • Minimize boilerplate repetition: Best practice is to include a very brief summary of the text and link to a page with more detailed content. Also, use the Parameter Handling tool to tell Google to treat URL parameters.

  • Minimize similar content: Wherever you can, try not to create duplicate content in the first place. If you have many similar pages, consider consolidating the pages into one. Create your own unique content for products and services, rather than relying on supplier content.

  • Avoid blocking crawler access: Google does not recommend blocking crawler access to duplicate content pages. Rather than blocking Google crawlers from accessing duplicate content on your website, such as with a robots.txt file, it's better to allow search engines to crawl these URLs, but identify them as duplicates by using the rel canonical link, URL parameter handling tool, or 301 redirects. Where duplicate content means Google is crawling too much of your website and using up crawl budget, you can adjust the crawl rate setting in Google Search Console, rather than blocking pages.

 

Frequently asked questions about duplicate content

Can I get a duplicate content penalty from search engines?

It's very unlikely you will get a duplicate content penalty from Google if you did not intentionally copy content with malicious intent.

However, if you have large amounts of copy that is the same or very similar to another site, it's possible that Google will see this duplicate content as intentional and malicious, which means you're at risk of a penalty.

This is what Google says about the duplicate content penalty:

Duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results. If your site suffers from duplicate content issues, and you don't follow the advice listed above, we do a good job of choosing a version of the content to show in our search results.

 

If I fix duplicate content issues, will it increase my rankings?

Absolutely.

When you fix duplicate content issues it means the correct content - that is, the content you prefer to be ranked - will be crawled and indexed by Google and search engines.

By taking proactive steps to fix duplicate content, you will prevent Google from wasting crawl budget on crawling and indexing the duplicate pages that you don't eat to show up in search. Instead, crawlers spend time indexing your preferred pages and your target audience is shown more relevant pages in the search results.

How much original text is required for a page to be considered “unique”?

That's the million dollar question!

At the end of the day, if you want to rank well with a page, it's best to focus on creating an original page that is valuable to your visitors and has unique, relevant content. There's no "correct percentage" of unique content that needs to be on the page or in the blog post. But if you focus on creating unique content first and foremost, search engines will recognise that it is an original page and will reward it in the rankings.   

Fix your duplicate content now

We've covered a lot in this article - and it can be difficult to know where to begin. That’s where we can help. Claim your FREE website audit today and find out where your duplicate content issues are and how to fix them. You'll get a full audit of your digital marketing assets AND we'll help you understand the opportunities for revenue-busting growth with a 6-month multichannel game plan. Get your FREE audit today!

New Call-to-action

Let's increase your sales.

Claim your $2,000 Audit for FREE by telling us a little about yourself below. No obligations, no catches. Just real, revenue results.