‘Duplicate content’ is one of the most famous SEO issues that have been consistently in questions of webmasters and they have been discussing it like anything. Based on the blog posts by famous SEO masters, published interviews of Googlers and discussions at popular webmaster forums, I have tried to summarize few important facts behind the nuances about duplicate content.
Defining a duplicate content
Content is considered duplicate when considerable part of it is exactly or partially similar to other webpage in same or different website. There are many reasons that can create duplicate content on same site which is likely to be considered non-malicious,
- Blog content accessible directly from post, via Category, Author, archives and RSS/Atom Feed.
- Forum replies creating multiple urls pointing to same page
- common sales elements, lengthy copywrite text on many pages of a site.
- Print-only versions of articles not pointing to original
- Same store items linked via multiple distinct URLs
- www and non-www versions of same page etc.
Different sites with similar content Pages,
Many webmasters assume that creating multiple or identical copies of the same content/page will either increase their chances of ranking for many long tail keywords or help them get multiple listings to win more traffic. There is nothing wrong in syndicating your content across different sites while taking due care about providing authenticate ownership of content.
However, in some cases its looked as a malicious practice when content is duplicated across different sites to deliberately trick the search engines into returning incompatible and poor quality search results with the same content repeated within a set of results.
Duplicate content penalty
Actually there is nothing called ‘duplicate content penalty’. Penalties are faced when Google perceives that you are trying to manipulate the search results by employing unscrupulous techniques like paid links, excessive link building with same anchor text, doorway sites etc. that results in preventing your site from showing up in top positions despite of how many backlinks it has or how low the competition is. They usually push down a search result by 30, 60, 90, 350 and 950 positions depending on the nature of penalty even though the page previously ranked number one.
Duplicate content is treated very differently. Here’s an abstract from official Google’s blog about handling such practices,
In the rare cases in which we perceive that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved. However, we prefer to focus on filtering rather than ranking adjustments … so in the vast majority of cases, the worst thing that’ll befall webmasters is to see the “less desired” version of a page shown in our index.
It recommends that for a page its unlikely to be totally out of Google’s index just because of content duplication (unless there are other factors involved like MFA or doorway site, spammy looking domain, no or very few inbound links etc.) In order to make a search more relevant to a user, pages with too much duplicate content are simply filtered and sent to ‘supplemental index’ (It is the place where less trusted result is sent when Google is not sure what to do with it but doesn’t want to throw it away).
Pages in supplemental index rarely rank, not often crawled and doesn’t carry any backlinking value thus the worst thing that may happen is to have ‘more desired’ version of page thrown into the supplemental index. Its similar to using a sieve to remove unwanted particles while, good particles sometimes can get filtered out accidentally.
Basically, when a search engine spider crawls a site, it reads the html pages and stores the content in its database. Then, it compares its findings to other information from its database and detects the exact percentage of duplication. A page with more duplicate content simply scores low on content-based algorithmic factors. However, Google search rankings are based on 200+ elements, among which a very small set of factors are actually based on the content in page body. Thus depending upon other vital components like relevancy score, inbound links, trust rank and so on, it then scales the final rankings and filters out the pages that qualify to be much duplicate and also of low quality.
How to know if a site is in the Supplemental Index
Well, I am not sure if there is any better way to check if a site is in supplemental index. I usually look for the results in Google returned by:
site:www.yourdomain.com (returns every page indexed)
and
site:/www.yourdomain.com (returns every page in the main index)
The difference in above two searches detects the pages from your site that are in supplemental index.
As far as its concerned about the duplicate content on the same site, don’t get confused to see the supplemental results even if you are not having any of the conditions mentioned earlier that create duplicate content, as there can be other reasons as well like old pages going supplemental if you have recently used 301 redirects to new ones. Thus you may just need to give it a time, but also need to make sure that there are no CMS errors delivering the same content at multiple URLs or canonicalization errors like www vs non-www. Try to tweak robots.txt file to stop Google indexing junk pages, use sitemap files to point to preferred URLs, use webmaster tools to fine-tune parameter handling and wherever needed, implement noindex meta tags for e.g. on archive pages in blogs, print-only pages etc.
If you syndicate your content to other sites then make sure that all the copies have link back to original version and to further ensure about your site being served in SERPs you would want to ask those sites to block their version through robots.txt.
The duplicate content filter sometimes comes out to be harsh even with sites that don’t mean to manipulate the search results anyway for e.g. when story submitted to highly credible site like digg ranks better than original site or someone with higher authority republishing content by scraping is considered original which may even keep a new site out of index. In such cases don’t mind to point some link popularity at that part of your site instead of just making it flow to home page.
Finally it’s in your hands to make the search engines trust your site to be unique. I am sure that the things discussed in this article will definitely help you to understand the duplicate content filter and keep your site unique and fresh.

May 25th, 2010 at 7:53 am
Duplicate content penalty applies only when you post the same content many times in the same website. It does not apply to posting the same article to different websites. You can use RSS feeds, provided that you keep all the links to the original source.
June 10th, 2010 at 10:22 am
I have heard that Duplicate Content can affest SERP’s ranking! In my opinion, unique content is the king!
July 20th, 2010 at 10:30 pm
yes content is the king…..