How To Make A FREE Duplicate Content Checking Tool for SEO Easily? (Video Inside)
Duplicate content detection can tank your SEO faster than you can say “manual penalty.” From hidden menu items and footers to copy-pasted region-specific pages, duplicate content splits your link equity, confuses Google’s crawler, and invites manual actions potentially resulting in lost leads OR even revenue. Let’s go over the exact, straightforward steps to build a powerful, Python-driven duplicate content checker leveraging AI embeddings.
Small business websites can’t afford wasted traffic or penalties. Whether you run a local store, consultancy, or online boutique, here’s your cost‑effective solution to keep your content unique.
Step 1 – Prepare Your URL List (Tab-Delimited)
Step 2 – Set Up Your Automation Environment (Google Colab)
Step 3 – Install HTML-Clean Support
Step 4 – Load URLs with a Custom Python Script
Step 5 – Export & Automate
In our latest video walkthrough, we demonstrated how to scan 5,000+ URLs in seconds, flag genuine duplicates via AI embeddings, and export a clean report—all at zero cost on Google Colab.
How? By building a free, scalable web page duplicate content checker in Python, powered by AI embeddings, custom scripting, and automation — no paid APIs required.
Why Duplicate Content Checking Matters for SEO
- Splits your link equity: Multiple pages fight for the same keywords.
- Confuses Google’s crawler: Boilerplate navbars get indexed over real content.
- Invites manual actions: Google may penalize or devalue your site.

Step-by-Step: Build Your Python AI Duplicate Checker
Step 1 – Prepare Your URL List (Tab-Delimited)
- Export the URls you want to check for duplication (we selected Nord VPN – a major SaaS VPN provider for the experiment)
Here’s the list of URLs we used: NordVPN URLs
2. Ensure that the URLs begin with the first row (matching the loader’s check in code).
3. Name your file exactly URLs.txt and ensure it’s tab-delimited (.txt).
Tip: Keep it clean — no extra columns or spaces.
Step 2 – Set Up Your Automation Environment (Google Colab)`
- Go to Google Colab, open —> New Notebook.

2. Open the terminal and install the required libraries:
pip install newspaper3k sentence-transformers scikit-learn numpy lxml_html_clean
Why?
- Newspaper3k for content extraction.
- SentenceTransformers & scikit-learn for AI embeddings and cosine similarity.
- NumPy for numerical operations.
- lxml_html_clean to avoid HTML-clean import errors.
Step 3 – Install HTML-Clean Support
!pip install lxml_html_clean
With
lxml_html_cleanin place, your content extraction step can confidently clean up page HTML—ensuring you analyze only the meaningful text, not boilerplate or broken markup.You’re adding a small, dedicated library that supplies the missing html.clean module for lxml.
Why it’s needed: The newspaper3k package relies on lxml.html.clean to strip out unwanted tags (scripts, ads, etc.) when parsing pages.
What happens under the hood:
- Checks your existing
lxmlinstall (you already have it). - Downloads
lxml_html_clean-0.4.2, which injects the standalone clean-HTML functionality. - Installs it so that
import lxml.html.cleanin your script now works without errors.
Step 4 – Load URLs with a Custom Python Script
import os
import csv
import re
import numpy as np
from newspaper import Article, Config
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
def load_urls(filename="URLs.txt"):
"""
Load URLs from a tab-delimited file.
If the file contains a header with 'URL', that column is used;
otherwise, the first column is assumed to contain URLs.
"""
urls = []
try:
with open(filename, "r", encoding="utf-8") as f:
first_line = f.readline()
f.seek(0) # Reset pointer
if "URL" in first_line:
reader = csv.DictReader(f, delimiter='\t')
for row in reader:
url = row.get("URL")
if url:
urls.append(url.strip())
else:
reader = csv.reader(f, delimiter='\t')
for row in reader:
if row:
urls.append(row[0].strip())
except Exception as e:
print(f"Error reading {filename}: {e}")
return urls
def extract_text(url):
"""Extract the main article text from a URL using Newspaper3k with custom configuration."""
try:
config = Config()
config.request_timeout = 20 # Increased timeout
config.browser_user_agent = (
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/117.0.0.0 Safari/537.36'
)
article = Article(url, config=config)
article.download()
article.parse()
return article.text
except Exception as e:
print(f"Error extracting from {url}: {e}")
return ""
def tokenize_text(text):
"""
Tokenize text by lowercasing and extracting word tokens.
Punctuation and non-word characters are removed.
"""
text = text.lower()
# Extract word tokens using regex (letters, numbers, underscores)
tokens = re.findall(r'\b\w+\b', text)
return tokens
def longest_common_substring_length(tokens1, tokens2):
"""
Compute the length (in words) of the longest common contiguous substring between two lists of tokens.
Uses dynamic programming.
"""
m, n = len(tokens1), len(tokens2)
dp = [[0] * (n + 1) for _ in range(m + 1)]
longest = 0
for i in range(1, m + 1):
for j in range(1, n + 1):
if tokens1[i - 1] == tokens2[j - 1]:
dp[i][j] = dp[i - 1][j - 1] + 1
if dp[i][j] > longest:
longest = dp[i][j]
else:
dp[i][j] = 0
return longest
def export_duplicates(duplicates, output_filename="duplicates_embeddings.csv"):
"""Export duplicate pairs to a CSV file."""
try:
with open(output_filename, "w", newline='', encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["URL1", "URL2", "Cosine_Similarity", "Longest_Common_Block_Words"])
for url1, url2, cosine_sim, common_words in duplicates:
writer.writerow([url1, url2, f"{cosine_sim:.2f}", common_words])
print(f"Exported duplicates to {output_filename}")
except Exception as e:
print(f"Error writing to {output_filename}: {e}")
def main():
# Load URLs from file
urls = load_urls("URLs.txt")
if not urls:
print("No URLs loaded. Please check your file.")
return
print(f"Found {len(urls)} URLs.")
# Extract article text for each URL
print("Extracting article text from URLs...")
contents = [extract_text(url) for url in urls]
# Load SentenceTransformer model and compute embeddings
print("Loading SentenceTransformer model and computing embeddings...")
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = [model.encode(text, convert_to_numpy=True) if text else np.zeros(384)
for text in contents]
embeddings = np.array(embeddings)
# Compute cosine similarity matrix between embeddings
similarity_matrix = cosine_similarity(embeddings)
threshold_embed = 0.9 # Embedding similarity threshold
contiguous_threshold = 50 # Require at least 50 contiguous matching words
duplicates = []
n = len(urls)
# For candidate pairs with high cosine similarity, check contiguous word match
for i in range(n):
for j in range(i + 1, n):
cosine_sim = similarity_matrix[i][j]
if cosine_sim > threshold_embed:
# Tokenize the texts
tokens1 = tokenize_text(contents[i])
tokens2 = tokenize_text(contents[j])
lcs_length = longest_common_substring_length(tokens1, tokens2)
if lcs_length >= contiguous_threshold:
duplicates.append((urls[i], urls[j], cosine_sim, lcs_length))
# Export duplicate pairs to CSV
export_duplicates(duplicates)
# Output results to console
print("\nDuplicate detection using SentenceTransformer embeddings (with contiguous block check):")
if duplicates:
for url1, url2, cosine_sim, common_words in duplicates:
print(f"{url1} and {url2} (Cosine similarity: {cosine_sim:.2f}, "
f"Longest common block: {common_words} words)")
else:
print("No duplicates found.")
if __name__ == "__main__":
main()
Step 5 – Export & Automate
Remember: Siteliner and Gemini Pro often only flag headers and footers — not actual body text. Our custom script digs deeper.
📈 Case Study: SaaS Recovery from Duplication
A Digitalonaian client from the SaaS industry client saw their daily clicks plunge from 1,400 to 350 (–75%) because their region-specific pages were copying the same content. You can automate that same process at zero cost.

- Problem: 1,000+ word articles copy-pasted across region pages dropped clicks from 1,400 → 350.
- Solution: Ran our script, cleaned duplicates, implemented 301 redirects + schema-rich FAQs.
- Outcome: Daily clicks rebounded to 1,200 in 3 months (+240%) — see the graph!
If this sounds like magic, it’s just automation and AI in action.
Wrap Up
No more paid tools, no more guesswork—just a free, scalable, AI-driven solution to duplicate content detection. Want Digitalonian to handle it? Slide into our DMs or drop a comment below. Let’s keep your SEO rankings bulletproof!
