How to Build a Python AI Agent for Sitemap URL Scraping for FREE?

Sitemap url scraping isn’t just a technical task -it’s a powerhouse tool for SEO. Extracting URLs from your competitors’ sitemaps, can help you learn their publishing frequency, discover topics they focus on, and uncover gaps in your own content strategy. Here I have outlined three efficient methods for scraping competitor sitemaps.

1- Using Google Sheets (Quick and Easy)

2- Using Screaming Frog (A Bit More Complex)

3- Using Python & Custom Script (For Large Sitemaps)

Watch the video or keep on reading for the overview!

How to Automate Scraping Data from a List of URLs

Why Not Just Copy-Paste?

If you’re dealing with small sitemaps (40-50 URLs), manually copying the URLs might work. But once you’re dealing with hundreds or thousands of URLs, this method becomes inefficient, error-prone, and slow. Additionally, sitemaps often have URLs nested under others—something you can’t capture by simply copy-pasting. This is where automated methods come into play.

Method 1: Using Google Sheets (Quick and Easy)

Step-by-Step:

  • Open a Google Sheet and select an empty cell

  • Paste the following formula to quickly pull URLs from the sitemap
  • Result: Google Sheets will pull all URLs from the sitemap. Notice that the result might include extra URLs (like media files) that are nested under each URL, which is why you can’t get them just by copy-pasting.

For instance, you may notice 1608 URLs being pulled, though the sitemap itself has 795 URLs—this happens because Sheets also includes images, PDFs, and other assets linked under each page.

Method 2: Using Screaming Frog (A Bit More Complex)

Step-by-Step:

1- Download and Install Screaming Frog from the official website.

2- Go Configuration > Spider > Crawl and select “Crawl Linked XML Sitemaps.”

3- Manual or Auto Discovery: You can enter the sitemap URL manually or let Screaming Frog auto-discover it.

4- Start the Crawl: Open the spider, enter the website URL, and click “Start”.

5- Wait for Completion: Wait until it reaches 100% and your data will be ready.

6- Filters: Use Screaming Frog’s 7 filters in the Sitemaps tab to group and analyze data by type.

With Screaming Frog, you can deeply analyze the content structure, filter out duplicate pages, and detect issues within the sitemap.

Method 3: Using Python & Custom Script (For Large Sitemaps)

Perfect for:

  • Large sitemaps (1000+ URLs).

  • Requires no local Python installation. Google Colab makes it easy to run the script directly from the cloud.

Step-by-Step:

1- Go to Google Colab. Open Google Colab, which lets you run Python scripts in the cloud without installing anything and create a new notebook.

2- Copy-paste the custom Python script (provided below).

3- Run the Script: Paste in your sitemap URL (we used conductor.com in this example), hit Play, and the script will pull all URLs within seconds (or an hour for large sitemaps).

4- Save the Output

  • The script saves the URLs to a notepad document.
  • It also saves a file named “MySitemapURLs” in your Google Drive once you grant the necessary permissions.

Output Data: https://docs.google.com/spreadsheets/d/1m_4Zfpz_YVAHy0MRCBe7wIkVgyVnFWPaTdAeNJ2u34o/edit?usp=sharing

Customization: The script is flexible, allowing you to adjust things like file formats or the user-agent string if needed.

Competitor Research (AI-driven)

After gathering the URLs, run a ChatGPT competitor analysis (as seen in our previous tutorial) to evaluate the SERP for each URL and gain valuable SEO insights. This AI-driven analysis with ChatGPT can quickly give you a competitive edge in search rankings.

Here’s a competitor research prompt to kick things off:

  • Analyze the Search Engine Results Page (SERP) for the 1223 URLs to understand their ranking.
  • Investigate “People Also Asked” sections and featured snippets.
  • Review Ranking Strategies: Focus on each competitor’s presentation, narrative style, and unique insights.
  • Identify Key NLP Keywords: Discover the most relevant keywords and related terms from each competitor.

Create a competitor research table outlining all findings to refine your SEO strategy and out-rank your competitors.

Wrapping Up

These three methods are perfect for automating sitemap scraping and analyzing competitor sites at scale. Whether you’re a marketer or an SEO professional, they can save you time, uncover hidden opportunities, and help craft a content strategy that positions you for success.

Got Questions?


Comment below or reach out to us at info@digitalonian.com. We’re happy to help!

Similar Posts