Kong Metrics
Back to Blog
technical-seo gsc-tips

XML Sitemaps and GSC: Best Practices for Large Websites

Kong Metrics Team · · 3 min read

Sitemaps are essentially a map for Googlebot. They tell the crawler which pages are the most important on your site and which should be indexed first.

For a site with 20 pages, a sitemap is trivial. For a site with 200,000 pages, the sitemap strategy determines the success of your crawl budget. If you treat your sitemap as a "set and forget" task, Google will likely ignore your priority and crawl thousands of junk URLs instead.

Best Practices for Sitemap Optimization

A well-optimized sitemap is essential for ensuring Google spends its crawl budget effectively. If your sitemap contains low-quality or irrelevant URLs, you are directly sabotaging your own indexing efficiency.

The 50MB / 50K URL Limit

Google has strict hard limits for sitemaps: 50,000 URLs per file and 50MB (uncompressed) in size. If you exceed these, you must use a Sitemap Index File—a master XML file that points to smaller, segmented sitemaps (e.g., sitemap-products-1.xml, sitemap-blog-1.xml).

If your CMS just dumps every single URL into one massive file, you ignore the topical hierarchy of your site.

Index Sitemaps for E-commerce

Large e-commerce stores should never use a single flat sitemap. Googlebot is resource-constrained. If it encounters a sitemap containing 50,000 product pages, it might not reach your high-value category pages for weeks.

Segment your sitemaps logically:

  • Create a sitemap specifically for your top-level category pages. Give these the highest crawl priority.
  • Create separate sitemaps for product pages, partitioned by date or category.
  • Only include "canonical" URLs in your sitemaps. Including parameterized, duplicate, or redirected URLs is a massive waste of crawl budget.

Spotting Orphan Pages

After submitting your partitioned sitemaps to GSC, monitor the "Sitemap" report daily.

If GSC shows you submitted 50,000 URLs but only 5,000 are indexed, you have a massive quality issue. It is likely that 45,000 of your indexed pages are low-value noise.

Use the Kong Metrics URL Clustering tool to identify which sections of your site are contributing to this bloat. If you see an entire cluster (e.g., "Archive Pages") that shows 10,000 impressions but zero clicks, these URLs are dragging down your overall crawl efficiency.

Use the sitemap as a surgical tool, not a storage bin. By segmenting your sitemaps and using Kong Metrics to prune the noise, you force Google to spend its crawl budget exclusively on your high-value, revenue-generating content.

To optimize further, ensure your sitemaps don't include error pages by checking How to Handle Soft 404 Errors, manage site architecture bloat with Diagnosing SEO Thin Content GSC, and verify your setup with How to Set up Verify GSC.