robots.txt and Indexing Directives After a CMS Migration: The Settings Nobody Checks
robots.txt and Indexing Directives After a CMS Migration: The Settings Nobody Checks
There is a category of migration disaster that has nothing to do with broken links, missing redirects, or slow page speed. It happens silently, often goes unnoticed for weeks, and can wipe out months of ranking progress in a single deployment. It is the misconfigured indexing directive — a robots.txt rule, a noindex meta tag, or a server-level header that tells Google to stop crawling or indexing your site, and Google complies immediately and completely.
These errors are common because they originate in legitimate staging environment settings. When you build a Next.js site before launch, you absolutely should block Googlebot from crawling your staging environment. The problem is that those same blocking directives survive the move to production if no one explicitly removes them. And since they are configuration details rather than visible content, no one notices until rankings start falling.
This post walks through every layer of indexing control — robots.txt, meta robots tags, X-Robots-Tag HTTP headers, and crawl budget considerations — with specific guidance for Next.js App Router projects. If you are planning a migration or recently launched one, this is the checklist you need to run before Google finds what you missed.
Quick Checklist
- Remove any staging Disallow rules before going live
- Verify robots.txt is accessible at yourdomain.com/robots.txt
- Check for noindex meta tags on every production page (automated crawl)
- Verify X-Robots-Tag headers aren't set at the CDN/server level
- Ensure /thanks and other form-confirmation pages are noindexed
- Test with Google's robots.txt Tester in Search Console
The Staging robots.txt Disaster (And Why It Happens Every Time)
The pattern is predictable. A developer builds the new Next.js site on a staging subdomain or a preview URL. They correctly add a robots.txt that blocks all crawlers — often something as simple as Disallow: / for all user agents. This is responsible staging practice. Google should not index a site that isn't ready.
Then launch day arrives. The staging site is promoted to production, the DNS records update, the CDN configuration switches over. The development team verifies that pages load, that redirects work, that forms submit correctly. They ship it.
Nobody checks the robots.txt.
Three weeks later, the site owner notices that impressions in Search Console have dropped by 70%. Rankings that took two years to build have collapsed. A quick check of the live robots.txt reveals the problem: Disallow: / is still in place, and Google has been respecting it faithfully since day one of the new site.
This happens on professional agency-managed migrations. It happens on in-house engineering team migrations. It happens because robots.txt verification is not part of most teams' launch checklists, and because the consequences are delayed enough that the connection between the launch and the traffic drop is not always obvious.
The fix takes five minutes. Finding the problem and recovering from it can take months.
How to Configure robots.txt for a Next.js Site
In a Next.js App Router project, robots.txt is generated through the robots.ts file in your /app directory. This is the correct, idiomatic approach — it gives you a type-safe, programmatic way to generate the file and makes environment-based configuration straightforward.
A minimal production-ready robots.txt configuration looks like this:
// app/robots.ts
import { MetadataRoute } from 'next'
export default function robots(): MetadataRoute.Robots {
return {
rules: {
userAgent: '*',
allow: '/',
disallow: ['/api/', '/admin/', '/thanks'],
},
sitemap: 'https://yourdomain.com/sitemap.xml',
}
}
Note the explicit /thanks in the disallow list — form confirmation pages should not be indexed, and that needs to be specified intentionally.
For staging environments, the correct pattern is environment-based configuration:
// app/robots.ts
import { MetadataRoute } from 'next'
const isProduction = process.env.NODE_ENV === 'production'
export default function robots(): MetadataRoute.Robots {
if (!isProduction) {
return {
rules: {
userAgent: '*',
disallow: '/',
},
}
}
return {
rules: {
userAgent: '*',
allow: '/',
disallow: ['/api/', '/admin/', '/thanks'],
},
sitemap: 'https://yourdomain.com/sitemap.xml',
}
}
This approach means the staging block is automatic and environment-driven — it cannot accidentally survive into production because it is tied to the NODE_ENV value that changes at deployment.
If you are using a static robots.txt file in your /public directory rather than the programmatic approach, the same logic applies: the file must be updated before production deployment, and that update must be verified after deployment. The programmatic approach is safer because it eliminates the manual step.
For reference, see the Next.js robots.txt documentation and Google's robots.txt documentation for complete specification details.
Meta Robots, X-Robots-Tag, and Indexing Directives — The Full Picture
robots.txt is only one of three mechanisms that control how Google crawls and indexes your site. The others are meta robots tags and X-Robots-Tag HTTP headers, and they operate at different levels with different scopes.
Meta Robots Tags
The <meta name="robots" content="noindex"> tag is applied at the page level and instructs search engines not to index the specific page where it appears. This tag is appropriate for pages like /thanks, /cart, /checkout, and /user-dashboard — pages that serve a functional purpose but should not appear in search results.
The migration risk here is that development or staging versions of pages often carry noindex tags as a default, and those tags can survive into production in two ways:
First, a CMS or framework that defaults to noindex for all pages until explicitly enabled. Some headless CMS configurations work this way. If your migration involved switching to a headless CMS, verify that the indexing default is set to allow, not block.
Second, individual pages that had noindex set in WordPress for legitimate reasons — draft pages, internal tools, admin pages — may have those settings carried over during content migration. Audit these carefully. Not everything that was noindexed in WordPress should be noindexed in Next.js.
In a Next.js App Router project, the robots metadata is set through the metadata export:
export const metadata: Metadata = {
robots: {
index: true,
follow: true,
},
}
A noindex page would have index: false. Verify this is not set on any page that should be indexed.
X-Robots-Tag HTTP Headers
The X-Robots-Tag is an HTTP response header that functions identically to a meta robots tag but applies at the server or CDN level rather than within the HTML. This is where the most dangerous and least visible indexing blocks can hide.
CDN configurations — particularly Cloudflare, Vercel's edge config, and custom Nginx setups — can inject X-Robots-Tag headers for entire path patterns. A rule like "add X-Robots-Tag: noindex to all /staging/* paths" is sensible in a staging context. The same rule applied incorrectly to production paths is catastrophic.
To check for X-Robots-Tag headers, use curl from the command line:
curl -I https://yourdomain.com/your-key-page
Look for x-robots-tag in the response headers. If you see noindex anywhere in that value, find and remove the rule that is setting it.
Check your Vercel project settings, your Cloudflare transform rules, and any Nginx or reverse proxy configurations for X-Robots-Tag injection rules. These should be documented and reviewed as part of every migration launch checklist.
Noindex Leaks: How to Detect Them Before Google Does
A single noindex tag on a single important page can be caught manually. A systematic noindex leak — affecting dozens or hundreds of pages — requires automated detection.
The most reliable method is a full-site crawl using a tool like Screaming Frog SEO Spider, Sitebulb, or Ahrefs Site Audit. Configure the crawl to report on:
- Pages with
noindexin the meta robots tag - Pages with
noindexin the X-Robots-Tag response header - Pages blocked by robots.txt
- Pages returning non-200 status codes
Run this crawl against your production site within 24 hours of launch, before Google's crawlers have had time to act on anything they find. The crawl report becomes your baseline and your immediate sanity check.
Pay specific attention to:
High-value pages. Your homepage, service pages, key landing pages, and top-performing blog posts should all be explicitly verified as indexable. Do not assume they are correct — check them individually.
Template-level issues. If noindex is set at the component or layout level in Next.js, it may affect every page using that template. A single template-level error affects hundreds of pages simultaneously.
Recently migrated content. Content migrated from WordPress carries its original metadata, including any indexing directives set in Yoast, RankMath, or All in One SEO. Verify that migrated content has the correct indexing settings in the new system.
Form confirmation pages. These should be noindexed intentionally. Verify they are noindexed and that the noindex is only on these pages, not applied more broadly.
After the initial crawl, set up recurring crawls — weekly for the first 90 days, then monthly. This catches regressions introduced by code deployments, CMS updates, or CDN configuration changes.
For guidance on what to do when problems are found, see our post on what to do when pages get deindexed. For sitemap configuration in Next.js, see XML sitemaps for Next.js migrations.
Crawl Budget Considerations for Large Migrations
Crawl budget — the number of pages Googlebot crawls on your site within a given time period — is a factor for sites with thousands of URLs. For smaller sites (under 1,000 pages), crawl budget is rarely a limiting constraint. For larger sites, it deserves specific attention during migration planning.
Why Migrations Affect Crawl Budget
A migration resets much of the crawl priority signals Google has built up for your old site. URLs that were crawled frequently because of high internal link equity and good engagement signals need to re-establish that signal on the new domain or path structure. In the weeks after a migration, Googlebot may deprioritize pages it previously crawled frequently.
You can accelerate re-crawl by submitting your updated XML sitemap immediately after launch, using the URL Inspection tool in Search Console to request indexing for priority pages, and ensuring your internal linking structure gives Google clear paths to all important content.
Blocking Low-Value Pages
Large WordPress sites often accumulate significant URL bloat: author archive pages, tag pages, date-based archives, search result pages, and paginated views of content that adds no unique value. These URLs consume crawl budget without contributing to rankings.
During a migration, take the opportunity to audit which archive and taxonomy pages serve a genuine user and SEO purpose and which are crawl budget waste. Block low-value pages from crawling via robots.txt Disallow rules and use noindex for pages that should be served but not indexed.
Do this evaluation before launch. Changing robots.txt rules after launch triggers recrawls and re-evaluations that add unnecessary complexity to your post-launch monitoring.
Sitemap Hygiene
Your XML sitemap should list only canonically indexed URLs — no redirects, no noindexed pages, no 404s. After a migration, verify that your sitemap is accurate and submit it fresh to Search Console.
A sitemap that lists 800 URLs when you've actually blocked 200 of them with noindex is a signal integrity problem — Google has to reconcile conflicting directives and that costs crawl budget. Keep the sitemap clean.
FAQ
What happens if I accidentally block Googlebot with robots.txt?
Google respects robots.txt near-immediately after it's updated. If you deploy a Disallow: / rule to production, Googlebot will stop crawling your site within hours and your indexed pages will begin dropping from search results as Google can no longer verify they exist. The fix — correcting the robots.txt and resubmitting your sitemap — signals Google to resume crawling, but re-indexing takes time. Depending on how long the block was in place, recovery can take days to weeks. This is why verifying robots.txt is the first check after any migration launch.
How do I configure robots.txt in a Next.js App Router project?
Create a robots.ts file in your /app directory and export a function that returns a MetadataRoute.Robots object. This generates the robots.txt file dynamically at the /robots.txt route. Use environment variables to serve a blocking configuration on staging and an allowing configuration on production. This eliminates the risk of manually forgetting to update the file at launch. See the Next.js robots.txt documentation for the full API reference.
What's the difference between robots.txt and noindex meta tags?
robots.txt controls crawling — it tells Googlebot whether to visit a URL at all. A noindex meta tag controls indexing — it tells Google not to include the page in search results even if it is crawled. A page can be crawlable but noindexed, or blocked in robots.txt entirely. The practical difference matters: if a page is blocked in robots.txt, Google cannot read the noindex tag on it, which can cause confusing behavior if you're relying on noindex for pages that are linked externally. For pages you want to prevent from being indexed, prefer noindex with crawling allowed rather than robots.txt blocking.
Can CDN-level headers block Googlebot even if robots.txt is correct?
Yes. X-Robots-Tag headers set at the CDN or server level override nothing in robots.txt — they are an independent mechanism. A correctly configured robots.txt allows crawling, but if your CDN is injecting X-Robots-Tag: noindex on every response, those pages will not be indexed. Always check response headers using curl or a browser's developer tools network tab, especially after any CDN configuration change. This is one of the most commonly missed issues in post-migration audits.
Should I noindex my /thanks and confirmation pages?
Yes. Form confirmation pages, checkout success pages, and similar post-action pages should be noindexed. They provide no value to search users, they can create duplicate content issues if they share template elements with other pages, and in some cases they can expose information about user actions that shouldn't be publicly visible. The correct approach is to noindex them explicitly via meta robots tag and also disallow them in robots.txt to prevent unnecessary crawling.
Next Steps
Indexing configuration errors are the easiest migration failure to prevent and one of the most damaging to recover from. A 20-minute audit of your robots.txt, meta robots tags, and response headers before launch costs nothing. Finding and recovering from a deindexation event costs weeks of lost traffic and months of recovery time.
Our SEO Parity Audit includes a complete indexation risk check: we verify robots.txt configuration, crawl every page for noindex leaks, check CDN-level header configurations, and deliver a written report of every issue found — before they cost you rankings.
Related posts:
Services: