/robots-txt

为 Next.js 项目生成 SEO 友好的 robots.txt 文件。包含基本的爬虫规则、Sitemap 引用和常见目录的禁止规则。

功能

✅ 生成标准的 robots.txt 文件
✅ 自动包含 Sitemap 引用
✅ 禁止敏感目录和文件
✅ 允许所有主要爬虫
✅ 支持爬虫延迟设置
✅ 适配 Next.js 项目结构

参数

$1 或 $ARGUMENTS: 网站域名（可选）
- 用于生成 Sitemap URL
- 示例：https://yourdomain.com
- 如果不提供，Sitemap 行会被注释或省略

使用示例

示例 1：基本用法

/robots-txt

输出：

# robots.txt 文件

为你的 Next.js 项目生成的 robots.txt 文件：

```txt
# Allow all crawlers
User-agent: *
Allow: /

# Disallow common Next.js and private directories
Disallow: /api/
Disallow: /_next/
Disallow: /static/
Disallow: /.well-known/

# Disallow admin and private areas (if applicable)
# Disallow: /admin/
# Disallow: /private/
# Disallow: /dashboard/

# Sitemap (uncomment and update with your domain)
# Sitemap: https://yourdomain.com/sitemap.xml

文件位置

将此文件放置在项目根目录的 public/ 文件夹中：

路径： public/robots.txt

复制上面的内容到 public/robots.txt 文件。

验证

创建后，访问 https://yourdomain.com/robots.txt 验证文件可访问。

Next.js 自动生成

如果你使用 Next.js 13.3+，也可以使用 generateRobotTxt 函数自动生成：

app/robots.ts (推荐):

import { MetadataRoute } from 'next'

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      {
        userAgent: '*',
        allow: '/',
        disallow: ['/api/', '/_next/', '/static/'],
      },
    ],
    sitemap: 'https://yourdomain.com/sitemap.xml',
  }
}

这会自动生成 /robots.txt 路由。


### 示例 2：指定域名

```bash
/robots-txt https://yourdomain.com

输出：

# robots.txt 文件（包含 Sitemap）

```txt
# Allow all crawlers
User-agent: *
Allow: /

# Disallow common Next.js directories
Disallow: /api/
Disallow: /_next/
Disallow: /static/

# Disallow private areas
Disallow: /admin/
Disallow: /_next/image/
Disallow: /private/

# Crawl delay (optional)
# Crawl-delay: 1

# Sitemap
Sitemap: https://yourdomain.com/sitemap.xml

部署

将上述内容保存到 public/robots.txt，部署后可通过以下 URL 访问： https://yourdomain.com/robots.txt

注意事项

robots.txt 文件必须位于网站根目录
文件名必须小写：robots.txt
文件更新后可能需要几天才能被搜索引擎重新抓取
使用 Google Search Console 的 robots.txt 测试工具验证


### 示例 3：博客项目

```bash
/robots-txt https://blog.yourdomain.com

输出：

# robots.txt for Blog

```txt
# Allow all crawlers
User-agent: *
Allow: /

# Disallow admin and management areas
Disallow: /admin/
Disallow: /api/
Disallow: /_next/
Disallow: /wp-admin/
Disallow: /wp-includes/

# Disallow search and filter pages (avoid duplicate content)
Disallow: /search?
Disallow: /filter?
Disallow: /tag/*/page/
Disallow: /category/*/page/

# Allow specific feeds
Allow: /feed/
Allow: /feed/rss/

# Disallow other feeds
Disallow: */feed/
Disallow: */feed/rss/
Disallow: */feed/atom/

# Sitemap
Sitemap: https://blog.yourdomain.com/sitemap.xml

博客特定规则说明

禁止搜索和分页页面：

避免重复内容问题
节省爬虫预算

允许主要 feed：

允许主 feed 被抓取
禁止其他 feed 变体

防止爬虫陷入无限循环：

禁止带参数的搜索 URL
禁止深层分页


### 示例 4：电商项目

```bash
/robots-txt https://store.yourdomain.com

输出：

# robots.txt for E-commerce Store

```txt
# Allow all crawlers
User-agent: *
Allow: /

# Disallow admin and backend
Disallow: /admin/
Disallow: /api/
Disallow: /_next/
Disallow: /checkout/
Disallow: /cart/
Disallow: /my-account/
Disallow: /wishlist/

# Disallow search and filter pages
Disallow: /search?
Disallow: /catalogsearch/
Disallow: /filter?

# Disallow sorted and filtered URLs
Disallow: /*?sort=
Disallow: /*?order=
Disallow: /*?filter=
Disallow: /*?price=

# Disallow session IDs and tracking parameters
Disallow: /*?PHPSESSID=
Disallow: /*?utm_source=
Disallow: /*?utm_medium=
Disallow: /*?utm_campaign=

# Allow product images
Allow: /media/catalog/
Allow: /products/images/

# Disallow temporary files
Disallow: /tmp/
Disallow: /var/
Disallow: /cache/

# Specific crawler rules
# Block aggressive crawlers (optional)
User-agent: SemrushBot
Disallow: /

User-agent: AhrefsBot
Disallow: /

# Sitemap
Sitemap: https://store.yourdomain.com/sitemap.xml
Sitemap: https://store.yourdomain.com/sitemap-product.xml
Sitemap: https://store.yourdomain.com/sitemap-category.xml

电商特定规则说明

禁止结账流程页面：

/checkout/, /cart/ 等页面不需要被索引

禁止搜索和过滤 URL：

避免重复内容
节省爬虫预算

允许产品图片：

图片搜索流量有价值

阻止 SEO 工具爬虫：

SemrushBot, AhrefsBot 等可能消耗服务器资源

多个 Sitemap：

电商网站通常有多个 sitemap（产品、分类、静态页面）


### 示例 5：多语言项目

```bash
/robots-txt https://yourdomain.com

输出：

# robots.txt for Multi-language Site

```txt
# Allow all crawlers
User-agent: *
Allow: /

# Disallow common directories
Disallow: /api/
Disallow: /_next/
Disallow: /static/

# Sitemap for all languages
Sitemap: https://yourdomain.com/sitemap.xml
Sitemap: https://yourdomain.com/en/sitemap.xml
Sitemap: https://yourdomain.com/zh/sitemap.xml
Sitemap: https://yourdomain.com/es/sitemap.xml

多语言 Sitemap 策略

选项 1：主 sitemap 引用子 sitemap

<!-- sitemap.xml -->
<sitemapindex>
  <sitemap>
    <loc>https://yourdomain.com/en/sitemap.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://yourdomain.com/zh/sitemap.xml</loc>
  </sitemap>
</sitemapindex>

选项 2：每个语言的 sitemap 单独引用

在 robots.txt 中列出所有语言的 sitemap
确保每个语言的 sitemap URL 正确


## 高级配置

### Crawl Delay（爬虫延迟）

```txt
User-agent: *
Crawl-delay: 1

# 或针对特定爬虫
User-agent: Googlebot
Crawl-delay: 0

User-agent: Bingbot
Crawl-delay: 1

说明：

设置爬虫请求之间的延迟（秒）
Google 通常忽略 crawl-delay
对小网站可能有用

阻止特定爬虫

# 阻止 SEO 工具爬虫
User-agent: SemrushBot
Disallow: /

User-agent: AhrefsBot
Disallow: /

User-agent: MJ12bot
Disallow: /

# 阻止存档网站
User-agent: archive.org_bot
Disallow: /

注意： 只阻止消耗大量资源的爬虫，不要阻止主要搜索引擎。

允许特定资源

User-agent: *
Allow: /images/
Allow: /css/
Allow: /js/

Disallow: /api/
Disallow: /admin/

说明： 在禁止规则之前使用 Allow 规则。

针对不同爬虫的不同规则

# Googlebot
User-agent: Googlebot
Allow: /
Disallow: /private/

# Bingbot
User-agent: Bingbot
Allow: /
Disallow: /private/

# 其他所有爬虫
User-agent: *
Allow: /
Disallow: /private/
Disallow: /admin/

robots.txt 语法

基本指令

User-agent：

User-agent: *        # 所有爬虫
User-agent: Googlebot # 特定爬虫

Disallow：

Disallow: /          # 禁止整个网站
Disallow: /admin/    # 禁止目录
Disallow: /file.html # 禁止文件
Disallow: /*.pdf     # 禁止所有 PDF

Allow：

Allow: /             # 允许整个网站
Allow: /images/      # 允许目录

Sitemap：

Sitemap: https://yourdomain.com/sitemap.xml
Sitemap: https://yourdomain.com/sitemap-2.xml

Crawl-delay：

Crawl-delay: 1       # 1 秒延迟

通配符和模式

* 匹配任意字符：

Disallow: /*?        # 禁止所有带参数的 URL
Disallow: /*.pdf     # 禁止所有 PDF 文件
Disallow: /private*  # 禁止以 /private 开头的所有路径

$ 结束标记：

Disallow: /*.pdf$    # 禁止以 .pdf 结尾的 URL
Disallow: /print$    # 禁止以 /print 结尾的 URL

最佳实践

1. 保持简单

User-agent: *
Allow: /
Disallow: /admin/

大多数网站只需如此简单。

2. 不要依赖 robots.txt 保护敏感内容

robots.txt 是公开的，任何人都可以查看
使用身份验证保护敏感内容
robots.txt 只是建议，不是强制

3. 禁止重复内容

# 禁止搜索结果、过滤、排序页面
Disallow: /search?
Disallow: /filter?
Disallow: /*?sort=
Disallow: /*?order=

4. 明确允许重要资源

User-agent: *
Allow: /images/
Allow: /css/
Allow: /js/

Disallow: /api/

5. 测试和验证

使用以下工具测试：

Google Search Console - robots.txt 测试工具
Bing Webmaster Tools - robots.txt 测试工具
在线验证工具

常见错误

错误 1：禁止整个网站

# 错误
User-agent: *
Disallow: /

修复：

# 正确
User-agent: *
Allow: /
Disallow: /admin/

错误 2：错误的路径

# 错误 - 缺少尾部斜杠
Disallow: /admin

# 正确 - 目录应该有尾部斜杠
Disallow: /admin/

错误 3：阻止搜索引擎

# 错误 - 阻止主要搜索引擎
User-agent: Googlebot
Disallow: /

除非有充分理由，否则不要阻止主要搜索引擎。

错误 4：过度复杂的规则

# 错误 - 过于复杂
User-agent: *
Allow: /images/jpeg/
Allow: /images/png/
Disallow: /images/gif/
Disallow: /images/webp/
Disallow: /images/svg/
# ... 更多规则

保持简单：

User-agent: *
Allow: /images/
Disallow: /private/

Next.js 集成

选项 1：静态文件（简单）

将 robots.txt 放在 public/ 目录：

public/
  robots.txt
  sitemap.xml

选项 2：自动生成（推荐）

app/robots.ts 或 app/robots.js:

import { MetadataRoute } from 'next'

export default function robots(): MetadataRoute.Robots {
  const baseUrl = 'https://yourdomain.com'

  return {
    rules: [
      {
        userAgent: '*',
        allow: '/',
        disallow: [
          '/api/',
          '/_next/',
          '/static/',
          '/admin/',
        ],
      },
      {
        userAgent: ['SemrushBot', 'AhrefsBot'],
        disallow: '/',
      },
    ],
    sitemap: `${baseUrl}/sitemap.xml`,
  }
}

环境变量配置：

import { MetadataRoute } from 'next'

export default function robots(): MetadataRoute.Robots {
  const baseUrl = process.env.NEXT_PUBLIC_BASE_URL || 'https://yourdomain.com'

  return {
    rules: [
      {
        userAgent: '*',
        allow: '/',
        disallow: ['/api/', '/_next/'],
      },
    ],
    sitemap: `${baseUrl}/sitemap.xml`,
  }
}

动态配置：

import { MetadataRoute } from 'next'

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      {
        userAgent: '*',
        allow: '/',
        disallow: process.env.NODE_ENV === 'production'
          ? ['/admin/', '/private/']
          : ['/'],
      },
    ],
    sitemap: 'https://yourdomain.com/sitemap.xml',
  }
}

验证和测试

1. 本地验证

# 开发环境
curl http://localhost:3000/robots.txt

# 生产环境
curl https://yourdomain.com/robots.txt

2. Google Search Console

打开 Google Search Console
选择你的网站
导航到"设置" → "robots.txt 测试工具"
输入你的 robots.txt 内容
测试特定 URL 是否被允许或禁止

3. 在线验证工具

Google robots.txt 测试工具
Bing robots.txt 测试工具
Screaming Frog SEO Spider

4. 手动检查清单

文件位于根目录：/robots.txt
文件名小写：robots.txt
文件可访问：返回 200 状态码
语法正确：没有错误
规则合理：不会意外阻止重要内容
Sitemap 引用正确：URL 完整且有效

常见问题

Q: robots.txt 更新后多久生效？

A: 通常需要几天到一周。搜索引擎不会立即重新抓取。

Q: 如何隐藏敏感内容？

A: 不要使用 robots.txt。使用：

身份验证（登录）
noindex meta 标签
HTTP 身份验证
IP 限制

Q: 禁止目录后，该目录的图片还能被索引吗？

A: 不能。禁止目录会禁止该目录下的所有资源。

Q: 如何阻止特定参数的 URL？

A: 使用通配符：

Disallow: /*?parameter=
Disallow: /*?utm_source=

Q: 需要为每个爬虫单独设置规则吗？

A: 通常不需要。大多数网站对所有爬虫使用相同规则。

Q: robots.txt 与 meta robots 标签有什么区别？

robots.txt: 控制整个网站或目录的爬取
meta robots: 控制单个页面的索引和跟随

通常结合使用以获得最佳效果。

注意事项

robots.txt 文件必须位于网站根目录
文件名必须小写：robots.txt
robots.txt 是公开的，任何人都可以查看
不使用 robots.txt 保护敏感内容
定期测试和验证规则
更新后可能需要几天才能生效
使用 Google Search Console 监控爬取错误

功能

参数

使用示例

示例 1：基本用法

文件位置

验证

Next.js 自动生成

部署

注意事项

博客特定规则说明

电商特定规则说明

多语言 Sitemap 策略

阻止特定爬虫

允许特定资源

针对不同爬虫的不同规则

robots.txt 语法

基本指令

通配符和模式

最佳实践

1. 保持简单

2. 不要依赖 robots.txt 保护敏感内容

3. 禁止重复内容

4. 明确允许重要资源

5. 测试和验证

常见错误

错误 1：禁止整个网站

错误 2：错误的路径

错误 3：阻止搜索引擎

错误 4：过度复杂的规则

Next.js 集成

选项 1：静态文件（简单）

选项 2：自动生成（推荐）

验证和测试

1. 本地验证

2. Google Search Console

3. 在线验证工具

4. 手动检查清单

常见问题

Q: robots.txt 更新后多久生效？

Q: 如何隐藏敏感内容？

Q: 禁止目录后，该目录的图片还能被索引吗？

Q: 如何阻止特定参数的 URL？

Q: 需要为每个爬虫单独设置规则吗？

Q: robots.txt 与 meta robots 标签有什么区别？

相关命令

注意事项