Cheerio.js 爬虫与 Next.js API 路由

昨天看了职棒大联盟的全垒打背景音乐才知道，比利时兄弟的《The Hum》原来是洛杉矶道奇队 2022 年的全垒打配乐😂。

前言

本文其实主要目的并不是想说怎么用 node.js 实现爬虫，而更想说为什么说 next.js 是一个全栈框架（严格上来说，并不是前端框架）。

假如现在你去问 ChatGPT，“next.js 是前端还是全栈框架？” ChatGPT 会给出很多证明 next.js 是全栈框架的例子，比如：

next.js 有自己的路由系统，可以实现服务端渲染（SSR）；
next.js 支持 API 路由；
next.js 支持部署服务端函数（serverless functions）。

在本文中，Jim 将会用一个简单的爬虫例子来证明 next.js 是一个全栈框架。

`cheerio.js` 爬虫

cheerio.js 是一个 node.js 的库，可以用来解析 HTML，它的 API 和 jQuery 很像，所以很容易上手。

安装

通过 npm 安装即可。

$ npm install cheerio

使用

cheerio.js 的使用非常简单，只需要引入 cheerio，然后调用 load 方法，传入 HTML 字符串即可。

// ES6
import { CheerioAPI, load } from 'cheerio';

const resp = await fetch("http://www.baidu.com");
const html = await resp.text();
const $: CheerioAPI = load(html);

然后就和 jQuery 一样了，可以通过 $('selector') 来获取元素，然后通过 .text() 来获取元素的文本内容。

console.log($("title").text());
// 百度一下，你就知道

`next.js` API 路由

next.js 的 API 路由是一个非常好用的功能，它可以让我们在 next.js 项目中直接创建 API 接口，而不需要额外的配置。

// pages/api/example.ts
import type { NextApiRequest, NextApiResponse } from "next";

export default async function handler(req: NextApiRequest, res: NextApiResponse) => {
  res.status(200).json({ name: "Jim" });
};

在上面的例子中，我们创建了一个 example 的 API 接口，它的路径是 /api/example。这个接口的功能就是返回一个 JSON 对象。

我们可以进一步尝试将 cheerio.js 和 next.js 的 API 路由结合起来，来实现一个简单的爬虫再渲染：

// pages/api/scrape_baidu.ts
import { CheerioAPI, load } from 'cheerio';
import type { NextApiRequest, NextApiResponse } from "next";

export default async function handler(req: NextApiRequest, res: NextApiResponse) => {
  const resp = await fetch("http://www.baidu.com");
  const html = await resp.text();
  const $: CheerioAPI = load(html);
  res.status(200).json({ title: $("title").text() });
};

这个接口的功能就是爬取百度首页的标题，并返回给前端。接下来让前端渲染一下：

// app/page.tsx
export default function Home() {
  const [title, setTitle] = useState<string>("");

  useEffect(() => {
    fetch("/api/scrape_baidu")
      .then((res) => res.json())
      .then((data) => setTitle(data.title));
  }, []);

  return <div>{title}</div>;
};

上述方法是一种常见的方法，但是有一非常非常致命的 bug——会造成无限循环。这也是 next.js 在继承 react.js 的 useEffect 时的一个糟糕特性，详见这两篇文章：How to Solve the Infinite Loop of React.useEffect() 和 Github Issue - Infinite render loop with server components when running next dev。

大神 Jack Herrington 给出了一种解决方案：

// app/page.tsx
function makeQueryClient() {
  const fetchMap = new Map<string, Promise<any>>();
  return function queryClient<QueryResult>(
    name: string,
    query: () => Promise<QueryResult>
  ): Promise<QueryResult> {
    if (!fetchMap.has(name)) {
      fetchMap.set(name, query());
    }
    return fetchMap.get(name)!;
  };
}

const queryClient = makeQueryClient();

export default function Home() {
  const baidu = use(queryClient("baidu", () =>
    fetch("/api/scrape_baidu").then((res) => res.json())
  ));

  return (
    <div>{baidu.title}</div>
  );
};

好的，本文简要地介绍了如何使用 next.js 的 API 路由来实现一个简单的爬虫。但是，这个爬虫的效率并不高，因为每次请求都会重新爬取一次百度首页，这样的话，如果有 100 个用户同时访问这个页面，那么就会有 100 个请求同时爬取百度首页，这样的话，百度就会封掉我们的 IP 了。

所以，我们需要将爬虫的结果缓存起来，这样就可以避免重复爬取了。

前言

cheerio.js 爬虫

安装

使用

next.js API 路由

`cheerio.js` 爬虫

`next.js` API 路由