使用 Crawl4AI 抓取搜狐文章教程

Crawl4AI 介绍

Crawl4AI 是一个开源的异步网络爬虫库，专为 AI 应用设计。它允许开发者轻松抓取网页内容、提取结构化数据，并支持自定义提取策略。Crawl4AI 内置了对 JavaScript 支持的浏览器自动化，适合处理动态网页。官方文档：docs.crawl4ai.com/。

Crawl4AI 的核心优势包括：

异步操作

提取策略

浏览器集成

缓存和配置

如果你是 Python 开发者，有基本的异步编程和网页抓取经验，这篇教程将带你从入门到实践，使用 Crawl4AI 抓取搜狐（Sohu）网站的文章。

环境准备

安装 Crawl4AI

首先，确保你有 Python 3.12+ 环境。然后安装 Crawl4AI：

pip install crawl4ai

然后使用 crawl4ai-setup 执行相关依赖的安装，在 Linux 环境中也许需要使用 sudo 来安装一些相关的软件依赖（使用 apt）

crawl4ai-setup

其他依赖

asyncio

bs4

logging

json

基本概念

Crawl4AI 的核心类是 AsyncWebCrawler，用于运行爬取任务。关键配置包括：

CrawlerRunConfig

JsonCssExtractionStrategy

抓取流程通常分为：

获取链接列表。抓取并提取文章内容。

我们以搜狐为例，演示如何实现。

实践：抓取搜狐文章

步骤 1：获取文章链接

搜狐有多个来源，如新闻（sohu_news）和自媒体（sohu_mp）。我们定义一个通用函数 base_fetch_links。

import loggingfrom typing import Callablefrom crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig, CrawlResultfrom dubhe.domain.vo import Link, new_linklogger = logging.getLogger(__name__)def _filter_link(link: Link) -> bool:  return link.href.startswith('https://www.sohu.com/a/')async def base_fetch_links(  url: str,  run_config: CrawlerRunConfig,  cb_filter: Callable[[Link], bool] = _filter_link) -> list[Link]:  browser_config = BROWSER_CONFIG.clone(browser_type=CONFIG.browsers.urls_browser)  async with AsyncWebCrawler(config=browser_config) as crawler:    result: CrawlResult = await crawler.arun(url=url, config=run_config)  # type: ignore    if not result.success:      raise DataError.error(        f'获取搜狐文章失败, url: {url}',        detail={'title': '获取链接失败', 'url': url, 'error_message': result.error_message},      )    internal_links = result.links.get('internal', [])    links = [new_link(link, url_clean) for link in internal_links]    logger.info('所有内部链接数量: %d', len(links))    links = [link for link in links if cb_filter(link)]    logger.info('文章链接数量: %d', len(links))    return links

这里使用了 crawl4ai 自带的 links 功能。links是一个 Python 字典类型，通常带有 internal 和 external 两个 list。每个条目可能有 href、text、title等。如果您没有禁用链接提取，则会自动捕获此信息。然后，我们通过判断链接是否以 https://www.sohu.com/a/开头来判断它是否是文章链接。这样做的好处就是我们只需要判断链接即可，不用管相关页面的布局是否有改变。

步骤 2：提取文章内容

定义 JSON schema 来提取文章元素，如标题、作者、内容、图片等。使用 JsonCssExtractionStrategy。

SCHEMA = {  'name': 'Sohu-Article',  'baseSelector': '#article-container',  'fields': [    {'name': 'author', 'selector': '.user-info h4', 'type': 'text'},    {'name': 'title', 'selector': '.main div.text-title h1', 'type': 'text'},    {'name': 'published_date', 'selector': '.main .article-info #news-time', 'type': 'text'},    {'name': 'content', 'selector': '.main article', 'type': 'html'},    {'name': 'original_link', 'selector': '.main .article-info [data-role="original-link"]', 'type': 'text'},    {      'name': 'imgs',      'selector': '.article img',      'type': 'list',      'fields': [        {'name': 'src', 'type': 'attribute', 'attribute': 'src'},        {'name': 'alt', 'type': 'attribute', 'attribute': 'alt'},      ],    },  ],}

下面来详细解读 SCHEMA：

name: 'Sohu-Article'

baseSelector: '#article-container'

fields

fields 列表详解

author

name: 'author'

selector: '.user-info h4'

type: 'text'

title

name : 'title'

selector : '.main div.text-title h1'

type : 'text'

published_date

name : 'published_date'

selector : '.main .article-info #news-time'

type : 'text'

content

name : 'content'

selector : '.main article'

type : 'html'

original_link

name : 'original_link'

selector : '.main .article-info [data-role="original-link"]'

type : 'text'

imgs

name: 'imgs'

selector: '.article img'

type: 'list'

fields

src

name: 'src'

type: 'attribute'

attribute: 'src'

img

src

alt

name: 'alt'

type: 'attribute'

attribute: 'alt'

img

alt

抓取函数：

async def fetch_sohu_articles(pages: list[OriginalPage], run_config: CrawlerRunConfig):  # ... (完整代码见 sohu_article.py)  async with AsyncWebCrawler(config=browser_config) as crawler:    async for result in await crawler.arun_many(urls=urls, config=run_config, dispatcher=dispatcher):      # 提取的 `extracted_content` 是个 JSON 数组，通常我们只取第 1 个元素      item: dict[str, Any] = json.loads(result.extracted_content)[0]      # 文章正文 HTML 片段      content = clean_html(item.get('content', ''))      # 抓取的文章图片      images = list(Img.new_imgs(item.get('imgs', []), access_schema))      # 文章发布时间      published_time = parse_datetime(item['published_date'])      # 发布作者      author = item.get('author', '')      author = author if author else remove_whitespace(item.get('original_link', ''))      # 下载图片并返回图片在本地存储的路径      img_paths = await download_imgs(images, access_schema)      # 更多其它代码略 ....

图片下载代码如下：

async def download_imgs(images: list[Img], access_schema: str) -> list[str]:  if not images:    return []  img_paths = []  async with AsyncClient() as client:    for img in images:      img_url = img.src      parsed_url = urlparse(img_url)      save_path = CONFIG.download_path / (parsed_url.hostname or 'unknown') / remove_suffix(parsed_url.path[1:])      parent_dir = save_path.parent      if not parent_dir.is_dir():        parent_dir.mkdir(parents=True, exist_ok=True)      if not parsed_url.scheme:        img_url = f'{access_schema}:{img_url}'      saved_path = await download_img(client, img_url, save_path)      if saved_path:        img_paths.append(str(saved_path))  return img_paths

步骤 3：应用示例

在 main 函数中组合使用：

import asyncioasync def main():  pages = find_all_pending_pages(site=SiteEnum.SOHU_NEWS)  await fetch_sohu_articles(pages, CRAWLER_RUN_CONFIG)if __name__ == '__main__':  asyncio.run(main())

注意事项

页面类型检测

BeautifulSoup

错误处理

并发

arun_many

实践建议

总结

通过本教程，您将能够快速掌握使用 Crawl4AI 抓取搜狐文章的核心技巧。如果您希望探索更多高级功能，请参阅官方文档以获取深入指导！比如：基于 LLM 的抓取策略

Crawl4AI 介绍

环境准备

安装 Crawl4AI

其他依赖

基本概念

实践：抓取搜狐文章

步骤 1：获取文章链接

步骤 2：提取文章内容

抓取函数：

步骤 3：应用示例

注意事项

总结

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签