MarkTechPost@AI 前天 14:10
A Coding Guide to Asynchronous Web Data Extraction Using Crawl4AI: An Open-Source Web Crawling and Scraping Toolkit Designed for LLM Workflows
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍如何利用基于Python的现代网络爬虫工具Crawl4AI,在Google Colab中直接从网页提取结构化数据。通过使用asyncio进行异步I/O,httpx处理HTTP请求,以及Crawl4AI内置的AsyncHTTPCrawlerStrategy,避免了无头浏览器的开销,同时通过JsonCssExtractionStrategy解析复杂的HTML。只需几行代码,即可安装依赖项,配置HTTPCrawlerConfig,定义CSS到JSON的模式,并通过AsyncWebCrawler和CrawlerRunConfig编排爬取过程。最后,将提取的JSON数据加载到pandas中进行分析或导出。

🌐 Crawl4AI是一个基于Python的异步网络爬虫工具,它简化了在Google Colab中提取网页数据的流程。它结合了asyncio、httpx和JsonCssExtractionStrategy等技术,实现了高效的数据抓取。

💡 Crawl4AI的核心优势在于其统一的API,能够在基于浏览器的(Playwright)和仅HTTP策略之间无缝切换,并具备强大的错误处理机制和声明式提取模式。这使得Crawl4AI非常适合可扩展的数据管道、笔记本中的即时ETL,或为LLM和分析工具提供干净的JSON/CSV输出。

🛠️ 使用Crawl4AI进行数据提取的步骤包括:安装依赖项(crawl4ai, httpx),配置HTTP请求头,定义CSS-JSON提取模式,以及通过AsyncWebCrawler和CrawlerRunConfig编排爬取过程。提取的数据可以方便地加载到pandas中进行进一步处理。

🚀 Crawl4AI支持纯HTTP爬取,避免了无头浏览器的开销,提高了性能。同时,它也支持Playwright驱动的浏览器自动化,使得用户可以根据需要选择最轻量级和高效的后端。

In this tutorial, we demonstrate how to harness Crawl4AI, a modern, Python‑based web crawling toolkit, to extract structured data from web pages directly within Google Colab. Leveraging the power of asyncio for asynchronous I/O, httpx for HTTP requests, and Crawl4AI’s built‑in AsyncHTTPCrawlerStrategy, we bypass the overhead of headless browsers while still parsing complex HTML via JsonCssExtractionStrategy. With just a few lines of code, you install dependencies (crawl4ai, httpx), configure HTTPCrawlerConfig to request only gzip/deflate (avoiding Brotli issues), define your CSS‑to‑JSON schema, and orchestrate the crawl through AsyncWebCrawler and CrawlerRunConfig. Finally, the extracted JSON data is loaded into pandas for immediate analysis or export. 

What sets Crawl4AI apart is its unified API, which seamlessly switches between browser-based (Playwright) and HTTP-only strategies, its robust error-handling hooks, and its declarative extraction schemas. Unlike traditional headless-browser workflows, Crawl4AI allows you to choose the most lightweight and performant backend, making it ideal for scalable data pipelines, on-the-fly ETL in notebooks, or feeding LLMs and analytics tools with clean JSON/CSV outputs.

!pip install -U crawl4ai httpx

First, we install (or upgrade) Crawl4AI, the core asynchronous crawling framework, alongside HTTPX. This high-performance HTTP client provides all the building blocks we need for lightweight, asynchronous web scraping directly in Colab.

import asyncio, json, pandas as pdfrom crawl4ai import AsyncWebCrawler, CrawlerRunConfig, HTTPCrawlerConfigfrom crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategyfrom crawl4ai.extraction_strategy import JsonCssExtractionStrategy

We bring in Python’s core async and data‑handling modules, asyncio for concurrency, json for parsing, and pandas for tabular storage, alongside Crawl4AI’s essentials: AsyncWebCrawler to drive the crawl, CrawlerRunConfig and HTTPCrawlerConfig to configure extraction and HTTP settings, AsyncHTTPCrawlerStrategy for a browser‑free HTTP backend, and JsonCssExtractionStrategy to map CSS selectors into structured JSON.

http_cfg = HTTPCrawlerConfig(    method="GET",    headers={        "User-Agent":      "crawl4ai-bot/1.0",        "Accept-Encoding": "gzip, deflate"    },    follow_redirects=True,    verify_ssl=True)crawler_strategy = AsyncHTTPCrawlerStrategy(browser_config=http_cfg)

Here, we instantiate an HTTPCrawlerConfig to define our HTTP crawler’s behavior, using a GET request with a custom User-Agent, gzip/deflate encoding only, automatic redirects, and SSL verification. We then plug that into AsyncHTTPCrawlerStrategy, allowing Crawl4AI to drive the crawl via pure HTTP calls rather than a full browser.

schema = {    "name": "Quotes",    "baseSelector": "div.quote",    "fields": [        {"name": "quote",  "selector": "span.text",      "type": "text"},        {"name": "author", "selector": "small.author",   "type": "text"},        {"name": "tags",   "selector": "div.tags a.tag", "type": "text"}    ]}extraction_strategy = JsonCssExtractionStrategy(schema, verbose=False)run_cfg = CrawlerRunConfig(extraction_strategy=extraction_strategy)

We define a JSON‑CSS extraction schema targeting each quote block (div.quote) and its child elements (span.text, small.author, div.tags a.tag), then initializes a JsonCssExtractionStrategy with that schema, and wraps it in a CrawlerRunConfig so Crawl4AI knows exactly what structured data to pull on each request.

async def crawl_quotes_http(max_pages=5):    all_items = []    async with AsyncWebCrawler(crawler_strategy=crawler_strategy) as crawler:        for p in range(1, max_pages+1):            url = f"https://quotes.toscrape.com/page/{p}/"            try:                res = await crawler.arun(url=url, config=run_cfg)            except Exception as e:                print(f" Page {p} failed outright: {e}")                continue            if not res.extracted_content:                print(f" Page {p} returned no content, skipping")                continue            try:                items = json.loads(res.extracted_content)            except Exception as e:                print(f" Page {p} JSON‑parse error: {e}")                continue            print(f" Page {p}: {len(items)} quotes")            all_items.extend(items)    return pd.DataFrame(all_items)

Now, this asynchronous function orchestrates the HTTP‑only crawl: it spins up an AsyncWebCrawler with our AsyncHTTPCrawlerStrategy, iterates through each page URL, and safely awaits crawler.arun(), handles any request or JSON parsing errors and collects the extracted quote records into a single pandas DataFrame for downstream analysis.

df = asyncio.get_event_loop().run_until_complete(crawl_quotes_http(max_pages=3))df.head()

Finally, we kick off the crawl_quotes_http coroutine on Colab’s existing asyncio loop, fetching three pages of quotes, and then display the first few rows of the resulting pandas DataFrame to verify that our crawler returned structured data as expected.

In conclusion, by combining Google Colab’s zero-config environment with Python’s asynchronous ecosystem and Crawl4AI’s flexible crawling strategies, we have now developed a fully automated pipeline for scraping and structuring web data in minutes. Whether you need to spin up a quick dataset of quotes, build a refreshable news‑article archive, or power a RAG workflow, Crawl4AI’s blend of httpx, asyncio, JsonCssExtractionStrategy, and AsyncHTTPCrawlerStrategy delivers both simplicity and scalability. Beyond pure HTTP crawls, you can instantly pivot to Playwright‑driven browser automation without rewriting your extraction logic, underscoring why Crawl4AI stands out as the go‑to framework for modern, production‑ready web data extraction.


Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post A Coding Guide to Asynchronous Web Data Extraction Using Crawl4AI: An Open-Source Web Crawling and Scraping Toolkit Designed for LLM Workflows appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Crawl4AI 网络爬虫 Google Colab
相关文章