MarkTechPost@AI 06月19日 02:15
How to Build an Advanced BrightData Web Scraper with Google Gemini for AI-Powered Data Extraction
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍如何构建一个利用BrightData代理网络和Google Gemini API进行智能数据提取的增强型网络抓取工具。它详细介绍了如何构建Python项目,安装和导入必要的库,并将抓取逻辑封装在一个可重用的BrightDataScraper类中。无论目标是Amazon产品页面、畅销书列表还是LinkedIn个人资料,该抓取器的模块化方法都演示了如何配置抓取参数、优雅地处理错误并返回结构化的JSON结果。可选的React风格的AI代理集成展示了如何将LLM驱动的推理与实时抓取相结合,从而能够使用自然语言查询进行即时数据分析。

💡该教程指导用户构建一个增强的网络抓取工具,该工具结合了BrightData的代理网络和Google的Gemini API,以实现智能数据提取。

📦首先,需要安装关键库,包括langchain-brightdata、langchain-google-genai、google-generativeai、langgraph和langchain-core。

🛠️BrightDataScraper类封装了所有BrightData网络抓取逻辑和可选的Gemini驱动智能。其方法使用户能够轻松获取Amazon产品详细信息、畅销书列表和LinkedIn个人资料,处理API调用、错误处理和JSON格式化。

🤖可选的AI代理功能允许用户使用自然语言查询,通过Google API密钥运行,并以清晰的格式打印结果。

In this tutorial, we walk you through building an enhanced web scraping tool that leverages BrightData’s powerful proxy network alongside Google’s Gemini API for intelligent data extraction. You’ll see how to structure your Python project, install and import the necessary libraries, and encapsulate scraping logic within a clean, reusable BrightDataScraper class. Whether you’re targeting Amazon product pages, bestseller listings, or LinkedIn profiles, the scraper’s modular methods demonstrate how to configure scraping parameters, handle errors gracefully, and return structured JSON results. An optional React-style AI agent integration also shows you how to combine LLM-driven reasoning with real-time scraping, empowering you to pose natural language queries for on-the-fly data analysis.

!pip install langchain-brightdata langchain-google-genai langgraph langchain-core google-generativeai

We install all of the key libraries needed for the tutorial in one step: langchain-brightdata for BrightData web scraping, langchain-google-genai and google-generativeai for Google Gemini integration, langgraph for agent orchestration, and langchain-core for the core LangChain framework.

import osimport jsonfrom typing import Dict, Any, Optionalfrom langchain_brightdata import BrightDataWebScraperAPIfrom langchain_google_genai import ChatGoogleGenerativeAIfrom langgraph.prebuilt import create_react_agent

These imports prepare your environment and core functionality: os and json handle system operations and data serialization, while typing provides structured type hints. You then bring in BrightDataWebScraperAPI for BrightData scraping, ChatGoogleGenerativeAI to interface with Google’s Gemini LLM, and create_react_agent to orchestrate these components in a React-style agent.

class BrightDataScraper:    """Enhanced web scraper using BrightData API"""       def __init__(self, api_key: str, google_api_key: Optional[str] = None):        """Initialize scraper with API keys"""        self.api_key = api_key        self.scraper = BrightDataWebScraperAPI(bright_data_api_key=api_key)               if google_api_key:            self.llm = ChatGoogleGenerativeAI(                model="gemini-2.0-flash",                google_api_key=google_api_key            )            self.agent = create_react_agent(self.llm, [self.scraper])       def scrape_amazon_product(self, url: str, zipcode: str = "10001") -> Dict[str, Any]:        """Scrape Amazon product data"""        try:            results = self.scraper.invoke({                "url": url,                "dataset_type": "amazon_product",                "zipcode": zipcode            })            return {"success": True, "data": results}        except Exception as e:            return {"success": False, "error": str(e)}       def scrape_amazon_bestsellers(self, region: str = "in") -> Dict[str, Any]:        """Scrape Amazon bestsellers"""        try:            url = f"https://www.amazon.{region}/gp/bestsellers/"            results = self.scraper.invoke({                "url": url,                "dataset_type": "amazon_product"            })            return {"success": True, "data": results}        except Exception as e:            return {"success": False, "error": str(e)}       def scrape_linkedin_profile(self, url: str) -> Dict[str, Any]:        """Scrape LinkedIn profile data"""        try:            results = self.scraper.invoke({                "url": url,                "dataset_type": "linkedin_person_profile"            })            return {"success": True, "data": results}        except Exception as e:            return {"success": False, "error": str(e)}       def run_agent_query(self, query: str) -> None:        """Run AI agent with natural language query"""        if not hasattr(self, 'agent'):            print("Error: Google API key required for agent functionality")            return               try:            for step in self.agent.stream(                {"messages": query},                stream_mode="values"            ):                step["messages"][-1].pretty_print()        except Exception as e:            print(f"Agent error: {e}")       def print_results(self, results: Dict[str, Any], title: str = "Results") -> None:        """Pretty print results"""        print(f"\n{'='*50}")        print(f"{title}")        print(f"{'='*50}")               if results["success"]:            print(json.dumps(results["data"], indent=2, ensure_ascii=False))        else:            print(f"Error: {results['error']}")        print()

The BrightDataScraper class encapsulates all BrightData web-scraping logic and optional Gemini-powered intelligence under a single, reusable interface. Its methods enable you to easily fetch Amazon product details, bestseller lists, and LinkedIn profiles, handling API calls, error handling, and JSON formatting, and even stream natural-language “agent” queries when a Google API key is provided. A convenient print_results helper ensures your output is always cleanly formatted for inspection.

def main():    """Main execution function"""    BRIGHT_DATA_API_KEY = "Use Your Own API Key"    GOOGLE_API_KEY = "Use Your Own API Key"       scraper = BrightDataScraper(BRIGHT_DATA_API_KEY, GOOGLE_API_KEY)       print(" Scraping Amazon India Bestsellers...")    bestsellers = scraper.scrape_amazon_bestsellers("in")    scraper.print_results(bestsellers, "Amazon India Bestsellers")       print(" Scraping Amazon Product...")    product_url = "https://www.amazon.com/dp/B08L5TNJHG"    product_data = scraper.scrape_amazon_product(product_url, "10001")    scraper.print_results(product_data, "Amazon Product Data")       print(" Scraping LinkedIn Profile...")    linkedin_url = "https://www.linkedin.com/in/satyanadella/"    linkedin_data = scraper.scrape_linkedin_profile(linkedin_url)    scraper.print_results(linkedin_data, "LinkedIn Profile Data")       print(" Running AI Agent Query...")    agent_query = """    Scrape Amazon product data for https://www.amazon.com/dp/B0D2Q9397Y?th=1    in New York (zipcode 10001) and summarize the key product details.    """    scraper.run_agent_query(agent_query)

The main() function ties everything together by setting your BrightData and Google API keys, instantiating the BrightDataScraper, and then demonstrating each feature: it scrapes Amazon India’s bestsellers, fetches details for a specific product, retrieves a LinkedIn profile, and finally runs a natural-language agent query, printing neatly formatted results after each step.

if __name__ == "__main__":    print("Installing required packages...")    os.system("pip install -q langchain-brightdata langchain-google-genai langgraph")       os.environ["BRIGHT_DATA_API_KEY"] = "Use Your Own API Key"       main()

Finally, this entry-point block ensures that, when run as a standalone script, the required scraping libraries are quietly installed, and the BrightData API key is set in the environment. Then the main function is executed to initiate all scraping and agent workflows.

In conclusion, by the end of this tutorial, you’ll have a ready-to-use Python script that automates tedious data collection tasks, abstracts away low-level API details, and optionally taps into generative AI for advanced query handling. You can extend this foundation by adding support for other dataset types, integrating additional LLMs, or deploying the scraper as part of a larger data pipeline or web service. With these building blocks in place, you’re now equipped to gather, analyze, and present web data more efficiently, whether for market research, competitive intelligence, or custom AI-driven applications.


Check out the Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post How to Build an Advanced BrightData Web Scraper with Google Gemini for AI-Powered Data Extraction appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

BrightData Web抓取 Google Gemini AI数据提取
相关文章