MarkTechPost@AI 21小时前
Build a Gemini-Powered DataFrame Agent for Natural Language Data Analysis with Pandas and LangChain
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍如何利用Google的Gemini模型和Pandas库,对泰坦尼克号数据集进行深入的数据分析。通过结合ChatGoogleGenerativeAI客户端和LangChain的Pandas DataFrame代理,创建一个能够理解自然语言查询的互动式“代理”。该代理能够检查数据、计算统计数据、发现相关性并生成可视化见解,而无需手动编写代码。文章演示了从基本探索步骤到高级分析(如不同人口群体的生存率、年龄与票价的相关性),以及多数据帧比较和自定义评分和模式挖掘,展示了如何利用会话式AI进行更深入的数据洞察。

💡安装必要的库:首先,通过pip安装langchain_experimental、langchain_google_genai和pandas,以启用DataFrame代理和Google Gemini集成,然后导入核心模块并设置GOOGLE_API_KEY环境变量。

🤖初始化Gemini代理:创建一个名为setup_gemini_agent的助手函数,该函数使用Gemini模型和温度参数初始化LLM客户端。然后,将其封装到LangChain Pandas DataFrame代理中,该代理可以针对DataFrame执行自然语言查询。

📊数据加载与探索:load_and_explore_data函数从Pandas GitHub仓库获取泰坦尼克号CSV文件,并打印数据集的维度和列名,以便进行快速检查。然后,它返回加载的DataFrame,以便立即开始探索性分析。

🔎基本分析与高级分析:basic_analysis_demo函数展示了对数据集进行基本探索性查询,如数据集维度、生存率、家庭成员数量和乘客等级分布。advanced_analysis_demo函数则运行更复杂的查询,如计算相关性、执行分层生存分析、计算中位数统计数据和进行详细筛选。

🔄多数据帧和自定义分析:multi_dataframe_demo演示了如何使用Gemini代理处理多个DataFrame,例如原始泰坦尼克号数据和缺失年龄已填充的版本。custom_analysis_demo则展示了如何使用Gemini代理处理自定义的、特定领域的调查,例如构建乘客风险评分模型、提取和评估甲板生存率,以及挖掘姓氏相关的生存模式和票价/年龄异常值。

In this tutorial, we’ll learn how to harness the power of Google’s Gemini models alongside the flexibility of Pandas. We will perform both straightforward and sophisticated data analyses on the classic Titanic dataset. By combining the ChatGoogleGenerativeAI client with LangChain’s experimental Pandas DataFrame agent, we’ll set up an interactive “agent” that can interpret natural-language queries. It will inspect data, compute statistics, uncover correlations, and generate visual insights, without writing manual code for each task. We’ll walk through basic exploration steps (like counting rows or computing survival rates). We will delve into advanced analyses such as survival rates by demographic segments and fare–age correlations. Then we’ll compare modifications across multiple DataFrames. Finally, we will build custom scoring and pattern-mining routines to extract novel insights.

!pip install langchain_experimental langchain_google_genai pandasimport osimport pandas as pdimport numpy as npfrom langchain.agents.agent_types import AgentTypefrom langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agentfrom langchain_google_genai import ChatGoogleGenerativeAIos.environ["GOOGLE_API_KEY"] = "Use Your Own API Key"

First, we install the required libraries, langchain_experimental, langchain_google_genai, and pandas, using pip to enable the DataFrame agent and Google Gemini integration. Then import the core modules. Next, set your GOOGLE_API_KEY environment variable, and we’re ready to instantiate a Gemini-powered Pandas agent for conversational data analysis.

def setup_gemini_agent(df, temperature=0, model="gemini-1.5-flash"):    llm = ChatGoogleGenerativeAI(        model=model,        temperature=temperature,        convert_system_message_to_human=True    )       agent = create_pandas_dataframe_agent(        llm=llm,        df=df,        verbose=True,        agent_type=AgentType.OPENAI_FUNCTIONS,        allow_dangerous_code=True    )    return agent

This helper function initializes a Gemini-powered LLM client with our chosen model and temperature. Then it wraps it into a LangChain Pandas DataFrame agent that can execute natural-language queries (including “dangerous” code) against our DataFrame. Simply pass in our DataFrame to get back an interactive agent ready for conversational analysis.

def load_and_explore_data():    print("Loading Titanic Dataset...")    df = pd.read_csv(        "https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"    )    print(f"Dataset shape: {df.shape}")    print(f"Columns: {list(df.columns)}")    return df

This function fetches the Titanic CSV directly from the Pandas GitHub repo. It also prints out its dimensions and column names for a quick sanity check. Then it returns the loaded DataFrame so we can immediately begin our exploratory analysis.

def basic_analysis_demo(agent):    print("\nBASIC ANALYSIS DEMO")    print("=" * 50)       queries = [        "How many rows and columns are in the dataset?",        "What's the survival rate (percentage of people who survived)?",        "How many people have more than 3 siblings?",        "What's the square root of the average age?",        "Show me the distribution of passenger classes"    ]       for query in queries:        print(f"\nQuery: {query}")        try:            result = agent.invoke(query)            print(f"Result: {result['output']}")        except Exception as e:            print(f"Error: {e}")

This demo routine kicks off a “Basic Analysis” session by printing a header. Then it iterates through a set of common exploratory queries, like dataset dimensions, survival rates, family counts, and class distributions, against our Titanic DataFrame agent. For each natural-language prompt, it invokes the agent. Later, it captures its output and prints either the result or an error.

def advanced_analysis_demo(agent):    print("\nADVANCED ANALYSIS DEMO")    print("=" * 50)       advanced_queries = [        "What's the correlation between age and fare?",        "Create a survival analysis by gender and class",        "What's the median age for each passenger class?",        "Find passengers with the highest fares and their details",        "Calculate the survival rate for different age groups (0-18, 18-65, 65+)"    ]       for query in advanced_queries:        print(f"\nQuery: {query}")        try:            result = agent.invoke(query)            print(f"Result: {result['output']}")        except Exception as e:            print(f"Error: {e}")

This “Advanced Analysis” function prints a header, then runs a series of more sophisticated queries. It computes correlations, performs stratified survival analyses, calculates median statistics, and conducts detailed filtering against our Gemini-powered DataFrame agent. It loop-invokes each natural-language prompt, captures the agent’s responses, and prints the results (or errors). Thus, it demonstrates how easily we can leverage conversational AI for deeper, segmented insights into our dataset.

def multi_dataframe_demo():    print("\nMULTI-DATAFRAME DEMO")    print("=" * 50)       df = pd.read_csv(        "https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"    )       df_filled = df.copy()    df_filled["Age"] = df_filled["Age"].fillna(df_filled["Age"].mean())       agent = setup_gemini_agent([df, df_filled])       queries = [        "How many rows in the age column are different between the two datasets?",        "Compare the average age in both datasets",        "What percentage of age values were missing in the original dataset?",        "Show summary statistics for age in both datasets"    ]       for query in queries:        print(f"\nQuery: {query}")        try:            result = agent.invoke(query)            print(f"Result: {result['output']}")        except Exception as e:            print(f"Error: {e}")

This demo illustrates how to spin up a Gemini-powered agent over multiple DataFrames. In this case, it includes the original Titanic data and a version with missing ages imputed. So, we can ask cross-dataset comparison questions (like differences in row counts, average-age comparisons, missing-value percentages, and side-by-side summary statistics) using simple natural-language prompts.

def custom_analysis_demo(agent):    print("\nCUSTOM ANALYSIS DEMO")    print("=" * 50)       custom_queries = [        "Create a risk score for each passenger based on: Age (higher age = higher risk), Gender (male = higher risk), Class (3rd class = higher risk), Family size (alone or large family = higher risk). Then show the top 10 highest risk passengers who survived",               "Analyze the 'deck' information from the cabin data: Extract deck letter from cabin numbers, Show survival rates by deck, Which deck had the highest survival rate?",               "Find interesting patterns: Did people with similar names (same surname) tend to survive together? What's the relationship between ticket price and survival? Were there any age groups that had 100% survival rate?"    ]       for i, query in enumerate(custom_queries, 1):        print(f"\nCustom Analysis {i}:")        print(f"Query: {query[:100]}...")        try:            result = agent.invoke(query)            print(f"Result: {result['output']}")        except Exception as e:            print(f"Error: {e}")

This routine kicks off a “Custom Analysis” session by walking through three complex, multi-step prompts. It builds a passenger risk-scoring model, extracts and evaluates deck-based survival rates, and mines surname-based survival patterns and fare/age outliers. Thus, we can see how easily our Gemini-powered agent handles bespoke, domain-specific investigations with just natural-language queries.

def main():    print("Advanced Pandas Agent with Gemini Tutorial")    print("=" * 60)       if not os.getenv("GOOGLE_API_KEY"):        print("Warning: GOOGLE_API_KEY not set!")        print("Please set your Gemini API key as an environment variable.")        return       try:        df = load_and_explore_data()        print("\nSetting up Gemini Agent...")        agent = setup_gemini_agent(df)               basic_analysis_demo(agent)        advanced_analysis_demo(agent)        multi_dataframe_demo()        custom_analysis_demo(agent)               print("\nTutorial completed successfully!")           except Exception as e:        print(f"Error: {e}")        print("Make sure you have installed all required packages and set your API key.")if __name__ == "__main__":    main()

The main() function serves as the starting point for the tutorial. It verifies that our Gemini API key is set, loads and explores the Titanic dataset, and initializes the conversational Pandas agent. It then sequentially runs the basic, advanced, multi-DataFrame, and custom analysis demos. Lastly, it wraps the entire workflow in a try/except block to catch and report any errors before signaling successful completion.

df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv")agent = setup_gemini_agent(df)agent.invoke("What factors most strongly predicted survival?")agent.invoke("Create a detailed survival analysis by port of embarkation")agent.invoke("Find any interesting anomalies or outliers in the data")

Finally, we directly load the Titanic data, instantiate our Gemini-powered Pandas agent, and fire off three one-off queries. We identify key survival predictors, break down survival by embarkation port, and uncover anomalies or outliers. We achieve all this without modifying any of our demo functions.

In conclusion, combining Pandas with Gemini via a LangChain DataFrame agent transforms data exploration from writing boilerplate code into crafting clear, natural-language queries. Whether we’re computing summary statistics, building custom risk scores, comparing multiple DataFrames, or drilling into nuanced survival analyses, the transformation is evident. With just a few lines of setup, we gain an interactive analytics assistant that can adapt to new questions on the fly. It can surface hidden patterns and accelerate our workflow.


Check out the Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.

The post Build a Gemini-Powered DataFrame Agent for Natural Language Data Analysis with Pandas and LangChain appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Gemini Pandas 数据分析 AI代理
相关文章