A Step by Step Guide to Build a Trend Finder Tool with Python: Web Scraping, NLP (Sentiment Analysis & Topic Modeling), and Word Cloud Visualization

Monitoring and extracting trends from web content has become essential for market research, content creation, or staying ahead in your field. In this tutorial, we provide a practical guide to building your trend-finding tool using Python. Without needing external APIs or complex setups, you’ll learn how to scrape publicly accessible websites, apply powerful NLP (Natural Language Processing) techniques like sentiment analysis and topic modeling, and visualize emerging trends using dynamic word clouds.

Copy CodeCopiedUse a different Browser

import requestsfrom bs4 import BeautifulSoup# List of URLs to scrapeurls = ["https://en.wikipedia.org/wiki/Natural_language_processing",        "https://en.wikipedia.org/wiki/Machine_learning"]  collected_texts = []  # to store text from each pagefor url in urls:    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})    if response.status_code == 200:        soup = BeautifulSoup(response.text, 'html.parser')        # Extract all paragraph text        paragraphs = [p.get_text() for p in soup.find_all('p')]        page_text = " ".join(paragraphs)        collected_texts.append(page_text.strip())    else:        print(f"Failed to retrieve {url}")

First with the above code snippet, we demonstrate a straightforward way to scrape textual data from publicly accessible websites using Python’s requests and BeautifulSoup. It fetches content from specified URLs, extracts paragraphs from the HTML, and prepares them for further NLP analysis by combining text data into structured strings.

Copy CodeCopiedUse a different Browser

import reimport nltknltk.download('stopwords')from nltk.corpus import stopwordsstop_words = set(stopwords.words('english'))cleaned_texts = []for text in collected_texts:    # Remove non-alphabetical characters and lower the text    text = re.sub(r'[^A-Za-z\s]', ' ', text).lower()    # Remove stopwords    words = [w for w in text.split() if w not in stop_words]    cleaned_texts.append(" ".join(words))

Then, we clean the scraped text by converting it to lowercase, removing punctuation and special characters, and filtering out common English stopwords using NLTK. This preprocessing ensures the text data is clean, focused, and ready for meaningful NLP analysis.

Copy CodeCopiedUse a different Browser

from collections import Counter# Combine all texts into one if analyzing overall trends:all_text = " ".join(cleaned_texts)word_counts = Counter(all_text.split())common_words = word_counts.most_common(10)  # top 10 frequent wordsprint("Top 10 keywords:", common_words)

Now, we calculate word frequencies from the cleaned textual data, identifying the top 10 most frequent keywords. This helps highlight dominant trends and recurring themes across the collected documents, providing immediate insights into popular or significant topics within the scraped content.

Copy CodeCopiedUse a different Browser

!pip install textblobfrom textblob import TextBlobfor i, text in enumerate(cleaned_texts, 1):    polarity = TextBlob(text).sentiment.polarity    if polarity > 0.1:        sentiment = "Positive "    elif polarity < -0.1:        sentiment = "Negative "    else:        sentiment = "Neutral "    print(f"Document {i} Sentiment: {sentiment} (polarity={polarity:.2f})")

We perform sentiment analysis on each cleaned text document using TextBlob, a Python library built on top of NLTK. It evaluates the overall emotional tone of each document—positive, negative, or neutral—and prints the sentiment along with a numerical polarity score, providing a quick indication of the general mood or attitude within the text data.

Copy CodeCopiedUse a different Browser

from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.decomposition import LatentDirichletAllocation# Adjust these parametersvectorizer = CountVectorizer(max_df=1.0, min_df=1, stop_words='english')doc_term_matrix = vectorizer.fit_transform(cleaned_texts)# Fit LDA to find topics (for instance, 3 topics)lda = LatentDirichletAllocation(n_components=3, random_state=42)lda.fit(doc_term_matrix)feature_names = vectorizer.get_feature_namesout()for idx, topic in enumerate(lda.components):    print(f"Topic {idx + 1}: ", [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-11:-1]])

Then, we apply Latent Dirichlet Allocation (LDA)—a popular topic modeling algorithm—to discover underlying topics in the text corpus. It first transforms cleaned texts into a numerical document-term matrix using scikit-learn’s CountVectorizer, then fits an LDA model to identify the primary themes. The output lists the top keywords for each discovered topic, concisely summarizing key concepts in the collected data.

Copy CodeCopiedUse a different Browser

# Assuming you have your text data stored in combined_textfrom wordcloud import WordCloudimport matplotlib.pyplot as pltimport nltkfrom nltk.corpus import stopwordsimport renltk.download('stopwords')stop_words = set(stopwords.words('english'))# Preprocess and clean the text:cleaned_texts = []for text in collected_texts:    text = re.sub(r'[^A-Za-z\s]', ' ', text).lower()    words = [w for w in text.split() if w not in stop_words]    cleaned_texts.append(" ".join(words))# Generate combined textcombined_text = " ".join(cleaned_texts)# Generate the word cloudwordcloud = WordCloud(width=800, height=400, background_color='white', colormap='viridis').generate(combined_text)# Display the word cloudplt.figure(figsize=(10, 6))  # <-- corrected numeric dimensionsplt.imshow(wordcloud, interpolation='bilinear')plt.axis('off')plt.title("Word Cloud of Scraped Text", fontsize=16)plt.show()

Finally, we generate a word cloud visualization displaying prominent keywords from the combined and cleaned text data. By visually emphasizing the most frequent and relevant terms, this approach allows for intuitive exploration of the main trends and themes in the collected web content.

Word Cloud Output from the Scraped Site

In conclusion, we’ve successfully built a robust and interactive trend-finding tool. This exercise equipped you with hands-on experience in web scraping, NLP analysis, topic modeling, and intuitive visualizations using word clouds. With this powerful yet straightforward approach, you can continuously track industry trends, gain valuable insights from social and blog content, and make informed decisions based on real-time data.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 80k+ ML SubReddit.

Meet Parlant: An LLM-first conversational AI framework designed to provide developers with the control and precision they need over their AI customer service agents, utilizing behavioral guidelines and runtime supervision. It’s operated using an easy-to-use CLI and native client SDKs in Python and TypeScript .

The post A Step by Step Guide to Build a Trend Finder Tool with Python: Web Scraping, NLP (Sentiment Analysis & Topic Modeling), and Word Cloud Visualization appeared first on MarkTechPost.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签