Using NetworkX, Jaccard Similarity, and cuGraph to Predict Your Next Favorite Movie

As the amount of data available to everyone in the world increases, the ability for a consumer to make informed decisions becomes increasingly difficult. Fortunately, large datasets are a beneficial component for recommendation systems, which can make a sometimes-overwhelming decision much easier.Graphs are excellent choices for modeling the relationships inherent in the data that fuel recommendation systems, and NetworkX is a very popular option that many data scientists turn to for graph analytics in Python. NetworkX is easy to learn and use, stocked with a wide breadth of graph algorithms, backed by a large and friendly community, and has copious examples available in notebooks, documents, Stack Overflow, and your favorite LLM. However, and to the disappointment of countless developers that broke into graph analytics with or even because of NetworkX, it famously falls short in performance at the scales used by typical recommendation systems.This begs the question: Can an effective graph-based recommendation system be written in a few simple lines of Python? More generally, can developers and data scientists have both easy-to-use and high-performance graph analytics?The answer to both questions is “yes”.Read on to discover how you can create a simple and effective recommendation system in Python using NetworkX, a dataset of 33M movie reviews, the Jaccard Similarity algorithm, and the NVIDIA cuGraph back-end, which provides the over 250X speedup necessary for modern large-scale graph data.The MovieLens datasetLet’s start with‌ the most important part of the system: the data. The MovieLens dataset1 is generously made available for download to the public, and is described in more detail in the README file. The full set includes about 331k anonymized users reviewing 87k movies, resulting in 34M ratings.Figure 1. The MovieLens data can be represented as a graph, where the individual ratings easily map to edges between user and movie nodes.Extracting recommendations from the data: bipartite graphs and Jaccard SimilarityThe type of graph we’ve created from the MovieLens data is a bipartite graph because there are only two types of nodes: movies and users, and the reviews (edges) can only occur between a user and a movie. This makes it particularly easy to apply the Jaccard Similarity algorithm to find similarities between movies. Jaccard Similarity compares pairs of nodes and computes a similarity coefficient using their relationships in the graph. In this case, movies are related to each other based on how users have chosen to watch and review them.Figure 3. Jaccard Similarity computes a similarity coefficient using the sizes of the sets of neighbors for the two nodes being compared. Based on the viewing preferences of users, we can see m3 is more similar to m2 than it is to m1, and movies m4 and m1 aren’t similar at all. This system would recommend m2 to someone who likes m3, and wouldn’t recommend m1 to someone who likes m4.NetworkX makes it easy … for smaller graphsNot surprisingly, NetworkX supports the type of analysis we’ve described above, and it’s quite easy to start seeing results in just a few lines of Python. But as we’ll see, performance becomes a limitation for larger-sized graphs—such as those needed for our movie recommendation system—when using NetworkX without the GPU-accelerated cuGraph backend.We’ll look at the key pieces of the recommendation system below, but the full source code is available here.Because the Jaccard Similarity algorithm we’re using doesn’t take edge weights into account, it’ll consider all reviews equal. We don’t want movies with low reviews to be recommended, so we’ll filter out all reviews under a certain threshold, which has the side effect of making the graph smaller too.# Create a separate DataFrame containing only "good" reviews (rating >= 3).good_ratings_df = ratings_df[ratings_df["rating"] >= 3]good_user_ids = good_ratings_df["userId"].unique()good_movie_ids = good_ratings_df["movieId"].unique()If we print the sizes of the data we’re working with, we see our graph of good reviews is approximately 330k nodes and 28M edges, with an average degree (number of neighbors per node) of 84:total number of users: 330975total number of reviews: 33832162average number of total reviews/user: 102.22total number of users with good ratings: 329127total number of good reviews: 27782577average number of good reviews/user: 84.41As mentioned above, graphs of this size often present a challenge to NetworkX, but GPU acceleration using the cuGraph backend removes the performance limitations often associated with this much data. However, we’ll continue with a CPU-only environment to demonstrate default performance.Note: All examples below were run on a workstation using NetworkX 3.4.2 and a Intel(R) Xeon(R) Platinum 8480CL @ 2.0GHz with 2TB RAMUsing a NetworkX graph created from users and good movie reviews, let’s pick a user, find one of their highest rated movies, and use Jaccard Similarity to find other movies like it.# Pick a user and one of their highly-rated moviesuser = good_user_ids[321]user_reviews = good_user_movie_G[user]highest_rated_movie = max( user_reviews, key=lambda n: user_reviews[n].get("rating", 0))When we look up the node ID in the movie name map, we see one of this user’s highest rated movies is the animated film “Mulan”:highest rated movie for user=289308 is Mulan (1998), id: 1907, rated: {'rating': 5.0}We can now use Jaccard Similarity to recommend a movie based on the user’s preferences and viewing history:%%time# Run Jaccard Similarityjacc_coeffs = list(nx.jaccard_coefficient(good_user_movie_G, ebunch))CPU times: user 2min 5s, sys: 15.4 ms, total: 2min 5sWall time: 2min 14sThe Jaccard Similarity computation using the default NetworkX implementation ran for over two minutes. Using these results, we can now provide a recommendation.# Sort by coefficient value, which is the 3rd item in the tuplesjacc_coeffs.sort(key=lambda t: t[2], reverse=True) # Create a list of recommendations ordered by "best" to "worst" based on the# Jaccard Similarity coefficients and the movies already seenmovies_seen = list(good_user_movieG.neighbors(user))recommendations = [mid for (, mid, _) in jacc_coeffs if mid not in movies_seen] Now we can simply print the first movie in the sorted list of recommendations:User ID 289308 might like Tarzan (1999) (movie ID: 2687)The code is easy and the results look good, but performance holds us backAs we can see, the recommendation seems reasonable; someone who likes “Mulan” seems likely to also enjoy the 1999 Disney animated film “Tarzan”.However, if our goal was to provide a service, or to analyze hundreds or thousands of movies, the two-minute runtime would have us start looking for an alternative to NetworkX. We can see that finding similarities between other movies using this system isn’t any faster:%%time# 1196: "Star Wars: Episode V - The Empire Strikes Back (1980)"print_similar_movies(1196)movies similar to Star Wars: Episode V - The Empire Strikes Back (1980):movieId=260, Star Wars: Episode IV - A New Hope (1977)movieId=1210, Star Wars: Episode VI - Return of the Jedi (1983)movieId=1198, Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)CPU times: user 13min 47s, sys: 71.8 ms, total: 13min 47sWall time: 11min 30s%%time# 318: "Shawshank Redemption, The (1994)"print_similar_movies(318) movies similar to "Shawshank Redemption, The (1994)":movieId=296, Pulp Fiction (1994)movieId=593, "Silence of the Lambs, The (1991)"movieId=356, Forrest Gump (1994)CPU times: user 28min 28s, sys: 172 ms, total: 28min 28sWall time: 16min 49sThe quality of the recommendations returned is impressive given that this system is composed of only a few lines of code. However, the runtime performance makes it virtually unusable. ‌As seen above, finding recommendations based on “Shawshank Redemption, The (1994)” takes nearly 17 minutes.NVIDIA cuGraph makes it transformatively fasterThe graph algorithm in the above workflow is prohibitively expensive, but by using the NVIDIA cuGraph backend and a compatible GPU, we can dramatically improve performance without changing code.Jaccard Similarity is supported in nx-cugraph version 25.02 or later. Version 25.02 is available from nightly builds, and will be part of future stable releases later this month. Instructions on installing nx-cugraph, as well as other RAPIDS packages, from both nightly and stable channels using conda or pip, are available in the RAPIDS Installation Guide.Once installed, nx-cugraph can be enabled by simply setting an environment variable:NX_CUGRAPH_AUTOCONFIG=True cuGraph utilizes the GPU to dramatically accelerate the neighbor lookups and set comparisons needed for the Jaccard Similarity computation. Furthermore, as the graph scales and the number of movies and reviews per movie increases, performance remains almost constant.The best part of the system – the simplicity of the code – does not change, and the results are identical, but performance increases by over 250X for the run that previously took nearly 17 minutes, reducing it to under 4 seconds.Figure 4. Chart showing speedup of cuGraph over NetworkX for Jaccard Similarity computation for various movies.Software: NetworkX 3.4.2, cuGraph/nx-cugraph 25.02CPU: Intel(R) Xeon(R) Platinum 8480CL @ 2.0GHz 2TB RAMGPU: NVIDIA Quadro RTX 8000 48GB RAMConclusionThe blog post covered a simple and effective recommendation system that’s easy to write in Python using NetworkX. Although there are many other approaches one could take—as covered here—few would match the low effort required to start exploring data that graph analysis with NetworkX offers. However, productive and meaningful data exploration requires quick turnaround, and NetworkX has traditionally struggled to scale to larger, real-world problem sizes.The NVIDIA cuGraph backend for NetworkX accelerates the familiar and flexible NetworkX API to also make it performant at scale, generating results in seconds instead of tens of minutes, keeping you focused and productive. Users can now continue using NetworkX, the most popular graph analytics library, without concern for scaling issues simply by adding a GPU and the cuGraph backend to their environment.To learn more about accelerated graph analysis using NetworkX and NVIDIA cuGraph, visit https://rapids.ai/nx-cugraph.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签