AWS Machine Learning Blog 2024年09月26日
Build a multimodal social media content generator using Amazon Bedrock
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何使用 Amazon Bedrock API 和 Amazon OpenSearch Serverless 构建一个社交媒体内容生成器应用程序,该应用程序结合了视觉、语言和嵌入模型,以生成引人入胜的内容。该解决方案利用 Anthropic 的 Claude 3、Amazon Titan Image Generator 和 Amazon Titan Multimodal Embeddings 等模型,通过一步一步的流程来创建引人入胜的社交媒体内容,同时保持品牌一致性。

🎉 **使用 Amazon Bedrock API 和 Amazon OpenSearch Serverless 构建内容生成器**:该解决方案利用 Amazon Bedrock API 和 Amazon OpenSearch Serverless 来访问和使用来自领先 AI 公司的高性能基础模型(FM),从而简化了构建多模态社交媒体内容生成器的过程。

📸 **利用视觉和语言模型生成内容**:该解决方案利用 Anthropic 的 Claude 3、Amazon Titan Image Generator 和 Amazon Titan Multimodal Embeddings 等模型,通过一步一步的流程来创建引人入胜的社交媒体内容,同时保持品牌一致性。

🔍 **多模态嵌入模型用于搜索和检索**:该解决方案使用多模态嵌入模型来搜索和检索类似的帖子,并为用户提供改进其帖子的建议。这有助于确保内容与品牌指南保持一致,并吸引目标受众。

📈 **提高内容创建效率和一致性**:通过自动执行内容生成过程,该解决方案可以帮助内容团队和影响者节省时间和精力,同时保持品牌一致性。这使得他们能够专注于更具创意和战略性的任务。

🚀 **增强内容的吸引力和参与度**:该解决方案生成的引人入胜的内容可以帮助品牌提高其社交媒体参与度,并与目标受众建立更牢固的联系。

💡 **可扩展性和灵活性**:该解决方案基于 AWS 云服务构建,具有可扩展性和灵活性,使其能够适应不断变化的需求,并处理不断增长的内容创作工作负载。

🛡️ **安全性**:该解决方案在 AWS 上运行,提供强大的安全性,确保数据和应用程序的安全。

📊 **分析和监控**:该解决方案允许您分析内容性能并监控其效果,从而帮助您优化策略并改进内容创作过程。

🤝 **协作**:该解决方案可以轻松地与其他工具和平台集成,使内容团队能够与其他团队和利益相关者协作。

🌐 **全球可访问性**:该解决方案可以轻松部署在全球范围内的 AWS 地区,使您能够接触到全球受众。

In today’s digital age, social media has revolutionized the way brands interact with their consumers, creating a need for dynamic and engaging content that resonates with their target audience. There’s growing competition for consumer attention in this space; content creators and influencers face constant challenges to produce new, engaging, and brand-consistent content. The challenges come from three key factors: the need for rapid content production, the desire for personalized content that is both captivating and visually appealing and reflects the unique interests of the consumer, and the necessity for content that is consistent with a brand’s identity, messaging, aesthetics, and tone.

Traditionally, the content creation process has been a time-consuming task involving multiple steps such as ideation, research, writing, editing, design, and review. This slow cycle of creation does not fit for the rapid pace of social media.

Generative AI offers new possibilities to address this challenge and can be used by content teams and influencers to enhance their creativity and engagement while maintaining brand consistency. More specifically, multimodal capabilities of large language models (LLMs) allow us to create the rich, engaging content spanning text, images, audio, and video formats that are omnipresent in advertising, marketing, and social media content. With recent advancements in vision LLMs, creators can use visual input, such as reference images, to start the content creation process. Image similarity search and text semantic search further enhance the process by quickly retrieving relevant content and context.

In this post, we walk you through a step-by-step process to create a social media content generator app using vision, language, and embedding models (Anthropic’s Claude 3, Amazon Titan Image Generator, and Amazon Titan Multimodal Embeddings) through Amazon Bedrock API and Amazon OpenSearch Serverless. Amazon Bedrock is a fully managed service that provides access to high-performing foundation models (FMs) from leading AI companies through a single API. OpenSearch Serverless is a fully managed service that makes it easier to store vectors and other data types in an index and allows you to perform sub second query latency when searching billions of vectors and measuring the semantic similarity.

Here’s how the proposed process for content creation works:

    First, the user (content team or marketing team) uploads a product image with a simple background (such as a handbag). Then, they provide natural language descriptions of the scene and enhancements they wish to add to the image as a prompt (such as “Christmas holiday decorations”). Next, Amazon Titan Image Generator creates the enhanced image based on the provided scenario. Then, we generate rich and engaging text that describes the image while aligning with brand guidelines and tone using Claude 3. After the draft (text and image) is created, our solution performs multimodal similarity searches against historical posts to find similar posts and gain inspiration and recommendations to enhance the draft post. Finally, based on the generated recommendations, the post text is further refined and provided to the user on the webpage. The following diagram illustrates the end-to-end new content creation process.

Solution overview

In this solution, we start with data preparation, where the raw datasets can be stored in an Amazon Simple Storage Service (Amazon S3) bucket. We provide a Jupyter notebook to preprocess the raw data and use the Amazon Titan Multimodal Embeddings model to convert the image and text into embedding vectors. These vectors are then saved on OpenSearch Serverless as collections, as shown in the following figure.

Next is the content generation. The GUI webpage is hosted using a Streamlit application, where the user can provide an initial product image and a brief description of how they expect the enriched image to look. From the application, the user can also select the brand (which will link to a specific brand template later), choose the image style (such as photographic or cinematic), and select the tone for the post text (such as formal or casual).

After all the configurations are provided, the content creation process, shown in the following figure, is launched.

In stage 1, the solution retrieves the brand-specific template and guidelines from a CSV file. In a production environment, you could maintain the brand template table in Amazon DynamoDB for scalability, reliability, and maintenance. The user input is used to generate the enriched image with the Amazon Titan Image Generator. Together with all the other information, it’s fed into the Claude 3 model, which has vision capability, to generate the initial post text that closely aligns with the brand guidelines and the enriched image. At the end of this stage, the enriched image and initial post text are created and sent back to the GUI to display to users.

In stage 2, we combine the post text and image and use the Amazon Titan Multimodal Embeddings model to generate the embedding vector. Multimodal embedding models integrate information from different data types, such as text and images, into a unified representation. This enables searching for images using text descriptions, identifying similar images based on visual content, or combining both text and image inputs to refine search results. In this solution, the multimodal embedding vector is used to search and retrieve the top three similar historical posts from the OpenSearch vector store. The retrieved results are fed into the Anthropic’s Claude 3 model to generate a caption, provide insights on why these historical posts are engaging, and offer recommendations on how the user can improve their post.

In stage 3, based on the recommendations from stage 2, the solution automatically refines the post text and provides a final version to the user. The user has the flexibility to select the version they like and make changes before publishing. For the end-to-end content generation process, steps are orchestrated with the Streamlit application.

The whole process is shown in the following image:

Implementation steps

This solution has been tested in AWS Region us-east-1. However, it can also work in other Regions where the following services are available. Make sure you have the following set up before moving forward:

We use Amazon SageMaker Studio to generate historical post embeddings and save those embedding vectors to OpenSearch Serverless. Additionally, you will run the Streamlit app from the SageMaker Studio terminal to visualize and test the solution. Testing the Streamlit app in a SageMaker environment is intended for a temporary demo. For production, we recommend deploying the Streamlit app on Amazon Elastic Compute Cloud (Amazon EC2) or Amazon Elastic Container Service (Amazon ECS) services with proper security measures such as authentication and authorization.

We use the following models from Amazon Bedrock in the solution. Please see Model support by AWS Region and select the Region that supports all three models:

Set up a JupyterLab space on SageMaker Studio

JupyterLab space is a private or shared space within Sagemaker Studio that manages the storage and compute resources needed to run the JupyterLab application.

To set up a JupyterLab space

    Sign in to your AWS account and open the AWS Management Console. Go to SageMaker Studio. Select your user profile and choose Open Studio. From Applications in the top left, choose JupyterLab. If you already have a JupyterLab space, choose Run. If you do not, choose Create JupyterLab Space to create one. Enter a name and choose Create Space. Change the instance to t3.large and choose Run Space. Within a minute, you should see that the JupyterLab space is ready. Choose Open JupyterLab. In the JupyterLab launcher window, choose Terminal. Run the following command on the terminal to download the sample code from Github:
    git clone https://github.com/aws-samples/Build-a-multimodal-social-media-content-generator-using-Amazon-Bedrock.git

Generate sample posts and compute multimodal embeddings

In the code repository, we provide some sample product images (bag, car, perfume, and candle) that were created using the Amazon Titan Image Generator model. Next, you can generate some synthetic social media posts using the notebook: synthetic-data-generation.ipynb by using the following steps. The generated posts’ texts are saved in the metadata.jsonl file (if you prepared your own product images and post texts, you can skip this step). Then, compute multimodal embeddings for the pairs of images and generated texts. Finally, ingest the multimodal embeddings into a vector store on Amazon OpenSearch Serverless.

To generate sample posts

    In JupyterLab, choose File Browser and navigate to the folder social-media-generator/embedding-generation. Open the notebook synthetic-data-generation.ipynb. Choose the default Python 3 kernel and Data Science 3.0 image, then follow the instructions in the notebook. At this stage, you will have sample posts that are created and available in data_mapping.csv. Open the notebook multimodal_embedding_generation.ipynb. The notebook first creates the multimodal embeddings for the post-image pair. It then ingests the computed embeddings into a vector store on Amazon OpenSearch Serverless. At the end of the notebook, you should be able to perform a simple query to the collection as shown in the following example:
query_prompt = "christmas tree, holiday, bags"similar_items = find_similar_items_from_query(                    query_prompt = query_prompt, k=3, num_results=5,                     index_name=index_name, dataset = df,                     open_search_client = oss_client)

The preparation steps are complete. If you want to try out the solution directly, you can skip to Run the solution with Streamlit App to quickly test the solution in your SageMaker environment. However, if you want a more detailed understanding of each step’s code and explanations, continue reading.

Generate a social media post (image and text) using FMs

In this solution, we use FMs through Amazon Bedrock for content creation. We start by enhancing the input product image using the Amazon Titan Image Generator model, which adds a dynamically relevant background around the target product.

The get_titan_ai_request_body function creates a JSON request body for the Titan Image Generator model, using its Outpainting feature. It accepts four parameters: outpaint_prompt (for example, “Christmas tree, holiday decoration” or “Mother’s Day, flowers, warm lights”), negative_prompt (elements to exclude from the generated image), mask_prompt (specifies areas to retain, such as “bag” or “car”), and image_str (the input image encoded as a base64 string).

The generate_image function requires model_id and body (the request body from get_titan_ai_request_body). It invokes the model using bedrock.invoke_model and returns the response containing the base64-encoded generated image.

Finally, the code snippet calls get_titan_ai_request_body with the provided prompts and input image string, then passes the request body to generate_image, resulting in the enhanced image.

def get_titan_ai_request_body(outpaint_prompt, negative_prompt, mask_prompt, image_str=None):      seed = random.randint(0, 2147483647)    body = {        "taskType": "OUTPAINTING",        "outPaintingParams": {            "text": outpaint_prompt,            "negativeText": negative_prompt,            "image": image_str,            "maskPrompt": mask_prompt,            "outPaintingMode": "PRECISE" # or DEFAULT        },        "imageGenerationConfig": {            "numberOfImages": 1,            "quality": "premium",            "cfgScale": 8,            "seed": seed,        }    }return json.dumps(body)def generate_image(model_id, body):    """    Args:    model_id (str): The model ID to use.    body (str) : The request body to use.    Returns:    image_bytes (bytes): The image generated by the model.    """    logger.info("Generating image with model %s", model_id)        accept = "application/json"    content_type = "application/json"        response = bedrock.invoke_model(        body=body, modelId=model_id, accept=accept, contentType=content_type    )    response_body = json.loads(response.get("body").read())return response_bodybody = get_titan_ai_request_body(outpaint_prompt, negative_prompt, mask_prompt, image_str = image_str)response = generate_image(model_id =MODEL_IMAGE, body = body)image_enhanced = base64_to_image(response["images"][0])

The following images showcase the enhanced versions generated based on input prompts like “Christmas tree, holiday decoration, warm lights,” a selected position (such as bottom-middle), and a brand (“Luxury Brand”). These settings influence the output images. If the generated image is unsatisfactory, you can repeat the process until you achieve the desired outcome.

Next, generate the post text, taking into consideration the user inputs, brand guidelines (provided in the brand_guideline.csv file, which you can replace with your own data), and the enhanced image generated from the previous step.

The generate_text_with_claude function is the higher-level function that handles the image and text input, prepares the necessary data, and calls generate_vision_answer to interact with the Amazon Bedrock model (Claude 3 models) and receive the desired response. The generate_vision_answer function performs the core interaction with the Amazon Bedrock model, processes the model’s response, and returns it to the caller. Together, they enable generating text responses based on combined image and text inputs.

In the following code snippet, an initial post prompt is constructed using formatting placeholders for various elements such as role, product name, target brand, tone, hashtag, copywriting, and brand messaging. These elements are provided in the brand_guideline.csv file to make sure that the generated text aligns with the brand preferences and guidelines. This initial prompt is then passed to the generate_text_with_claude function, along with the enhanced image to generate the final post text.

def generate_vision_answer(bedrock:boto3.client, messages:list, model_id:str, claude_config:dict,system_prompt:str):    """    Generates a vision answer using the specified model and configuration.    """    body={'messages': [messages],**claude_config, "system": system_prompt}    bedrock = boto3.client(service_name='bedrock-runtime')        response = bedrock.invoke_model(modelId=model_id, body=json.dumps(body))       response = json.loads(response['body'].read().decode('utf-8'))    print("Claude vision answer OK")    formated_response= post_process_answer(response['content'][0]['text'])        return formated_responsedef generate_text_with_claude(image, prompt):    '''    Generate text with Claude for post generation and historical posts analysis    '''    with BytesIO() as byte_io:        image.save(byte_io, format="PNG")        image_bytes = byte_io.getvalue()    messages={"role": "user", "content": [    {            "type": "image",            "source": {            "type": "base64",            "media_type": "image/jpeg",            "data": base64.b64encode(image_bytes).decode(),            }    },    {"type": "text",         "text": prompt}    ]}    claude_text = generate_vision_answer(bedrock, messages, MODEL_TEXT, CLAUDE_CONFIG, SYSTEM_PROMPT)       return claude_textinitial_post_prompt = PROMPT_TEXT.format(                        role=role, product_name=product_input, target_brand=brand,                         tone=tone, hashtag = hashtag, copywriting= copywriting,                         brand_messageing = brand_messageing)        post_text = generate_text_with_claude(                    image = image_enhanced,                     prompt=initial_post_prompt)

The following example shows the generated post text. It provides a detailed description of the product, aligns well with the brand guidelines, and incorporates elements from the image (such as the Christmas tree). Additionally, we instructed the model to include hashtags and emojis where appropriate, and the results demonstrate that it followed the prompt instructions effectively.

Post text:

Elevate your style with Luxury Brand’s latest masterpiece. Crafted with timeless elegance and superior quality, this exquisite bag embodies unique craftsmanship. Indulge in the epitome of sophistication and let it be your constant companion for life’s grandest moments. #LuxuryBrand #TimelessElegance #ExclusiveCollection

Retrieve and analyze the top three relevant posts

The next step involves using the generated image and text to search for the top three similar historical posts from a vector database. We use the Amazon Titan Multimodal Embeddings model to create embedding vectors, which are stored in Amazon OpenSearch Serverless. The relevant historical posts, which might have many likes, are displayed on the application webpage to give users an idea of what successful social media posts look like. Additionally, we analyze these retrieved posts and provide actionable improvement recommendations for the user. The following code snippet shows the implementation of this step.

The code defines two functions: find_similar_items and process_images. find_similar_items performs semantic search using the k-nearest neighbors (kNN) algorithm on the input image prompt. It computes a multimodal embedding for the image and query prompt, constructs an OpenSearch kNN query, runs the search, and retrieves the top matching images and post texts. process_images analyzes a list of similar images in parallel using multiprocessing. It generates analysis texts for the images by calling generate_text_with_claude with an analysis prompt, running the calls in parallel, and collecting the results.

In the snippet, find_similar_items is called to retrieve the top three similar images and post texts based on the input image and a combined query prompt. process_images is then called to generate analysis texts for the first three similar images in parallel, displaying the results simultaneously.

def find_similar_items(image_bytes: str, query_prompt:str, k: int, num_results: int, index_name: str, dataset, open_search_client  ) -> []:    """    Main semantic search capability using knn on input image prompt.    Args:        k: number of top-k similar vectors to retrieve from OpenSearch index        num_results: number of the top-k similar vectors to retrieve        index_name: index name in OpenSearch    """    query_emb = get_titan_multimodal_embedding(image_bytes=image_bytes, description = query_prompt, dimension=1024)["embedding"]    body = {        "size": num_results,        "_source": {            "exclude": ["image_vector"],        },        "query": {            "knn": {                "image_vector": {                    "vector": query_emb,                    "k": k,                }            }        },    }                 res = open_search_client.search(index=index_name, body=body)    images = []    texts = []        for hit in res["hits"]["hits"]:        id_ = hit["_id"]        file_name = hit["_source"]["file_name"]        post_text = hit["_source"]["post_text"]        image = get_image(file_name = file_name, dataset = dataset)        image.name_and_score = f'{hit["_score"]}:{hit["_source"]["file_name"]}'        images.append(image)        texts.append(f"Post Text: {post_text}")                        return images, textsdef process_images(_similar_items, PROMPT_ANALYSIS):    pool = multiprocessing.Pool(processes=3)  # Create a pool of 3 worker processes    args = [(image, PROMPT_ANALYSIS) for image in _similar_items[:3]]    results = pool.starmap(generate_text_with_claude, args)  # Execute the function calls in parallel    # Unpack the results    analysis_text_0, analysis_text_1, analysis_text_2 = results    # Close the pool and wait for the tasks to finish    pool.close()    pool.join()    return analysis_text_0, analysis_text_1, analysis_text_2similar_images, post_texts = find_similar_items(                                    image_bytes=image_enhanced_bytes, query_prompt=text_input + " " + post_text,                                    k=5, num_results=3, index_name=index_name, dataset=mapping_table,                                    open_search_client=oss_client)analysis_text_0, analysis_text_1, analysis_text_2 = process_images(similar_images, PROMPT_ANALYSIS)

An example of historical post retrieval and analysis is shown in the following screenshot. Post images are listed on the left. On the right, the full text content of each post is retrieved and displayed. We then use an LLM model to generate a comprehensive scene description for the post image, which can serve as a prompt to inspire image generation. Next, the LLM model generates automatic recommendations for improvement. In this solution, we use the Claude 3 Sonnet model for text generation.

As the final step, the solution incorporates the recommendations and refines the post text to make it more appealing and likely to attract more attention from social media users.

Run the solution with Streamlit App

You can download the solution from this Git repository. Use the following steps to run the Streamlit application and quickly test out the solution in your SageMaker Studio environment.

    In SageMaker Studio, choose SageMaker Classic, then start an instance under your user profile. After you have the JupyterLab environment running, clone the code repository and navigate to the streamlit-app folder in a terminal:
    cd streamlit-app/sh setup.sh sh run.sh 
    You will see a webpage link generated in the terminal, which will look similar to the following:

https://[USER-PROFILE-ID].studio.[REGION].sagemaker.aws/jupyter/default/proxy/8501/

    To check the status of the Streamlit application, run sh status.sh in the terminal. To shut down the application, run sh cleanup.sh.

With the Streamlit app downloaded, you can begin by providing initial prompts and selecting the products you want to retain in the image. You have the option to upload an image from your local machine, plug in your camera to take an initial product picture on the fly, or quickly test the solution by selecting a pre-uploaded image example. You can then optionally adjust the product’s location in the image by setting its position. Next, select the brand for the product. In the demo, we use the luxury brand and the fast fashion brand, each with its own preferences and guidelines. Finally, choose the image style. Choose Submit to start the process.

The application will automatically handle post-image and text generation, retrieve similar posts for analysis, and refine the final post. This end-to-end process can take approximately 30 seconds. If you aren’t satisfied with the result, you can repeat the process a few times. An end-to-end demo is shown below.

Inspiration from historical posts using image similarity search

If you find yourself lacking ideas for initial prompts to create the enhanced image, consider using a reverse search approach. During the retrieve and analyze posts step mentioned earlier, scene descriptions are also generated, which can serve as inspiration. You can modify these descriptions as needed and use them to generate new images and accompanying text. This method effectively uses existing content to stimulate creativity and enhance the application’s output.

In the preceding example, the top three similar images to our generated images show perfume pictures posted to social media by users. This insight helps brands understand their target audience and the environments in which their products are used. By using this information, brands can create dynamic and engaging content that resonates with their users. For instance, in the example provided, “a hand holding a glass perfume bottle in the foreground, with a scenic mountain landscape visible in the background,” is unique and visually more appealing than a dull picture of “a perfume bottle standing on a branch in a forest.” This illustrates how capturing the right scene and context can significantly enhance the attractiveness and impact of social media content.

Clean up

When you finish experimenting with this solution, use the following steps to clean up the AWS resources to avoid unnecessary costs:

    Navigate to the Amazon S3 console and delete the S3 bucket and data created for this solution. Navigate to the Amazon OpenSearch Service console, choose Serverless, and then select Collection. Delete the collection that was created for storing the historical post embedding vectors. Navigate to the Amazon SageMaker console. Choose Admin configurations and select Domains. Select your user profile and delete the running application from Spaces and Apps.

Conclusion

In this blog post, we introduced a multimodal social media content generator solution that uses FMs from Amazon Bedrock, such as the Amazon Titan Image Generator, Claude 3, and Amazon Titan Multimodal Embeddings. The solution streamlines the content creation process, enabling brands and influencers to produce engaging and brand-consistent content rapidly. You can try out the solution using this code sample.

The solution involves enhancing product images with relevant backgrounds using the Amazon Titan Image Generator, generating brand-aligned text descriptions through Claude 3, and retrieving similar historical posts using Amazon Titan Multimodal Embeddings. It provides actionable recommendations to refine content for better audience resonance. This multimodal AI approach addresses challenges in rapid content production, personalization, and brand consistency, empowering creators to boost creativity and engagement while maintaining brand identity.

We encourage brands, influencers, and content teams to explore this solution and use the capabilities of FMs to streamline their content creation processes. Additionally, we invite developers and researchers to build upon this solution, experiment with different models and techniques, and contribute to the advancement of multimodal AI in the realm of social media content generation.

See this announcement blog post for information about the Amazon Titan Image Generator and Amazon Titan Multimodal Embeddings model. For more information, see Amazon Bedrock and Amazon Titan in Amazon Bedrock.


About the Authors

Ying Hou, PhD, is a Machine Learning Prototyping Architect at AWS, specialising in building GenAI applications with customers, including RAG and agent solutions. Her expertise spans GenAI, ASR, Computer Vision, NLP, and time series prediction models. Outside of work, she enjoys spending quality time with her family, getting lost in novels, and hiking in the UK’s national parks.

Bishesh Adhikari, is a Senior ML Prototyping Architect at AWS with over a decade of experience in software engineering and AI/ML. Specializing in GenAI, LLMs, NLP, CV, and GeoSpatial ML, he collaborates with AWS customers to build solutions for challenging problems through co-development. His expertise accelerates customers’ journey from concept to production, tackling complex use cases across various industries. In his free time, he enjoys hiking, traveling, and spending time with family and friends.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon Bedrock Amazon OpenSearch Serverless 多模态内容生成 社交媒体营销 品牌一致性 AI 驱动的内容创作
相关文章