Agents as escalators: Real-time AI video monitoring with Amazon Bedrock Agents and video streams

Organizations deploying video monitoring systems face a critical challenge: processing continuous video streams while maintaining accurate situational awareness. Traditional monitoring approaches that use rule-based detection or basic computer vision frequently miss important events or generate excessive false positives, leading to operational inefficiencies and alert fatigue.

In this post, we show how to build a fully deployable solution that processes video streams using OpenCV, Amazon Bedrock for contextual scene understanding and automated responses through Amazon Bedrock Agents. This solution extends the capabilities demonstrated in Automate chatbot for document and data retrieval using Amazon Bedrock Agents and Knowledge Bases, which discussed using Amazon Bedrock Agents for document and data retrieval. In this post, we apply Amazon Bedrock Agents to real-time video analysis and event monitoring.

Benefits of using Amazon Bedrock Agents for video monitoring

The following figure shows example video stream inputs from different monitoring scenarios. With contextual scene understanding, users can search for specific events.

A front door camera will capture many events throughout the day, but some are more interesting than others—having context if a package is being delivered or removed (as in the following package example) limits alerts to urgent events.

Amazon Bedrock is a fully managed service that provides access to high-performing foundation models (FMs) from leading AI companies through a single API. Using Amazon Bedrock, you can build secure, responsible generative AI applications. Amazon Bedrock Agents extends these capabilities by enabling applications to execute multi-step tasks across systems and data sources, making it ideal for complex monitoring scenarios. The solution processes video streams through these key steps:

Extract frames when motion is detected from live video streams or local files. Analyze context using multimodal FMs. Make decisions using agent-based logic with configurable responses. Maintain searchable semantic memory of events.

You can build this intelligent video monitoring system using Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases in an automated solution. The complete code is available in the GitHub repo.

Limitations of current video monitoring systems

Organizations deploying video monitoring systems face a fundamental dilemma. Despite advances in camera technology and storage capabilities, the intelligence layer interpreting video feeds often remains rudimentary. This creates a challenging situation where security teams must make significant trade-offs in their monitoring approach. Current video monitoring solutions typically force organizations to choose between the following:

Simple rules that scale

Complex rules that require

Manual monitoring

Point solutions

These trade-offs create fundamental barriers to effective video monitoring that impact security, safety, and operational efficiency across industries. Based on our work with customers, we’ve identified three critical challenges that emerge from these limitations:

Alert fatigue

Limited contextual understanding

Distinguish normal from suspicious behavior Understand temporal patterns like recurring weekly events Consider environmental context such as time of day or location Correlate multiple events that might indicate a pattern

Lack of semantic memory

Establish baselines of routine versus unusual events Offer natural language search capabilities across historical data Support reasoning about emerging patterns

Without these capabilities, you can’t gain cumulative benefits from your monitoring infrastructure or perform sophisticated retrospective analysis. To address these challenges effectively, you need a fundamentally different approach. By combining the contextual understanding capabilities of FMs with a structured framework for event classification and response, you can build more intelligent monitoring systems. Amazon Bedrock Agents provides the ideal platform for this next-generation approach.

Solution overview

You can address these monitoring challenges by building a video monitoring solution with Amazon Bedrock Agents. The system intelligently screens events, filters routine activity, and escalates situations requiring human attention, helping reduce alert fatigue while improving detection accuracy. The solution uses Amazon Bedrock Agents to analyze detected motion from video, and alerts users when an event of interest happens according to the provided instructions. This allows the system to intelligently filter out trivial events that can trigger motion detection, such as wind or birds, and direct the user’s attention only to events of interest. The following diagram illustrates the solution architecture.

The solution uses three primary components to address the core challenges: agents as escalators, a video processing pipeline, and Amazon Bedrock Agents. We discuss these components in more detail in the following sections.

The solution uses the AWS Cloud Development Kit (AWS CDK) to deploy the solution components. The AWS CDK is an open source software development framework for defining cloud infrastructure as code and provisioning it through AWS CloudFormation.

Agent as escalators

The first component uses Amazon Bedrock Agents to examine detected motion events with the following capabilities:

Provides natural language understanding of scenes and activities for contextual interpretation Maintains temporal awareness across frame sequences to understand event progression References historical patterns to distinguish unusual from routine events Applies contextual reasoning about behaviors, considering factors like time of day, location, and action sequences

We implement a graduated response framework that categorizes events by severity and required action:

Level 0: Log only

Level 1: Human notification

Level 2: Immediate response

The solution provides an interactive processing and monitoring interface through a Streamlit application. With the Streamlit UI, users can provide instructions and interact with the agent.

The application consists of the following key features:

Live stream or video file input –

Amazon Simple Storage Service

Custom instructions –

Notification configuration –

Amazon Simple Notification Service

Natural language queries about past events –

Video processing pipeline

The solution uses several AWS services to capture and prepare video data for analysis through a modular processing pipeline. The solution supports multiple types of video sources:

Amazon Kinesis Video Streams

When using streams, OpenCV’s VideoCapture component handles the connection and frame extraction. For testing, we’ve included sample event videos demonstrating different scenarios. The core of the video processing is a modular pipeline implemented in Python. Key components include:

SimpleMotionDetection

FrameSampling

GridAggregator

S3Storage

This multi-process framework optimizes performance by running components concurrently and maintaining a queue of frames to process. The video processing pipeline organizes captured frame data in a structured way before passing it to the Amazon Bedrock agent for analysis:

Frame sequence storage –

Image grid format –

The following figure is an example of an image grid sent to the agent. Package theft is difficult to identify with classic image models. The large language model’s (LLM’s) ability to reason over a sequence of image allows it to make observations about intent.

The video processing pipeline’s output—timestamped frame grids stored in Amazon S3—serves as the input for the Amazon Bedrock agent components, which we discuss in the next section.

Amazon Bedrock agent components

The solution integrates multiple Amazon Bedrock services to create an intelligent analysis system:

Core agent architecture

Receives frame grids from Amazon S3 on motion detection Coordinates multi-step analysis processes Makes classification decisions Triggers appropriate response actions Maintains event context and state

Knowledge management

Amazon OpenSearch Serverless

Store and index historical events Build baseline activity patterns Enable natural language querying Track temporal patterns Support contextual analysis

Action groups

Analyze grid

Alert

Log

Search events by date

Look up vehicle (Text-to-SQL)

For structured data queries, the system uses the FM’s ability to convert natural language to SQL. This enables the following:

Amazon Athena

These components work together to create a comprehensive system that can analyze video content, maintain event history, and support both real-time alerting and retrospective analysis through natural language interaction.

Video processing framework

The video processing framework implements a multi-process architecture for handling video streams through composable processing chains.

Modular pipeline architecture

The framework uses a composition-based approach built around the FrameProcessor abstract base class.

Processing components implement a consistent interface with a process(frame) method that takes a Frame and returns a potentially modified Frame:

```class FrameProcessor(ABC):    @abstractmethod    def process(self, frame: Frame) -> Optional[Frame]: ...```

The Frame class encapsulates the image data along with timestamps, indexes, and extensible metadata:

```@dataclassclass Frame:    buffer: ndarray  # OpenCV image array    timestamp: float    index: float    fps: float    metadata: dict = field(default_factory=dict)```

Customizable processing chains

The architecture supports configuring multiple processing chains that can be connected in sequence. The solution uses two primary chains. The detection and analysis chain processes incoming video frames to identify events of interest:

```chain = FrameProcessorChain([    SimpleMotionDetection(motion_threshold=10_000, frame_skip_size=1),    FrameSampling(timedelta(milliseconds=250), threshold_time=timedelta(seconds=2)),    GridAggregator(shape=(13, 3))])```

The storage and notification chain handles the storage of identified events and invocation of the agent:

```storage_chain = FrameProcessorChain([    S3Storage(bucket_name=TARGET_S3_BUCKET, prefix=S3_PREFIX, s3_client_provider=s3_client_provider),    LambdaProcessor(get_response=get_response, monitoring_instructions=config.monitoring_instructions)])```

You can modify these changes independently to add or replace components based on specific monitoring requirements.

Component implementation

The solution includes several processing components that demonstrate the framework’s capabilities. You can modify each processing step or add new ones. For example, for simple motion detection, we use a simple pixel difference, but you can refine the motion detection functionality as needed, or follow the format to implement other detection algorithms, such as object detection or scene segmentation.

Additional components include the FrameSampling processor to control capture timing, GridAggregator to create visual frame grids, and storage processors that save event data and trigger agent analysis, and these can be customized and replaced as needed. For example:

Modify existing components

Create alternative storage backends

Implement preprocessing and postprocessing steps

Finally, the LambdaProcessor serves as the bridge to the Amazon Bedrock agent by invoking an AWS Lambda function that sends the information in a request to the deployed agent. From there, the Amazon Bedrock agent takes over and analyzes the event and takes action accordingly.

Agent implementation

After you deploy the solution, an Amazon Bedrock agent alias becomes available. This agent functions as an intelligent analysis layer, processing captured video events and executing appropriate actions based on its analysis. You can test the agent and view its reasoning trace directly on the Amazon Bedrock console, as shown in the following screenshot.

This agent will lack some of the metadata supplied by the Streamlit application (such as current time) and might not give the same answers as the full application.

Invocation flow

The agent is invoked through a Lambda function that handles the request-response cycle and manages session state. It finds the highest published version ID and uses it to invoke the agent and parses the response.

Action groups

The agent’s capabilities are defined through action groups implemented using the BedrockAgentResolver framework. This approach automatically generates the OpenAPI schema required by the agent.

When the agent is invoked, it receives an event object that includes the API path and other parameters that inform the agent framework how to route the request to the appropriate handler based. You can add new actions by defining additional endpoint handlers following the same pattern and generating a new OpenAPI schema:

```if __name__ == "__main__":    print(app.get_openapi_json_schema())```

Text-to-SQL integration

Through its action group, the agent is able to translate natural language queries into SQL for structured data analysis. The system reads data from assets/data_query_data_source, which can include various formats like CSV, JSON, ORC, or Parquet.

This capability enables users to query structured data using natural language. As demonstrated in the following example, the system translates natural language queries about vehicles into SQL, returning structured information from the database.

The database connection is configured through a SQL Alchemy engine. Users can connect to existing databases by updating the create_sql_engine() function to use their connection parameters.

Event memory and semantic search

The agent maintains a detailed memory of past events, storing event logs with rich descriptions in Amazon S3. These events become searchable through both vector-based semantic search and date-based filtering. As shown in the following example, temporal queries make it possible to retrieve information about events within specific time periods, such as vehicles observed in the past 72 hours.

The system’s semantic memory capabilities enable queries based on abstract concepts and natural language descriptions. As shown in the following example, the agent can understand abstract concepts like “funny” and retrieve relevant events, such as a person dropping a birthday cake.

Events can be linked together by the agent to identify patterns or related incidents. For example, the system can correlate separate sightings of individuals with similar characteristics. In the following screenshots, the agent connects related incidents by identifying common attributes like clothing items across different events.

This event memory store allows the system to build knowledge over time, providing increasingly valuable insights as it accumulates data. The combination of structured database querying and semantic search across event descriptions creates an agent with a searchable memory of all past events.

Prerequisites

Before you deploy the solution, complete the following prerequisites:

aws configure

us-west-2

us-east-1

AWS Command Line Interface

Deploy the solution

The AWS CDK deployment creates the following resources:

Storage

Amazon Bedrock resources

Compute

Database

AWS Glue

Deploy the solution with the following commands:

```#1. Clone the repository and navigate to foldergit clone https://github.com/aws-samples/sample-video-monitoring-agent.git && cd sample-video-monitoring-agent#2. Set up environment and install dependenciespython3 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt#3. Deploy AWS resourcescdk bootstrap && cdk deploy#4. Run the streamlit appcd code/streamlit_app && streamlit run app.py```

On Windows, replace the second line with the following code:

```python3 -m venv .venv && % .venv\Scripts\activate.bat && pip install -r requirements.txt```

Clean up

To destroy the resources you created and stop incurring charges, run the following command:

```cdk destroy```

Future enhancements

The current implementation demonstrates the potential of agent-based video monitoring in a home security setting, but there are many potential applications.

Sample Use Cases

The following showcases the application of the solution to various scenarios.

Small business

{ “alert_level”: 0, “timestamp”: “2024-11-20T15:24:15Z”, “reason”: “Vehicle arrival in driveway”, “description”: ”Standard vehicle arrival and parking sequence. Vehicles present: Black Nissan Frontier pickup (parked), silver Honda CR-V (arriving), and partial view of blue vehicle in foreground. Area features: Gravel driveway surface, two waste bins (County Waste and recycling), evergreen trees in background. Sequence shows Honda CR-V executing normal parking maneuver: approaches from east, performs standard three-point turn, achieves final position next to pickup truck. Daytime conditions, clear visibility. Vehicle condition: Clean, well-maintained CR-V appears to be 2012-2016 model year, no visible damage or unusual modifications. Movement pattern indicates familiar driver performing routine parking. No suspicious behavior or safety concerns observed. Timestamp indicates standard afternoon arrival time. Waste bins properly positioned and undisturbed during parking maneuver.” }

Industrial

{ “alert_level”: 2, “timestamp”: “2024-11-20T15:24:15Z”, “reason”: “Warehouse product spill/safety hazard”,”description”: ”Significant product spill incident in warehouse storage aisle. Location: Main warehouse aisle between high-bay racking systems containing boxed inventory. Sequence shows what appears to be liquid or container spill, likely water/beverage products based on blue colored containers visible. Infrastructure: Professional warehouse setup with multi-level blue metal racking, concrete flooring, overhead lighting. Incident progression: Initial frames show clean aisle, followed by product falling/tumbling, resulting in widespread dispersal of items across aisle floor. Hazard assessment: Creates immediate slip/trip hazard, blocks emergency egress path, potential damage to inventory. Area impact: Approximately 15-20 feet of aisle space affected. Facility type appears to be distribution center or storage warehouse. Multiple cardboard boxes visible on surrounding shelves potentially at risk from liquid damage.” }

Backyard

{ “alert_level”: 1, “timestamp”: “2024-11-20T15:24:15Z”, “reason”: “Wildlife detected on property”, “description”: ”Adult raccoon observed investigating porch/deck area with white railings. Night vision/IR camera provides clear footage of animal. Subject animal characteristics: medium-sized adult raccoon, distinctive facial markings clearly visible, healthy coat condition, normal movement patterns. Sequence shows animal approaching camera (15:42PM), investigating area near railing (15:43-15:44PM), with close facial examination (15:45PM). Final frame shows partial view as animal moves away. Environment: Location appears to be elevated deck/porch with white painted wooden railings and balusters. Lighting conditions: Nighttime, camera operating in infrared/night vision mode providing clear black and white footage. Animal behavior appears to be normal nocturnal exploration, no signs of aggression or disease.” }

Home safety

{ “alert_level”: 2, “timestamp”: “2024-11-20T15:24:15Z”, “reason”: “Smoke/possible fire detected”, “description”: ”Rapid development of white/grey smoke visible in living room area. Smoke appears to be originating from left side of frame, possibly near electronics/TV area. Room features: red/salmon colored walls, grey couch, illuminated aquarium, table lamps, framed artwork. Sequence shows progressive smoke accumulation over 4-second span (15:42PM – 15:46PM).Notable smoke density increase in upper left corner of frame with potential light diffusion indicating particulate matter in air. Smoke pattern suggests active fire development rather than residual smoke. Blue light from aquarium remains visible throughout sequence providing contrast reference for smoke density.”

Further extensions

In addition, you can extend the FM capabilities using the following methods:

Fine-tuning for specific monitoring contexts

Refined prompts for specific use cases

You can expand the agent’s ability to take action, for example:

Direct control of smart home and smart building systems

Integration with security and safety protocols

Automated response workflows

You can also consider enhancing the event memory system:

Long-term pattern recognition

Cross-camera correlation

Anomaly detection based on historical patterns

Lastly, consider extending the monitoring capabilities beyond fixed cameras:

Monitoring for robotic vision systems

Drone-based surveillance

Mobile security applications

These enhancements can transform the system from a passive monitoring tool into an active participant in security operations, with increasingly sophisticated understanding of normal patterns and anomalous events.

Conclusion

The approach of using agents as escalators represents a significant advancement in video monitoring, using the contextual understanding capabilities of FMs with the action-oriented framework of Amazon Bedrock Agents. By filtering the signal from the noise, this solution addresses the critical problem of alert fatigue while enhancing security and safety monitoring capabilities.With this solution, you can:

Reduce false positives while maintaining high detection sensitivity Provide human-readable descriptions and classifications of events Maintain searchable records of all activity Scale monitoring capabilities without proportional human resources

The combination of intelligent screening, graduated responses, and semantic memory enables a more effective and efficient monitoring system that enhances human capabilities rather than replacing them. Try the solution today and experience how Amazon Bedrock Agents can transform your video monitoring capabilities from simple motion detection to intelligent scene understanding.

About the authors

Kiowa Jackson is a Senior Machine Learning Engineer at AWS ProServe, specializing in computer vision and agentic systems for industrial applications. His work bridges classical machine learning approaches with generative AI to enhance industrial automation capabilities. His past work includes collaborations with Amazon Robotics, NFL, and Koch Georgia Pacific.

Piotr Chotkowski is a Senior Cloud Application Architect at AWS Generative AI Innovation Center. He has experience in hands-on software engineering as well as software architecture design. In his role at AWS, he helps customers design and build production grade generative AI applications in the cloud.