Issue streaming response from bedrock agent

0

I would like to stream a response from a bedrock agent to the user. I'm working in python with boto3 and AgentsforBedrockRuntime. Under the hood, agents use InvokeAgent API, which is not build for streaming. Are there any temporary solutions to solve this issue? Is the bedrock team considering implementing this in the near future? Is there a way to see the roadmap for bedrock?

I think this post (not mine) articulates the issue well.

Thanks in advance!

Evan
asked a month ago254 views
2 Answers
0

You could consider using AWS Lambda’s response payload streaming feature, which allows functions to progressively stream response payloads back to clients. This can be particularly useful when working with AI models that support streaming. If you’re working with Python, you might need to create a custom runtime for AWS Lambda, as response streaming is not directly supported on Python runtime.

Here is the doc to the API https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent-runtime_InvokeAgent.html

profile picture
EXPERT
answered a month ago
  • I don't think adding a layer in front of Bedrock that receives the whole response and then streams it out would really help - it only adds latency if the layer doesn't already exist, and if a Lambda is already present in the architecture the Bedrock should really be consuming the clear majority of the overall latency - wouldn't expect the transmission from Lambda to client to be significant.

0

An important thing to consider about the agent pattern is that the final response is typically a product of multiple LLM calls, often chained together so the output of one was used in the input to the next.

This substantially reduces the value of response streaming for agents versus the plain LLM-calling ConverseStream and InvokeModelWithResponseStream APIs: Because only the last generation in the agent flow can be meaningfully streamed so the client is still waiting with no output through the intermediate steps.

I can't really comment on roadmap or timelines, but for potential alternatives with the current API I'd suggest maybe:

  • Testing faster models or optimizing or removing prompt steps in the agent to try and optimize response latency subject to your quality requirements (an automated testing framework like AWSLabs agent-evaluation might help you test these optimizations against a suite of example conversations)
  • Making more basic, UI-side changes to your application to reassure users that the model is working on a response: Like typing/thinking bubble, progress wheel, disabling the send button, etc...

Again, even if/when a streaming feature becomes available in this API I'd avoid assuming it'll be a massive change in perceived latency for your users - unless your agent is often outputting very long messages where even streaming the final generation in the chain would help.

AWS
EXPERT
Alex_T
answered a month ago
profile picture
EXPERT
reviewed a month ago