- Newest
- Most votes
- Most comments
You could consider using AWS Lambda’s response payload streaming feature, which allows functions to progressively stream response payloads back to clients. This can be particularly useful when working with AI models that support streaming. If you’re working with Python, you might need to create a custom runtime for AWS Lambda, as response streaming is not directly supported on Python runtime.
Here is the doc to the API https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent-runtime_InvokeAgent.html
An important thing to consider about the agent pattern is that the final response is typically a product of multiple LLM calls, often chained together so the output of one was used in the input to the next.
This substantially reduces the value of response streaming for agents versus the plain LLM-calling ConverseStream and InvokeModelWithResponseStream APIs: Because only the last generation in the agent flow can be meaningfully streamed so the client is still waiting with no output through the intermediate steps.
I can't really comment on roadmap or timelines, but for potential alternatives with the current API I'd suggest maybe:
- Testing faster models or optimizing or removing prompt steps in the agent to try and optimize response latency subject to your quality requirements (an automated testing framework like AWSLabs agent-evaluation might help you test these optimizations against a suite of example conversations)
- Making more basic, UI-side changes to your application to reassure users that the model is working on a response: Like typing/thinking bubble, progress wheel, disabling the send button, etc...
Again, even if/when a streaming feature becomes available in this API I'd avoid assuming it'll be a massive change in perceived latency for your users - unless your agent is often outputting very long messages where even streaming the final generation in the chain would help.
Relevant content
- asked 4 months ago
- asked 2 months ago
- How do I raise the priority of agent to agent or agent to queue transferred calls in Amazon Connect?AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 5 months ago
- AWS OFFICIALUpdated 5 days ago
I don't think adding a layer in front of Bedrock that receives the whole response and then streams it out would really help - it only adds latency if the layer doesn't already exist, and if a Lambda is already present in the architecture the Bedrock should really be consuming the clear majority of the overall latency - wouldn't expect the transmission from Lambda to client to be significant.