♊️Pairwise comparisons - our newest evaluation feature that I'm super excited to unveil This is NOT the same as typical comparison views ✨This is native pairwise comparison functionality - brand new, pretty unique, and pretty powerful ❓How is it different and why am I so excited about it? 1⃣The previous way we had for comparing runs would evaluate each run individually, and then let you view the results side by side For example, if you were evaluating summarization, you'd create an evaluator to score a summary on a scale of 1-10, and then compare the results 2⃣This is fine/best for a lot of use cases. That's why we implemented it first, and why most (all?) other eval tools focus on this 3⃣However, it does have some downsides. Oftentimes it's hard to come up with a prompt to perfectly score summarization on a scale of 1-10. This task is hard for even humans to do! 4⃣It's often easier to not score an LLM generation in a vacuum, but rather compare it to another generation and say whether it is better or worse. ⭐️There's a reason the most trusted LLM eval (LLMsys arena) does this type of "pairwise comparison" - it's easier and more intuitive to do⭐️ 5⃣We've taken this idea and added in native support for "pairwise comparisons" in LangSmith. Specify two runs, as well as a comparison metric. We show how to do this using LLM-as-a-judge (with prompts from the LLMsys chatbot arena) Check out the blog and Lance Martin's video in the tweet below for more information and a great walkthrough of how to use
🍐 Pairwise Evaluation in LangSmith For LLM use cases like text generation or chat (where there may not be a single "correct" answer), picking a preferred response with pairwise evaluation can be an effective approach. LangSmith’s pairwise evaluation lets you (1) define a custom pairwise LLM-as-judge evaluator with any desired criteria and (2) compare two LLM generations using this evaluator. Dive into our latest blog & video tutorial to learn about pairwise evaluation and walk through an example of how to use custom pairwise evaluators in LangSmith. ✍ Read our blog post: https://lnkd.in/gdCZHxQp 📽 Watch the video: https://lnkd.in/gAz_dKZg 📄 Check out the docs: https://lnkd.in/gv4z2zEZ
A great example of how LangChain stays at the trailblazing front!
impressive!
cc: Liz Darnell Harrison is a good follow for intuition on LLM application development. If you are ever in the SF Bay area and he is giving a talk, I recommend it.