VG-SC-final - 7:7:2025, 7.29 pm
===

samuel: [00:00:00] There's a lot of AI that seems great when you build a demo, but actually the amount of engineering required to go from 85% complete to a hundred percent complete is many multiples of what it took you to get to 85%. And that's where evals come in as a, like. You cannot systematically improve systems without knowing how they're performing.

That was Samuel Colvin, 

hugo: creator of Pedantic and founder of the Observability platform. Log Fire Samuel is one of the rare voices bringing real software engineering discipline back into AI product and software development. In this conversation, Samuel and I tackle the wall. Every LLM team eventually hits the demo dazzles, but real users still break the model.

Samuel explains a surprisingly lightweight fix, human seated evaluations where a handful of thumbs up or thumbs down labels. Let him and log fire. Build a live rubric that scores every call and flags for you the moment quality drifts. He even shows a prototype agent that rewrites its own system prompt and raises its hand when it's missing context.

We dig into drift vague queries at topple naive rag pipelines [00:01:00] and why tying evaluation to business goals beats chasing raw accuracy. If you are tired of hoping the next bigger model will save you, this episode is your roadmap to reliability. Now a quick heads up before we dive in the next run of my building, LLM, applications for Data Scientists and Software Engineers course is coming up and students now receive over $1,500 in credits, including 500 bucks from Log Fire, plus perks from Modal Gemini, and Hugging face.

We're starting soon, so sign up today. Details are in the show notes. I'm Hugo Bound Anderson, and this is Vanishing Gradients.[00:02:00] 

Samuel maybe to. Yeah, get started. I'm gonna quote you to you and from your website. You started working on pedantic out of a mixture of frustration, that type hints do nothing at runtime and curiosity as to whether they could be used to validate data. Turns out you are right and or lucky, probably both with Pipe Dantic crazy growth.

Just quickly, I don't wanna say congratulations on all the well-known success Pedant has. Speaking of helping people bring software engineering principles. Into Python and data powered and otherwise workflows. So congrats on that, man. 

samuel: Thank you very much. Yeah. If you forget the history of Python and indeed JavaScript, where type ins are kind of similar and you kind of go back to first principles, it is kind of weird that we have these things that hang around in our code [00:03:00] that allow us to define what types data should be, but runtime, you can do whatever you like.

It's awfully convenient in some contexts, and I think all of us can do some, even the best rust developers around can do a lot of things quicker in Python than they can in Rust, but yet. It's weird that type ins do nothing. So, yeah. And that weirdness remains nearly 10 years on from starting Pedantic.

Absolutely. 

hugo: And also the amazing success and support Pedant has had LLM powered applications now. Right. I mean, 

samuel: what's weird, and one of the unforgivable things I've done as a founder is that for the whole of 2023 and most of 2024, we were building general observability. We weren't working to get pedantic used in Netherlands.

We were just unbelievably lucky that a bunch of people, Jason being one of them, OpenAI Lang Chain, others went and picked up pedantic and started using it as the way to validate structured outputs. I mean, lots of startups would die for that amount of adoption and worked really hard for it. We got it by accident while we were doing other things.

hugo: Yeah, well by accident and also, I mean, as it so [00:04:00] happens, pedantics base model is a really good abstraction for us to use, and I do, hopefully we get more into this in this conversation, but we have these new, incredible, wonderfully horizontal technologies in foundation models, and we are figuring out what's happening on the application layer.

That's what I do in my consulting and in my course alongside that tool builders such as yourself are trying to figure out what abstraction labs work for people who are building applications and we're not there yet, which is one of the re and we shouldn't expect to be, I mean, we've talked about this before, right?

But React came a decade or 15 years after. Yes, absolutely. Exactly. Well, it's heartening to know we have the same line. So I'm interested what pulled you into thinking so deeply about LLM powered systems and evaluation? So. 

samuel: End of last year, we've been building log fire like a year and a half, and we started trying to implement AI functionality within Log Fire, and I think we suddenly realized how poor the building blocks were for actually going and using LMS within your application.

The analogy I use is that. It's a bit [00:05:00] like the 17th century in science, and we have all these mad aristocratic scientists who have gone and built tools because they wanted to do science. That is what, in my opinion, lots of the agent frameworks and LL lab libraries feel like these people are no doubt experts in ai.

They're not the experts in building the tools in building software libraries. So I was not compelled by the state of LLM tooling, and I don't want to like torture the metaphor too far and describe myself as the humble tool maker who came along and built the right tool. I don't pretend to be a particular expert in the underlying maths and science of lms, but I.

I do from running Pedantic and a bunch of other open source libraries have a lot of expertise, what it's like to build infrastructural python libraries. And our team in pedantic has lots more experience beyond that. And so I, I think we thought that there was an opportunity and it was, if not the right time, then not the worst time in the world to go and try and build an agent framework with decent engineering principles at its foundation rather than following the latest.

Hype cycle as quickly as possible at the [00:06:00] expense of good engineering. 

hugo: Yeah, 

samuel: that makes a lot of sense. Why 

hugo: are you thinking about evaluation so much? Why is that so important? 

samuel: I think that one, we've been pulled by our community people. People want evals in their systems. There's definitely value in actually going in and setting up evals properly, but there's also value in evals as a mechanism to get people to go and actually dig in and see what's going on within your application and actually read what the inputs are and what the outputs are, and work out what happened.

It's all too easy. Even prior to ai, all of us are digital natives who've grown up in the world where you assume computers are right when they do things, and I think we are all guilty of looking at what an LM does and being like, oh, that's looks about right. And not actually digging into it. And I think half the time, literally going and reading what it's done and working out why it might have done that is valuable.

Part of the value of evals is just in, in being a mechanism to get people to go and do that. I know you talk about go and do it in a spreadsheet. You don't need some fancy tool. But that is another, basically a way of saying, actually read what the LLM is spitting out and try to work out where it's making mistakes.

And then you will identify where the [00:07:00] mistakes come from. So of course there is value in structuring that and in having nice mechanisms for testing how the lambs are performing. But part of it is just like pushing people to go and actually look at what, how they're behaving. 

hugo: Absolutely. And I mean, we do call it evals, but sometimes I like to call it making sure your product does what you think it does as well.

Yeah. But I do agree, like we, we are in a space where there's been, I call it proof of concept purgatory people. There's the plateau productivity. A whole variety of going from prototype to production, all of these things. But we do, this is a space where prototypes and we have demo as well, like where these things are so straightforward to kind of show a nice demo and a mind blowing demo in, in a lot of ways that it makes sense.

That's something that would occupy the space for some time. And then when we start trying to incorporate into software, we start figuring out how to make sure these things. Work, and I do totally understand. Like with demos, I'm not as cynical as I may sound. I do think we're exploring the art of the possible as a community.

[00:08:00] I think there's a whole bunch of marketing and kfab and VC money that goes into this crap as well, which we should be concerned about. But I do think the art of exploring the possible before doing evals is important. 

samuel: I agree with you, except that I think that quite often it's tempting to be like, look, here's a sexy demo, therefore it's possible to do the.

End thing. And like if you have an AI doctor, which 95% of the time very quickly gets you a correct like prognosis of what's wrong, but 5% of the time recommends you take up smoking and therefore die sooner and therefore like don't get pancreatic cancer, then that's a doctor we can't use. And there's a lot of AI that seems great when you build a demo, but actually the amount of engineering required to go from.

85% complete to a hundred percent complete is many multiples of what it took you to get to 85%. And that's where evals come in as a, like, you cannot systematically improve systems without knowing how they're performing. 

hugo: Yeah, absolutely. And to your point, what you mentioned earlier, and this is something we've discussed several times, but just the importance of looking at your [00:09:00] data.

I mean, this is the Vanishing in Gradients podcast, which I've always called a data podcast. We call it ML now, ai, but. Data powered stuff. Right. And I do think just getting people in the habit of looking at their data and the internals of their systems is incredibly useful. I also, I am concerned, and I think I've said this to you in my consulting work and teaching, I tell people to maybe just look at 20 traces, like prompts and responses and some of the internals.

Just to see what's up and give it a plus, one or minus one and a reason, right? Then you can start fixing things once you do that, after a pivot table, whatever it may be. But 50% of people say to me, oh, can I just get an agent to look at it for me? And I like, firstly, I kind of dig that response in some ways, I do think human learning before machine learning, before agent learning.

Is important, but my deeper concern is that these people aren't curious about what's happening in their system and the incentives syn skew because the best data products have been built, as far as I can tell, and in our community have been built by [00:10:00] people who are obsessed with. Looking and seeing what's up and understanding it.

These are like quote unquote organic systems in the sense that they constantly incorporate the noisiness and stock ity and entropy of the real world, right? So you do need to be deeply curious about your system to succeed, right? Yeah, absolutely. So. You've said evals are like flossing, that they're important in theory, ignored in practice.

I love that. Why do you think individuals and teams struggle with 

samuel: them? I think a lot of it is that, that like there is not a, I mean I remember back whatever it was, 2014 when I first started building web applications and writing tests and I didn't know how to do it. And it seemed like a really hard task.

And once I had done a bunch of it, sure it's a lot of work, but at least I knew what I was trying to do. And I think evals are to everyone as tests did to me in 2014, as in. No one thinks that there's a handle you can turn where you will get evals that are useful out without any innovation, without any nuance or complexity.

And again, the 85% thing without any evals is extremely tempting. And so [00:11:00] I think that mixture of the very high overhead of getting them to work and the low and the apparently low return, if you're already 85% complete, then if something's an eight and half outta 10, is there any point in to improve it?

Sometimes no. Sometimes. Eight and a half out 10 is perfectly satisfactory. Sometimes eight and a half out 10 is completely unacceptable, as I said in the like Dr. Case. So like I think there's a bit of that. I think also there's a sense that like these systems are stochastic and that I can't ever make them reliable, or I just have to wait for the model to.

Get more intelligent, which is like a very convenient sort of VC answer to the question. It doesn't work if you are like, if your job this year is to build a thing with an LLM, then just wait for the models to get more intelligent isn't actually a satisfactory answer. Because even if it's so fast that it's happened by Christmas, that's no use if you need to get something out in the next few weeks.

hugo: Absolutely, and I do agree to a certain extent the development we're seeing in models like is promising there, but in all honesty, like. When I've built rag systems, for example, often when you [00:12:00] introspect into what the problems are after looking at your data, it's the OCR or the metadata extraction. It has nothing to do with a generative aspect.

And a VC or someone really bullish on models might say, oh, wait for multimodal models to just be able to do all of that yourself. But you can't introspect into that as like. If something's up, you don't actually get inside into the black 

samuel: box, right? If you're missing the key piece of information, there's no amount of intelligence that can solve it.

And so if you can, one of the things that you asked points out to you or can identify for you is when there is something fundamentally missing that you need to go and like basically add to the context in the simplest case, but sometimes it's not as simple as that. Absolutely. 

hugo: We, before getting into human seated eval, we've got a great question on discord from John K.

John Kay asks what success metric dimensions could, should be considered by product owners and people helping product owners evaluate new AI data products. And he's thinking maybe accuracy, safety, fairness, reliability, trustworthiness, goal alignment, et cetera. 

samuel: Yeah, I've heard lots of different approaches to this.

I think. [00:13:00] Those are reasonable, but there's also, there's like the no bullshit one where you ask the model basically to answer a question. It doesn't have the answer, the context for, and check that it says, I don't know. I think that the highest level, the best high level answer is you need to eval end to end if you possibly can.

And that something in the example I'm gonna give you in a minute. People are gonna probably jump up and down in their seats and be like, but this is too trivial. This is why. But it's also reasonable to demonstrate in a podcast, but like if, so, if to take an example, right? If you are generating, if you are doing text for some natural language search, let's say, you really need to be able to run that sequel against an actual production data set and check that it's returning the right things, not try to.

Do some checks of whether or not the SQL look good, whether they be lms, a judge, or programmatic, or anything else you need to do end to end. Can it find me the right piece of data? And that is, in some cases, relatively easy if we have a data set. In cases like Log Fire itself, it's really hard because the data that people are searching against is changing all the time.

[00:14:00] We can't go and like, we don't even have the data forever, so it's really hard to run good evals end to end. But I think that like trying to do end-to-end evals is a golden rule. Or as much as you possibly can. You can do things like if you're doing like a rag app, you could go and like basically ask the LLM to generate a bunch of questions from the original source data and then use them as a way of, of, of evaluating it.

It, it takes a bit of creativity and this is the kind of thing that's hard I think. Coming back to your question earlier about why don't people do evals because it's hard is the number one reason. Absolutely. And 

hugo: actually John K, as you were talking about end-to-end evals, he edited his question to also mention alignment with business goals.

And I would actually say in addition to everything you just said, Samuel, think about what your business goals are. And of course, that isn't only like ROI like positive stuff that like, well, it is in the end, but it involves. Fairness and reliability, trustworthiness, safety for users, all of those things.

And essentially you do wanna make sure that you are measuring your [00:15:00] system's ability to deliver on business goals as much as possible. And actually we have a case study we do in, in the course where we get people who ingest LinkedIn profiles. Turn them into a structured output and generate hiring emails based on job specifications.

That's something we need more of, to be honest. All scalable, but I think it is actually a very useful thing to do. And of course you can do structured output tests in the middle and that type of stuff. And when you do your evaluation harness, you don't only have lms a judge, you're doing structured output and fuzzy and string matching and that type of stuff.

But what I'm really want to get to here is that engineers love when they do evals measuring like LLM evals and that type of stuff. But what you actually want with this system, in the end, you're sending out. Emails to hopefully get more people in jobs more efficiently. So that's your business goal in the end.

And of course you wanna measure accuracy and confusion, matrices and all of that, but in the end, you wanna do your cost and latency, right? Time and money people. But you do wanna tie it to your business goals as much as possible, right? 

samuel: Yeah. I think the other thing is I used to be a mechanical engineer back in the past, and one [00:16:00] of the things I realized when I switched to being a software engineer is that one of the fundamental differences between the two disciplines is the feedback loop.

If you're a software engineer or data scientist or whatever we call people who code, your feedback loop is a matter of seconds, right? You run the prompt or edit your code, you can run it, you see what happens. Maybe you have to go and push your PR and let tests run to see if it's good, but like. Even if you have to put it into prod, you probably know if what you've done has worked in a week, but the most part in a matter of seconds, if you're a mechanical engineer and you need to check the meantime between failure after a thousand hours, like you might basically have three feedback loops in your career to actually know how good you are at designing mechanical parts.

So my point is, we're all addicted to this very short feedback loop, and one of the things you get, the unit test is a very quick feedback loop. You can know whether or not your thing has worked on a unit test immediately. Some of these. Evals are gonna take longer than that. If you actually want to know how good your sales emails are, you need to wait and see fundamentally what the open rate is and what the reply rate is, and that is gonna take months.

And developers are not used to [00:17:00] that. Maybe data scientists have a bit more like, oh, I'm gonna have to actually go and wait and see what happens. It's definitely true in business, but like I think a lot of us are addicted to a small feedback loop that's not often as possible in evals. You're muted, by the way.

Thank 

hugo: you. And to that point, I'll dig it up. I'll link to it later. I did a podcast a while ago with Roberto Meri, who's a VP of data and product at Instagram, and he launched reels back when during the pandemic the plague, when TikTok had really started taking off and they didn't see success. With reels for at least six months, and a lot of people were freaking out, but it was a long-term experiment and they were certain that it would take a longer time.

So keeping your eye on the prize and tying that back to product, I think is super important. I do wanna jump into the human seated evals, but something. We've been talking around, which is so intimately related to evaluation, and of course we've had demos. Now we're talking about evals. Can't wait for the conversation to shift to robust tests.

The role of these in, in software development. So maybe you can speak a bit to how [00:18:00] you're thinking about testing as well. 

samuel: LLM testing in particular, as in, yeah, I'm not sure I have anything that, I mean, the stuff I'll show you now I think touches on some of it. I think it iss hard to test LMS reliably and I don't pretend particularly to have anything very innovative to say that.

I mean spend a lot of my career writing unit tests, other a standard variety, probably written more of them than most between Ian and Tic Core and things like that. But I don't know that I have a. Like an immediate answer because test latency is so important to running tests and the overhead of running them is high.

I think pipeline evals, which I will hopefully show you a bit, is probably if you care about type safety, if you are an experienced Python engineer, is probably the, as far as I know, the best framework for going and running test style evals right now. But I don't pretend to have any particular special source in that space.

In EVA is just a well-built type safe EVs library will run your EVs concurrently because [00:19:00] that's generally very valuable. 

hugo: Yeah, yeah. Great. And I do think something, I mean, 'cause a lot more software engineers are coming into the space now. We might need a different word to test, right? Because software engineers both in the work I do consulting and.

In, in the course I teach, they come in and they're like, oh, your tests don't pass a hundred percent of the time. Like, what? What's up? And of course, in data science and data powered products, if your tests pass a hundred percent of the time, you're not writing all the right tests. 

samuel: Yeah, yeah. And I, we don't have the beginnings of a concept of that.

Right? Like, GitHub doesn't have a, like this test is passing 80% of the time right now. Or whether that should even be something that's run particularly on a pr. Absolutely. 

hugo: And actually Laura in the chat has a great point. Tests are also super tricky to conceptualize when you don't, as you were saying, have a deterministic answer.

How do you test LM outputs even? And one thing that I do even before launch is actually we talked about using synthetic data. To, to feed into your system to see how it works. But you feed the same prompt in 20 times and, and look at the distribution. And in the [00:20:00] LinkedIn example case, you can see how often it returns, structured output, how often it doesn't, accuracy, capitalization, all of these and develop basic tests both in dev and in prod and slightly different iterative loops there.

But for the main cases. Mm. You see there? Yeah. Yeah. But we're here to talk about human seated evals. So this is the core concept for today. I'd love for you to break down what are human seated evals. If you can walk us through how they work and why you built this approach. 

samuel: Yeah, so the first thing I will say about human seated evals is they are a work in progress.

They are a concept that I have a demo of. They're not yet a feature of tic ai or log fire. I think. We will probably tweak exactly what they are before we put something out into Tic AI as a formal release or into log fire. As in put something into prod. I'm gonna demo something today. It will break in places.

I, I can demo two things actually, which are closely related. I would love feedback from people on both the idea on the implementation, on, on the use cases and where it's gonna break down. And yeah, we're gonna [00:21:00] try and iterate towards something that, that makes sense. But the basic idea is. Evals are hard.

They involve writing an awful lot of code. There are lots of cases where it would be very nice to be able to get something akin to LM as a judge where it, where some human input informs the rubric, which the LLM judge uses. So this came particularly from talking to Hugo and Hugo trying some of the evals platforms.

Basically pointing out that none of them gave a, particularly the, there were, I forget the exact details, but like my, my opinion coming away from listening to you talking about it was that like if you can do a bit of human annotation, that's probably valuable and that's probably, I. Enough information or valuable information for an LLM as a judge to go and start evaluating more cases.

And then the next step beyond kind of l LM as a judge is self-improving agents, which is basically where we use an LLM to go and improve the prompt. Now that to come back to what you said earlier, it is not a panacea. It is not like I can just go and put [00:22:00] an agent up with a crappy prompt and expect it to improve to the point where it's behaving well.

But I think that there probably are cases where, whether it's really self-improving agents or whether it is. Agents that are improving through a mixture of LLM inputs and human input. I think there is valuable stuff to be done there. I think there was an enormous amount of uncertainty about where does the prompt live?

Does that case of like, we have some SaaS platform, let us own all of your prompts. And now it's easy for product managers to go and click edit in them, but they're not a source control. That's the other end of the spectrum where they're all. Deeply nested within your Python code, and I think that there's some innovation to be done about where to prompt to live, when are they updated, what bits of the system are ephemeral, what bits are permanent.

But yeah, the basic principle of human CDD valves is human annotations. AI uses them to go and perform more, which is what I think had a demo today. 

hugo: Yeah. Love it. 

samuel: And 

hugo: to your point about prompts and how to version them, how to store them. I think the jury's still out on a bunch of this because it also depends on the use case.

But we [00:23:00] actually did a survey in our last cohort of how many prompts people try before getting something that looks okay. Right? So this is at the very start, product development. Some people said 10, some said a hundred, a bunch said 400. We had a handful of people who. They've worked on products, which like over the first five to 10 days they'll try over a thousand prompts.

Right. So just making sure that people are aware of this type of process as well. 

samuel: Yeah. I think my point is that it's an evolving document. It's like how many versions of your book did you edit? Or like how many versions of that notion document existed before it became your website homepage. There is no number.

Basically it's constantly multiple people collaborating on it and like it doesn't make sense to commit every single one of them, particularly if you're in some big monorepo where tests take half an hour to run. You need a way to iterate on that prompt more quickly. So yeah, it becomes a living document.

And the version history of it isn't particularly interesting. Maybe you wanna be able to snapshot like, this is where we were on the 2nd [00:24:00] of January, or whatever it was to go back. But like we don't need every single version in the way that we kind of want every version of source control. I think the other big thing, and you've mentioned this already, I'm trying to damage some of this here, like not everything can be fixed by basically improving the prompt with the data it already gets.

You need. Sometimes you're just not giving the right data to the agent. And like one of the things, even self improving agent, like my first stab at it is trying to alert the developer when it's missing information because I think that is part of it. Like there's a lot of things that you can't put lipstick on a pig and the agent can't, similarly, you can't produce the right prompt or the right context if you're fundamentally missing data.

Absolutely. 

hugo: And I love, this is a place we'll get to, but I love that you're already speaking to, even once we have this set up, what the human's role is. So I'm wondering, so in terms of seeding evals, human seeding evals, by doing some annotation of these types of things, what do you think is a good start?

Like how many seeds do you need to get started? 

samuel: I'll show it today, working with like four or five. Um, as in, if you, I mean in this case I [00:25:00] know the cases that are gonna not work 'cause I've played with this demo before and you immediately start to add more information as you add some cases. I don't know really the number, I suspect that it's under 30 will be valuable in almost all cases.

I mean, if we think that the agents are anything like as intelligent as everyone claims they are, we can treat them as equivalent to a, an intelligent intern or at least an intern on that particular job. And if you had an intern and you told them 15 different things that they had done wrong, you would expect them to be a heck of a lot better after you've given them 15 pieces of feedback than when you were given them none.

Um, and I think I remember Barry Zang saying that from Aaro saying this at AI engineer back in New York in February. If you think about these things, if you try and imagine what the world looks like from the point of view of your agent and what context it has. And you can try to like blinker yourself down to what the agent actually can see.

You will realize how hard its job is in terms of what it sees quite often, and you will quite quickly. If you can have that like blinker view of the world of [00:26:00] what do my agents see, what does it need to be able to do a better job? That is the number one way of improving the performance of agents. Just give it the data, the context it needs.

And I think that one of the things that evals does is start to point that out to you. 

hugo: Yeah. 

samuel: And I think that the human seated evals thing, I can hopefully, sorry, both human seated evals and the like self-improving agents, which is kind of one step on. Part of what I'm trying to do is basically surface to the user.

I just don't have this information, so how am I supposed to be able to solve this problem? For sure. And I do agree that 

hugo: under 30 can be very valuable. Also, of course, like it was a silly question in some ways, instructive question, but. In the abstract. It really depends on what the nature of your system is and what the nature of the exact even think about ml, right?

Like where you've got binary classification and how much does a test set tell you about something? Depends on class imbalances, and you wanna look at false positives and false negatives and that type of stuff. So essentially you want. A number of cases, which will cover true positives and then a bunch of failure modes, right?

Yeah. And I usually tell people to start with 20, and I've got this from my friends and colleagues at [00:27:00] Shankar Hamill as well. Part of the rationale behind that is firstly just to get people started, right? And if people start looking at 20, they're likely then to be like, oh, this is actually interesting.

Let's look at a few more now. Shreyer in the end says, and I kind of agree with this, that only stop, like you don't wanna look all day, every day. You shouldn't stop looking if you are still learning things from looking at your data. Right? Yeah. And think about when Netflix first launched their recommendation system.

The engineers at that point were like looking at the results every day, right? To make sure what was up. And then after a week, maybe a bit less after a month, a bit more stable, those types of things. So it is really. A function of how you iterate post-launch as well. So we'll get to the demo soon, but I'm wondering, speaking of this iterative process, once you hand off, after seeding to the LM as a judge, how do you think about the role of the human there and keeping quality control essentially?

samuel: Yeah, I mean, I think that like what's the point of running? As a judge in, in real time, obviously what you want it [00:28:00] to do is to surface when you have issues where basically send you an alert when the performance drops, and that is, that's one place where you dive in and start looking at it. I mean, I still think that the best products are built by people who use their own product.

One of my rules of starting a company was I had to be my own customer because I basic, I had run a company where I wasn't my own customer and I don't like it. And fundamentally most of the improvements we get from Log Fire, sure. We care enormously about what you all think of it. Please keep giving us feedback, but ultimately an awful lot of feedback comes from me ranting about some feature that's driving around the bend and it would not be as good a platform if it wasn't that all of us care about how it works.

That's a lot easier when you're building deaf tools than when you're building hospital software where. Not all of your developers are gonna also be doctors, but if you're building, I think caring about your product and using it is the ultimate like continuous like human judge of any piece of behavior.

hugo: Yeah. So I'd love to know what playing with it look like before we dive into the demo. But before that, Julian has a wonderful question in the chat, and I think the jury's out on this for me in particular, but do you have any advice on convincing [00:29:00] higher ups on the important of getting subject matter experts to review the final output as part of the dev and evaluation loop?

samuel: No, no specific, no specific. Kind of like how do you persuade management to spend money on, on, on stuff, on quality? I think that there's a, for the most part, like management in some abstract terms is reasonably compelled by ai. There's a bunch of things that they will do if you can basically say it's ai.

That they wouldn't have done when it was general, like code quality or improving product quality. So I think calling AI probably helps. My impression is the AI budget. So I've seen lots of great research that says that AI budget's the fastest growing budgets. Security is the other budget that is like ring fenced and so call it AI or call it security.

hugo: Yeah, absolutely. And I do think you, you actually reframed the question really nicely. How do you get the executive to invest in quality, right? To paraphrase slightly and. That's really tough 'cause everything's expensive. But I do think helping them understand and sending them to conversations like [00:30:00] this, I'm happy to come and talk with your executives as well.

I, I will show them cases, serious failure modes of not getting SMEs to review the final output where you have, it's, once again, it's garbage in, garbage out. 

samuel: Right. And I think the other thing to say is there's often an assumption that like, this is like tens or hundreds of hours of people sitting there like combing through cases.

And we need like. Computers are so big. Everyone talks about big numbers. Everyone wants to basically show you their biggest number. And so people think that we are gonna need like thousands of of bits of feedback. Actually, the fact is like you can probably get some value out of having a subject matter expert look at 20 examples in two hours, and that will be like extremely valuable for you.

And if at the end of it, like you have a like burning issue that you haven't done enough, then that's an indicator you do some more. My instinct is that like two hours of subject matter expert with it. If you prepped correctly is gonna be unbelievably valuable. And I think developers are expensive. GPUs are expensive.

You can probably afford two hours of almost any subject matter expert on almost any [00:31:00] product. 

hugo: Yeah, absolutely. And I mean, in the case of the email generation I mentioned earlier, it's a no-brainer that in that case, you'd want the person who used to write the emails to check a few of them before you start.

Sending 

samuel: them out, right? Yeah. But that's one where we've all got a lot of emails. We all have an intuition on it, right? Yeah. I think the more esoteric, the more domain relevance, whether it be, I mean, medical's a hard one because everyone's like, ah, but medical, we shouldn't use AI for medical. The other, but like, so I'll put medical out to one side, but if you think about a complex accounting question, obviously you're gonna have more luck if you have a accountant looking at it than if you have a software developer.

Just as if we have a accountant looking at software, they're not gonna do a great job. 

hugo: Yeah, totally. And I do agree that generally executives and employers and team leads are very interested in ai. I do think a bunch of middle management is probably quite scared 'cause I think that's the type of thing that may be able to be automated.

Middle management, or as I like to call them, the bouncers of the executive world are half [00:32:00] joking. 

samuel: I've had a couple of conversations in education about this and I've heard a lot of people desperately not want AI to come in and be involved in education. And my argument to them is. If you like, rewind to 1830 in the beginning of like mechanization in agriculture, who was most affected by it?

It was the people who refused to accept it and ended up not having a living. And actually, how do you let AI not affect your life too much? You adopt it for the things for the low hanging fruit and the actual time spent with the child tutoring them. Carry on using a human to do that. But like prep them by giving them a great summary of that child's recent education.

Get all of the like invoice chasing done with an AI so you can spend your time teaching. And I think that like it is possible in general to persuade people that they looked into AI before AI looked into them and their job without a doubt. 

hugo: So the human seated evals loop and what we've been talking about, and we'll get into a demo soon.

You've tested this at Log Fire. I'm wondering if you could share a case. Where it worked well, and one where it fell short, and I appreciate this is still in [00:33:00] development, in research. 

samuel: My impression is that if you have the relatively single shot cases like I'm looking at today, it, it works pretty well. My, and I don't have, I'm not gonna claim I have evidence for this, but my impression is as the interaction with the LLM, the conversations get longer.

The explosion of different use cases effectively within the flow chart gets so complicated that it gets harder and harder to cover all of the cases. So I think like all things we can be most confident about, its effect on simple cases, but that's not to say it shouldn't be useful in more complex ones.

hugo: Yeah, 

samuel: makes 

hugo: sense. I also am interested, 'cause I mean I know there are incentives to build and not maintain, but maintenance is such an important critical concern, right? So do you have any idea of what are the signs that your rubric or judge might be drifting or degrading over time? I mean, 

samuel: if your rubric is hard coded and you are not basically.

Using previous AI responses within that rubric, you can be reasonably confident that it is tethered to your original [00:34:00] definition or your original system prompts, as I will. One of the interesting things about building this self-improving agents demo is I. You can suddenly see how, uh, how effectively your prompt is evolving over time as a function of the previous input.

And so it can start to drift. And one of the things I realized in doing it is that if you are using, uh, the inputs and outputs from previous runs to basically improve your system, prompt. And you are not manually going through that, you can immediately see how data from one customer could end up in the response to another customer.

And so as these systems get more complex in terms of, yeah, using real world data to improve performance, which is something we obviously all need to do, it immediately starts to become vulnerable to the kinds of exploits that are possible with ai. Indeed, the example I'm gonna use here is chosen. From within Log Fire because the other one I was gonna use, I basically ran a query to get loads of real world examples and discovered it was full of very proprietary custom data from customers and therefore don't wanna use it on a podcast.

So like it is like, I think [00:35:00] security and Drift, and like ending up training, training data that that training or setting up your prompt in such a way that it cannot. Include context from one, from a customer that's private is a hard problem ignoring the fact that all the models are trained on data they shouldn't have been legally, which is another matter, but we are where we are.

Totally. 

hugo: So one more question before we jump in. I'm wondering what kind of tasks, if any, or apps do you think this method might not 

samuel: be a good fit for? Well, I don't think that there's an obvious where it won't will and where it won't work. I think, as I say, there are security implications to some of these things that we need to be careful of.

And so, but I mean like, it doesn't mean it can't be used in them. It means that. We have to be more careful when we're if, like, as I say, as the system by which we improve our system prompts or instructions gets more sophisticated. We're gonna have to be more and more careful about data lineage and categorizing data for doing that.

Makes sense. Let's jump 

hugo: in, man. Cool. So just a quick heads up. At this point in the live stream, Samuel switched to a live screen share where he [00:36:00] showed exactly how human seated evals work. Then lets a self tuning agent rewrite its own prompt inside log fire. If you build with LLMs and wanna see the clicks, jump to the video on YouTube, links in the show notes.

Otherwise, feel free to keep listening and we're rejoining the conversation a bit further downstream. 

samuel: I don't think we're in a world where we just like. Set off our agent and leave it to do its thing should be super valuable in terms of allowing us to get agents into production more quickly and behave decently.

It's probably not gonna move the dial on the 1% of cases where people have spent hours building thousands of evals, but the vast majority of cases in production today, not like that. And I think coming back to the flossing case. If we can get people to basically drink less high sugar Coca-Cola, that's gonna probably have a better effect on tooth decay than telling everyone to do flossing and, and no one doing any flossing.

Absolutely. I 

hugo: love that analogy once again, and we've got actually a couple of questions to wrap up on in the chat, which are very relevant to that wonderful demo Laura asks. Do you expect humans would need [00:37:00] to periodically review or label the elements of judge output to confirm alignment to the initial seed?

And I'll generalize that slightly more by asking not only confirm alignment to the initial seed, but uh, if there are new things happening in the system that are outside the seed scope as well. Yeah. 

samuel: I guess one thing that I didn't share with you, which I think is actually super relevant and actually sort of half of the value of human CED eval of the self-improving agent, even if.

It's not how you would, how I kind of described it, but if you look at the run to the coach run here, if you look at its final output, it contains a bunch of the patch, but it also contains an overall score of basically how well it thinks this context is done. And then it includes developer suggestions.

And the idea of this is that if this number drops below, say five, we do a warning to the user. So this isn't just about like improving itself, it's also about warning. The user when or the developer when things aren't set up correctly. You see here it turned seven and it said the agent needs to be [00:38:00] consistently called inject time, yada, yada yada.

Consider making this required first step automatic injection, which is literally what I had before. I like tweaked it for this demo, so it's actually come back immediately and given us. Valuable. Basically feedback to the developer even on the first time it ran. So I, I think that like this is almost as it is valuable in autonomously updating the prompt.

There's also value in it. Basically doing a good job of telling you what's wrong with it, which isn't something we normally get from agents and it's not something we would, I mean, one thing we, we have done before ourselves internally is add a field. When you're doing, like let's say you are asking it to return a simple numerical value, you add another field, which is explanation and then like, or you could set that up to be like, how bad was my prompt?

It'll take it a little bit longer to run, but like that information is super useful in both alerting you and in trying to work out how to fix it. But I think it's powerful having this like kind of background agent running where we don't care about latency, we don't care about cost as much because we're running [00:39:00] it.

A fixed number of times per day to like alert us to when there isn't enough information. Yeah, that makes a lot, I mean, the main answer to Laura's question a hundred percent. We need to like have human in input here and I suspect that like the better flow with this would be that the agent goes and suggests an improved prompt and you accept it rather than it like automatically going out and using it.

hugo: Exactly, and this actually speaks to a broader point. I do think we haven't seen a lot of this yet, but I'm very excited about the idea of proactive, not reactive LLMs and agents. I would love like an intern style agent that comes to me like at the start of every day and says, Hey, what's up? What can I help you with?

Um, and, and get involved that way as opposed to me continually needing to go to it. Yeah. Sorry, that made me sound somewhat lazy, but I, I do think I possess. Qualities of good laziness as well. We actually have another question from John, which I think is like wonderful future music. John's question is, given this system can multiple people and or agents contribute to the same evaluations?

samuel: Yes, and I think, as I say, I think we need, [00:40:00] it's good for us in a way as a platform, but like. This stuff needs UI and it needs a like authenticated system where we can go and see how these things are performing and the collaborative bit of getting these right and like commenting on the, like, I think there's value in like comments to your peers in a PR review sense of like, this is right and this is wrong, as well as just like edits to the system prompt.

Totally, 

hugo: and the reason I refer to it as future music as well, I do think most LLM systems and agentic systems are single player currently, and I do think the future of firstly fun and good vibes, but secondly, economic impact, potential economic impact of these systems is in multiplayer mode. I mean, having an LLM with agentic capabilities.

On a marketing team, for example, and everyone can chat with it and delegate and get responses back and get tasks done and that type of stuff. So we were 

samuel: talking today about, obviously we're writing a bunch of marketing material at the moment, or documentation. We know people are gonna let LLMs complete some of their documentation or writes some chunks of it.

Like while I don't want LLM generated documentation everywhere, I [00:41:00] want some of it to be generated. One of the things we're working on as a team now is a. System basically. Well, we might implement it as an MCP resource, but basically our tone as a company for how we write. So we want basically telling the LLM.

Don't be as verbose as usual. Don't be as formal as usual, be like a bit chattier and as concise as possible and to the point, but also it'll end up being a bit of like. How we think we should talk as pedantic and that like you can imagine, part of what's valuable in a company in the future is fundamentally the prompt for how do we talk in our marketing material.

Mm-hmm. Like that's part of the IP as much as like the actual marketing material which changes over time and is just gonna be AI generated in some cases. This like, that's just inevitable. Absolutely. So to 

hugo: wrap up, I'd love if we could remind people how to get involved. So I've shared github.com/pedantic/pedantic stack demo, and the pool requests there.

But if people really wanted to play around with these human seated evals and all the wonderful things you've just demoed, what's the best way for people to get involved and do you have a [00:42:00] slack or discord they can come and chill on as well? Yeah, please 

samuel: come and join us like there's links. On Lanza ai. I will try and answer any questions after this if people have any questions there, but like expect me to be easier to ping from our slack where I'm I am, or one of our team.

Worst one of our team will bug me and say, Samuel, you haven't replied to someone who asked about your bloody human CD eval demo That I don't understand. So. I'll be there and I'll try and answer anyone. Amazing. 

hugo: Well, thank you so much. And also thank you for the spur of the moment sponsoring of our building with the LLMs course starting next week.

So for those of you who've missed the start, next week, I'm starting, uh, third cohort of our building, LLM, powered Applications for Data Sciences and software engineers, which I teach with Stephan Krach, who currently works on. Agent infrastructure at Salesforce, and we have a whole slew of guest lectures from wonderful people such as Samuel and Jason Liu and Rehan and Paige from DeepMind and a bunch of people.

And we also have a bunch of credits. So it's sponsored by Modal, $500 in modal credits, [00:43:00] $300 in Google and Gemini. And now to top it all off, we've got $500 in credits from Log Fire and Samuel and the wonderful team there. So really appreciate that sponsorship as well, Samuel, I'm sorry and can't wait to see you in the course.

No problem. 

samuel: It was not spur of the moment. I had just forgotten to tell you when I was emailing you the other day, but amazing. Just it was like spur of the moment, remembering to say it. But anyway, thank you so much for having me. Thank you for dealing with my somewhat ad hoc demo, but I really enjoyed our conversation.

I 

hugo: the, the fun of it as well. So also thank you. We've had 110 people join in real time and really appreciate you joining and sticking around and all your questions. But most of all, thanks for all the work you do bringing software engineering back to. Software Samuel and can't wait to continue the conversation.

Awesome. Thank you so much. Thanks to tuning in everybody, and thanks for sticking around to the end of the episode. I would honestly love to hear from you about what resonates with you in the show, what doesn't, and anybody you'd like to hear me speak with along with topics you'd like to hear more about.

The best way to let me know currently is on Twitter [00:44:00] at Vanishing Data is the podcast handle, and I'm Hugo. See you in the next episode.