The following is a rough transcript which has not been revised by Vanishing Gradients or the guest. Please check with us before using any quotations from this transcript. Thank you.
===

hugo: [00:00:00] We're here to talk about an incredible database that you've put together, but not just the database, Alex.

alex: a collection of. resources, blog posts. I've already learned a lot about evaluation practices, that are happening in industry,from your posts. You've already tracked over 400 deployments of LLM powered apps and agents before we dive into the meat of it. maybe you could just tell us what startled you and what are like the top three takeaways that stand out to you from analyzing all of this data. Yeah, thanks. Thanks, Hugo, and thanks for having me on. I think overall, the thing that was interesting while putting this together, and it was a labor of love, over, over the past few months, just, collecting all of these links, it was just something which I just had in a little file.

I was justgathering these things together,along the way.And it's clear,a lot of people are doing a lot of different things. So think like the diversity of approaches and use cases, and this only continues to expand, the more the LLM, capabilities, [00:01:00] extend further. So we've got now like more multimodalthings coming in.

you've got, these kind ofmega 1 million, 2 million plus context windows, which are growing and being there. You've got video.Potentially coming down the line, audio inputs. People are doing a lot of things with that. yeah, the first thing was like.People are doing a lot of different things.

maybe not all of it is like at Google scale or whatever, but people are trying this stuff out.So 

that's really cool to see. the other thing I guess is when you dive into like the technical details, It seems really super clear that the fundamentals still matter. so all of the like classic software engineering, like bread and butter, DevOps stuff, like how do you reliably deploy applications?

how do you make sure these things are robust? How do you handle failure? How does this thing scale? yeah, that stuff still matters and you still need like expertise and that stuff's actually like hard to do. it's not, a demo app is one thing, but once you have real users, production [00:02:00] data, all of these kinds of things,that becomes a problem.And I guess the final thing, maybe this kind of gestures a little bit towards what we're going to talk about is that, The lasting kind of products or experiences or things that I've seen,work or the people have written about are like, projects or,yeah. Efforts which are driven by the problem rather than the solution.

the technology. So it's less about let's dump an LLM into our application. It's Oh, we're finding that users aren't converting, or we're finding that no one is reading our emails or, whatever the specific problem is. So it's starting from Something that relates to business use or a feature that, that users are interested in rather than let's do gen AI.

yeah, I guess that those are the big ones, which stood out to me, none of them are necessarily particularly technical, but, 

hugo: I love it. So just to reiterate, we have wide variants and variety of use cases from, classic tech spec stuff, generation to conversational agents, to, To all types of [00:03:00] multimodal stuff, then fundamentals still matter, which I love.

And something I want to dive into is perhaps the idea that fundamentals matter even more now as well, right? Because we're really pushing up against the limit of what's possible, particularly with the non deterministic nature of, language models. third is related to that. I think that we're really.

What you've seen is that we've got a lot of business use cases, which are driving adoption, not technology trying to drive business in some way. and I think this comes down to fundamentals as well in a lot of ways, because you want to, be able to define a business problem, have a business way to evaluate it.

And then a macro level evaluation of whatever software system you're building and then dive down and start essentially doing evaluations of the micro aspects of it aswell.yeah, please 

alex: intervene that it does seem that a lot of people do start with something top down from a CEO or something which says, put Jenny eye into the application.

So a lot of people do [00:04:00] start there, but I just wanted to say that, the things that work is when people actually realize, no, we have to go down, go back to the problem. I think,yeah, there's no data to say this, but probably I would imagine that there's a majority of. Dump gen AI into my app versus the problem driven gen AI use cases.

hugo: Yeah, makes sense,but something we'll get to as well. And I don't want to put the cart before the horse, so to speak, but it isn't like we've got just like these giant autonomous LLM brains that are like operating systems doing all types of stuff, right? The people who've been most successful so far embed them in incredibly structured workflows with well defined business logic, right?

which is absolutely key and against the zeitgeist in some ways. So I think this correction here is incredibly important. particularly with the non deterministic stochastic nature of these. I've linked to the database. and one of your posts, the introductory post about it. And as I said, like people go through all the blog posts.

there are all types of different lessons. I'll just, actually, let me just say, there's, the evaluation [00:05:00] playbook. So a post on what you've learned from evaluations, what you've learned from prompt engineering and management and production, then what you've learned from agents in production with respect to architecture challenges and more than building advanced search retrieval and recommendation systems.

So apologies for reading your own posts titles to you, Alex, but I'm just letting everyone who's listening know is that this isn't just a database, like it's a. It's an incredibly large collection, of insights and I'm starting to talk like an LLM. I say insights more than I ever did before. I think before chatting with these monstrous machines, but because there's so much in it with all these blog posts, articles, case studies.

alex: How would someone use it? so if someone's working with LMS or Gen AI, how do you think they could use the database? Do you intend it as a reference tool or is it a source of inspiration or how are you thinking about it? I mean, it's definitely all of the above and take what you want from it.

we also, it's also available on the hugging face. hub as a data set. if you want to take the data and, dump it into an LLM or whatever, filter it and sort it out as you [00:06:00] will. we wanted to allow for that use case, for sure. if you have a. Particular project that you're working on.

Let's say it's, I don't know, something audio related or something in the healthcare space, and you want to see like, how do people deal with,privacy related things in the healthcare space when they're dealing with LLMs, then you can just drill down into the specific use cases you're interested in.

of course there's also the option, we do live in the world of. Models with large context, dump the related examples that relate to your context in, have a conversation about that. That's also something you can do. or, read the source material. There are a lot of videos summarized in there.

I think there's like a hundred, 150 hours of like videos, which have been summarized in there. it should, the database should save you a lot of time having to watch like Hours of videos to extract, what were people doing with their systems? so yeah, the idea is it's inspiration to point you into a direction where you can learn more and understand more that relates to specific problems that you're working on.

hugo: And you've also used Google Labs Notebook LM to generate podcasts of each blog post also. 

alex: Yes. Yes. [00:07:00] Sure. I mean, there was just so much information and, ZML is not a research organization or a research lab or anything. it's not like we had unlimited time and manpower to likegenerate reports and so on.

hugo: I did my best, putting my former historian hat on. but, yeah,the podcast also like help just extract insights and summarize things in a useful way. Just some people like audio. Absolutely. Absolutely. I just want to say, I don't want to get into this too much, but because we're talking about the structure of it now, in some ways, I did really admire the information architecture of it.

We live in a world where we're bombarded by content and noise and shit constantly to, to be honest, and finding something which was easily easy to navigate and not only as I said before, you've got a blog post, there's a point you make and you.hyperlinked to a number of case studies, I can go into those case studies, which are all structured in the same way, which then provide the primary sources.

I'm looking at the LLM evaluation framework at Canva one, it goes to a YouTube where I can actually watch the primary source there. And it's easy to [00:08:00] navigate. and I know that this stuff isn't easy work and it takes a lot of time. So really appreciate all the work you and the team have done there, Alex, very seriously.

alex: Yeah.Thank you and shout out to Zuri, who works on our team, who made it look nice and laid it out and so on our website. 

hugo: Awesome. big ups to the whole team atZenML. I'm wondering, I want to get into an agent reality check soon to see, will this be the year of agents? What that means.

but I am interested, looking at all these deployments, are there any things in there that people found surprising when they hit production? 


alex: I think thatmaybe it's a counterfactual one. I think prompt engineering and prompt management is actually hard. and that's something which, which came out like,weparticularly, we're all interested in the tech.

We're all interested in how the boundary is being pushed on how these things work, and so on.When you look at how teams were dealing with this stuff in, in, in production,it, it seemed like this was something that they all had to learn at one stage or another.

[00:09:00] firstly, there is no one single magic prompts. certainly within the life cycle of an application,this stuff is going to change. and of course, prompts are brittle. Like when you switch out a model or even sometimes with the same model, you change one thing, you add a comma.

you put things in all caps or whatever, suddenly you're getting completely different results. those are grist for the mill for the prompt engineering stuff. and then just like the whole tracking, prompts throughout the whole life cycle of the application, particularly once you get into like the world of agents and they're off doing their whole thing in the background with their own, Tracing prompts through the life cycle of your or your application and all of your infrastructure all of the stuff is non trivial and I think yeah I think that would be surprising like to many people they would assume that like the prompt engineering bit was the easy Bit and like actually it's you know managing Clusters of GPUs or whatever, like Isaw more often than not, people had to like at some point, get serious about like how they were dealing, handling their prompts and prompts sprawl and, [00:10:00] progressions and all of that.

hugo: Absolutely.Because I do think people don't, there isn't enough conversation around this. Prompts almost seem like an ancient technology now that we have agents and all of these things. But I think we can all agree that prompt engineering is an aspirational term. and there isn't, we don't have an engineering discipline around it.

There isn't a lot of science to it either. and these things we're interacting with aren't computational in the way we expected. A lot of LLMs are horrible calculators and we expect computers to be good calculators, right? So something funky is happening.I don't want to dig into this too much, but I also think we're in relatively early days and we actually don't havemany of the tools that we'll be using in a few years for prompt management, I don't think.

I'm very excited to see what arises in the space of open source and vendor tools to help people with prompt management. I think it'll be a beautiful space if done well.

alex: Yeah, beyond that, actually, I find it super fascinating thatPrompt engineering and tweaking things in the right way is actually something that [00:11:00] engineers don't seem to be that good at, innately.

I don't know. I've seen people who are like super smart on whatever, very logical things. but when it comes to figuring out how to exactly manipulate the, Like the embedding space in a certain way that things come out in a certain way. yeah, it requires a different mindset almost.

hugo: And I quite like that as someone with a non traditional background, that there is this space within the world of LLMs where, where other people get to, to shine.Absolutely. And it's bringing once, a lot of the natural language, NLP community, a rich history of so many different.

Disciplines, linguistics and history and sociology coming together to build all these wonderful technologies, particularly in the Python space as well. let's talk about agents. It'sthe 7th of January in 2025 and It is the year of agents, Alex. but of course,we know that, at one end of the spectrum, we have this idea of LLMs that, I'm like, Hey, can you do something?

And it [00:12:00] goes out and does it and brings me back at the results, all the results or performs the action. And at the other end, we have entirely structured workflows that have, that are totally deterministicand classic software.A lot of our culturalconsciousness now goes to fully autonomous agents.

I do think Anthropic published a wonderful post recently, which I'll link to in the chat in the notes, which really level set helped to set expectations around when to use agents and when to not, and introduced the idea of, Almost like partially agentic system. So a continuum of LLMs to agents.

but you, through all the work you've done over these 400 plus deployments emphasize most production deployments still lean on structured workflows. why do you think this approach remains dominant?

alex: it goes back to what people are seeking from, from the systems in the end, like no one wants some kind of, not to anthropomorphize agents, but like some slightly unreliable,actor within their system, which may or may not provide what you want to, to users or [00:13:00] customers, but potentially,people want reliability, people want predictability, and they want this kind of.

Control. And so having this, structured workflows or guardrails or whatever metaphor you want to put around that, like allows for, it allows for this reliability. It allows for more predictability, and also allows for you to intervene, more, along the way, which can guide things on in.

in the right way. and quite often you see actually when you're reading through the, or watching the videos or reading through the blog posts, the technical blogs that the teams have written,people will start out with the kind of agent God mode where they have this thing, where, let it go wild and let it just handle user queries.

And then, then slowly they move into a more structured scenario or structure setting, because it allows for, yeah, human oversight, human intervention, minimizing complexity. what we were talking about with like prompt engineering earlier. yeah.

Tracking all of the agents doing their all autonomous things. And, it gets even more complicated when you're talking about multi agent systems. engineers don't [00:14:00] want to have to debug this at two in the morning. what was my.Agent doing like 35 minutes ago is during some internal conversation and it's had yeah, agents want, sorry, engineers want something that they can reason about and instruction, obviously, as these systems continue to get more, more capable, you'll see, I probably like a shift to, to devolving more.

hugo: responsibility or less oversight to those things, but you're still in the end going to have to, to debug these things at some point. Of course. can you just to ground this? And because you have such a large database, can you give an example, of where a structured workflow outperformed a more autonomous agent design in production? 

alex: yeah. So there's a couple of good ones. there is a company called, Rexera, who develops an agent based system, for like real estate. so basically it's quality control, where, they needed something to review like real estate transaction workflows. check things for errors, that kind of thing, verify [00:15:00] accuracy.

and it was built into like part of a bigger system. That's not like the core thing that the company does, of course. and they started off with like very simple prompts. they found like super high, error rates. and certainly it couldn't handle like multi step or complex workflows. and then they're like, okay, this is clearly too simple.

Let's go full AGI let's have a multi agent system. They built it with crew AI,which I think, as I understand, it's like a wrapper around, Lang chain. and this was quite, quite a bit improved. It was able to handle more complex workflows. but still like they had pretty high, pretty high error rates.

And one thing they really suffered from was like agents veering off course and getting lost in some like little Eddie somewhere,so that wasn't working and it wasn't reliable enough for them to like, just like hand over this quality control process to the system. and then they went, they took it one step back and they're like, okay, let's have structured workflows.

so they used LangGraph, I think, for this. and they had a kind of tree like decision structure where, at certain points, like the agent goes off in a certain way. I [00:16:00] think they took away the multi agent part. It's which made it a little bit simpler to reason about, and they heavily reduced like the randomness of like agents getting lost into some way.

hugo: Cause once you structure out a little bit, the decision tree and decision workflow, then, then these things happen, less. That's one, one good example of that. And it also shows that the learning process within the team of how they, how they dealt with the problem of things going wrong.Yeah, I love the story as opposed to just the final what worked. Having this story of kind of their process of what worked, what didn't, and how they went forward and went back. I am, so I like, grounding it in the idea of a tree where agents will make decisions at certain points and these types of things.

alex: I aminterested if you could break down just, for the audience, what a structured workflow really means and how agents are contained or how LLMs are contained in different parts of structured workflows. Mean, I would say I think this can differ quite a bit. between different projects.

It can be, like I said, it can be like [00:17:00] this,this tree like structure where decisions are being taken,at certain points, it can be even more rigid than that, where it really is just I don't know, there, the guardrails metaphor is better where you're like going through at certain points and, you also have this kind of do not act in the case of uncertainty principle.

so at which point like the agent maybe wouldn't necessarily complete anything else because it's better not to do something than to take an action in the face of uncertainty. I would say there's lots of different patterns and the technical details in the blogs are not enough to necessarily need to come out with a systematic idea of these are the three patterns that we, we see.

but I just would say that there's considerable diversity in how people are doing these things. 

hugo: That makes perfect sense. And I am wondering, probably important to balance the conversation. Are there specific areas where autonomous agents shine? Or do you think structured workflows will continue to lead for the foreseeable future? 

alex: Well, like in this example that I was just mentioning about Rexera, and there's also another good one from lindy. ai, which I followed a similar kind of [00:18:00] pattern. once you do get into like more complex workflows and complex tasks that you're trying to do.There, you do get some advantage by having something which has a bit more flexibility to it.

so yeah, put this under the kind of complex reasoning, planning,bucket. potentially, although there weren't so many examples of this in the database, there's quote unquote, creative exploratory tasks. so there's one,one company, I forget the name, Ubisoft. which, had LLM driven, dialogue generation for games.

and there, they had human oversight within it. So I guess that was the structure of the container for that. But then there was a lot more kind of ideation, and so on, with a quote unquote, agentic,framework around it. and yeah, again, going from the complexity,side of things like if you're, if the environment is much more dynamic or things are changing and, it's harder to put structure around things because the input or the scenarios are less predictable, then potentially agents are [00:19:00] Again,very hesitantly, like maybe more powerful.

hugo: but it's still, yeah, it's still super early days. And, it's hard to say that there are like really standout cases where this thing could only have been accomplished with, with agents or like could only have been accomplished in a like very reliable way. yeah.Yeah,super interesting. So I want to think a bit more about architectural patterns from the field.

alex: So you've highlighted that companies like Anthropic, Amazon, Klarna using approaches like React and RAG and microservice based approaches. I want to out or hear about which of these patterns are proving like the most resilient, but I'm wondering if you could just break down what these are and what, like the space of architectural patterns you've seen from all of these case studies. Yeah. and I say this all with a caveat and I apologize for the caveats, but this is,I come from a previously a research background and this database is like the technical blogs which exist in the world, but it's hard to know, what people aren't publishing.

And [00:20:00] obviously the things that they publish are generally the things which make them look good. And you'd hear far less about the failures, all of this,but with bearing that in mind, like.RAG is clearly,or like RAG hybrids plus agents,is clearly like a super big pattern, so RAG for grounding, for knowledge,for the fact that you don't need to retrain these systems, as new information comes inpotential of citing,where sources and information comes from.

so this was fairly common. It's common at like big players like Amazon pharmacy. There's a really great, case study,published by Amazon in the database. LinkedIn also had one,and many others. This is a super common thing,generally, cause no one wants hallucinations.

Everyone wants like factual responses. particularly when there's Real users or customers. It's not aninternal facing tool. then like architecturally, microservices, at the point where you start having like real scale, this seems to be like a super common thing. Like you want like independent scaling of your components.

You want to be [00:21:00] able to maintain them. in a modular way, you want to, be able to, yeah, allow teams of your teams to intervene at different points within your system, and potentially be tech agnostic. you don't want,maybe there's some new tool which comes out and you want to switch it out.

You don't want to have, and this is like. Classic software engineering that we were talking about at the beginning. and lots of examples of this in the database. Stripe had a good one. I think 11 labs also talked about like microservices. and, the other kind ofquite common pattern, particularly in the agents space is like human in the loop.

and, that's basically there for reliability. anytime you have like agents dealing with this kind of customer service agents or, support bots or,this kind of thing,there's almost always. And certainly in like companies with real money on the line. they will have some kind of if a conversation gets, weird or difficult or whatever, pass it on to a human agent or at the very least log it for future review.

and, there's a really great case study by, Summer [00:22:00] Health, which was one of these like medical,case, transcription services,which, which I think are going to become much more of a thing this year. and they're, they had exactly that kind of human in the loop systems because obviously it's in the healthcare space.

both a lot of regulation plus a lot of like potential for downsides when, when things go wrong,to handle edge cases and expert support. the key thing is likepeople are using LLMs where it's, it's good and useful, but like being very clear about the places where LLMs are weakand limit the damage where possible.

in those cases. and yeah,I'm sure like five years from now, we will think of like human in the loop systems and whatever is like fairly primitive tools for handling like the reliability of this, but yeah, it's what we've got at the moment.

hugo: I really appreciate that.That breakdown.

And I'm wondering which of these patterns has proved the most resilient in real world deployments? And the answer maybe depends on use case, but talking, could you talk us through what, what works for what essentially?

alex: it's certainly what [00:23:00] depends on, on, on the particular use case you have. And if you've got a high,high task complexity,requirement for, then it's all about like, how do you break it down?

how do you reason, reason through this? And, depending on your risk appetite as well, like you will be more or less comfortable with something more structured versus not. and,and, this is changing day by day. Like you've got O 1 now, which, and O 1 pro and O 1 plus.All of these,these other models, which are like supposedly better at this kind of planning and reasoning through and breaking through things, breaking things down.

the spectrum of what, what will work there, what will work without too much human oversight and human intervention, I guess is,Is fluid. the other thing is yeah, how much autonomy does the agent, need if it's just simple Q and a,it's probably a lot easier.

hugo: but if, you're having something which is like a, again, I'm a complex workflow. there, you're going to have to see how you handle that. and also how many tools are involved? is it really, like you say, is it much more of a, kind of [00:24:00] a react thing, or.Or is it simpler? plus all of the standard, like SRE stuff, like what are your latency requirements? Like how quick does this stuff system need to respond? how much money do you have to burn on these agents? Just like thinking to themselves in the dark. yeah. That's super helpful. I'm wondering, for teams just starting to buildwith LLM agents, what do you think is the simplest architectural pattern they should explore first to avoid unnecessary complexity? 

alex: simplest is don't use agents. and,I have somehow like Hamel or, sitting in the side of my, my, my ear saying don't use agents.

most of the time. but,for an agentic, quote unquote system, and I, we didn't really define terms here, but I know I often think of when someone says agent, I replace it with the word magic, because that's often seems to be what going on when people are talking about, agents and systems.

A very simple light pattern is like router plus [00:25:00] code. so it's glorified conditional statements, but you do have an LLM in there, which,helps, helps do more, more complicated things. you can have this, this router, which is like either a simple regex, if someone says.

Help, send it down theemergency services pathway or whatever it is. or you can even have an LLM in that place to, to route people to different places. And then that query or task or whatever is passed to more specialized tools or even sub agents. so this is relatively easy to reason about.

it's, simple to start with, you can make it more complex. You can chain components together, you can, add more conditions. You can add loops in there. You can add state management. but again, this isn't fully autonomous agents. It's not a multi agent system,but it is something which is a bit more responsive and a bit more reactive potentially than,that simply like.

Conditional clauses within code.

hugo: And I do think it is good to set the space in it. It is funny. And I posted this in the chat and we'll in the show notes.It was only recently [00:26:00] that Anthropic posted their blog post called building effective agents, where they defined agents in a way that a lot of us were like, Oh, that kind of feels right as a first approximation.

So I'm actually going to read that out. they write workflows first, the systems where LLMs and tools are orchestrated through predefined code paths agents. On the other hand, a systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they are created. I think that's a nice start. And as I said earlier, like there's a whole spectrum and continuum. And I think the example you're giving also speaks to their idea of augmented LLMs, which is the path to agents where they can do some retrieval and tools and maybe have the conditionals talking about as well.

and then essentially you were talking about chaining them and prompt chaining and these types of things as well. So there is a graduated path there, but I love the examples that you suggested to, get started with as well. just as in traditional machine learning, I was just going to say, never start with machine learning with a, if you're trying to do binary classification, start with a [00:27:00] majority class classifier.

alex: Yeah. And then see, have that as your baseline and move from there, right?Yeah. And I think,I always think about like when I'm about to start a new project or start a new, feature or something, work there's someone specific I have in mind of what would this person do?

And this person is like an engineer with a lot of experience. And generally speaking his first reaction is going to be like simplify things down don't take the complicated thing Don't add a new framework Don't add a new dependency don't add anything new boil it down to something really simple take a small next step or whatever like obviously there's this engineer with a lot has had a lot of hard won experience in terms of like things which have failed and the ways that you can shoot yourself into the foot and I think that's that's certainly like lessons from software engineering we do well to bring into this space of LLM engineering. 

And of course there'll be new, we'll have all the old failure modes and new ones will emerge given the complexity that we're introducing to the systems. But that dovetails really nicely into what I wanted to chat on next, which is [00:28:00] lessons from production failure. So from what I've seen from the database and all your wonderful work around it,agent hallucinations and scaling fragility seem to be recurring themes, but I'm just wondering if you could tell us like,the top three failure modes or what, like what is the lion's share of failure modes? A super common one, particularly when it comes to agents, is this,kind of agent going off into somewhere else. and getting lost. and, it's not necessarily like a fault of the principle of agents or the architecture or something. It's it's. it's something which is, correctable and handleable or whatever.

But this idea of cascading failures stemming from some initial thing went wrong,or right, maybe, and an agent ended up in some little,little place, which compounded, as the agent, takes on more, keeps on acting,and so on. and backtracking within agentic systems and reversing to a previous step isn't I would sayprobably not so easy to implementprobably is quite complex when [00:29:00] you're dealing with, particularly when you're dealing with agents with a lot of tools, I don't know, I think about, like replits agents.

And if you've tried that or Devin or one of these systems or whatever, where it's actually hard to go back to the States,that you were whatever, 13 steps before and then explore some other way. so the.Yeah, it's, there's a reason why it's,it's,perhaps common.

hugo: and yeah, just the way these systems are, it's easy for these things to, to cascade. And it's just a kind of inherent limitation of, The current state of the art. and yeah, we can talk about all of the ways to handle this with better error handling and so on. But, yeah, this kind of cascading failure, in silence, particularly once you have, scale, I would say is a common one and a really nasty one, cause it's like, it sucks as a user experience.Yeah, that makes perfect sense. And I do think, and we might get to this, we've been dancing around it a bit, but. error analysis is an incredibly important part. And you mentioned of our field, right? And you mentioned Hamill earlier, I'd chatted with him the other day. And I actually, Irecorded it.

It was [00:30:00] just a, like a zoom chat or whatever. And I put some of it online. but he talked about a lot of these open office hours. He's doing where people come in. and they're like, Hey, I've got this multi agent architecture and I need evals. Where do I put the evals? And he's like, have you looked at your data yet?

And they're like, what? And he's let's just look at the data. And you go in and look at the data at each agent and see what the conversations look like and what's happening. And the error analysis you do within there will very much inform,what are the important evals as well. Like where you see it breaking down is where you need to be as rigorous as possible, right?

I am interested, like your database dude, has teams from Dropbox and Slack, like some pretty sophisticated organizations that doesn't mean that what they're doing, we should all be doing as well. But, I think something we've just, an undercurrent of what we've been speaking about is the supreme importance of observability, actually.

How do you do error analysis or look at your data if you don't have proper observability in place? I suppose you were talking about failures spiraling out of control. [00:31:00] I'm wondering how,the big dogs, so to speak, like Dropbox and Slack, addressobservability gaps to prevent these failures from spiraling out of control. 

alex: for sure, the start of it is just like more logging, more tracing, more better ability to trace like through the whole life cycle. because agents aren't, it's not just a question of input and output, rather, it's 40 steps of things that happened maybe from that initial thing, which caused something to go down a certain way.

So your ability to go to trace through,the entire life cycle. And, I shouldn't say none of this is.is new. Like we've had distributed systems for a while now. and the whole discipline of tracing and telemetry and so on. Like, that's a thing. So we shouldn't feelas a field or as engineers, we have to like reinvent the wheel or relearn these things.

at least like I, I found a lot of.Benefit just rereading,distributed architecture, distributed systems,books and things to,to better understand [00:32:00] this. because yeah,that, that's something which, which we've been dealing with for a while. but yeah, people use different tools.

and there's a bunch of cool tools and you don't need to necessarily name any of them. but there's a, there's a spectrum of how much data these companies include. So some includes system information along with what was going on with particular agents and queries.

Some people include whatever was going on in the embedding search space.For whatever that agent was doing. and the data,needed to store all of this can increase or decrease depending on how much people do you have specialized tooling. As I said,you have Like more structured architectures to help deal with your observability.

so you have this like router and code pattern, which maybe makes it easier. So you have your architecture to support the observability, if rather than just like slapping the observability onto whatever random system that you have,come up with. and Obviously, if you have more money, you can afford to include like more humans in the loop, whether that's experts or kind of quora of, crowdsourced or less experts,[00:33:00] humans to, to review things as, a safety net.

And you often see this in healthcare and banking, Andit's really great that there's a lot of examples from healthcare and banking in the,in the database because these are, yeah, I would say, at least on the edge of, people are super cautious in this space for obvious reasons when it comes to money and health.

and so it's really interesting to see how people are trying to make things robust because I thinkthose industries are both.Really trying to, to do that. 

hugo: yeah, I know to your point part of it's because they're so regulated as well, right? They've got to get things right. They've got serious money and class actions on the lineas well.

I am in When things go wrong, whether it's host hallucinations or misfired api calls What are the best ways or common practices for teams to diagnose and recover as quickly as possible? 


alex: the first thing is you were just saying, look at the logs, have, possibly a standard way of going through that, possibly, some of this stuff has to do with, your team dynamic, have multiple people with you in your team, looking at the logs, [00:34:00] discuss things.

work through that. that's one thing. then also, version control rollback mechanisms or whatever, make it easier to, to take the,take the pressure off a situation, obviously when you're firefighting,it's, and you can diagnose a problem maybe. but you'll often like, Diagnose maybe the first order,cause for what something is going on.

And if you have good rollback and you have good failure systems, then you can go back and okay, this is what we saw what was going on, but actually was there something underneath it or something underneath that was going on? so that's important.and, yeah, lots of feedback loops, wherever possible, whether that involves like humans in, in, In the loop user feedback,having options for users to, to bubble things back up to you.

that's sometimes in the database is like an afterthought, but then once people got burned a couple of times, then they're like, no, we actually have to include, include this stuff. and, Yeah, just have better ways to surface the [00:35:00] data and surface the data in ways that make sense for your company.

hugo: So a lot of companies are building like their own, like bespoke tooling around this bespoke dashboards, and so on. And, it's great that there's, A bunch of observability tools. but it seems like the companies with real stakes here,are surfacing the data in little dashboards, which they built themselves, which is great.It makes perfect sense. and maybe we'll see the next wave of open source tooling come out of these companies and then the next wave of vendor startups as well. Only time will tell. the other thing that we've been dancing around a bit.We're currently we're speaking like people roll these things out immediately, which is for the successful ones is as far from the truth as possible.

So I'm going to read from one of your posts. it's a flow, which I really love. And from my days of thinking and working in conversational AI and chatbots and these types of things. And a lot of friends working in that space.Like it resonates deeply. So you've written many case studies from Thompson Reuters, [00:36:00] LLM playground to BNY Mellon's virtual assistant and Alaska airlines, NLP search pilot, all of which have hyperlinks to the case study, all of which linked to primary resources, everyone, these reveal a landscape dominant dominated by internal tools and limited deployments.

I'm not going to share my screen to share this figure, but people can check it out. Essentially you have an initial, so it's a tree initial implementation with a proof of concept. Then you do risk assessment. Cost evaluation. You haven't even put this out anywhere. You're just looking at it yourself. Team of engineers, whatever it is, stakeholders, then you do control testing with it as an internal tool, not rolled out to external users yet.

for monitoring setup for evaluation systems, then you gather feedback. Then you do a limited deployment. You roll out to 1 percent or 0. 1 percent or 0. 01 percent of users. And of course, anyone working in conversational AI. or building chatbots over the past couple of decades would know you never put a chatbot out there immediately, right?

alex: Becauseno matter what you think users will say, they'll say stuff that will turn your chatbot [00:37:00] into a monster, right? and then you do a gradual expansion to the whole user base and then send it out to full production, right?Yeah. And this is again, going back, like the fundamentals matter.

Like people have been like figuring out, like, how do you roll stuff out in a robust way in a, in a safe way, like people have been doing this for a while. This isn't not something unique to LLMs, like the whole idea of like feature flags, AB testing, canary deployments, all of this stuff. This is there's a discipline to this.

hugo: yeah, exactly.Is therea standout case where observability and logging obviously turned a potentially catastrophic agent or LLM failure into a manageable incident? 


alex: I think, one which sort of stood out and maybe it's not it's not in the healthcare or the banking, system.

So I guess the fallout from that,was less life threatening, but Honeycomb had, they have this great,LLM based, like text to query, assistant, which they rolled out, which I think Hamill was involved in. Or it was Hamill 

hugo: and Philip built that together.

And [00:38:00] if I recall correctly, Philip was the internal person there. They, that was the first LLM in production outside, like the big vendor APIs, essentially. 

alex: so yeah, they, they they launched this,in their case study, and they spoke about The launch went well and everything, but they found that there were some like some weird bugs where like small prompt changes could have like huge effects or yeah, we're affecting the ability of this thing to perform in the way that it was supposed to, and, yeah,they added in like full LLM, interaction logging, maybe even they had it before, but they weren't looking at the logs or, Okay.

Or something like this, I forget exactly, but there were like kind of four, over 40 different decision points where like the pipeline,needed to log exactly what was going on in the interaction and they had distributed tracing,with,I forget it was, whether it was Langsmith or OpenTelemetry, one of the two, and they also included like user feedback, behavior tracking, and this prevented thisthing where, these LLM generated [00:39:00] queries,could have been like, just quite wrong.

hugo: by having this kind of feedback loop, they were able to. to prevent that so yeah, that is at least one one which stands out to me. yeahsuper cool 

alex: There's certainly like probably a selection bias in terms of like people aren't going to put their best failure stories in these kind of blog posts, unfortunately, 

hugo: not at all Not unless you saved it, right?

alex: And you have a nice story to tell. there's the famous, Amazon had built an ML model which recruited and it had bias so they removed it from production and they made a huge publicity statement about it. They were like, this is biased and we removed it, right? But, yeah. There's a really nice one like Weights and Biases, early on they built their own support bots, Q&amp; A bot for Weights and Biases documentation,and they were very much like building in the open, and there's a ton of like great blog posts from them about how they were handling their evals, but they have, having one of them, something about like how They misconfigured [00:40:00] how embeddings were generated.

I forget the exact details. but the result of this was that like they had to go and redo all of their evals and they had to redo something and they put a number in there of that they had to spend like quite a few thousands of do, of, dollars, like regenerating things, re embedding things, rerunning their evals.

So yeah, it's nice when companies actually do that. Amazing. And also in terms of, we've mentioned embeddings here and there in terms of garbage in, garbage out, get your embeddings right. we've. models, your whole system's only as good as your embeddings, along with a lot of other things, and we've almost culturally forgotten about the deep importance of embeddings, I feel, but, things come in and out of context, don't they?It's also seen as, yeah, like a kind of an unsexy part ofthe, I guess no one talks about embeddings, but they're super cool, super interesting and like super powerful as well. Totally.

hugo: So I'm interested in getting some just some actionable takeaways for all the devs listening and watching.

So for developers integrating LLMs into their pipelines, what's one practical [00:41:00] step they can take right now tohopefully mitigate unreliable agent behavior? 

alex: like we were saying, add structure, add constraints to, to your, to what your LLM or your agent is doing in your system.

likeif possible, see if you can do it without an agent, see if you can do it in some way, which is as structured, as robust as possible. The idea that you can just say,Go away and do this complicated thing and think that there's going to be no like failure modes or error modes in this is Yeah, I don't know insert some bad word.

yeah, just add constraints where possible. This is gonnahelp simplify your life when you have to debug this thing It's going to prevent problems going to prevent users being unhappy all of these kinds of things and I guess you know add as part of that add You Better error handling, better fallbacks,all of these kinds of things.

hugo: Also for developers, how do you recommend Something that's so exciting about the space currently is how much we can just rapidly experiment with new things and never ship anything that works, right? I suppose that's the half joke, but my question is, how do you recommend balancing [00:42:00] rapid experimentation with the need for reliability in production systems for developers? 

alex: yeah, that's a big question. we could do several podcasts on 

it.Yeah. That's like the Holy grail question. And that's also how do you like,when are notebooks useful versus hard codes someone will block you regardless of what you say 

anyway.Yeah. experiment away,start small, keep things constrained, take baby steps to the next thing, see if you can sandbox things as much as possible, or perhaps, simulate environments, so that you're not necessarily, touching what users are doing,all of the stuff that you've been talking about, the kind of SRE stuff around AB testing, phased rollouts, canary deployments, there, you can have teams which are like iterating fast and doing things, but your exposure to when things go wrong to your users, can be super small and you can catch things early.

and not many people necessarily will have interacted with that. obviously super use case. dependent, if you're in healthcare and banking, [00:43:00] whatever,take a little bit more care than you would if it's just an internal, like whatever, a documentation Q and a tool or something like that. yeah.

Plus, plus just, All of the standard stuff, version control, being able to reproduce things. yeah,logging, monitoring, all of that stuff. 

hugo: Super useful. And I love that you mentioned like simulating the experience as well, for example. So you can actually, even if you just build a prototype and you're playing around with it yourself,you can simulate user queries and essentially use an LLM to simulate them.

And in a for loop, Generate a thousand user queries. Look at the responses output as a CSV, whatever. Don't do your SQL light or your Postgres databases yet, then put it in a spreadsheet and just have a look at the conversations in a Google sheet or an Excel sheet. And it will amaze you, what these, what these systems do.

And you'll notice a bunch of failure modes anyway, but the ability to generate synthetic data to simulate conversations, I think at an early stage is incredibly useful. Yeah.If someone, so this is one way to think about these things, but I'm wondering from all the [00:44:00] work you've done, all the case studies you've seen, if someone wanted to test an agent in production, but minimize the blast radius, what's the best way to sandbox or constrain its actions?

alex: I mean, they're like what you were saying with like having constrained environments and that can mean lots of different things, but it can mean exposure to the number of users or, making sure that you have what we were talking about earlier with a do nothing strategy for uncertainty instead of just always doing something.

Because often, the worst things can happen by just, keeping on doing things, or keeping on talking. obviously, strict access controls, strict permissionsin terms of what these agents cando, making surethings can always be rolled back, and even just, Very, boring things, like limiting the number of API calls that this, highly experimental agent,can make, or limiting the number of resources that this thing can call on so it can't bring your whole system down because it, commandeered, a cluster of GPUs or whatever.

yeah. A lot of this stuff is boring. but yeah, important if you, yeah, if you want to be able to have your cake and [00:45:00] eat it, like to experiment, but also have real users and scale and all of that. 

hugo: Yeah. And to your point,limiting the amount of API calls, that is a great idea.

Limiting the amount of resources it can take. you don't want it to accidentally turn into a paperclip Maximizer. I'm interested in, and even when anthropicIntroduce computer use, what six, seven weeks ago, they were like, Hey, only use this if you know what you're doing, and please do it in an isolated dockerized container that can't get, that can't escape, no, very prudent about it.

alex: I am also interested in just as a heuristic, what's it like a simple guardrail, you think that can dramatically improve the reliability of agents without adding too much overhead? I mean, that's an easy one. and it's still surprising, like how,Yeah, how big an impact this can have, but just like structured outputs, and output validation, and there's a spectrum of this from like having specific schemas to,or using like the inbuilt, inbuilt tools, which some of the API driven [00:46:00] model providers provide.

But whether this is instructor or a Pydantic or your flavor of this,as you wish, but this kind of relatively simple thing, that was one of the first things that people wanted and got out of LLM, providers. just bring so much more reliability reduces errors, makes it much easier to debug.It's pretty easy to do minimal overhead. And in general, it's just going to make your system nicer. 

hugo: Yeah, absolutely. so now I want to just have a bit of future, future thinking. Really appreciate you setting, what we're all capable of currently, what's aspirational.I, we are seeing a lot of excitement around multi agent ecosystems, so I'm wondering if you think there was, this will become mainstream or a simpler single agent setup is more practical for now. 

alex: a lot of people really want multi agent,systems to become a thing. and it seems like a lot of money is being pumped into trying to make it a thing. it's,[00:47:00] yeah, short answer is, I don't really know. none of us know, and none of us know the extent to which, reliability, like,How long it takes for this, these systems to become really reliable.

think about, I don't know, self driving cars or whatever, where like we've been at this, I don't know what, 95%, 98 percent or whatever, point for many years now. And I don't know whether you get the same thing with multi agents where last bit is really going to take a long time.

But at the moment,even like single agent systems, like are, are stilla little bit of a struggle. Now they still have a ton of failure,failure modes, failure points. at the moment there is potential, for sure. But, yeah, whatever we're going to get in the future,it's still going to need a ton of this, engineering discipline around it,and we're only just starting to, develop the, the frameworks or the ways for , single agent systems to work coherently, I think it will still take a while before,multi agents and it's still likea long way [00:48:00] down, in terms of before these things become mainstream, I think, even though there's a lot of hype and a lot of new frameworks and potential, yeah, I'm not,yeah, I don't think when it comes to this time next year, we'll be, having everything run by multi agent systems.

hugo: Absolutely. We all know about the hype cycle or hype. The only thing we get wrong with the hype cycle is we think it has one. local maxima, whereas it's more like awacky roller coaster over the decades, isn't it? we talked a bit about human in the loop approaches and I'm wondering what role you think human in the loop approaches will continue to play, especially as agents become more capable. 

alex: I think it will be, yeah, there for a long time, actually,it depends a little bit on the task, like the extent to which people have either money or legal or high stakes needs to add humans into the, into the mix, but,it helps with accuracy and reliability in those kind of high stakes domains, helps with handling edge cases, unexpected situations,gives you some feedback to, to bring into the loop there, can help with explainability.

yeah, I just think [00:49:00] like the more complicated these systems become, the more you're going to have humans in the loop, but I think you'll have a.Parallel development of tools to help like humans analyze like what was going on in these systems because you can't just, humans aren't LLMs, you can't just dump a full, stack trace or whatever and expect people, you're going to need better UIs and so on to, to navigate this stuff, particularly when you have multi agent systems.

hugo: And I'm wondering just finally, if there's one mindset shift you'd recommend for developers moving from agent prototypes to production systems, what would it be? And I'm not only really talking about agents, I'm talking about LLM powered software more generally from LLMs to augmented LLMs to agents.

So what would a shift, a mental shift moving from prototypes to production systems be? 


alex: I think it's as simple as, shifting from thinking about the LLM in the center of things, which is obvious that we do since it is this new shiny thing. so shifting from the kind of [00:50:00] model centric, LLM centric to system centric.

and people are focused on the LLM because it's, yeah, it's This new thing,and it's what people are building. It's the thing which gets keeps on getting switched out and made more capable. But when you're talking about production systems, you do need like this holistic view. You need to think about,the system in as Holistic away as possible.

The LLM is just like one component. in this, you need to think about performance, reliability, scalability, security, cost effectiveness, observability,all of these things.Latency. Yeah. and holistically includes like your users, right? It's, we're not just building tech for building tech.

hugo: you, you want to serve users. Does this thing actually solve a problem which people have? and, how are we monitoring and how are we continuously improving this? I think it's a different perspective. and it's easy to get lost in, into the technical, Like deep dive,of whatever is going on with a particular model,but in [00:51:00] production, yeah, you need to think holistically if this thing is going to be, useful and that does good in the world.Absolutely. And, from all the work I didon Metaflow and with the people at Metaflow, that I'm a big fan ofthinking about building systems, not just models as well. Alex, you also added over a hundred case studies.In the past week or something like that. is that, did I get that wrong?

Orwhat's the future of this project looking like? And if you do that over the holidays, I'm not, I don't know what you're going to do when we're fully back into the year. 


alex: it's as, my ability to fill it up is limited purely by the like engineers who write blogs. I have.For sure, anyone listening, I encourage you to write about your failures.

I encourage you to write about your successes, like the more technical details that are shared, I think is super useful for everyone. I, yeah, we'll continue to maintain this. there's lots of different ways that we can,try and make it more useful. There's still a few more these overview blogs that I'm going to put out one on security, one on like cost [00:52:00] effectiveness, that,that are still to come.

but, and beyond that, yeah, we might see how we can make the kind of the search, more easier if you want to drill into one particular tool and how lots of different companies are using it. There's different ways we've been thinking about that, but really it's just we want to, I enjoy reading about these case studies and so on.

hugo: So we wanted to make it, helpful for others to do the same.Super cool. Well, thank you for going out and doing all the work and bringing it backif people wanted to connect with you on LinkedIn or Twitter,what's your handle? And well, Alex Strick van Linschotten on, on, on LinkedIn.

alex: yeah. So it's just Strick VL, everywhere, on the internet. So you can. Awesome. Strick 

hugo: VL. And I'll put that in the show notes as well. love to thank you for all of your expertise. I'd like to thank everyone for joining the live stream as well. And I got so excited at the start. I forgot to ask what you all are up to at ZenML currently, because of course all of this work is supported by the wonderful people at ZenML.

So, if you could tell us what you're up to, and people can visit as well. 

alex: Yeah,ZenML continues to,[00:53:00] we're continuing to build out, ways for ML platform teams to, to support, the work that, that people are doing within organizations. whether it's enterprises or small, small companies,machine learning pipelines, machine learning workflows, we're still figuring out like how, ZenML and MLOps fits into this world where LLMsagents, are involved.

I think everyone is still figuring that, figuring that out. And, yeah, it's also, we've been talking about agents and LLMs and whatever, and it's easy to forget that, organizations, are still doing bread and butter,prediction and then small models,and decision trees and all of this good stuff as well.

And, when you look at The overwhelming majority of machine learning in production. It is that stuff, which is,it's accepted as normal, but even a few years ago, that stuff was also cutting edge at the time. so yeah,we're, we're just wearing, working to support that,as well as just to understand how this new world fits into it.

hugo: Fantastic. And yet a lot of people are still counting. Still like doing ETL and figuring out how to [00:54:00] get those basic dashboards up and running. I know I said, we're going to wrap up one final question. you were kind enough to invite me on a podcast several years ago. are you still podcasting at all or doing anything that people can listen to? 

alex: That podcast is on hold. at the moment. yeah. that was pipeline conversations. Exactly, but we have I don't know, 30, 30, 40 episodes. There's a bunch of cool conversations. we stopped before LLM started. there's some of the, notebook LM podcasts we're putting out through that podcast.

so if you subscribe, you'll get those as they get released. Fantastic. 

hugo: And I'll put that in the show notes as well. And look. If people want to listen to a podcast from a couple of years ago, that could be very exciting. There are lots of varieties of great stuff there. I'm just looking now.

There's one with Goku Mohandas, who is incredible on all of the ops op stuff and education full stack with Charles Fry from modal, which is great. and I was on it, of course. So definitely check it out. And. Machine learning at the British library. I'm going to go and listen to that. It's one of my favorite places in the world.

Little [00:55:00] known fact. I love the Rosetta stone. So there you go. Alex, we could wax lyrical for a long time and I'm sure we will in, in, in future, it's time to wrap up once again. Thank you so much for your expertise and wisdom and generosity. And thanks everyone for joining as well.Thanks for having me on.Real pleasure.