The following is a rough transcript which has not been revised by Vanishing Gradients or the guest. Please check with us before using any quotations from this transcript. Thank you.

===

Alex: [00:00:00] Imagine a future like five years from now where every single app has all of these little ambient agents or background agents doing things for you, improving things here and there, and you can already live that future. You can shove a thousand MCP servers into your application and you can have things autonomously improving it, whatever.

What you're gonna have is chaos, basically. And the companies and the use cases where people are actually doing things that really work are where people have like heavily scaled things down, where people have focused on specific use cases. So that's, I guess one lesson, keep it as simple as possible. 

Hugo: That was Alex Strick Van Lyncher, a machine learning engineer at Zen ml.

Alex just gave us one essential lesson when building AI powered software, keep it as simple as possible to avoid the chaos of premature scaling. In the rest of the episode, we'll dive into many other critical lessons. Alex has uncovered through his phenomenal work building the LLM [00:01:00] Ops database, an incredible resource, cataloging nearly a thousand real world LLM, and agent deployments.

This database gives us a unique data backed view of what's truly working and what's failing in the transition from proof of concepts to enterprise production. Our conversation today is a masterclass in production reality for builders and technical leaders. We discuss the critical importance of implementing simple ML ops, evaluation hygiene to effectively debug failure modes, how to manage the unpredictable reliability cliff of multi-tool agentic systems, how leaders can distinguish between agent hype and actual deployment reality and much more.

I learned a tremendous amount about moving LLM and AI projects from the lab to production scale, and I hope you do too. This conversation was a guest q and a from a recent cohort of our course building AI powered applications, which I have the great pleasure to co-teach several times a year with my [00:02:00] friend Stephan Rauch, who works on AI agent infrastructure at Salesforce.

Links in the show notes if you're interested. Also, if you enjoy these conversations, please leave us a review and give us five stars on your app of choice and share with your friends and colleagues. I'm your host, Hugo b Anderson, and this is Vanish in gradients. Hey there everyone. Hugo Bound Anderson here.

I'm so excited to be here with Alex Strick. Van Lin Sutton. I hope I got that almost correct. That's alright. I'm improving. So great to have you here today, Alex. Everyone as always, if possible camera's on and microphone's off would be super cool. Anything for the q and a we'll have in discord so we can keep the conversation going there.

And the channel, as I mentioned. Yesterday called Special Guests July Build with LLMs. I'm so excited to be here with Alex, who's a machine learning engineer at Zen ML to talk about AI in production [00:03:00] lessons from over 750 real world industry use cases. Earlier this year, or was it December, you launched the LLM Ops database?

Amazing, and this was one of the first thing, a, a huge collection of what's up in the space, what's happening, what people are building, the challenges people have. I very much encourage everyone to check it out. Not only is there a huge amount in there, but I love the information architecture. As I've mentioned to you before, Alex, you've got these blog posts and you could drill down in particular summaries that you've put together.

Then drill down into primary resources, like you just link through to YouTube videos of people talking about this stuff or their blog posts. Really well done there. I presume part of it is because you are, are you trained as a historian? Did I get that right? 

Alex: Yeah. Yeah. So I actually worked, so I'm only working three or four years in a technical side.

Before that, I was a historian. Yeah. Running around the world, actually doing field research and so on. Yeah. 

Hugo: Amazing. And though you may have seen that, Alex and I actually recorded a, we live streamed a podcast earlier this year. I can't believe it was only six [00:04:00] months ago or so. Seems like a long time ago.

But let's, let's jump in, man. I'm so excited about everything you've been working on. And all the new things you have in the database and there's a bunch of new stuff around agents, right? And so I'm wondering after cataloging over 750 LM deployments currently, what are two to three lessons that jump off the page for you?

With respect to agents currently? 

Alex: Okay. Yeah, obviously that's a really big question. I will give you my answer, which is what I take away from it, which isn't necessarily what the case studies themselves take away from it if you, if you can understand the distinction there. Firstly, I think the, one of the key things which I have found while reading and uploading all of these things into the database is that just to be aware of like how many failure modes there are of things, and as a result of that, basically.

Whenever you are building some kind of a system like this, try to keep things as simple as possible for as long as possible. And if this means that you never [00:05:00] need to use AI in your application and RegX or something is, does the job do that a thousand times and. I think we can all, all see all of the tools which are available and things which are around.

And imagine a future, like five years from now where every single app has all of these little ambient agents or background agents doing things for you, improving things here and there. And you can already live that future. You can shove a thousand MMCP servers into your application and you can have things autonomously improving it, whatever.

And what you're gonna have is chaos basically. And the. Companies and the use cases where people are actually doing things that really work are where people have like heavily scaled things down, where people have focused on specific use cases. So that's, I guess one lesson. Keep it as simple as possible.

Then following closely on from that, just the importance of basic hygiene or just basic fundamentals of [00:06:00] tracking everything and using those traces and, and whatever you're tracking in terms of your application in a kind of continuous evaluation system. And that's a lot of buzzwords and it's, there's a lot of different ways that you might actually have that.

Happen or show up in your app, in your application, but it's, yeah, if you're not doing that, then you have no idea where your application is failing and you also have no idea how to improve it. And I guess the third one is just another fundamental of quality in quality out. So whatever data or context engineering is the new hype word these days, but yeah, like whatever you are passing into your LLM cool or into your agent system or whatever you are getting from your rag retrieval component, this stuff needs to be as good as possible if you're gonna get good results.

Yeah, I mean it's, I. Probably could have given you the same answer a few months ago. I don't know that those fundamentals, and I also don't know that those fundamentals are gonna change too quickly. But yeah, that [00:07:00] those are the top, top three lessons I guess. 

Hugo: I love that. And I think I probably did ask you something similar in January, and I think we did discuss similar things.

I'm interested in how the space has changed and everyone remember, if you have a question, pop it in Discord. If there's a question you want answered, give it a thumb up as well. But Alex, I get the sense that. You and the team have been somewhat surprised with respect to how much agent adoption has happened.

And I posted a blog post, and correct me if I'm wrong, but I posted a blog post that you wrote in December in Discord. Yep. And it's around the state of agents there, and there were four points you raised. Reli. Well, among other things, reliability. LLM agents are notoriously unpredictable, scalability and cost.

Today's state-of-the-art LMS are ravenously resource hungry. Three, security and access control. LM agents, open-ended nature and potential for misuse. Raise security questions. Then observability and debug ability. Understanding why an LLM agent made a particular decision is notoriously difficult and we can index on any of [00:08:00] these.

I think the last one's nice because it's, it speaks to a lot of what you were just talking about, but it, I, I feel there's a bit of cognitive dissonance in my head with all this adoption that you've noticed of agents. Yet they're still difficult to observe and still very unreliable. So can you help me mind meld these two seemingly contradictory aspects of the space?

Alex: The boring part of the answer to that question is just like everyone is jumping on because there's money and there's investment, and when you say you have a multi-agent setup or whatever, people will add another extra million to whatever you're raising. So that's the less, perhaps less interesting part to it.

I think we are. A little bit further on from when we last spoke in December. I was even reminded today that Open AI's deep research agent only launched in February. For some reason, I had the feeling that it's been around for much longer. So yeah, like you have a, a strong adoption curve, which is driven by hype and FOMO and and so on, and I think.[00:09:00] 

A lot of people will market what they're doing as something multi-gen, or they will talk about it in these kinds of terms. So there's a kind of a vocabulary for the conversation, but when it turns out what to look into, what they're actually doing, and the case studies are quite interesting for this actually.

They structure things quite a bit and the cases where things are. A little bit further towards the autonomous side of the spectrum. These are most often like internal use cases where they can roll things back easily. Or it's chat bot research or agen discovery, this kind of thing for internal customers where the costs of failures aren't really that high and any, anytime where either you are touching a very high kind of high sensitivity domain, healthcare, finance, these kinds of things, then you start to have a lot of kind.

Blocks along the way or kind of checkpoints where either things are farmed out to humans or whatever. Yes, agents are increasingly present and yeah, 20, 25 somehow [00:10:00] is year as the agents, but probably 2026 will also be a year of agents where, yeah, these things are continue to be worked on and people try to figure out how to make these things robust.

Hugo: Absolutely. The agentic future is bright, and I do, I really like your clarification around, and I paraphrase. Around how people are building a lot of agents, but they're not necessarily to serve to enterprise customers or healthcare patients. Uh, uh, these types of things. Agents can deliver a huge amount of value at the moment, but in my opinion, it's when you have a human in the loop in the correct way.

To step back a bit, remember everyone agents, and we'll get to this in, in, in the course, but agents are LLMs with tool calls, so that can. Access and use tools and functions wrapped in some sort of wire loop or for loop with some context, right? And to, to that end, when an LLM is calling a tool and when the output in LM is triggering a tool call, all state of the art is like 85 to 90% accuracy or something.

And if [00:11:00] you get four or five of those in a row. It goes down to 60, a few more. It's a coin flip, right? So the point is that these things are still eminently unreliable. So either, as a developer, you need to make hard decisions around how you put these things in business logic. So they can only do certain things, or the user has to be in the loop in a way where they feel confident that they can work with the system.

I think Cursor is a wonderful example. There are certain things, of course, that infuriate me when working with AI assisted. AI assisted programming and building, but I am in the loop in a way, and expectations are set in a certain way. So I think that gets to, I tell people, stop building agents for the most part, but that's when trying to build consistent and reliable software.

Right, Alex? 

Alex: Mm-hmm. Yeah. Yeah. I, I think, uh, stop building agents means many things to many different people. Um, I think PE people are increasingly seeing like the capabilities of these kind of like fairly specialized tools. Like even like the open AI, deep research tool is somehow streamlined. [00:12:00] And I guess just as an addendum to what I said earlier, like we do have these kind of vertical agents, which are very good at one kind of.

Quite a straight jacketed, like narrow domain, so that that is one area where people are finding successful consumer deployment of agents. But yeah, it, it really depends on, on, on kind of the domain you are working on and your level of risk and the level of risk at different points in your, in your workflow.

And it depends on lots of things really. But going back to what I said at the beginning. Can get away with not having something which is fully or semi-autonomous, then you really should and you, it'll be much easier to To debug and evaluate and improve. 

Hugo: Absolutely. And a lot of the case studies that you have are currently, are people building things to have human loops, whether it's like BlackRocks, what is it again?

Yeah, Aladdin Co-pilot or the what IBM is doing, they're things for people to really be in the loop with and not just serve as software. Right. 

Alex: Yes. And [00:13:00] but some of the, kind of the places where, or the things where you might want to have, have split decision points where something maybe gets sent to a human or versus the LLM handles it.

These can often be, it's not only for robustness and reliability, but it's also for cost saving as well. Because yes, somehow the. A more reliable system might just reroute everything to a human, but like that ends up, can end up being quite costly. But a lot of the case studies interestingly showed that when you, when you develop a system which does have some kind of routing capability, where you can pick like the ones which have lowest risk versus high, highest risk, then it's actually only a kind of a small fraction on the order of 10 or 15% of queries that do end up having to have that extra intubation.

Intervention or kind of escalation. So I think that's a useful lesson which has come out of like industry deployments where people have tried this. There is a kind of an 80 20 going on with what needs that kind of higher touch. 

Hugo: Makes sense. We've got a great chat in Discord from Evan Williams and that [00:14:00] is.

Evan Williams, whose handle is yes. Or the bourbon, but which I love. Evan asked, is there a sweet spot for the number of tools an agent has access to? Is it preferable to have multiple agents with fewer tools instead of one agent with every tool? And I want to situate this question in the context of what you've seen recently, which is the adoption of like multi-agent, a lot more.

Alex: Yeah. And the thing where, which really brings this into, into relief is the adoption of MCP, I would say more or less everywhere and, okay, not everyone supports it and there's still, yeah, there's still a good deal, more adoption to go, but you often will see. People just like shoving more and more MCP servers and each MCP server may have, I don't know, 10 or 15 tools conservatively.

Some of them have many more, and there aren't, we don't actually have really good data or, or studies in terms of like, where do we reach some kind of maximum re or minimum like reliability here. And at least everything that I've seen in terms of the older studies on tool cooling is the number is like probably way [00:15:00] lower than what people think.

And I think we're gonna, there's gonna need to be a lot more. Work done in terms of figuring out exactly which tools to call and, and how to manage that context because at the moment we're just living on the edge and just seeing like what we can get away with. So that's like, I would say, a problem coming on the horizon around tool calls.

I would say the rule of thumb is just try and constrain that as much as possible. If you can get away with a smaller number of tools where possible and it's, yeah, there will be a sweet spot. Obviously you can have an agent with only five tool calls, but one of them is like computer use, which is a kind of a jack in the box.

This tool call can do anything, which obviously is a separate degree of unreliability then, 'cause if it can do anything, there's infinite num ways that it can go wrong as well. Yeah. Just bear in mind the principle of keeping it as low as possible. 

Hugo: I think that is a really nice heuristic, to be honest, and that's something when I consult and when I teach I, I do say when you have some business problem or you want to wanna solve something, [00:16:00] start with the smallest possible thing and then build up from there.

It's really fun to use frameworks like Crew AI to build these multi agent things, but due to the observability, traceability issue in the failure modes. A lot of the time you do end up hand rolling something yourself with respect to the amount of things an an agent can do. This is something that we saw, well, you saw and we talked about six months ago when you launched the database.

Is that a lot of the most successful cases, at least then? Actually had business logic, which constrained what capabilities were. So for one can imagine a travel agent assistant, which can do four things. Mm-hmm. Book flights, book hotels, book excursions, and book transport. Right. And what happens is when you chat with the agent, it, it decides or figures out which piece of business logic to put you in.

And in that piece of business logic, you can only book a flight. And if you then want to move to another part, then you do that. If you're trying to get outside these four things, it tells you, I'll connect you with a human, or something along those lines. 

Alex: Boil it down even further. Right? Don't even expose like a chat [00:17:00] interface.

Maybe just have four buttons and each of which like takes you to the different agents and then you rid yourself of a whole bunch of other possibilities when people try to, I don't know, tell it to pretend it's a cat or whatever. 

Hugo: Totally. And but is this something you're still seeing or are people like more yoing now?

Alex: I would say people are trying things out and certainly we've seen as NML, lots of our customers, like people are experimenting and usually teams will have some kind of like little LLM or Gen AI kind of vanguards that are like trying things out and living on the bleeding edge a little bit. But when it comes to like things which are customer facing, people are just like locking it down as much as possible.

Yeah, I would say. 

Hugo: That's the only way that 

Alex: works, right? Yeah, 

Hugo: exactly. At the moment, at least a great question in the chat about are there any specific evaluation approaches for agentic systems? How do you think you could debug value modes? And I'll kind of seed the witness. I do know that something you've seen emerge is a lot more evaluation driven development and also like incorporating guardrails and testing through CICD [00:18:00] and that type of stuff.

Alex: Yeah, the, in terms of like systematic approaches, I think it's just keeping it simple, like firstly, instrumenting everything. Making sure you have all of your data, starting with a very simple harness. No need for crazy, fancy things. No even no need from day one. No need for fancy systems or special frameworks, like it really is just like a bunch of cases where.

Or you've seen things that can go wrong and a for loop, which just checks whether when you fix something, it still stays fixed. That's the kind of the simplest version of it. And then you can get more fancy, you can parallelize these things. You can probably want to have like a. A subset of that, those test data or those test cases that you want to go through, which is something that runs like every time you make a change and then you have a full set which runs before you make a release or something.

'cause with evals, you can quite quickly escalate your costs depending on your use case and your models and so on. And. Also come with your data scientist hat on and have some [00:19:00] hypotheses about like how things might be imp to bring a kind of a scientific mindset to, yeah, what you think might be going wrong.

Trying to make a fix and then iterating around that loop. And it's, there are by now quite a few cases in the database of people doing this and keeping it really simple and just focusing on like one thing, improving that, doing the loop and improving. That's a pretty common thing I would say by now. But it's also quite common for.

People to just deploy things and not have any evals. I've seen quite a few, even in the enterprise category of people you quite often see. This kind of factory pattern where people will have a kind of a way to create your own custom agent for your, your customers or your data within your enterprise. Uh, obviously if you're doing evals on this, then you need to have like custom evals for each agent that, that your factory has created and so on.

And yeah, it gets very complicated. So people are living a little bit on the edge, I would say. Yeah. 

Hugo: And as perhaps one should be, or it's pretty [00:20:00] exciting and interesting to live on the edge, particularly for people who are curious and entrepreneurial. I, I'm glad that you really mentioned building like a basic eval and this iterative loop of building, observing, and iterating on your evals, because as you know, that's what this course is all about.

I also do wanna say to your point. And doing lms that judge stuff can be increasingly expensive. A as you scale, but something I see a lot of I'm interested in, you have, if you have as well, Alex, is people cleverly combining. LM as judges with regular Python, like regular expression. String matching. Fuzzy matching, making sure you have, you're not gonna use an LM as a judge to see if you've got JSO output.

Right. That would be using a jackhammer to make an omelet. 

Alex: Yeah, it's for sure. It's not every team, like many teams like there is. And the tools unfortunately seem to encourage people to use LLM as a judge for everything, even the very simple things. So that's, that's something I hope and probably over time that will, there will be a kind of a course correction to that.

But yeah, the more you can remove failure [00:21:00] situations via some kind of dumb method or easy tool, the better. 'cause obviously it's faster and yeah, more, more reliable and lot, lots of reasons why you should do that. 

Hugo: I totally agree, and yeah, to your point, a lot of frameworks, if you go to their evaluation tab in their documentation or quick start or hello, hello world with evaluation, it will start with the LLM as a judge.

I understand that's for several reasons due to the shiny objects. Also. These are venture backed companies a lot of the time, right? So there is this scale, yep, vibe. I do jokingly call it a venture backed tragedy of the commons, but part of the reason we have to do this work to help people is because of that side of things.

So I do sincerely hope there's a realignment as well. Um, straight Outta Hamburg, Germany asks, do you have any experience or know of experiences to combine MCP and a to a concepts in a multi-agent framework? Sorry, in a, in multi-agent systems? And perhaps for those who don't know, you could just quickly say what MCP is.

In three words. No, I'm g Sure. As well. And [00:22:00] A 

Alex: A. Yeah. So MCP is basically how to plug in a bunch of tools into your system. So let's say you have an LLM that's just whatever, clawed or open ai, but it doesn't know how to, I don't know, get the latest stocks or whatever that. That's MCP and A two A is basically a new framework from Google, which is how to allow different agents to communicate.

It's like a yeah. Protocol to allow different agents within the sa, same system to communicate with each other. That's the very, very short version. I would say it's A two A is very new. I to put a geographic spin on it. I only know Americans who are building with it. Americans, in any case, like always a little bit ahead of the curve.

But yeah, you won't find really anyone building with a two A. I think in Europe, at least in, in big use cases, I think it's a little bit early. I certainly haven't, there's not nothing, not too many things in the database. It would even maybe say nothing in the database where people are using this in, in, in industry yet.

But that will come like we do need [00:23:00] protocols for that kind of thing, but it's just a little bit early, I would say. So, yeah, I dunno the answer to that yet. We'll figure it out as a, 

Hugo: and also to be clear, MCP, like the adoption of NP MCP is actually quite wild and I can't think of any historical precedence where Tropic came out with MCP and Google DeepMind open ai.

Cetera, Microsoft adopted it almost, I immediately, within a course of a couple of months or something like that. Mm-hmm. And these companies aren't even frenemies let's, in a lot of ways. Right. So that type of adoption is, is pretty serious. Serious stuff. 

Alex: Oh, they couldn't ignore it. There was so many users and so many people wanted it that at a certain point, they all that they all joined and it's great.

We all benefit from that. No one benefits from everyone having their custom standards or custom specification. And it's a little bit to, to put on my Zen ML hat, like, it's actually really nice to see some standards starting to emerge in this LLM world where we saw [00:24:00] similar patents in ML ops for model serving, model deployment, whatever is a nice example.

But yeah, in the LLM world, we're gonna need more standards if we're gonna have reliable ways to deploy and work on this stuff. Absolutely. 

Hugo: Laura, uh, Laura Thomas has a great question. When you have lots of different places you can intervene in a multi-agent system, for example, how do you rationalize about which intervention point to choose and explore?

Now I do want to just. Give us a brief answer to this. The way I think about this and what I've seen is most successful is, once again, if you start small, you don't necessarily have this issue in the abundance. You would if you start, if you start with a basic single call, start with a multi-agent 

Alex: system.

Hugo: Exactly right. So if you start with a BA basic system, which has an LLM call or two, a couple of tool calls, maybe some retrieval, then you, there are only several points at which you need to intervene. And I do think like. The biggest lift I see in early product development intervention or the levers you pull are [00:25:00] prompt engineering or prompt alchemy, as I like to call it, and improving retrieval.

Like, what's your OCR? Are you actually getting what you, what you think it is? And, and better chunking and re-ranking your chunks and that type of stuff. I'm interested in what you've seen in, in the field, Alex. 

Alex: Yeah. Fully agree. Scoping it down so that you know exactly how things causally relate to each other is the, the main piece of this, if you have any retrieval in your system, work on that first before you start thinking about prompts or whatever to often I've seen, seen problems where basically everything is in your system is downstream of.

Whatever is going on in your rag. And, and then there's no way that you can recover if you're not, if you don't fix your rag stuff first. And yeah, the prompt engineering stuff is sometimes hard to reason about, but, but if you have some kind of like way of knowing how, or the extent to which you're improving based on how you're improving your prompts, then those two things will take you.

Yeah, long way, I would say. 

Hugo: Absolutely. And think about what you're evaluating and where as well. So for example, when evaluating retrieval. [00:26:00] Evaluate the retrieval part before evaluating the generative part, right? Mm-hmm. And think about, yes, you want decisional recall because you probably want to get some false positives to avoid getting false negatives and these types of things.

So we're gonna have to jump off in a minute, sadly. But I do wanna, we should do another podcast again soon. And Alex, with all of this, there is something that startled me, which was. Increased adoption of vendor backed agent frameworks, because as you know, something we've seen a lot is people adopting frameworks and moving back and hand rolling things.

But I think what we've seen in the past several months at least, is people becoming more comfortable with using, uh, frameworks such as Lang Graph and Lang Chain. Is that, is that accurate? 

Alex: Yes. I would say land graph more than Lang chain. And this is part like, because it's driven by, it give, gives people a language or a vocabulary around which to, to deal with or engineer in the current things which people are working on.

So land graph is interesting because it allows these multi-agent things that everyone [00:27:00] is enthusiastic about with at the moment. Maybe Lang chain was pre previous years. And yeah. But in the end, people will use it to experiment and to see what exactly this new thing is that people are doing. But then I think it's very much, still very much a common thing that people will take whatever they learned from there and then rewrite it themselves often in different language or So that's quite common.

Yeah. 

Hugo: Makes a lot of sense. Thank you so much for your time and wisdom, Alex, and for this all the work you do. But this database in just continues to nourish us and feed us and propagate knowledge and, and, and skills and wisdom back along the adoption curve. So really appreciate your work man. Thanks very much for having me on.

Such a pleasure. Alright. See you soon. Bye. Thanks for tuning in everybody, and thanks for sticking around until the end of the episode. I would honestly love to hear from about what resonates with you in the show, what doesn't, and anybody you'd like to hear me speak with along with topics you'd like to hear more about.

The best way to let me know is currently on LinkedIn, [00:28:00] and I'll put my profile in the show notes. Thanks once again and see you in the next episode.