The following is a rough transcript which has not been revised by High Signal or the guest. Please check with us before using any quotations from this transcript. Thank you.
===

lance: [00:00:00] There's been a democratization shift in the industry. And then because of that shift, most people using MLAI today are working at a higher level of abstraction. So rather than intense focus on model architecture and training, most users now just are ahead of this object, this extremely powerful new computing, primitive.

And what do I do with this? And this is where I've been operating for the last few years at Lang Chain Building on top of these LLMs. Prompt engineering context, engineering, fine tuning, building agents, all of these new disciplines built on top of this new primitive that's being offered through an API by a small number of players.

So I think those are a few major shifts that we've seen in the landscape over the last few years. 

hugo: That was Lance Martin, a machine learning engineer at Lang Chain outlining a key shift in the generative AI era. We're moving from the challenge of training models to the new engineering discipline of orchestrating them at scale.

Lance used to build and scale production ML systems at Uber, including self-driving technology, and now builds tools at Lang Chain [00:01:00] to help teams across all verticals, build and deploy AI powered applications. In this episode, Duncan Gilchrist and I speak with Lance about what's changed since the early ML days and how core principles like simplicity and observability.

Must be adapted for non-deterministic systems. Lands brings so much insight from the bleeding edge of the space with examples from Claude Code and Manners, discussing the practical disciplines of context engineering, including context rot or why the effective context window is often much smaller than the token limit and the three part playbook to manage.

Reduce, offload and isolate. This also includes using multi-agent architectures for context isolation. We also cover the emerging architecture of the agent harness, which manages tool calls, essentially, how LLMs can do things and how builders can use only a few atomic tools like a Bash tool to expand the agent's action space dramatically.

This episode is a deep dive into [00:02:00] engineering discipline, and also gives technical leaders clear insight into how teams are building and delivering value at the bleeding edge, where the foundation models powering your applications are constantly and exponentially improving. High signal is brought to you by Delphina, the AI agent for data science and analytics.

If you enjoy these conversations, please leave us a review. Give us five stars, subscribe in the newsletter and share it with your friends. Links through in the show notes. I'm Hugo Bound Anderson, welcome to High Signal. Let's jump in. Hey there Lance, and welcome to the show. 

lance: It's great to be here and I've known Duncan for many years.

It's a pleasure to be on and great to meet you as well. 

hugo: Totally. And look, I'm so excited to hear about what's happening at Lang Chain and all the wonderful things that you are affording people to do there. But what I'm also really excited about is. You've worked on production ML systems at places like Uber prior to working on generative AI tooling at Lang Chain.

So you have a wonderful perspective [00:03:00] on what's changed. So I'm wondering if we could open by you letting us know what feels fundamentally different about building and maintaining ML systems versus what people are doing now with generative AI and LLMs in particular. 

lance: That's right. Yeah. So you know, Duncan and I overlapped at Uber.

This was back in 20 15, 20 16 era, and I think there's been a few interesting shifts in the ML AI landscape since that time. I think one is architectural consolidation. So we saw the emergence of the transformer architecture extremely expressive. We saw driven by scaling laws and compute data model size models get much bigger.

We saw other architectures like CNN's ARNs. That are a little bit more specialized, kinda get swallowed by transformers. So we'd had architectural consolidation and scaling loss, driving much, much larger models. That's thread one. And then thread two was I worked in self-driving firm a number of years at Uber than after Uber.

And in that era there was approximately the same amount of orgs that were training models [00:04:00] were housing or using them. So it was one-to-one in the sense that. Each self-driving company was trying to training their own models. It was highly proprietary. A lot of in-house expertise at really all these companies.

And I think obviously beyond self-driving where ML is being deployed in recommender systems, typically organizations that were using ML were also training the models. Now that's entirely flipped. You've had the emergence of a small number of foundation model providers. Models become extremely large, and most people using AI today are not actually training model.

So that's, that's kind of this, there's been a democratization shift in the industry. So architectural consolidation, democratization shift. And then because of that shift, most people using ML AI today are working at a higher level of abstraction. So rather than intense focus on model architecture and training, most users now just are ahead of this object, this extremely powerful new computing primitive.

And what do I do with this? And this is where I've been operating for the last few years at Lang Chain Building on top of these LLMs, [00:05:00] prompt engineering, context engineering, fine tuning, building agents, all of these new disciplines built on top of this new primitive that's being offered through an API by a small number of players.

So I think those are a few major shifts that we've seen in the landscape over the last few years. I 

duncan: think that kind of contrast from classic or traditional ML to gen AI is so interesting and you and I have a lot of war wounds I think from and battle scars from the Uber days. I'd love to explore with you like which lessons from traditional ML still actually apply in gen AI systems and which start to to fall apart.

lance: Yeah, so that's an interesting one and what I would say here is actually. Even though we're now handed these incredibly powerful models through, for example, APIs offered by these Frontier labs, simplicity remains essential. I think it is very important to start with the simplest possible solution. We see many organizations that we work with a lang chain and others that I've consulted with, we'll jump to agents who are in the air, agents are buzzword.

I wanna build an agent, and I think we'll talk about this more in detail later, but really [00:06:00] thinking through the problem you're trying to solve. There's many different ways to do it. Prompt engineering, a simple workflow. Building an agent with simple content engineering all the way up to maybe building an agent with some kind of RL in the loop or reinforcement, fine tuning.

There's a spectrum of solutions you can use with these models depending on your problem, and I think starting simple is very important. The other thing that's really critical is actually observability and evaluation. So you can build an agent, but actually having the ability to understand what's happening.

We actually evaluate it in a rigorous way is obviously extremely important. So at Lanre we do a lot of work on observability tracing and also a test suite and evaluation suite. Beyond just simple unit tests, I think many software orgs are familiar with their simple unit tests. Working with ML systems, particularly LLMs, which are non-deterministic, needs a new kind of evaluation, which we could talk about later.

I think the final point is a really interesting observation. I heard from Jason Way, who was at OpenAI for many years and now MSL Meta Super Intelligence Lab. He talks about [00:07:00] this idea of verifiers law. Which says that the ability to train an AI to solve a task is proportional to how easily verifiable a task is.

So verification means coding. You can just like compile the code, you can run it, make sure it runs that. That's kinda what verification means. And tasks that are easier to verify are actually easier to, for example, apply, reinforce, and fine tuning to that's what he is referring to. So actually setting up a valuation and a part of evaluation is establishing some verification metric.

It's actually very imp, helpful and important foundation. If you ever wanna apply, reinforce, and fine tuning or training train model, fine tune models for particular tasks. So setting up evaluations with clear verification or criteria is very important for quality and also if you ever want to move into fine tuning or more advanced things.

So I think those are three things that are still very important today with these new systems, which were always, of course, important in the prior eras of ml. 

hugo: There's so much meat and insight in [00:08:00] there that we will get into throughout this conversation. One thing that I do think we're, we will talk about is how we go about building and evaluating agents.

And it does feel kind of like a two cultures thing in, in some ways because as you've said, starting and building small is incredibly important to be able to inspect a, evaluate and deliver value. And yet we'll talk about a blog post you wrote about context engineering In Manus, typical Manus call has 50 tool calls and Thro has told us.

To start slow with workflows, and yet they also have published about their multi-agent research system, which is a big sprawling behemoth in in a lot of ways. I am interested in before we, we get to that point to talk about a wonderful post of yours that we'll link to in the show notes called Learning the Bit Lesson, and you quote Rich Son by saying general methods that leverage computation are ultimately the most effective.

And I'm wondering if you could expand on this quote, unquote bit a lesson and tell us how it's shaped the way you approach designing systems today. 

lance: Yeah, actually I think this is one of the [00:09:00] most interesting challenges associated with now building on top of this new kind of LM compute primitive that we all have access to.

So basically, rich Sutton 2019 put up this very important like kind of seminal essay called The Bitter Lesson and the intro line follow goes. The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation. Ultimately the most effective and by large margin.

So simple methods, throw more compute, often beat more complex methods with more kind of human biases. And like a class example of this is in vision. We had convolutional narrow networks. They encoded various inductive biases about how we should solve vision tasks like classification or object detection.

Transformers ultimately became state-of-the-art for that particular task. Transformers are a very general message passing architecture. And so the observation was just a basic architecture like Transformers plus more data and scale actually ultimately be to more kind of handcrafted methods. That's been the [00:10:00] trend we've seen in AI for over 70 years Now.

The link here is that also applies to things that we're building on top of LLMs, so it also applies at the kind of AI engineering layer. This is one of the biggest lessons I've learned in building LLM applications. 'cause here's the problem. Just as the model layer we're building on top of exponentially, we're di designing architectures on top of kind of ex exponentially rising or increasing compute.

We're also now building applications on top of exponentially improving LLMs or models. And so what you build today. The assumptions baked into your architecture in terms of the whatever app you're building will not be correct in six months. When a new model's out. That's much, much better. And I saw this play out in my own work on this project called Open Deep Research, and that's what I talk about in the blog, and I just cover a year of working on this project.

It's basically a fully open source, deep research agent. I started it as a very simple workflow back in [00:11:00] 2024 because basically two calling with LMS was weak, actually didn't really perform that well. I didn't want to, and we'll talk about what agents are versus workflows a bit later. Suffice to say, I started with a particular architecture that was not using an agent, and over time as tool calling get much better.

I basically had to re-architect Opend research three or four times to keep up with increasingly improving models. I talked to Manis about this in a webinar we did about two weeks ago. They mentioned that Manus is one of the most popular kind of general purpose agent products out there today. It's based in Singapore.

They actually rea re-architected Manus five times since launching in March. So what you're seeing across the industry is building on top of this new primitive that's getting much, much better all the time, forced you to continually re reassess your assumptions and rebuild your applications. Another great example of this, Boris Churn from Claude Code mentioned in passing in one of his talks, like in the q and a, the kind of secret sauce, so forth.

The secret kind of secret sauce of Claude code is he [00:12:00] said 70% model, 30% kind of scaffolding or harness, right? They have an noag agent harness that they've, they use with cloud code. And the follow-up question was, oh, okay, but over time is the model gets better. Does that mean all your work on the harness is like irrelevant?

He said, yeah. He said, yeah, that is the case. So what's happening is over time, models get better and you're having to strip away structure, remove assumptions, and make your harness or your system simpler and adapt to the models. And so this is something that's very hard. You can't just have a fixed architecture, a fixed scaffold, fixed harness and be done because you're building on top of something that's always improving.

And I think this is one of the biggest lessons that is tricky and we'll, I think we'll talk about a little bit more later, but that's one of the biggest things I've observed in working with LLMs. It's embracing change. Being willing to re-architect your system. And the nice thing is with LMS being extremely strong at, for example, code Assist cloud code cursor, Devon, it's very easy to rebuild things, but embracing that fact that look, you're building on top of a kind of primitive that's always getting [00:13:00] better, you have to constantly re-architect your application and you can't be shy about that.

I think it is one of the most disorienting things though for people who are starting to work with lms, uh, for the first time. 

hugo: Would you mind just clarifying what you mean by a harness as well? 

lance: Yeah, right. So. When you build an agent, and we'll talk about kind of agents and workflows in more detail, but basically when you build an agent, you have the LLM, so LMS exposed through some kind of SDK for example.

You could be using claw, you can be using open ai, but when you build an agent as an example, you're taking that L-L-M-S-D-K, you're binding a bunch of tools to it and you're basically allowing that LL LM to call tools. So when it calls a tool, it basically just. It produces a structured output that adheres to whatever tool you provided.

Let's say it's a search tool. The search tool takes a single parameter query. The LM will produce a function call or tool call that just has like query and then whatever query parameters that you want. That's all tool call is. You need something that actually executes that tool call, and that's kinda where the harness comes in.

So the harness will actually do that and other things, we'll talk about context [00:14:00] engineering in a bit, but the harness is the thing that actually says, okay, the LM actually made this tool call. Okay, I'll go ahead and run the tool. I'll take that tool result, package it as a message, add it to, for example, typically you're having, you have a message list that's growing it.

You're passing back to the LM every turn of your agent. So managing that message list, managing tool execution, packaging up the tool results as messages, passing 'em back to the LM. That's the harness and we'll talk about it in a little bit, but actually it handles more than, it typically handles some kind of logic or what we might call context engineering.

And cloud code, for example, has a very interesting harness. You actually can see when it's running, you can see the tool calls it's making, like if you ever work with cloud code, you can see like it'll run bash tool. It'll run different search tools. You actually see that in the trace as it's running. So the harness is doing all that under the hood and passing those results to the model and model's reasoning, making additional tool calls.

And so forth. 

duncan: A few times you've used words like workflows and pipelines and agents. Can we unpack those a little bit? I, [00:15:00] what do those actually mean and how do they fit together and where are they in the generative kind of hype cycle? Yeah, 

lance: yeah. This is a really good one to clarify. So I've said the word agent a few times.

I might have said the word workflow. So we should break all these things down carefully. Actually, probably the best blog post we, I'm sure, sure. We'll put in the show notes is from Anthropic on, this came out late last year. It's called Building Effective Agents. They defined workflows as systems where l LMS and tools are orchestrated through predefined code paths.

It follows a predetermined sequence and you can have LLM calls embedded in that sequence, but you have an application that goes A to B2C to D every time you run it. Step C could be an LM call to do something. That's a workflow. An agent's a bit different. An agent system where an lm, they say dynamically direct its own processes and tool usage, meaning control over how it accomplishes tasks.

So what does this actually mean in practice? All it means is I have an LLM, I have some set of tools. I bind the tools [00:16:00] to the lm. Let's say I have tools, A, B, C, D, the LM can call those tools in any order. It wants to solve the problem. It can call B, C, D. Whereas in a workflow, you lay out very precisely the steps A, B, C, D.

So that's the key difference. Workflows follow some predefined set of steps that I lay out as a developer that can involve LM calls, whereas agents allow an LM to call tools autonomously in a loop in any order they see fit. And I think the key point to make is the difference is autonomy. Agents are really good for tasks that you can't really enumerate ahead of time.

Research is a classic. That's why deep research is one of the seminal Asian products. Research is open-ended. The next step is conditioned on the prior one. I'm gonna do a search, get some results, reason about the results, and do another search. Whereas tests, I wanna run this test suite is a classic workflow type problem.

Every time like a, a PR is put up, I wanna run these five tests. That's more of a workflow thing. 

duncan: You, so it sounds like you advocate for structured workflows. They're simpler in certain kinds of cases. Like how do you think about what kinds of problems [00:17:00] are better solved by those versus more agentic approaches?

lance: Yeah, so this is one of the classic. This is one of the classics. Okay, so when to use workflows, when to use agents. I'll share some nice documentation in the show notes on this, but there's a few different resources I like. One in particular is a talk given by a guy from Shopify who developed what they call Row.

So Roast is a framework built internally at Shopify for for laying out workflow. It's very similar to a framework that we have at Lang Chain called L Graph Lang. RAF is an extremely popular framework for building agents or workflows, but I really like the Shopify example 'cause it really exemplifies a lot of the rationale and reasons why we build Lang.

Graph. Workflows are great when you have, so, when you have problems that have predefined, predictable steps, migrating a legacy code base, running some set of tests, the Shopify talk talks mentions a lot of those are some of the things that they use to motivate. Roast very well defined. Predictable steps.

Two is consistent and repeatability. So when you [00:18:00] need deterministic behavior and clear oversight like testing's. Another great example of that with every pr you want these end tests run and I guess known sequences, steps A, B, C, D. That's really where workflow shine. And then agents are good for anything that's requires ongoing adaptation, debugging, iteration.

Research is a classic. Coding's another good one. That's why coding agents are so popular. Solving problems with coding often are iterative. You try one solution, you might run it through a set of tests, fail, you try again, some more open-ended, adaptable. Problems are much better for agents. Research, coding, predictable, well-defined steps, migrations tests, much better workflows.

And one nuance is actually you can embed agents in workflows. So you could have a workflow of end steps and one of those steps could be calling an agent to do a thing. In fact, Shopify talks about that a lot in their roast talk, which I'll be sure to link. I think it's an it's subtle point that they actually can play together, but it is also true that lots of problems that people wanna solve with agents you could absolutely solve just by [00:19:00] laying out a workflow.

hugo: The other subtle point that I think is worth expanding on on slightly 'cause it tripped up a lot of people, is that. It's not agent or or not. And when we talk about workflows, a lot of people will use the term agent or agentic to describe that, and of course Anthropic makes that clear. In the blog post you, you mentioned from last December, which is there's a spectrum of agency where maybe you've got an l LM and you're adding some memory, couple of tool calls, retrieval, and then you're starting to build workflows.

So it isn't an on switch or an off switch. The other thing I think worth mentioning, I think you're speaking to this already, is. Having high agency in your software in terms of having an agent with incredibly high agency works very well when you have relatively strong supervision as well. So a human in the loop with it who can guide it and train it and have conversations with, and maybe even God forbid, check the code.

It's written, for example. 

lance: So I think this is an interesting point. Part of the reason why a lot of people were hesitant to build agents [00:20:00] prior to, for example, this year is. There's three different reasons. One being that tool calling was not as reliable, but as a consequence, you did have to, you had to babysit agents much more.

You had to check their work very carefully. They get caught in what you might call like a tool calling loop. So continually trying to call the same tool. Many times that just burns tokens needlessly. So as LMS have gotten better at tool calling. It's become more and more feasible to build agents that are actually effective, that don't fall into these kind of common traps and failure modes, as you mentioned, that require a bit less babysitting.

Now I do wanna mention it is still true though that agents, because they can autonomously call tools, do pair very well with human the loop, and there may be certain tools. For example, with cloud code, you can basically, you can run it in safe mode where basically you approve certain tool calls, and that's true across many different agent systems.

That's a whole topic we get into, but sandboxing agents is often very important and so forth. So because they're running autonomously, you do indeed have to be quite careful both about [00:21:00] token usage and them spinning off into kind of very kind of long sequences of tool calls and also security. So making sure they don't make tool calls that you don't want them to make hit sensitive systems.

Delete things and so forth. So it is absolutely true that because agents have higher autonomy or agency. We often have to be a little bit more careful about what they're doing and sandbox 'em appropriately. 

hugo: Absolutely. And Anthropic actually made that very clear when they first released their first prototype for Claude Code.

I think they're like, please do sandbox this. This is highly experimental. 

lance: Yes. 

hugo: And then like earlier this year, we see, we saw even people like Steve, Steve Jager say, say on Twitter, I can finally talk about this now. I deleted a production database by Vibe Coding. 

lance: Yes, 

hugo: the op the opposite of safe mode. Yes.

Cursor used to call it YOLO mode. They don't anymore YOLO mode. They've changed the name. I know, but you actually spoke to it a really interesting point, which is models are getting significantly better. So can you tell us a bit about how the improvement or more affordances of models [00:22:00] have made building agents or made agents more reliable at what they do?

lance: You know, it, it's, so there's a few different interesting threads here. So one is, I'll link this in the notes, but so meter. Publishes a kind of a leaderboard or a, or kind of an evaluation that measures the length of tasks that LMS can accomplish. And I have to go back and check. I believe it's doubling every seven months.

So this is like the length of human equivalent work. So it's something like, agents can at 50, at a 50% success rate, accomplish tasks that take a human two hours today or something. And there's a bunch of different models evaluated. But the point is the. The kind of autonomy level of LMS is doubling every seven months.

It's one of the interesting scaling laws to track. So that's a consequence of models getting better at tool calling largely that getting better at tool calling, allowing them to per perform longer horizon tasks. So it really comes down to the fact that the models are indeed getting quite a bit better at instruction [00:23:00] following and tool calling.

And also I would note you see this with cloud code, they're getting better at adapting. So for example, if they do make an error, for example, they format a tool calling correctly. They can see that, for example, that trace and they can correct. So self-correction is another very important point. So it's really all these things coming together at the end of the day.

Models getting much better at tool calling, allowing for longer horizon tasks. That's really the key driver here. 

hugo: And that's something, the self-healing is something Anthropic has published quite a bit about in the building, their multi-agent research system. Yes. And we do have a bunch of other topics to get to, but Yeah, something you've actually been speaking to implicitly is Yeah, the rise of background agents, right?

So as the models get better at doing tool calls, the ability for us to send them off for longer. And so I'm wondering if you can tell us just. What you are seeing with respect to this burgeoning field of of background agents? 

lance: Yeah, it's funny, at least at Lang Chain, we call them ambient agents. We have a whole course on it.

So I did a whole course on building Ambien agents, amazing in Lang Graph, and the use case I built in the course that you can build up to is an agent that will run your [00:24:00] email. And so it just runs autonomously every night on Aron, and it'll process all your emails. Actually, sorry, it's not every night you can run it.

It actually pings every 10 minutes, so it's running in the background every 10 minutes constantly. You can configure that any way you want though. You could have it run once a day overnight. For me, it's every 10 minutes and it's constantly kinda monitoring your emails. It pulls them in, it triages them.

It decides which ones to respond to which not to respond to. And it'll produce responses and queue them all up for you. You approve them through a little interface we built and you'll fire them all off. So Harrison, our CEO actually uses it. I don't, 'cause I don't get that many emails, but So that's a good example of an ambient agent.

I think it's a great point. It's a very good emerging form factor. Codex is a great example for code. Just kick it off, it runs. Async does stuff for you. It makes a lot of sense. The catch I would say is that in the context of coding, it can actually increase the burden, the the review burden. So for example, you have to really trust the system to do a bunch of work autonomously and come back to you after some period of time.

If it spins [00:25:00] off on a task for a long period of time and it's on the wrong track, you get this big body of work done at the end and you say, oof. So designing the right kind of guardrails or human loop, ooh, it's stuck. How do you check it? How do you prove what it's doing? Is a little bit tricky still. Like with my email agent, I have a few different gates, like basically it pings me if it needs to ask a question, it pings me when it's going to prepare a response, like a, an email that it's drafted it and it'll ping me if it just decides.

This is worth ignoring to, let me confirm. Long story short, when you're working these async or ambient or background agents, you do have to be careful to design the system such that. It has the right kind of human loop checkpoints, because if it just goes off and does a bunch of work behind the scenes, that can be problematic for the obvious reasons.

So that's a little bit of the trick to these async ambient agents. I personally, for code, I actually do use Cloud code synchronously Mostly that's just me. And I do use AY agents for things like email, or at least I've built it. And I used it for a while, but Harrison continues to use it. I just don't get enough email to [00:26:00] justify it.

But I do think. With AY or ambient agents being very thoughtful about how you set up human, the loop is very important and how much trust you can build in the system. I'll make a small note here that in the email example, a very important aspect of ambient agents, I think is memory because you want them to handle long horizon tasks autonomously.

So ideally. They remember your preferences because you're endowing them with like longer horizon work. So in my little email example, I actually have a memory system. So every time I give it feedback, it records my feedback and bakes it into this little, this little kind of in this little long-term memory, which is just stored in like a very simple set of files.

And those files are updated constantly as I give it feedback. So it gets smarter and smart over time. So it runs autonomously, but it's also learning my preferences. It's gonna be very annoying. These systems run autonomously, but don't learn from our feedback and just keep saying, making the same mistakes over and over.

I think that's another tricky thing about autonomous agents, though. Ideally they have some form of memory so they can adapt and learn. 

duncan: Zooming kinda one click out on [00:27:00] that. Actually, I think the context you feed into the agent is obviously so critical to making the agent work correctly. And you've actually written a lot about this term kind of context engineering.

I'd love for you as a domain expert to maybe define that term and help us think through why it's important and how leaders should think about it. 

lance: Yeah, this, so this is actually a very important kind term that's emerged in the last couple months. So the way I think about it is, let's say I build an agent, I take an LLM, I bind some tools.

Let's say it's a deep research agent. I actually made this mistake, so I'll walk through it. Exactly. You bind up onto the search tools. You have it run, it does a search, returns the results. Those results are appended to a message list. You pass those messages back to the LM makes another search tool call.

Same thing that happens five or six times. By the end, that message list could be quite big depending upon how many search, how, how large, or how many tokens are in each search result. You can be talking, in my case, hundreds of thousands of tokens. They're extremely token heavy. [00:28:00] If you're just doing this kind of naive.

Tool calling it a loop thing, which is a base case agent and that's extremely expensive and slow. So actually being very thoughtful about the context you basically feed to your LLM is important not just for latency and cost, but also chroma. I'll link this in the show notes, but at a really nice report on what they call context rot.

So basically as context gets longer performance degrades. And Anthropic mentions it recently in a nice post they have on context engineering, and they mentioned it. The attention mechanism kind of starts to degrade with respect to context length. So the point is, agents in a naive form use a lot of tokens if it's just tool calling a loop, depending on the tool calls you're using.

Oftentimes, for example, with a search tool, it's pretty token heavy, so they're very token hungry. Consider the fact that Manus typically calls 50 tools per run. It's a lot of tokens. Philanthropic mentions, production agents can call hundreds of tools or have hundreds of turns. So it's costly, it's [00:29:00] slow, and it can degrade quality.

So that's why we talked about agents and agent harnesses. Often agent harnesses have some mechanisms to manage this. And a few of the trends I've observed are one context reduction, and there's a few different interesting tricks here. Some of those intuitive are basically. Compacting older tool calls. So imagine you've called a tool, then you call another, another tool by like your fourth turn.

You can compact tool call one. You don't have to keep it in the message history. Manus does this. It's a good idea. I've done it as well. Another thing if you use cloud code is summarization. So basically once you start to fill up the context window of the LM produce a summary of the entire message history, and that basically compresses all those tokens into much shorter form and you can move forward.

So context reduction, pruning or compacting older tool results and trajectory summarization are two tricks we've seen, uh, across claw code and Manus. I've used them as well. 

duncan: Imagine if you're using like they just [00:30:00] sonnet four or five models with a million token, you're saying you don't even wanna use those million tokens.

You only wanna use fraction of them because of context rock as you, the agent gets confused even if you start to use the whole context length. 

lance: This is a very subtle, good question and it's actually under-reported. So it is true that these models have some context window, for example, a million tokens for the latest four or five models.

That doesn't mean that performance will be of high quality through that entire context window, and you can have degradation, all sorts of non-obvious ways With respect to context length, man, the Manis CSO and I chatted about this last week. In our webinar, and he mentioned that often the effective context window for these LMS is actually quite a bit lower than this kinda state of the technical one, so something to be very careful of.

Just because the context window is a million tokens doesn't mean you're gonna get necessarily high quality instruction following throughout. All those all throughout that entire kind of context length. Then the failure modes can be not obvious and subtle as noted in that CMO report on context rot. So I think it's [00:31:00] worth being careful and even if you have a very large context window being judicious about how it's used.

Anthropics come out and said as much. They have a very nice, actually last week put out a very nice white paper or blog post on content engineering. Their new SDK has a bunch of updates that actually incorporate some of these ideas. In particular, it actually has basically a compaction of older two calls automatically built into once some of the new models in the SDK.

So they're actually employing this idea exactly. As noted, Claude Code indeed uses summarization 

hugo: for those wanting to dive even a bit deeper. It definitely doesn't help with context Rob, but there are tools like prompt caching that can help with cost and latency when having long context as well. 

lance: Okay, so that's an important point, a good one.

So prompt caching is indeed useful. That helps with cost and latency. It doesn't help with context Rock. 

hugo: Yes. 

lance: Because if it's, if you're using a hundred thousand tokens, even if it's cash, it's still a hundred thousand tokens. So Manus actually uses caching very extensively, [00:32:00] but they still perform context reduction through pruning and summarization.

So we actually do both. 

hugo: Yeah. The other thing I've seen, and this is an anecdotal in, in work I've done and from people, friends who work in the spaces, if something super, super important, a really important piece of information. Including it at the start of the prompt, and the end of the prompt can seem to increase performance anyway.

lance: Yeah, 

hugo: like that actually, 

lance: that is true. And actually, so I've done that quite a bit. So basically moving instructions to, for example, yeah, the most recent message to reinforce something, and that's part of this. So basically if you're reducing context in effective way, the overall message list is managed pretty appropriately because what's happening is you're pruning or compacting older messages, particularly the older tool calls.

And you're doing summarization. So basically your message list will be less, much less token heavy versus if you did not do those things. But there's a related, so there's actually three. There's three big ideas here. One is we just talked about reducing [00:33:00] context. The second one is offloading. The third is isolate, and I'll talk about those briefly.

So offloading actually means. For example, Manus uses the file system and for all tool results, it's tied to the pruning thing. You actually save the full tool message to a file system, so it's preserved. So then when you do that compaction, you always have reference to the actual file if you ever need it again.

That's it. So it's idea of offloading is a really good idea. And I use that as well. I think it's a very effective way to do it. The other small point I'll make with respect to offloading is actually offloading a lot of your functions or tools. So this is a very subtle one. We're seeing this more and more.

Instead of giving your agent like a hundred tools, which can be, it uses a lot of tokens 'cause you have to include all those instructions in your system. Prompt like you have a hundred tools, here's how to use 'em all instead. For example, what Manis does, and I've used this quite a bit as well. Use only a few, a small number of of kind of atomic tools, like less than 20 file system bash tool.[00:34:00] 

Basically allow your agent to use a computer to, for example, run scripts that can expand its action space hugely without bloating function calls, for example. This is a good take that Manis mentioned to me most recently. It's top of mind for all MCP tools. I think we'll talk about MCP in a minute here.

Instead of binding all those to the model, having those all live in the system prompt. They just have a CLI that the agent can call through a bash tool to run. Any of those MCB tools, so I wanna make sure I make this clear. The main idea is instead of binding a huge number of different tools to the model directly, instead bind a small number of tools, like a file system tool, a bash tool, and let the model or agent, for example, use the Bash tool to execute commands to do many other things.

So the action space can be huge, even though it is only making two or three different tool calls. That's the key insights and, and see a very good one with Claude Code. Think about cloud code. How many different tools is it actually calling? I mean, I've used it huge [00:35:00] amount since last February. Search Bash.

Yeah, web search. I can't think of a huge number more than that. Very simple. That can do a huge amount. 'cause if you just give it access to a computer. It's extremely powerful. So that's a good insight. Basically, the action space can be expanded hugely if you give agents access to a computer effectively, that that's kinda the key insight.

Second idea there, and offloading from like the system function calls and the agent instructions to, for example, just calling them directly from, for example, a terminal. The final, I'll mention briefly is context isolation. This one's a little bit more clear, but basically when you have a task that, for example, is token heavy, you can also offload it to a subagent, let that subagent perform that task and then just return some result to the main agent.

We see this a lot. I've used it extensively in open deep research. Me, uh, Andro uses this in their multi-agent researcher. Manus uses it so very common approach, context, isolation through multi-agent, very intuitive. So those are the three big ideas. Reduce offload, isolate. With a [00:36:00] number of examples from Cloud Code Manus and some my own work on open deep research.

duncan: Okay. Another um, acronym came up there. The MCP Lang Chain has been really big part of the emerging ecosystem of LLM infrastructure, and now we're seeing protocols like MCP become pretty popular. Can you talk a little bit how this stuff fits together and like what we all should know? 

lance: Yeah. This is a really good one.

There's a really good talk from John Welsh at Anthropic at this year's AI Engineer summit that talked about the origin of MCP inside anthropic. It's a good motivating kind of story to how to, for kind of how to think about this. MCP stands for model context protocol. So as we talked about before, models got really good at tool calling sometime mid to late last year.

And when that happened, this is internal philanthropic. People started writing all sorts of tools without much coordination. So there's lots of duplication, many custom endpoints for different use cases. And all these inconsistent interfaces basically confuse developers, duplicate functionality, create maintenance challenges, and so forth.

So the Model [00:37:00] Kish Pro Protocol emerged internally is a standard protocol to address this problem, and it was open source and basically is a protocol that allows you to connect tools, context prompts to. Different LLM applications or it's a client server model, so it's basically an MCP server and a client application.

The client. In a tangible case, like for me in my day-to-day work, I have an MCP server that services l RAF documentation. I work with Lang Chain. We use Lang Graph Lang RAF as our open source kind of agent and workflow framework. I wanna write law line graph code. I have a little MCP server that connects to line graph docs.

That's all it does, but I expose it S-P-M-C-P server so I can connect it. Through the same server to cloud code, to cursor to Anthropic cloud desktop app. So it's like a universal connector protocol to connect tools, context, or prompts to different applications. That's really all it's now, the bigger picture here that I think is interesting is relates a little bit more to the broader ecosystem in Lang chain frameworks.

[00:38:00] So one of the points that John made is that standardization. Is often very beneficial, particularly in large organizations. This gets a little bit to maybe some of your audience, but we've seen this a lot with line chain land graph. The reason why certain frameworks like line chain land graph protocols like MCP are popular is because standards are helpful if you have a large work with many different people, a standard set of tooling.

It's very beneficial. And that was one of the reasons why he argued MCP took really took hold with philanthropic is because the standard was very useful for a lot of different reasons internally in terms of kind of author auth and security and consistent documentation and onboarding and so forth. And we've actually found that's actually one of the main reasons people enjoy, for example, land RAF land, RAF has a framework for building agents, building workflows.

It's very popular. It's well supported. There's good documentation. And so for example, many organizations building agents onboard a land graph because it's just a standard set of low level tools and primitives that you can use to build agents and everyone's speaking roughly the same language, [00:39:00] and it works seamlessly with MCP.

So you can build a land graph agent, you can connect to tools, you can use MCP to connect to it. So that land graph agent is basic little, is a little MCP client. You can connect it to MCB servers, no problem. So they play well together. But I think it, the interesting point though is about this notion of standardization and why actually in larger orgs standards are important and that's what motivates.

Certain frameworks, protocols for kind of taking hold is my view on it. 

hugo: So now we've been talking around this, but I, I wanna talk about evaluation and in particular, like how people who work a lot on evals almost feel that it should be called something other than evals. 'cause it, it seems niche, right?

Whereas a lot of time it's figuring out if your product works and does what, what you want it to do. And you mentioned that Manus has rearchitected five times this year or something along those lines. Yeah, 

lance: that's right. 

hugo: And. If there is a constant need and push for re-engineering as models improve and new models come out.

Yeah. How can you test if your system is future proof and how do you think about evaluation [00:40:00] and just making sure it works? 

lance: Yeah, so that's actually two, maybe two different interesting threads there. So one observation is kinda this idea of you have your system, how do I know it's gonna be resistant to newer and improving models?

So one of the kind of problems or kind of. One of the challenges that can happen is that I have some architecture, it works very wealth model today. And so if models get better, my architecture limits further improvement of my application. This can happen for lots of different reasons and that's one of the major predictions of the better Lesson is basically the structure we add to our applications today limits their progress in the future or limits their improvement in the future.

You can test this basically by, this is a good take that Manis mentioned that I actually really like. Um. You can actually test against different model capacities even today. So take your, take your system, evaluate it against, for example, a low capacity model amid capacity in the state of the art and make sure performance goes up.

If performance goes up with across capacities, you can tell [00:41:00] that your harness, your system is future-proof in that sense. So I think that's 0.1 on evals and future proofing your system. But 0.2 I think is interesting and broader, broadly speaking. Talking to Manis. Talked to a lot of philanthropic people internally as well.

A lot of the larger static benchmarks become saturated very quickly, so a lot of the evaluation, for example, clot code, they've spoken about this publicly. A lot of it is actually just dogfooding and direct user feedback and app. Matt mentioned the same thing. A lot of their, a lot of their evals are born from direct in-app user feedback.

So getting your products out there, having the ability to capture feedback in app and roll those into eval sets is often what people are doing in practice. At Lang Chain, we have Lang Smith. It's a very feature rich set of evaluation tools. We have lots of nice tooling to capture feedback from traces at the eval sets, and so there's lots of support for that, but I think that's kinda what we're seeing is oftentimes people are.

Attempting to ship product capture, user feedback, roll that feedback into evaluation sets is kind of the approach that I'm [00:42:00] seeing more and more, which is obvious. I think the only subtle point is oftentimes these kind of larger static benchmarks get saturated very quickly, so you need to be constantly be surfacing new failure cases from users rather than relying on kind these big fix kind of data sets.

MENA said they, they moved away from, for example, I think it was like Gaia and some these other big like QA benchmarks. They said they saturated relatively quickly on those. 

hugo: And just to be clear, thi this is amazing, uh, because relying on user feedback is wonderful in these cases, but if you are working in a regulated space and serving a conversational agent to financial customers or people coming to an online pharmacy or something like that, you can't actually do this.

lance: Yeah, I That's true. May depend on your applications. So ideally our operating domain where actually can get some degree of user feedback to assess the quality of your app. I'm not actually sure if that's feasible in all domains, but ideally that would be the case. 

duncan: Actually in your mind, it's interesting to think through what good evaluation really requires in this space.

'cause where like the outputs right, are so [00:43:00] non-deterministic and context driven. It sounds like there's kinda these buckets of like user feedback, large scale eval, and maybe there's a bit of interchange between the two. Is it even possible in your mind to build great agents without sprinkling at least of those two pieces?

lance: Yeah. For every agent I've built and for most of the popular production agents talking to Manis, talk to Claude folks. Evals are certainly being used across the board. They're obviously very useful. I think that the catch is, there's been some good blog posts on this. I'll link a few in the show notes.

It's always just very important to actually look at your data and not rely strictly on evals. And I think there is, we've seen this quite a bit internally, there's an emphasis on just dog fooding, getting your applications to the hands of users, looking at the raw traces, looking at user feedback directly, rather than relying on large static eval sets because the models are changing so fast.

And this is one of the interesting that came up from talking to, from talking to Claude folks. A lot of cloud code was really driven by internal dog fooding, so [00:44:00] basically being very aggressive about shipping updates, dogfooding internally, collecting feedback very rapidly, looking at traces and updating it in that manner.

That does seem to be a common mode of evaluation that we're seeing across LM applications, and I think just looking at the raw data, having a good tracing system in place, just like table stakes. I think that having high quality evaluation sets is beneficial. And Jamel Hussein, all some of his stuff has posted a lot of good things on that particular topic.

But I, I do think the case now, Jamel has mentioned this quite a bit, just setting up high quality tracing, looking at your data and being very aggressive about that, and shipping at least dogfooding internally very aggressively. Is where you start. 

hugo: Absolutely. And I think it's very important of course to keep an eye on your high level business like evaluation are your, are other metrics you think you want met actually being met, but let that guide development and I think a gotcha for a lot of people beginning building this type of software.

Is focusing on the generative part and just thinking about [00:45:00] retrieval. A lot of the time it isn't the generative part, which is the issue. It's your retrieval. So your evals will guide focusing on, on, on retrieval, for example. Right? 

lance: Yeah. I think a related point there is actually what I've found is when you're laying out these systems, it can be very beneficial to set up evals for sub components.

Rags, retrieval, or rag is a good example. If you've retrieval system, you're exactly right that the quality of the ultimate output is dependent upon the retrieval itself and then the degeneration. So you're taking, you're retrieving from a vector store or database. You're passing that to an LLM, the LMS producing an output.

You can actually do an evaluation on the retrieval itself, and that can be very beneficial. In my little email app example, I actually have, I have evals for every component of that. I have an eval for like just a triage step. I have an eval for the ultimate responses. I have an eval for the tool calls.

And so that's actually a good point, like when you're laying out these applications. You can have evals for these sub components to make sure that these smaller pieces are working as expected and for your ultimate output. Maybe for that, you of course, you of course maybe have an eval set [00:46:00] or two, but you're also doing very aggressive dog food and you're just getting it out there using user feedback to kind of update your data sets aggressively.

So it is true that in development, typically with the applications, you can set up evaluations or sub components just to make sure they're working well as you build out your system. You have your whole system, then you ship it. Then you're more a little bit into like kind of guard railing with some like online evals.

Just making sure that you're not seeing egregiously wrong outputs, so you might be doing that. And then you might also be just collecting user feedback directly, rolling bad examples into data sets more on the fly. 

hugo: Totally. So to wrap up, a lot of our listeners and viewers are AI data leaders, ML leaders.

So I'm wondering, for leaders trying to make sense of the space, what do you wish more engineering managers or CTOs understood about how Gen I gen AI systems actually get built? 

lance: Yeah. Maybe I'll walk through a few kind of summary principles that hopefully hit a lot what we talked about. So I think one, keep things simple.

Use just prompt engineering if you can get away with it. If you need a little bit more complexity, then bump [00:47:00] up to a workflow. If you actually can't get there with a workflow, then you consider an agent. If the problem is truly more open-ended an agent though with try to minimize the tools. Keep it very simple.

If just a single agent can get you there, then think about content engineering. You could, for example, offload to multi-agent through if you needed context isolation from more like heavy duty tasks. And finally, if all those things are insufficient, you might think about fine tuning or training models, but really only after all the others are exhausted.

So I think that's 0.1. Keep things simple. And I think it's tricky because you often might hear, for example, on Twitter or in the timeline. People talking about, oh, reinforcement, fine tuning or building agents, and maybe your problem does not need that, and don't increase the complexity arbitrarily. The second point is like the bitter lesson thing is building for rapid model improvement.

So recognize what you build today. Will have to be re-architected kind of aggressively over time. The man is example five times in since March. My example of OD research, I re-architected that three or four times in a year. You have to bake in the fact the models are [00:48:00] getting much better and whatever little kind of crutches you have in your application today to make it work, go away as that model improves.

That's exactly what is predicted by the better lesson. And you have to be aggressive, but removing those kind of assumptions or structure as the models get better. Otherwise, you might bottleneck your performance. So that's a very important thing to think about and keep in mind. Relate to that. Don't be afraid to rebuild man.

Rebuilt five times cloud codes constantly rebuilding. I rebuilt open D Research three or four times in a year, so you have to embrace that in this new era of lms. And I also think a subtlety is that the cost of rebuilding is much lower with code models, so it's much faster re-architect things. Another subtle point I'll make is things that don't work today will work tomorrow.

So cursor's, a great example of this cursor did not work well until claw. Three, five sonnet. Then suddenly the product experience was unlocked and it obviously the rest is history. So actually don't be shy or afraid or whatnot if your product doesn't quite work yet. Because, for example, with modeled efficiency, that'll very quickly get removed as models get better.

I think curses is a great example of that. Yeah, and maybe the [00:49:00] final point is just be wary of rushing to train models. It can be tempting, it can be really charismatic. You think about applying fine tuning for your domain, but oftentimes these, the frontier models are getting so good so quickly you can take all this time to collect a dataset, train a model, and then actually you get better lesson because the frontier model kind of encapsulates the capability that you fine tuned for and then you waste all the time.

I'll give you an example of that. Very specifically. Two years ago, structured outputs weren't great from frontier models and people are doing fine tuning for structured outputs. And yeah, it's all relevant today for the vast majority of use cases is irrelevant because the LM providers have gotten extremely good at structured outputs and complex nest is schemas and JSON mode and so forth.

So just an example. So keep it simple. Build for rapid model improvement. Don't be scared to rebuild things that work today. Won't work tomorrow. Don't rush to train models, I think are like the five things that I would leave you with. 

hugo: Fantastic. What wonderful lessons both from all your time building, but everything you've seen happen in the space working online chain as well.

Thank you for bringing us [00:50:00] your wisdom and expertise and for your time as well. Lance, this has been super fun. Yeah, great 

lance: to be here. Great to see Duncan again, and great to meet you, Hugo. Hopefully I make it up to Sydney sometime and we can get in the water. 

duncan: Go for a surf together. 

lance: Yeah. 

hugo: Thanks so much for listening to High Signal, brought to you by Delphina.

If you enjoyed this episode, don't forget to sign up for our newsletter, follow us on YouTube and share the podcast with your friends and colleagues like and subscribe on YouTube and give us five stars and a review on iTunes and Spotify. This will help us bring you more of the conversations you love.

All the links are in the show notes. We'll catch you next time.