The following is a rough transcript which has not been revised by Vanishing Gradients or the guest. Please check with us before using any quotations from this transcript. Thank you.
===

john: [00:00:00] If you start out by saying, I'm going to take my really complex thing, and I'll give it to this high level framework where I just say, here's your task. Go do anything you want. You're gonna come into a less nice outcome than if you just build a simple thing first. 

hugo: That was John Berryman, founder of Art Tourist Labs, an early engineer on GitHub, copilot, an author of the O'Reilly.

Prompt engineering for LLMs. In this episode, John and I dive into what we call the seven deadly sins of AI application development. One of the biggest takeaways, the way you need to think about the 

john: models themselves, empathize with the large language models. Think about them, is the super intelligent, super eager, and a forgetful AI intern.

A DHD 

hugo: before founding Arcturus Labs, John was an early engineer on GitHub co-pilot, contributing to both its code completion and chat features. John's background in search is [00:01:00] also fundamental here. He's the co-author of the Manning book, relevant Search and the O'Reilly book. Prompt engineering for LLMs, which he might just be renaming to Context Engineering.

This conversation is all about practical patterns and antipas for building with ai, which we've framed around the seven deadly sins of AI app development. We start with why demanding a hundred percent accuracy is a recipe for failure. Then we dive into the practical realities of working with agents.

Deconstruct the rag black box and explore why starting simple is always the right move. The core of our discussion is about building robust debug able systems, culminating in John's powerful mental model of treating an LLM as a forgetful, super eager AI intern. I learned a ton from this conversation, and I hope you do too.

This conversation was a guest q and a from a recent cohort of our course building LM Powered applications, which I have the great pleasure to co-teach several times a year with my friend Stephan Raic, who works on [00:02:00] AI agent infrastructure at Salesforce. Links in the show notes if you're interested. Also, if you enjoy these conversations, please leave us a review and give us five stars on the app you use, share with your friends and colleagues.

I'm your host, Hugo b Anderson, and this is. Vanish in gradients. Well, the thing is, John and I have been chatting about all the things that can go wrong and then the antidotes to those things. When we started chatting about these things, we looked at them and we realized we could bucket them in a variety of ways in terms of quality and expectations, when and how we build apps, agentic issues that we run into retrieval issues, coupled with what we're calling context.

Now, I think something we'll get into is when we think about retrieval, a lot of people are focusing for. Good, but incorrect reasons on the generative aspect and not the retrieval aspect and decoupling these things can be very important. And then kind of challenges around tooling as well. So John, I'm wondering, there are many ways we can start here, but [00:03:00] we can start talking about the sins that involve expectations around quality and accuracy.

Or we can start talking about agents immediately. What's the first sin you wanna bring to the table? 

john: The first thing that I had on my roster today was building an AI product with the expectation that it must be 100% accurate. And this is a problem that I run into with clients when I'm on introductory calls.

Sometimes I tend not to take this work because you get the similar pattern where the person will come in and say, Hey, we're doing this thing. It's doing some sort of data extraction from these legal documents, or it's doing some sort of process. It's building something and it needs to be 100% accurate. I realize they've got.

Legality and liability all kind of bound up to that 100% accurate thing. So the sin is requiring these AI systems to be more accurate than you would require a human to be. To be a hundred percent accurate. The penalty of the sin is if you're really pushing a product that you're expecting to be absolutely a [00:04:00] hundred percent accurate, then you're probably gonna get into some liability issues.

Some trouble with that. I don't know how far. You are down that road, but the penance to this is to reframe the problem. I think, you know, our agency is not such at this point where it can do all the thinking for us. These systems get distracted. We will go through all this to empathize with the way these agents work.

I think the goal for. The current state of the art is to help our users save time. We make sure that whenever we're doing something on our user's behalf, it's transparent. They can see what the model has done and think to it for themselves so that they're the ones that are, are the final signator. They sign off on that if you do it right, there's a lot of benefit to your customers because what used to be them.

For extraction work, for example, them, uh, you know, filtering through tons of documents and finding that the morsels instead, you can say, here, I've done that for you. [00:05:00] Here's an easy way to check on it. And they can be like, right. So you've saved them a lot of time and foregone this penalty of 100% accuracy.

hugo: I love it. The penance I, I really like because. You framed it as understand it's, it's an expectations mismatch, right? Mm-hmm. People are expecting data powered software, ML io powered software to act like what we think of as software, which also doesn't work a hundred percent of the time, but it does does a lot more.

So I think framing it as helping people understand that it's less dependable than. Classical software less dependable than humans as well, and figuring out how we can work with 

john: it. Yeah, absolutely. It's all about reframing it so that your users know that you're gonna save them a lot of time and they have the correct expectations going into it.

So they're not gonna be frustrated at your company. They're gonna be like, well, I've saved a lot of time, but this is just one of those things. 

hugo: I'm also interested in the fact that you say you often don't take this work on. Because you could [00:06:00] imagine a parallel universe where you're like, oh, what an exciting opportunity to help reframe things for these people.

I appreciate that. That may be very challenging. So I wonder what advice you'd give to freelancers who come up against this problem if they wanna work with such people. 

john: I would say figure out how flexible your potential client is on this topic, and also how deep they are down this road to perdition. If your client is really adamant that this has to be 100% accurate.

If it's a domain where getting it wrong is really gonna get someone in hot water, like legal documentation or medical stuff, then maybe you don't take those. But a lot of times clients will come around and understand that, okay, I don't really expect a human to be. 100% accurate. It makes sense to me that an AI is gonna be less accurate these days than a human.

You can talk me into that. And so a lot of people will be flexible enough, early enough that it's still salvageable. You can still make the most of it. 

hugo: [00:07:00] Absolutely. Moving on to sin number two, they tell us 2025 is the year of agents. I think many years will be the years of all different types of agents, to be honest.

What are you seeing? Like some failure modes and antipas and for working with agents. 

john: Well, if you expect to give your agent a chunk of work and let it go off and work for 20 minutes and come back with the correct answer, then you're probably gonna be a little bit hurting. I, I suspect most of us are now vibe coding all the time.

If you expect the software, if you expect to say, build me this library and walk away, get a coffee and come back, and it'd be exactly what you were thinking, then it's often gonna be a huge mismatch. And there's two problems for that right now. The one problem is that the model just aren't strong enough yet.

It's been amazing how much we have improved the strength of these models in the past three years. It's like every month it's getting better and better. So the solution to this problem is to wait a month. So you might just be lucky. It might be strong enough, [00:08:00] but the other problem when giving an agent a task that's just a little bit too big is something that we're never going to get away from.

It's the nature of language. When I say I want something of medium complexity built. I walk away, the space of possible things that can be built is, is enormous. Not all of those meet all of the implicit assumptions that I have in my head when I said the thing and walked off to grab my coffee. When you come back, if the model is strong enough, it will have satisfied your request, but it's going to be in this kind of null space of like, you know, all the assumptions.

That one wasn't what I really talked about. Now, there are ways to overcome this, and one important caveat, the one important caveat is if. The work is not closed form solution. If it's open form, then the agents, it's, it's actually kind of prime time go for with that. So research ag agentic research is awesome 'cause I can't read the internet over and over.

These things can do a main search. You have a bunch of little threads [00:09:00] to untie and then do parallel searches, and you have like an army of little agentic warriors pulling back and organizing information for you. That's a great usage. But if it has to be a specific solution, then what I would, the, the penitent for this in is to make sure you keep your user in the loop When the user says, I wanna do this thing, make sure you give the model a chance to come back and say, you said this thing, do you mean this or do you mean this?

Let's narrow it down a little bit 'cause I'm confused. Give your model a chance to do that. And ideally, I still think it's gonna be really useful to keep your. User in the loop as the process is happening, make the process transparent to the user. Let the user. Work with the agent and see what it's looking at.

Maybe if we get to the other side of this conversation, Hugo, that's one of the big points on that one. How to engage with the user. 

hugo: Absolutely. Giving too much to do is always a bad idea, whether it's an intern, an agent, an LLM, the next sin revolves around using context irresponsibly, part of which can be.[00:10:00] 

Giving it too much. I also love that you showed some counter examples, particularly research agents or analysts. This whole course is built upon the premise that we want to build consistent, reliable software that isn't too stochastic. And a lot of software we do want that with, but with research assistants or analysts, if you send out 10 different research agents, you do want 'em to come back with different stuff so you can synthesize it as well.

So even the Stochasticity, there can be a quote unquote. Superpower. I do wanna move into, mm-hmm. The next sin because it's so, so related, which is really about using context irresponsibly. Julian in Discord has just mentioned he loves your book, says fantastic book, in his opinion, by far the best on relev engineering.

Thanks much Julian. And of course, any questions you'll have in the chat on Discord, please put them there. John, it seems like. July is the month of context engineering. So maybe tell us a bit about the sins of context. 

john: Right, so my previous book is Prompt [00:11:00] Engineering for LLM applications and what Hugo's referring to at the beginning of our conversation is I just tweeted a post where I've covered up the name Prompt Engineering with Context engineering is like, I wrote a new book, but, but what?

Really, we were getting down to, in that book, the, the whole middle chunk of that book was about this problem with context engineering in times. Bar pass, which would be like six months these days. They were talking about like, these models have bigger and bigger and bigger context windows. And indeed, like the first models, when I was working on copilot, it was 2048 tokens.

That's nothing. So now we can have, you know, I think the standard with the larger models is like 200,000. At least a lot or at a million. But if you are filling up that many tokens, you are not gonna get the performance that that you want. Both in terms of cost, it's gonna be a lot more expensive. Latency, it's gotta read all that crap, even though it's gonna read fast, it reads a lot faster than it writes.

But also accuracy, the models, the attention [00:12:00] mechanisms are not. Infinite. So if you just like philanthropic, they used to say, maybe they still say that on the website somewhere. If your content is less than 200,000 tokens, just shove it all into the prompt and do rag over the full content. What it's, it's actually, I guess I can copy and paste it into chat.

Now this thing, let's see if this works, came out in the past week where they're studying. How much long context degrades s ability of models to solve hard problems, and it's kind of shocking, but anticipated and identify with these agents, understand that if you were to blow 200, you know, a stack of books that high and say, find, assimilate an answer from everything you read.

They're just not going to do a great job at it. So solution, the, the penitence for this particular sin is, is to use context more responsibly. This, uh, you [00:13:00] know, I'll sell my book again. Buy that book. The middle three or four chapters are all about this. You're in a problem space. Your user has some sort of problem domain.

They're looking at first layout all of the potential context you might use on the table. Figure out how create some sort of like ranking relevance algorithms to, to tier them into like, this has to be in there, otherwise it's not gonna make any sense. This would be nice. This is more of a nice to have.

Maybe instead of tiers, you can also use kind of like floating point, like this is more relevant lessons list. There's different ways you can do it. You have to figure out for your own domain. Once you have those articles, you can say, okay, given all the content I could use, here's the stuff ordered by how I want to use it.

Have a template that you shove the stuff into, make sure that you as a human, read the thing. And it's understandable. 'cause I've worked on teams where they're like, you have to understand that the agent is waking up for the first time in its life with every request. It needs to read the prompt and understand it.

Make sure you can trim the stuff and size it [00:14:00] so that it fits into your token budget, not the max token window size. Awesome. Brad, how's 

brad: a question? Hey, thanks John. I fully agree with everything you guys just spoke about in terms of context. Engineering. Well, I'm interested to know how you think about underlying models improving.

Nome. Brown refers to this as just scaffolding, and obviously he's very optimistic about how this is all gonna play out, but he's alluding to the fact that developers do less scaffolding and assume the model's gonna get better. So all of those things you've mentioned about context engineering, how much time do you put into those for clients?

How fast do you think models are gonna get better and start abstracting away some of those things that you're doing? I 

john: don't think models will ever completely, uh, abstract away writing a good prompt and building context responsibly. I think there are some frameworks out there that are playing towards that end.

They're actually very interesting. Like A-D-S-P-Y is a great example where you don't write any prompts, you just write functions, and it's very intuitive in some respect, [00:15:00] but. I fear them from my experience in that they completely hide. Yeah. You can't open the control panel and turn whatever knobs you want.

It's completely hidden. I don't know what the future holds. It could get so much better that you just don't ever have to worry as much about engineering. So maybe just really rarely open up the whole thing. Or you could, I've been kind of toying around with this idea. You could go the exact opposite way where.

Rather than having frameworks at all, maybe the system message, the models are so rock hard, good at following the system message that you write your if statements in text and the whole workflow might be encoded in that in the future. I think that's also a little bit far effected for the the near future, but generally.

I mean, I wrote the Prompt Engineering book. I suspect that prompt engineering is gonna be with us for a while. It's not just scaffolding. 

hugo: Thanks so much for that increased amount of quote unquote context. John, I appreciate it very much. I think this ties into another sin we'll get to is you don't wanna [00:16:00] treat systems as black boxes too much.

'cause we already have a lot of black boxes here, even in the transformer architecture, right? So to your question, Brad, even if there is a future where you don't need to do those things in the edge cases or failure modes where they do go wrong. If you've used this approach, it's almost impossible to debug in that sense because you've created black boxes.

And what we're seeing, and John and I will get to this, is that when you have these modular workflows, remember we're working under the data flow paradigm, right? So we've got these modular workflows and different pieces that we can improve upon and different levers we can pull. And once. You do your failure analysis and see, maybe it's the OCR or whatever it is, you can work on that like the ETL part of your pipeline and not touch the model and these types of things.

Whereas when you have infinite context, it'll be very tough to debug and figure out these failure modes. I, I love the next sin, John. 'cause we are here in the spirit of simplicity as well. I haven't taught this workshop in this course yet, but I do teach a workshop where I build the demo of using crew AI to build a multi-agent system where [00:17:00] I.

Essentially need it to have nine tool calls. I need to run it a hundred times to actually do seven or eight, and it never actually does all the tool calls. There's also a lot of forgetting that happens. That isn't to say these tools aren't wonderful to use in a variety of ways, but the next sin I think, is a complexity issue with respect to multi-agent systems.

So maybe you can tell us a bit about that. 

john: Basically, if you start out by saying, I'm going to take my really complex thing and I'll give it to this high level framework where I just say, here's your task. Go do anything you want. I think a lot of times you're gonna come into a less. Nice outcome than if you just build the simple thing first.

A lot of times there's an opportunity to say, okay, well here's what I'm thinking through. Maybe even have been beginning. I think, man, this is, this is gonna be hard. I need stuff to happen to parallel. Try to build the simplest thing first and see if it falls flat. I've got a great example with a client right now where I'm [00:18:00] extracting.

Basically inventory decisions, stocking decisions from email threads. It just, the old way of doing things was discussing emails. They're trying to modernize it by pulling stuff into web apps. I thought to myself, I'm gonna need some sort of agentic thing, pull this type of information, and I decided to just do the dumbest thing first, which is I made a pedantic type that had all the information I want.

I made a system message that's like, Hey, find this stuff, and it pulled out. Basically everything I wanted. I can now see the tweaks I'm gonna make. What I was able to forego there is building the complex thing that I thought we would ultimately need Instead, the real pain felt is I'm missing some corner cases.

I think I'm just going to update the system message and the field definitions for this structured output. I'm, I'm getting and debt. Be good enough this time, but if you go head first and say, let's just make some orchestrated multi-agent system, I think there's a lot of chance for them [00:19:00] to run off course.

I dumped another link into our chat, which is a great academic article on this. They look at multi-agent systems and how they, in their case you read, read it yourself, they underperform analogous sting wheel agent systems or sim simple systems, and they categorize. The reasons for going off the rails, and it's things just like the handoff.

When you get multiple players in the same room, it's hard to figure out how to take all the information that this one needs from the orchestrator and make sure that that survives, make sure it gets back to the orchestrator. That's where these multi-agent systems sometimes lose out. What's the penitent?

How do 

hugo: we solve 

john: this? Start simple and build. Until you really see that, that you need something complex, I think there's still. Is, is a need for multi-agent systems, but knowing which multi-agent system you need and the shape it has to have is much better informed by starting simple, feeling the [00:20:00] pain, and then you know, building into it as opposed to saying, I think I, there's these roles just go because then when something goes wrong, it's harder to pick it apart and figure out.

Exactly where the damage occurred. 

hugo: Absolutely, and it's actually the upcoming sin is the black box of rag and it's not entirely unrelated to, and of course this is why this ordering makes sense. In the first workshop of this course, we built a basic rag system using five, 10 lines of code using llama index and show how amazing it is for for demos, but how suddenly you can't even figure out what the prompt is.

It's abstracts over embeddings and chunking and all of these things. Of course, Hamill wrote a great post. Based around these experiences. Right? But the lesson here is quite general use vanilla python and API calls and write your own function calls initially, perhaps in order to build things up just as you would a retrieval system so that you can inspect it and introspect right.

john: Absolutely. That's one of my pet peeves. Since my background is [00:21:00] retrieval. Everyone says rag like it's a new thing. Like it's one thing and like there's the voice in my head screaming, no, no, no, you're doing it wrong. If you take the rag black box off the shelf and implement it, and when something goes wrong, you kind of got over ofcourse because it was a black box when he pulled it down off the shelf.

Instead, if you really want to invest the time to make it good, think about it as a transparent box. Think about it is two things, at least it is retrieval, which we have been doing as an industry in some form, another for probably 70 years. And then with Google, everyone knows search. So you've experienced it yourself for these 25 years.

Then on the other side there is the large language model application that uses a tool. It happens to be a tool that is doing search. You can think about it as the react flow, which is a paper that was written when I started all this stuff three years ago. [00:22:00] I mean, so like if you look at these two ingredients in isolation than your, a lot better off.

Something goes wrong. Now you have a pipeline and you can break it down. The answer's wrong, but is the answer wrong? Because the model is weak, we improve the model. Actually, I'm looking at the context and I don't understand it. As a human, I need to fix the template so it's more readable, legible to the model.

Okay, maybe the template's good, but if you look at the information here, it's totally irrelevant. Inform, you know, we in the template, we say this is relevant, it ain't, so then we go back to retrieval land and we start figuring out precision recall. Try to figure out what is going wrong with our system and why it's not matching the documents that we want.

You can back it all the way up to, well, okay, ETL pipeline. Did I not tokenize the document correctly for Lexile search? Did I not chunk the document correctly? For semantic search? It's a pipeline. We can chop it up. Isolate the damage [00:23:00] and repair it 

hugo: now for a word from our sponsor, which is well. Me, I teach a course called Building L, empowered software for data scientists and software engineers with my friend and colleague, Stefan Ick, who works on agent force and AI agent infrastructure.

At Salesforce. It's cohort based. We run it four times a year, and it's designed for people who want to go beyond prototypes and actually ship AI powered systems. The links in the show notes. I love it. Do you wanna say, to just reiterate, the sin is believing that RAG is a single thing in a black box.

Realizing that we can separate it into its component parts and see what's up. Some of them are black boxes in themselves, but seeing what's up there and figuring it out, once again, the penance is to build it yourself. Using a custom end-to-end implementation for ingesting, indexing, querying retrieval, prompt generation response generation.

John's words if quote unquote rag then breaks. You can trace down it to the component into the particular pipeline and fix it. I love that we're having this conversation now because it foreshadows. We're doing a lot of rag and starting with [00:24:00] agents or a lot of retrieval and starting with agents. Next week we have guest talks with Jason Liu on things.

He's up to at 5 6, 7 Labs with Trea Shanker on retrieval for unstructured data and within us from space here. A lot of the work they're doing at Space and explosion around PDF ingestion in in particular, so. Unsexy stuff, but the real stuff that helps us build these systems. Mike Powers. What's up man? 

mike: So maybe just ask John to say a couple words on something that he helped me understand, which is there, there's sort of a impression that retrieval meet in rag means vector database and the two are not the same thing.

Retrieval is anything you do that's effective. To manage or prep your context that you're gonna send in. Semantic search might be part of that, but as John was pointing out earlier, there's all sorts of other search techniques that might be applicable in various scenarios. I, I thought there might be a point worth, it's reinforcing, John, if you have a couple things.

john: Yeah. Thanks for making a point that I love making in front of people and part of it is like, my background is Lexi search. [00:25:00] That's so that, that was the, the hammer that I. Hammered down every nail in the world for, for 10 years of my career. Lexile search is looking for token matches where basically you've got a document and you're chopping it up in into tokens and indexing those tokens and the whole process.

You can still see the tokens. It's really, Lexi search makes it really easy to debug problems. You can change ranking parameters, relevant parameters, while the data is live. And so it's really. Useful, but it's not quite as good at some things. Semantics search is really good at finding meaning. Electrical search is really good at finding specific phrases, and sometimes you need both and there's no reason that you can't have both.

Mike's point that he's helping me make is don't presume that semantic search is the only game in town when you really get down to it, having your tool make SQL queries. A type of rag, right? It is retrieving something. I [00:26:00] really like the idea of thinking in these generalities. There's a lot of simple things that you can use to build up the complexity.

Whatever you wanna build. 

hugo: Really appreciate that clarification. I gotta say, the next sin, once again, there's all relatable, but this is one near and dear to my heart. It's about, once again, reducing complexity, but the sin actually involves trying to pack too much in one go. So tell us about sin number six.

john: Yes. The little brother of that, one of the earlier sins were like, you don't say solve this thing in 20 minutes, I'm gonna come back. That's kinda like one shot in a project. This one is about one shoting. Even like per request, it's if you say to a model, here's a big bunch of data, extract all this complexity out of it, then it's often not going to do a good job of that.

The solution to that is to try it first. See how good the models are run, start getting your evaluation set up and figure out if it's going forward. If it does, then go for it. [00:27:00] But if something is starting to break down, don't just like. Keep changing that prompt. Break the problem up into smaller pieces.

We're taking restaurant websites and extracting different pieces like their schedule and their menu items. There's no reason to have one agent do everything. You can have parallel requests and actually speed up stuff potentially by doing, doing things in parallel and have it isolated. I've seen behavior that was kind of damaged when, when we've tried to stuff too much into our pond.

For example, when we're at GitHub, when I was at GitHub, we used large language model as a judge for looking at conversations inside the the copilot application and saying, was this good or bad? Does this have these type of qualities? We were given a rubric. List of questions to checkbox. It had this, it had this, had this, and if you overloaded it, it would start completely lying.

It'd be an obvious answer that you'd look back and say, how do I trust the judge? It's obvious it would even make up questions that didn't [00:28:00] originally exist, but shrinking that down really helps. Similar domain, I've learned from one of my friends that's like, this is their everything is when you're looking at hallucinations and trying to figure out if these phrases are hallucinations, don't say, here's all the facts we pulled out.

Find the hallucination. Instead just say, here's the source document, here's the fact. True or false, here's the source. Same source document. Here's the next fat. True or false, you have to trade off cost inaccuracy, but that's generally better to break it up like that. 

hugo: Totally agree. Even back in the early days of this technology, people would say, I like, I have like an L.

LM will rarely gimme the answer I want the first time. I would always say, do humans do that? You actually need to know someone quite well and have a communication style. Down with someone very often, you need to ask follow up questions. Understanding that with respect to these systems is absolutely key.

The final sin is one that's so important to me, particularly in a course like this where we talk a lot about. What frameworks [00:29:00] to use, when to use vanilla, Python, and APIs. And as we've hinted at, we're being somewhat explicit about a lot of time building these systems. No matter what frameworks you adopt, you will eventually end up going and starting to hand roll your stuff yourself, particularly as you begin even the basics of evaluation.

But it has to do with frameworks and logos. So tell us about the deadly sin that can result from indexing too much on frameworks and logos. 

john: Yeah. Maybe it's the least deadly, but if you spend too much time, you'll starve to death here. The temptation that even as a consultant I have is like every week there's new SDKs for, you know, putting, putting together, uh, agency and there's, you know, like evaluations.

How do we do evaluations? Well, there's, there's a lot, a lot of different platforms out there, but there's all sorts of different models that are coming out every day with. Interesting nuances and it's just an explosion of logos around you. There is a time to look into that, but especially as we're [00:30:00] honestly still early in, this has only been three years that the world has been upside down.

I think the ecosystem is largely sorting itself right now. You're starting to see standardization with the model APIs. You've noticed that they're all looking pretty similar at this point. That's great. You're seeing some standardization with observability platforms, evaluation type stuff. There's a lot of similarity.

They're offering similar feature sets, so we're seeing good signs, but still, you need to be able to, if you show me the prompt. I think you need to be okay with also just like digging in and building something, realize that even though the information that these models is producing is amazing, it's the first time Emin have built a tool that talks back to us.

It's still just an HTDP request. It's still just streaming data, and so it shouldn't be too scary to dig into it yourself. And there's things that you'll always want to do by yourself without indexing on someone else's solution. [00:31:00] If you're evaluating. Your domain. Then build human review tools that are for your domain.

It'll take you 30 minutes by coding and you'll have the perfect thing for your domain. Don't just search around to find somebody that solved the problem for you. Hook into other people who have solved the more general problem. Yeah, maybe that's good, but you've got to at some point, and I do your own thing, eventually the ecosystem will have sorted itself.

I'll change my tune, but not yet. So what's the penance? Dig in, do enough research to figure out what the things are, pick one and then dig in and anticipate that something crazy will happen. You're gonna write, rewrite a lot of stuff anyways. I think that's the world we live in. You can write code so fast, it's just that you're gonna have to write it three times.

hugo: So you mentioned, for example, vibe coding, certain tools like JSON viewers and annotation tools. I love this example 'cause it's a wonderful example of. Vibe coding is a great thing to do. I don't know enough about front end [00:32:00] stuff to really know if what I'm building would be right for anyone else, but it works for me.

I have a few tests to make sure that it's actually pulling my data correctly and labeling it, like doing what I think it does. I am wondering this being preoccupied with logos as a sin is an anti-pain. What type of stack do you see starting off with works most of the time? 

john: Honestly, I would be interested in the audience's answer for this.

I'm searching myself coming from co-pilot in GitHub. We started only on open and there was no frameworks. And so like I, I, I developed my experience in that ecosystem. When I'm building something like a simple demo for a blog post, I'll usually just do that 'cause it's just htt, HTTP request. But I have been searching more into the established frameworks for stuff I've.

Recently started working more with lane Wrap, and I find the way it thinks about directed nodes, it thinks a lot like the way I do for [00:33:00] workflows, but still a little bit hesitant to say this is the way, because it's also still fairly heavy weight. So I, I fear the day where I have a stack trace and it's 15 layers deep in something when I've compiled a graph.

It's not running my code, but it's running code somewhere else and it freaks me out a little bit, but I think it's becoming a decent starting point. I think they had a rough spot. About a year ago and have largely overcome that. 

hugo: Totally agree. I'd love to hear from everyone on Discord. We can chat more about tools that have worked for you, as we have been doing throughout the course, using Python and APIs and hand rolling things and adopting tools and building on top of them as well.

We've gone through the seven deadly sins of AI app development. I appreciate you carrying through with the sacred overtones of this experience. Sins and flashy things will divert our attention from. The happy path, the straight and narrow way of building the things that serve us in our communities. So once we have [00:34:00] figured out our penance for all these things, what is the straight and narrow way that then emerges?

How do we walk down it together? 

john: Alright, well, shall I speak in parables to you for continuing? Absolutely. All right, well, I'll leave. I have several things. I would love to talk more, but since we are at time, I'll leave you with something that I've referred to a lot. The straight and narrow path here is to empathize with the large language models.

Think about them as a super intelligent, super eager, kind of forgetful AI intern. DHD. And so the idea is that these things are really smart. They've read the internet five times and memorized a lot of it. You get asked for specific quotes, and it's amazing how much they've memorized, but they're not psychic, if anything is outside of their training set.

You have to provide it to them. That's like providing tools to access information to the outside. They don't read your garbage if they're trained on reading documents, just like we have been [00:35:00] trained to read. But if you just throw something that wouldn't make sense to you, it's not gonna make sense to them either.

So empathize, read it through their eyes is they're coming to it fresh. The agents prefer familiarity. If you can ever describe something in firms that are readily available. In an online domain, if you can use motifs, use markdown within. The agents are gonna do a lot better about satisfying that pattern when you get into stuff that's.

Behind corporate walls. Remember, it's gonna be a little tricky. The models don't know this. Apriori, the models need time to think, so give them a chance to chain of thought reasoning these days. You can also use the reasoning models themselves to actually spend some more time on the problem and come to better solutions and don't, they've got a DH, adhd, so don't just shove all the information into the world in there.

'cause they'll get distracted, they'll follow some other thread. Biased by the information that you put in that you thought was just spurious. They'll latch onto it and do something with it. Also, [00:36:00] they are forgetful, and this might be, I think I'll close with this one, they're forgetful. You have to remember, as soon as you issue that prompt, the model wakes up.

It's a brand new day. It has never thought anything before. So if you have something that. Your human intern that you welcome into your company, they're gonna make mistakes on their first day and they'll come back the next day and learn more. But the large language model is gonna make mistake and with no feedback mechanism that you have to supply, there's no way that it can possibly get better.

It's going to make the same types of mistakes over and over again. So that's the way. I've kind of anthropomorphized the model for myself. I think about it as being my scatterbrained smart AI intern. 

hugo: I, I think that's it. I love the idea of empathizing with the model and understanding the system you are working with.

Its affordances and, and what it's bad at. As I said earlier, we talk about hallucinations in the space a lot. We don't have as much conversations around the forgetting. And I, I think that's, that's incredibly important to [00:37:00] recognize. Along with everything else you identified, we are at time. I can stick around for another 10 minutes if anyone wants to chat a bit more.

I don't know if you are able to, John, I got to the bottom of the hour, so that's, that's cool. Well, so we can call it a wrap with that, but if we have any questions or anything people would like to discuss with respect to these antipas and how John's thinking about fixing them, that would be great. Or any reflections as well.

john: If there are no reflections. I'll say this and then we'll go to Brad, but I'm actually curious about how you guys would answer the second to last question I had. What is your stack? I've been asking everyone this and getting interesting answers. 

brad: Sorry to interrupt. Brad, let's go with you first. No, I don't mean to hog them like my questions related to the stack.

I really like what you said about thinking and generalities. My, my question is, if you are building a stack for two separate clients at the moment. Where we are now, it's very custom to the user intent or the client or the problem. Do you see like things converging, so what I [00:38:00] mean by that, do you see yourself converging toward a similar framework or a similar agentic structure for those two clients as things progress or are we going the other way?

That makes sense. I think for domains 

john: like observability and a lot of those are converging. I don't have to think hard about that. I think that the models themselves like. I had, everyone did completions at first, which no one really thought about. It was literally completing documents. I did check, so everyone did chat.

So everyone did everything. This is actually a good sign. Everyone is understanding when things appear and the models are converging and their APIs are converging. Not perfectly though frustratingly not. Exactly, but the stack that you use to actually build the agent, which I think is what you're asking about, I don't know, because in some respects there's, there's so many good ideas out there.

D-S-P-Y-D, spy, however you say, it is a really good idea. I disagree with some of its founding principles, but I like the idea of. Something that [00:39:00] automatically optimizes stuff. That's a really cool idea. There are several frameworks out there now that are kind of falling into this directed graph, which is very intuitive in a lot of respects, so I like that too.

To them though, they act different. It's like they're not API equivalents. You can't learn one and immediately know the other. You can learn directed graphs and do whatever you want with it, but these are still different. There's still kinda the wild card of what happens next with. The models as they get smarter.

Like in my crazy scenario, what if you don't need a SDK for anything? You just, right. Dear Abby, I want this the exact thing to happen. We might be there at some point. You might just talk to these things and all it takes is open the eye to make another little feature is like, Hey, we're just gonna remember whatever you said and make sure it absolutely happens.

I don't know what shape that would be, but we're still looking at a lot of wild cards, more wild cards in front of us than behind us. 

hugo: I posted a screenshot of a message I got last week from someone who may help build a few things. [00:40:00] He wrote to me and said, there are several issues. The first is they started using Lang Chain and Lang Graph.

Probably too early. He wrote, it's a choice that seemed to accelerate development in the beginning, but became a regretful decision for many reasons. This is the anti-patent I see in tool adoption, people adopting tools too early and then churning. You know, 12, 18 months later, you can see how Sam Hogan this tweet from him about this type of type of churn from from framework.

So once again, not. Telling which tools to use, but the principle of starting simple and then adopting tools as the need arises and then recognizing you'll build things alongside, or in addition to these tools on, on top of as well. And so I have seen people, you know. Brain trust is incredibly popular.

Phoenix Arise a log fire. There are a variety out there. And all the Lang view, like the Lang Star E ecosystem, right? But once again, adopt tools mindfully and realize the potential costs. Later on, I'm wondering [00:41:00] if anyone actually posted weights and biases and weave, and I'm wondering, Alona, if there's anything you'd like to do.

Have you used, do you use weave in production workflows or, 

ilona: I've used it sometimes, but more for POCs at the moment. But what I'd like is see evaluation and observability. It's so easy to use for everyone, even not being an engineer. So it was a great fit. 

hugo: I love that. And you're speaking to the power of dashboards, right?

Yeah. And creating dashboards. And of course, weights and biases was doing this. This was their bread and butter before the Che GBT moment. Right? But they were doing it from machine learning workflows, of course, ability changes in some ways, but stays the same in others when we enter the generative AI landscape.

Super cool. I see Julian typing. Julian, is there anything you wanted to say or add? We might wrap up after the next question or comment. 

john: I have one more question. I'm just curious. Take Brad's question to the audience. I'll have one more data point. What agent building SDKs you use to build your prompts?

Like that type [00:42:00] of domain? What's good, including I 

brad: do it directly to the model was the question to what do you use to refine or, 

john: oh, sorry. Building a large language model application, so like in the same vein as Lane, lane graph, for example. 

mike: I'm still in experiment mode where I'm trying different things and for each project try something new intentionally.

But if I'm doing something straightforward, like a proof of concept type thing, light LOM is sort of a front end. I personally am not super interested in writing an adapter for each individual model that I wanna use. I wanna be able to just change a string somewhere. Hey, see what Gemini does with this.

See what open AI does with this in Compare some things. So light, LLM, and it seems to be pretty well understood by another question is just how. A factor is, 'cause I do a lot of just AI assisted coding. How well does the assistant understand or are, or how UpToDate are the kind of MCP style docs that I can feed [00:43:00] to cursor?

It's a negative if it's a bleeding edge thing that I have to read the docs myself, you know, because they're not accessible to the, to the assistant. But let ll and check some of those boxes for me right now. That's good feedback. 

hugo: I know Alona, you, you've definitely used a variety of different tools and hand rolled different things in production.

Are there any tools that you're a fan of in your work? 

ilona: What we need is a lot of things that are easy to understand for auditors. For example, in the healthcare area, they want to see evaluations a lot. And they wanted to understand all the results, and that's why I love weights and biases and all these things.

Hmm. Always having in mind that regulatory requirements are necessary for production grade software development in the regulated domains. It's very important. 

david: Good 

hugo: feedback. 

david: I were briefly riffing on, oh, please, David. I have a quick question. Most of the people that I've been working with are really small businesses, and so I think there've been a lot of [00:44:00] examples.

Sorry, I am arriving late after I was triple booked here, so maybe this was already discussed, but the notion of having a bunch of different cloud services to be able to store the information. Or how the application actually gets deployed. So the way that I've been doing it is been packaging everything up in a Docker composed solution.

Then leveraging Simon Wilson's LLM package that does some auto writing into a SQL I database. It's kinda like a poor man's V system, but I'm just curious, are most people doing this working? And then you're asking the client, maybe it is a larger client, medium to large businesses, where it's like, okay, get a subscription for weights and biases so that you can push.

Your evals there. Okay. And get a subscription to this and get a subscription to that. I'm curious about how it all gets tied and packaged together. What I'm saying is kind of this almost poor man's hacked docker composed [00:45:00] solution is what I feel like I'm doing. I 

hugo: feel like the poor 

david: man's 

hugo: hack is almost like the wise man's hack as well, right?

Like. You don't need the buffet when 

john: you're just trying to have lunch. This is the part of software that's the same as five years ago, not the part that's amazing and different. That sounds perfectly reasonable to me. 

hugo: Big shout out to, you know, well, of course Docker Compose and all about Simon Willis's.

LM Client Utility and dataset are both wonderful for a lot of our use cases. I do think once you wanna get like multiplayer mode with being able to. Have workflows that various people and different teams can use. Enterprise solutions become more important, particularly with security and authentication and all, all of these types of things.

The other thing worth noting is that enterprises adopt tools and vendor solutions when they want a phone to call you pay money to have someone else accountable for things as well. Right. And we can't forget that. That's actually a very important reason behind adoption of enterprise solutions. Yeah. But it sounds like you're very much [00:46:00] on the right track, David, the poor man's solution, you can't quote me on this.

The Poor Man's solution, what we call the poor man solution is the wise human solution. I was 

david: curious as to other people deploying these, what does deployment actually look like with all these different moving pieces? That's still something I'm trying to wrap my brain around and what the right mechanism of action is with that and then.

Services, tools, uh, subscriptions, things like 

hugo: that. Yeah, John and I were corresponding about doing some sort of survey, which we've started doing right now. I think running a, a survey as to what tools people use and why would be super useful on top of that. Cool. Well thank you everyone for sticking around.

I know it's been a long day of workshops and talks. Really appreciate you all being here. Everyone who was here from the start is pretty much still here. Big ups to John for. Bringing the real religious sacred knowledge from the frontline of AI app development. Thank you for condensing it into understandable D takeaways that we can all learn from.

Really appreciate you, John. [00:47:00] 

john: Thank you. Thanks for the opportunity. I always learn more by putting this stuff together, so I appreciate you guys as well. Thank you for your time. Fantastic. 

hugo: And you are part-time auditing the course, so if people wanna chat in Discord, you'll be around every now and then as well, right?

Indeed. Alright, thanks once again everyone, and thank you John. Thanks for tuning in everybody and thanks for sticking around until the end of the episode. I would honestly love to hear from you about what resonates with you in the show, what doesn't, and anybody you'd like to hear me speak with along with topics you'd like to hear more about.

The best way to let me know is currently on LinkedIn, and I'll put my profile in the show notes. Thanks once again and see you in the next.