The following is a rough transcript which has not been revised by Vanishing Gradients or the guest. Please check with us before using any quotations from this transcript. Thank you.
===

shreya: [00:00:00] This is a $10 million budget project from the state of California, and they wanna have LLMs do this because the director of the school of journalism said if humans were trying to create this database of officers and any misconduct detailed in the police report, it would take them over 35 years to do so.

Now your question becomes, how do you have LLMs go through this kind of police reports? Extract police officers in instances of misconduct and then somehow build this big database of all of that information. But the police misconduct example, there's a really, really high cost of doing something incorrectly.

So people are still working out like, okay, how can we more automatically verify that things are correct because we just can't publish an incorrect database. But for a lot of other applications, you like don't really need perfect accuracy. For example, themes from customer support transcripts, right? As long as it's like a.

Pretty aligned with it. You want, and the quotes are pretty representative, you're gonna learn a lot. Right? So that's where I see a bunch [00:01:00] of different applications differing. 

hugo: That was my friend, Shreyer Shanker, who's completing her PhD in Human centered Data Management Systems at uc, Berkeley. What makes Shreyas perspective so unique is that she's a true builder with deep industry experience.

Before her PhD, she was the first ML engineer at Viaduct, did research engineering at Google Brain and software engineering at Facebook. Shreyas research isn't just theoretical. It's been deployed in production at major tech companies like Meta and startups like Lang. On top of all of that, she co-teachers a hugely popular nay, the most popular and exciting practitioner focus course on LLM evaluation.

We start with real world use cases from building a database of police misconduct reports to analyzing customer support transcripts. Then we dive into shreyas. Powerful framework for treating these as structured ETL pipelines. The core of our discussion is about reliability. We cover the non-negotiable process of manual error analysis.

That's right. How to use agents to automatically optimize your pipelines and the [00:02:00] practicalities of using an LLM as a judge. I learned a ton and I know you will too. This conversation is a guest q and a from a recent cohort of my course building LLM, powered applications for data scientists and software engineers, which I have the great pleasure to co-teach several times a year with my friend and colleagues, Stefan Crouch, who works on agent force and AI agent infrastructure at Salesforce.

Links in the show notes if you're interested. Enough outta me. Let's jump in. What is up everyone? Hey, Reya. Great to see you. Thanks. It has been a while. I think last time we chatted and you had that background, you had, you had papaya in, in your lap, though. 

shreya: I know he's sleeping, so 

hugo: I mean, I'm, I'm a bit jealous.

Just for context, everyone. Papaya Is, is Re's beautiful, beautiful dog. Although I think Golden 

shreya: Dole. 

hugo: Yeah. One of the reasons I'm, I'm really excited to have you today is 'cause of your interest in unstructured data processing with LLMs. Yeah. Particularly with your work on Doc Wrangler and Doc ETL and just your deep interest in ML lops from a technical side, from a sociological side, seeing what people do and, [00:03:00] and really figuring out what's happening in such a vibrant and very fast, fast moving space.

Right? Yeah. So what we're here to talk about today really is how to. Accurately and cheaply process, large collections of documents, and we're gonna get into all the agentic approaches you're thinking, thinking about there. But I'm just wondering at a high level how we should think about this challenge of programming software and LLMs to accurately and cheaply process large collections of docs.

shreya: Yeah, so I think, you know, for decades, enterprises have been wanting to make sense of large unstructured corporate. So think like customer support transcripts, uh, PDFs. I mean, it doesn't even need to be in A PDF. Just long documents and there are so many different types of workloads people wanna run on them.

For example, document question answering. That's the one that you see most online, but there's a lot of kind of batch workloads that people really wanna do, which is like thematic analysis. Given [00:04:00] all my customer support transcripts, extract the themes and summarize sentiments around each themes given a bunch of product reviews.

What are the things that people are most often complaining about? Right? So this is, these are the kinds of queries that traditional databases and data processing engines could not solve until generative AI came into play. So now the question becomes, okay, how do you make it very easy to program gen AI powered pipelines to answer these kind of batch style AI powered queries?

It turns out that this is very hard because you specify some sort of pipeline, like extracting themes. From a document and then you try to chain these, what we call operators together, but there's no way of knowing whether it's correct or not truly. How do you know that all the themes were extracted? If it's the theme that you imagined as a theme or not, like maybe.

You wanted high level themes and it went really, really low level, or you wanted themes that [00:05:00] relate to complaints, but then generally it's bringing out themes that even people like as well as complain about, right? There's like all sorts of ways that these outputs can misalign with what the analysts expect.

So accuracy is this challenge, and then when you do all of this work. To try to have a accurate pipeline, then how do you make it cheap, right? Because people want to run these kinds of pipelines on millions of documents. Like I work with the law school at Berkeley, I work with a major public defender in, um, California.

And all of these workloads are very, very expensive. Like people are spending upwards of tens of thousands of dollars, um, and want to have cheaper models, but still get the accuracy and the quality of the state of the art models. So your question of like, how do you have cheap and accurate, it is the main center of my research.

I look into, you know, how to search the space of all possible pipelines given a user query. So it could involve [00:06:00] any number of LLM calls, any integration with external data sources, right? There's an infinite number of ways you can solve the query. And then when you know you have the most ag, you've searched and found the most accurate pipeline you can find.

Now, how do you make that as cheap as possible? So, turns out that there are ways to leverage cheap models. Swap them in in place for expensive models. When cheap models are very confident or where the task is really easy. So sometimes if you have a classification task, you could use GPT-4 oh mini to do that versus a summarization.

Maybe you want Claude to do that. So ultimately, I see this as a really big search problem of user has a query and a document collection. First, you wanna search for the most accurate pipeline possible. And then you want to search for how you can swap in models of those pipelines with cheaper implementations.

Cheaper models, maybe it's code that's doing the extraction. Um, and you wanna do this cost optimization in a way that meets guarantees with respect to your most accurate pipeline. Like, I wanna [00:07:00] achieve 95% accuracy of the most accurate pipeline I've felt. Um, so that's the high level overall picture. 

hugo: Great.

And so when we talk about accuracy though. When we ask how do you make things accurate, there's almost an assumption that we have consistency or reliability, right? So how do we even think about this when. LLMs are kind of like flaky. They're like your friend who like turns up sometimes and doesn't other times, right?

shreya: So in the doc e TL projects we're building, so you can think of it as like a map, reduce like framework where you specify map operations of your data, reduce operations of your data. But these operations are specified by LLM prompts. Um, so you might imagine to do thematic analysis on your customer support transcripts, you have a map operation that extracts themes and quotes from each document.

Then you have a reduced operation which groups by the theme, and then generates a summary of all the quotes that correspond to a theme. There's, there's one problem with baseline reliability, [00:08:00] which is the LLM was just not consistent as you said, or it just happens to return an empty string or like some garbage that, like if you asked it again, it would get the correct answer.

So for that, we have retries built in. We have code-based guardrails. We have something that we call gleaning, which is if you have a cheap validator to check for certain properties of the output, like let's say I want there to be at least three themes extracted from each document, like that's a code-based value validator.

Maybe I want the themes to be sufficiently diverse so I can have a very cheap LLM judge evaluate if they're diverse or not. If they're not diverse. Rerun the operation until it was diverse. So we have a bunch of validation guardrail like strategies over there to improve output. Now the second thing is overall alignment with user preference, right?

If you didn't define what theme means in your original map operation prompt, there is no world in which you can expect the query engine. [00:09:00] To extract the right themes that you're like having in your mind. So this is where kind of our Doc Wrangler and our more HCI work comes into play, which is what are the right interfaces and playgrounds that we can build for people to try out different versions of their pipeline, their prompts for their pipelines, and then annotate.

Which outputs are good, which themes are good, which themes are bad, and then have LLMs go and analyze that feedback and improve the prompt automatically. Turns out you can do this really well in an interactive interface. It's much harder to do like in A CLI or in the data processing system that you're building.

hugo: I love that. And there, there are actually so many. So many things in there to unpack, which I didn't necessarily intend. I don't wanna, you know, go down rabbit hole too much. Choose the, yeah. So we, we are already, you know, getting into validation and guardrails and a evaluation and yeah. You know, simple L basic L lms, a judge as a judge.

So we won't go too deep into this. Go and join Australia and Hamill's course if you, if you wanna go, go deeper, but actually just link to your doc e TL project and, and, and Doc Wrangler in, in, in discord. And [00:10:00] your, I wonder if you could speak to a particular example. I mean, for example, the US presidential debate and analysis you did, or another example with just with respect to how you think about.

When you build out these systems, when to start doing validation evals, LM as a judge and and how, how to do it and how the iteration cycle works. 

shreya: Yeah, so I'll give you a concrete example from the original DO ETL paper, which is a collaboration with the school of journalism and the law school at Berkeley.

There's this project called the California Police Records Access Project. And what they did was they collected all the police reports that all the police departments had in the state of California, and they wanna build this big database of officer misconduct. This is a $10 million budget. Project from the state of California, and they wanna have LLMs do this because the director of the school of journalism said if humans were trying to create this database of officers and any misconduct detailed in the police report, it would take them over 35 [00:11:00] years to do so.

Now your question becomes, how do you have LLMs go through these this. Kind of police reports, extract police officers in instances of misconduct and then somehow build this big database of all of that information. Well, the way that we think about it is what does the user need to program? User probably wants to program the logic that is being applied to documents.

We think of documents as like rows in a table. So presumably what they wanna do is apply some functions to each row to extract some attributes. So we call that the map operation. And then they wanna be able to do further analysis on those attributes. So for example, do deduplication on any extracted attributes, or they want to group by and extracted attribute group by the extracted officer, and then summarize a bunch of information related to that officer.

Or a case or something like that. So the first layer is just this like [00:12:00] programming framework of like, what does the user even want? Well, they wanna map over each operation. Then they wanna do an entity resolution step, then they wanna group by and aggregation. These are very familiar data processing concepts.

Then the second thing as what do you, how do you specify these operations, right? You wanna extract police officers from a police record. Turns out that you have to write that down as a prompt somehow. You have to explain that police officers have this are referred to by sergeant or officer or something, right?

Like you have to give some instruction to the LLM and it turns out that you need some way to explain anything that's domain specific. So maybe it is that like I only wanna focus on the police officer that filed that police report. Um, or I want. Police officers that are actually in the state of California are affiliated with that department, [00:13:00] not like other random police officers.

Uh, so there's a bunch of like specifications and prompts there that you need to write. Then I think there's like a output formatting of like, I want to expect consistency. From the LLM. So make sure you give me first name and last name of the police officer. If you don't say that, it might just give you Sergeant X or Officer X, and that's gonna be inconsistent in all of the LLM outputs.

So again, right, it's just like exercise of clearly specifying what you want. To do and what you want it to output. And then I think the guardrails and the validation logic, the manual validation comes in after that. So when people use DOC ETL, they'll run it on a sample of documents after they've written good specifications for their pipeline.

And they will have, I mean like the journalism school, there's a bunch of interns that sit there and analyze the outputs. And we'll tell you like, this was correct and this is not correct. Um, you have three interns doing this for the span of like a week, [00:14:00] and then you go this, think of this as error analysis, literally.

And then you go figure out, okay, what did our prompt not specify that we should have asked for? And then. You know, what is the LLM simply not capable of doing? Maybe this document was way too long and we realized that it was, it was extracting too few instances of misconduct. So the way that we wanna do this is to chunk up the documents and apply an LLM operation to each chunk, and then you, you unify the results, right?

So there's so much of that, uh, improved process. I think what we see is people will do like three rounds of this error analysis and improvement. So they have humans annotators able to say, I got 90% plus accuracy or alignment with my expectations, and then they will go and deploy it on the entire database.

Now with the police misconduct example, there's a really, really high cost of doing something incorrectly. So we're still, people are still working out like, okay, how can we more automatically verify that things are correct because we [00:15:00] just can't publish an incorrect database. But for a lot of other applications, you like don't really need perfect accuracy.

For example, themes from customer support transcripts, right? As long as it's like a pretty aligned with it, you want, and the quotes are pretty representative, you're gonna learn a lot, right? Um, so that's where I see a bunch of different applications differing. 

hugo: Fascinating. I love that you, um, firstly I love that you mentioned error analysis.

'cause that's a huge focus of what, what we cover in this course with respect to the, you know, iterative cycle of LLM powered, powered software and, and the lifecycle. Um, also we had Jason Leo came and gave a guest talk guest q and a this morning, and, and, and you know, he nice as heavily on. Well, once again, part of the look at your data movement.

I know, I'm, I'm, I'm sorry everyone. I know I promised we'd talk about agents and, and, and we will, but I, I, I also love that you mentioned, um, in terms of error analysis like that, the iterative loop of doing it and that you do it, like in this case, you got people to do it th three or so times in order to get a certain level of alignment and, [00:16:00] and accuracy.

So I'm just wondering, you know, like 50 plus. Percent of the people I consult when I tell them to do error analysis, they say, can't I get an agent to do that? And I'm like, human learning before machine learning, before agent learning. Right. And um, I'm just wondering if you can, and people do I do understand.

'cause it's, um, it's serious lift. Right. Um, but also you need to inspect what your software's doing to get a vibe for it and, and, and, and see what's up and understand what's happening with these, you know, quote unquote organic systems. We're, we're building. I'm just wondering. If you can speak to how often and how much you encourage people to look at their data and do error analysis and when is enough, enough.

'cause we don't wanna tell people that they need to be doing it every day or every week in perpetuity. Right. 

shreya: It's a good question. I, I mean, in our own course we tell people like, you have to do error analysis. Look at your data. We teach a structured process for this. [00:17:00] We have people do what we call what's called open coding, basically on each output.

So write down like any failure modes that you might observe. Any theories for like why it's bad. Many of the traces might not have failure modes, and that's fine. You don't have to write any notes for them. But the point is, you wanna keep labeling doing this, open coding until you have no new failure.

Modes emerge like you've gone and labeled 30 to 50 new traces. And there like there are failure modes, but there are failure modes you've already seen in other labeling. Then you hit the point where you're like, I have my set of failure modes that I would like to more systematically measure, or like I know that I need to fix these failure modes.

For example, I wanted the output to have officer name, first name, and last name. And the best way to do that is to go specify that in the prompt. I didn't think to specify that in the prompt, but now error analysis showed me this inconsistency. So now let me go specify that in the prompt, right? So like these are the kinds of [00:18:00] realizations that people have when they do error analysis.

I find that when people ask if they can have agents or automated error analysis, the agents are not gonna find this because how does the agent know that you expect consistency in outputs? They might have their, have its own definition of what's correct. Like as long as it didn't hallucinate a new officer, it says it's good, but you want something else.

So this is where human judgment needs to be incorporated in error analysis. The agent could be perfect and it's good, but at the end of the day, you are the one writing the spec for what the LLM pipeline should do, and you are the one who has to do the error analysis to make sure that your spec is exactly what you want.

Um, and I find that people. Once they realize that, then it clicks for them to do error analysis and they don't like ask to use agent to do it. 

hugo: Yeah. And the other thing I I've seen is once people start looking at their data, even if they're hesitant at first, they, they get, they're like, oh, wow, that's, that's interesting.

And they wanna like dig in more and more, right? [00:19:00] 

shreya: Yeah. Yeah. Totally. 

hugo: The, the other question, and we've got a question from Brad, which we'll get to in, in in a second. The other thing objection, I, I, I, I hear is. Look, we've got like millions of traces now. Like, where do I even start? And, and and how do you, what do you tell people to do in, in these types of situations when they're just already overwhelmed?

shreya: Yeah. I honestly think going through a random sample of them is totally fine. Again, keep labeling until you haven't observed any new failure modes. If you want to be smarter about sampling, you can, but you don't have to do it your first time doing error analysis. Take a uniform, random sample of the data, that is fine.

Um, sometimes if you wanna do a more sophisticated sampling strategy for the next time you're doing error analysis, maybe you can sample based on users. So, I don't know, take a few records from a random sample of users, um, or you could sample based on geographic region, or you could sample based on query modality or I, [00:20:00] I don't even know.

Right? Like it's your application, you kind of. Think of the dimensions of your application that probably differ and then do some sort of stratified sampling around those. Um, but other than, you know, people are learning from random sampling, so I know I don't see people saying that they looked at a random sample of their data and then learn nothing from it.

I do like to do. Like in very later rounds of error analysis. Sometimes like I kind of already hypothesize my failure modes. So for example, if it's hallucinating a tool or something for those I can write code or functions that can like very quickly check for those kinds of traces in my unlabeled set.

But again, that requires me to have done some error analysis in the first place to know what failure mode I'm looking for so then I can go try to approximately find them in the rest of the data set. 

hugo: Great. So we have a question from Brett. And Brett, if you wouldn't mind, um, just. Introducing yourself and saying what you're working on at the moment.

And, um, Brad also just to, he, he also has a funny accent. 

brad: Hey, re that's awesome. Thanks Hugo. [00:21:00] Um, I'm, I'm, I'm just tinkering at the moment. I've, I'm taking a lot of courses and I follow a lot of your work, Reya, so thanks for everything that you guys are doing. It's great. 

shreya: Yeah. Of, of course. 

brad: I was just hoping you could expand on the open coding part and, and maybe provide an example.

Does this mean you're just logging. All traces from the start are manually going through until you can broadly categorize everything. Are we talking about something as simple as absolutely. A spreadsheet or just whatever tool it is? 

shreya: Yeah. You should check out chapter three. So we have, um, like the first three chapters of our course reader as like a free preview that's on.

Our AI evals course, um, landing page. So chapter three details this error analysis lifecycle. So all of you can like read it right now. Um, but you literally look at your data and you write down freeform notes on what is bad about it and worry about categorizing and turning those into like more quantitative metrics later on.

Um, you wanna do this open coding for like at least 50. [00:22:00] I, I ballpark 50 to a hundred traces, like do it for an hour or two hours. Um, and that like this process gives you so much more lift than pretty much anything else, any other tool out there, a phone. 

hugo: Thanks so much, Brad, for the great question and, and the great response Tri And Brad's actually being really humble, Brad comes from like a, a non-data ML background, but is actually building some really wonderful like personal AI knowledge management systems that.

Um, cool. He started in our January cohort of this course and he's, he, he's back, back for more and like building super, super interesting stuff. That's the other thing that this space at the moment, in terms of opening up, you know, leveling the ground for everyone to come in and, and, and, and build things is, is just really inspiring for me.

So, Reya, what's up with agents? To ask a bit more specific of a question, we're talking about, once again, programming LMS to accurately and cheaply process large collections of documents. We've been talking about a lot of the nitty gritty of it, but you are, you're really TA doing a lot of [00:23:00] things by tackling it with agents.

So why agents and what makes them a good fit for this kind of problem. 

shreya: So I think of a agent as a LLM that can kind of choose its own execution plan. Exactly. So it can choose what tools that it wants to use and it can choose what order it wants to execute them. I'm not as, I'm just quickly by, 

hugo: by choose we mean the tool choice is a function of the LLM output.

shreya: Yeah. Yeah, no, I, I'm not, so we're bullish on like open-ended agents to like, just like do any data processing task. I think you one need to write the language which you communicate. With the data, uh, agent. So like our DO ETL framework is in a way that kind of language. Um, and you use agents for more scoped things.

So one of the things that we're doing is running a query optimizer that is agentic, that it takes your pipeline and it runs some evals on it and figures out what is not performing well and it will try to rewrite the pipeline. To be [00:24:00] more accurate. So maybe it'll take an operator and reride it into a more complex pipeline of multiple operators that are much better scope.

I think agents are a great fit for like, if it has a very good metric to optimize for and it has a very structured set of actions or tools that it can take, for example, structured set of ways to rewrite the user's pipeline, then the agent's gonna give you a big lift. But I don't know, I feel like. People also wanna just use the agent.

They wanna like upload a folder of PDFs to chat GBT and ask for something. Um, but there's really no good way of knowing or improving on that accuracy. It's not gonna be accurate at all. I mean, you have no idea what it did. So that's where I'm not a fan. 

hugo: Yeah, and I mean, what state of the art in function, calling or whatever is 90% accurate or whatever it is.

And you have a few in a row. I mean, I think maybe six calls or something of that like. It gets you down to a coin flip of whether it's gonna gonna work or not. So I am, we talked about what I'd [00:25:00] refer to as the flakiness of LLMs before. What I'm hearing is that you, if you have like a small amount of tools that it's capable of using, it will be more flaky.

But if you are principled and clever about it and have, you know, robust evaluations, then, then you can, that, that can get you a long way. 

shreya: Absolutely. Yeah. And I think there's a lot of interesting work now in the research community with using reinforcement learning to train these agents to at least get like match o three's performance.

So that's quite interesting. Again, it really only works on very small tool sets, like uh, two or three tools. Um, and the turn number of turns in the conversation is like five for some of these models. But it's interesting to see the space explode so much 

hugo: now for a word from our sponsor, which is well. Me, I teach a course called Building l, empowered software for data scientists and software engineers with my friend and colleague Stefan Krach, who works on agent force and AI agent infrastructure at Salesforce.

It's cohort based. We run it four times a year, and it's designed for people who want to go beyond [00:26:00] prototypes and actually ship AI powered systems. The links in the show notes, I mean, I'm, I'm speculating here, but I, I mean there's a lot of work in the big labs going into RL and a lot of it's pretty secretive stuff, so I, I think.

This is something that a lot of people are thinking may form. I mean, I, my hot take is that none of these companies have any moat really. Well, I can't figure, like, memory is not a moat. Right. And that look open AA doesn't even have the network effects of MySpace. Right. And look what happened there. But I, I think RL is something which, you know, especially as, you know, trade, trade secrets could, could help help there.

But I am. I think I'd, I'd love to know in the, these types of workflows, what kind of tasks are these agents actually, actually doing in, in, in your systems and how, how do they integrate with what the, what, what the humans and the domain experts are doing. 

shreya: We use agents for rewriting pipelines to be more accurate.

So let's say you've given me a map produce pipeline. You've just specified your [00:27:00] MAP operator to extract themes from each document. Your reduce operator is gonna group by the theme and summarize insights per theme. Um, if your document is very, very long, you might not be able to extract all of the themes so that map operator is not accurate.

So what our agent will do. Is run an eval to make sure to like assess the performance, see that it's not very good, and then it will rewrite that map operator to be like a chain of thought, a decomposition to multiple map operators. Or it might split up the document and run a map operator on each chunk.

And then aggregate the results of the map operator on each chunk. These are what we call rewrite directives, right? These are very structured rules of like, Hey, take an operator that is not performing well, and rewrite it into a little bit more complex pipeline still in our domain specific language, in our framework, and then see if that performs better.

Agents are a really good way to do this because they can kind of match rewrite directives to. And they can [00:28:00] implement the rewrite directives. If you say like rewrite this map operation prompt to run on a chunk instead of the full document, the agent could come up with that prompt. So we totally use agents for these kinds of rewrites and as LLM Judge to verify.

Whether one rewrite is better than the original or not. 

hugo: Super cool. And you know, I, you probably know, I, I wrote a blog post recently called Stop Building Ai, A Agents, yeah. Like half Tongue in Cheek, but like one place where I think they, they are really powerful ISS coding and, and coding and assistance, but also doing these types of things in an automated manner.

And we're just gonna see that get better and better. I'm interested. How you tell if a rewrite actually makes the pipeline better and like, do you get an agent to do that and then use an LLM as a judge to figure out whether it makes it better? Like Yeah. How do all these moving parts interact? 

shreya: Good question.

So we use LLM as a judge, but again in a structured way. So imagine you have two pipelines that are logically equivalent. You wanna compare which pipeline is more accurate than another. So you'll run [00:29:00] each on a sample. So you'll have samples of out two. Samples of outputs. We have an L LMS judge that has a specific prompt that is specific to that pipeline, that operator.

So it's going to judge theme extraction and it's gonna have a rubric of criteria that is specific to the operation. So are all of precision and recall, um, are both assessed in that rubric. So then the LMS judge takes this custom rubric and does pairwise comparisons on the plans to see. Which sample output better meets the criteria specified in the rubric?

Then we just take the plan that has the most wins in the payer wise comparison and say that this one is the best. So the key points to take away here are the LLM is judge one has a very structured rubric that is domain specific, can be defined by the user or defined by like a rubric generating LLM or something.

And then the second is we do. [00:30:00] Pairwise comparison or like a ranking against that rubric. So it's not like one to five scale or like pass or fail against the rubric. It is truly like is this plan? Is plan A better than plan B given this rubric and the sample outputs? Um, and that tends to align a lot more with.

Accuracy. 

hugo: How do you figure out they, these sound like wonderful processes. How do you figure them out? Like is there any science to this? 'cause I mean, and to give a more context for us, we've been working in machine learning for a decade plus, you know, we say like a lot of LLM stuff is incantations. So was a lot of the machine learning we were doing, to be honest.

So I, I wonder like how much is science and how much is art currently? 

shreya: Yeah, that's a great question. So a lot of the stuff I'm talking about is actually like five different problems that we have individually written papers on. Um, so in our research, right, we have a big vision, which is the system for unstructured data processing.

And as we build that, or we build components of it, we realize like [00:31:00] it's exactly our question. You know, is this a art or is this a science? Like how do I know the technique that I'm doing here is very good? We might not know. So then we decide, okay, let's write a paper on it. Let's. You know, figure out what are the baseline, what is like the obvious way of solving the problem?

Did that work? Can we measure that performance? If it didn't work, can we build something better? And so you'll notice that like there's a lot of papers I've written in my PhD that are not the doc ETL paper, but are precursors to doc E tl, for example, like the spade paper, the eval gen paper that might be familiar with a lot of papers on validation, where we just simply tried.

Different ways of doing evaluation for domain specific data processing pipelines. Um, so once we kind of figured that out, like turns out you can reliably infer criteria from prompts. You can reliably elicit human feedback to refine that criteria. Rubrics are a really good way. Systematically evaluating accuracy.

Um, if you don't have a rubric, if you just ask LLM, judge, good or bad, we found that that wasn't very [00:32:00] good. Um, so a lot of these problems, again, are small, solved in these like precursors to the Dokey TL project. 

hugo: Amazing. So, um, Mitch has a great question in the chat. Are you, are you, um, in a position where you can turn your microphone on Mitch to ask your question and if, if, if so, if you could introduce yourself?

Um. Let us know what you're up to at the moment. That'd be great. 

mitch: Yeah, sure. Um, so yeah, I work for the UK civil service at the moment as a data scientist in a AI incubator they have there and we have a product. Yeah, that's very similar to the the doc ETL stuff that you described, and it was really interesting.

Cool. I've never thought of it. As framed as a ETL pipeline before, which is really funny given I've been doing it for a year. I've just been thinking of it as, yeah, topic modeling. And so my role there specifically is on evaluation, um, and trying to figure out whether the product does what it is supposed to.

And we've done several. Kind of blind evaluations where we'll get, like you said, three, they're not interns, they're civil servants. But we'll get three people to independently undertake the task that we want the AI to do. And then we might even get a supervisor to kind of [00:33:00] blindly rank those four sets of outputs and see whether the supervisor can even tell what's the AI and what's the, the people that are actually in their team.

And we're getting reasonably positive results from that kind of stuff. But one of the questions we keep running into is. Can you reduce that kind of burden of evaluation over time and get to the point where people can just use your product out of the box Where essentially I think the intuition is that, you know, if I go to the supermarket and buy paracetamol, then half of them aren't sugar pills.

We've just decided we've, we've decided at a certain point that, you know, paracetamol works and they're all gonna be paracetamol. And so I'm kind of intrigued how we get to that. I suppose it's a good question. At the moment we've kind of gone down to like do a meta-analysis of all these evaluations.

We've done try and get some kind consensus estimates on effect sizes and then try and use that to justify a reduction in sample size going forward. But that's, we're finding that quite tricky conceptually. 

shreya: Yeah, it's, I. I agree with you, this would be a [00:34:00] great question to solve. So I think there's no world in which there will be zero evaluation for a high stakes application.

You're gonna have to go through at least one round of ensuring that, you know, there's like 90% alignment or something with your preferences. Otherwise you have no way of knowing. So now the question becomes how do you reduce the number of rounds? And then as you said, the number of samples needed for a round.

I think in. These cases, it's very, it's a domain specific. And the other thing to think about is how do you improve specification in the prompt so much to where you can like reduce the number of specification failures that emerge in the error analysis. So, so many errors, emergent error analysis that like if you just had mentioned them in the prompt, you would not have seen that.

Um, and I think those are like, you have a good. Blueprint for how to write your prompt, right? Are you using few shot examples? Are you clearly require specifying do's and don'ts in the beginning? Are you specifying the output format? [00:35:00] I, I find that like, you know, that tends to eliminate many different errors that come up.

Um, the other thing is using some like ensembling strategy here, or not even ensembling, but to try in multiple models and then see for certain inputs. Where do the models substantially differ in their outputs and use those in prioritization of human review. Um, and that might surface failure modes a lot faster.

So to answer your question, short story, like I don't think you can get rid of manual validation to some extent, but you can try to reduce samples. 

mitch: Perfect. Thank you. Um, 

hugo: thanks for the great question, Mitch, and the thanks for elucidating all, all of that trail. I, that actually answers. Partially a, a question that came to mind for me.

Um, when, when we started talking, we, we talked about how like accuracy and, and cost that are really important and particularly when you have like really large corpor [00:36:00] of, of docs, right? Um, it can get prohibitively expensive. So I know you're thinking about that on the et TL side of, of course, and well, the pipeline side, of course, when you introduce LMS as a, as a judge, that is another side where costs can, can blow up completely.

So I'm wondering. How, and there's kind of a, a juggling you have to do with these kind of different costs and trade offs there. So how do you think about like the different types of costs that can be incur incurred? 

shreya: Yeah. My recommendation for judge to minimize cost is to try to use the cheapest model possible and limit it to like binary evaluation.

So whether that be pairwise comparison. Output A is better than output B. Um, basically like any call that has single token or like very short outputs. 'cause remember you pay the most for output token cost. And the other thing with LLM is judge is you can order the request in a way that leverage C rate.

So if, I don't know if you're familiar. With the con, the model providers [00:37:00] caching pricing model. But for OpenAI, I think you can save 50% between all the LLM providers, you save between 50 and 90% on subsequent calls that share the same prefix of the prompt. So in this case, right, if you want to judge three different things on some output.

Make sure the output's gonna be cashed. Make sure you specify the criteria to judge after the output that you wanna judge. Um, and then just have binary evaluation criteria. So you're really, really minimizing the cost there. If you find that, you know, you're using the cheapest model possible and you have really ordered your calls to like take advantage of context caching, um, and they're all binary, so single token outputs, then the only thing you really can do there is like.

Judge, fewer outputs, right? 'cause you've already minimized on all the fronts you can, but on in practice, I [00:38:00] don't see people minimizing on any of these fronts. So, 

hugo: yeah. Do you see that that will be something which people do more and more as they incur more and more costs? 

shreya: Absolutely. I mean, people wanna save money all the time.

I think. I think we'll see more specialized systems for LLMS Judge. Um. That will do this optimization for you, like reordering of LLM calls and so forth. Like we're already doing this in DOC ETL, we're trying to reorder all our LLM calls to save as much money as possible. So the more you can have a declarative system maybe for LLM judges to do this, the better.

However, I feel like we're like quite far out from that because people don't know how to write LLM judges in the first place and the evals vendors. Don't really know how to do this either, so I wouldn't expect any anytime soon. 

hugo: Well, also, I don't know if things have changed, but last time I looked, a lot of the eval venture or a lot of the eval vendors and framework vendors always talk about judge beforehand, hand labeling and their docs and tutorials as well, which.

Which is a, a challenge. I don't want you to give [00:39:00] any away, any stray shanka se secret sauce. But I am interested if you're thinking about product productizing, any of, any of the things you're working with. 

shreya: Oh, nothing. It's like everything. I welcome just open source. I'm a researcher trying to be a professor.

I'm applying for jobs this year. Um, amazing. So nothing, yeah. 

hugo: I appreciate We're at time. Do you have time for one more question? Totally Cool. If not 

shreya: Sure, sure. One question. Yeah. 

hugo: One thing I very much appreciate about the workflows. We've discussed, uh, how you get humans in the loop. And as I mentioned, we had Jason come and give a guest talk today and in Montani from Spacey, an explosion.

And one of the reasons I, I mean, I love the work. All, all, all of you do and, and, um, appreciate all of you is how like humans are at the center in, in a lot of ways and figuring out how to get humans working with, um, the machine. I, I'm wondering how you think about in these processes and workflows generally how humans fit.

Fit into it. And what does it mean to treat humans as like first class citizens in, in, in these workflows? 

shreya: [00:40:00] Yeah. There's a funny phrase in the HCI community, like people don't say human in the loop anymore. It should be like computer in the loop because the loop is around the human and not the human around the computer.

I don't even remember, but I, I thought that was funny. I don't know. I think I don't have a great answer to this question, so I feel like I'm trained in like a lot of HCI research methodology. I'm in HR research community, so that informs a large part of it. You know, every solution that I build, how do you think about designing it from like what is the user experience going to be?

And then second then what are the technical challenges that need to be solved now given these user problems. But yeah, I, sorry, I don't have a great question to that. Or, sorry I don't answer to that other than like think about. Users. 

hugo: Yeah. No, and that comes across in, in the workflows. You've, you, you've described as well.

So I really appreciate that. Re as always, I learned so much from you. I, there's so much, I, I, I need to go away and, and think about it. I just wanna go and lie in the sun and cogitate on all, all these things. Thank you for your, your time, expertise, [00:41:00] and, and wisdom. And I know you how much you got going on at the moment, so I really appreciate you making the time to, of course, to speak with the course.

Um, yeah, 

shreya: of course. Of course. 

hugo: Thanks for having 

shreya: me. 

hugo: Absolutely. And can't wait to see you later this week. In, in, in your course as well. 

shreya: Yeah. Sounds good. 

hugo: Great. All alright. Thanks once again Reya. Thanks everyone. See ya. Thanks for tuning in everybody and thanks for sticking around until the end of the episode.

I would honestly love to hear from about what resonates with you in the show, what doesn't.