The following is a rough transcript which has not been revised by Vanishing Gradients or the guest. Please check with us before using any quotations from this transcript. Thank you. === hugo: [00:00:00] Shreya is, completing her PhD in data management systems with a human centered focus at uc, Berkeley. I think, although that's. Literally true and technically true. I don't think that sentence gives enough breadth to all the things you work on. So I'm really excited to dive into, all the things from classic ML work to all the wonderful stuff you're doing with building robust and reliable AI systems and pipelines. but Shreya's work does focus on addressing data management challenges. And one thing we'll touch on is Shreya's. opinion that most, if not all ML problems are actually, data management, issues, particularly in production machine learning pipelines. Prior to her PhD, Shreya was the first ML engineer at Viaduct has done research at, Google brain and software engineering at Facebook as well. That is definitely enough out of me. So I'd love to, with that brief introduction, Shreya. I'm just wondering if you can tell us a bit about your journey into machine learning, [00:01:00] AI, and what sparked your interest in this field. shreya: Yeah. 2015, and this was, I think, the year that, Seek2Seek became like super popular, or deep learning, basically. I really took over Stanford, which is where I did my undergrad, and everybody around me was taking an AI class. So I thought, okay, I have to take an AI class. and I took an AI class and it was very,it felt like such a classroom setting that I didn't think it was cool. But what really made me realize it was cool was I was doing an internship at Google, just normal software engineering. And there was this AI generated music research team that was trying to play, or trying to generate Bach like music. Based on RNNs. And I thought this was super cool. And I went back to college and I took five more deep learning classes and I emailed [00:02:00] everyone I knew, can I do AI research? And I truly didn't know that much. And I think, yeah, that's how I got into AI. hugo: And of course, I find, so I'm someone who went to grad school far too early. So I, no regrets, but if I did everything again, I'd probably experiment with either not going to grad school at all or, going and working for some time, then coming back to grad school. And so I'm just wondering about your journey from working, not only in industry, but some pretty like heavy hitting hardcore places as well. and what that was like then the journey back to academia. shreya: Yeah, I did AI and ML research and I learned a lot, but I still felt like I didn't know how it was deployed in products. And to be fair, it was quite early when we were doing research and it wasn't super well deployed. the only people who are really deploying ML are like Google and Meta or Facebook at the time or whatnot. So I thought, okay, I want to go to industry and learn how to deploy it. And Then understand maybe how to make [00:03:00] deployments more successful, reliable, whatever, and that, that got my interest in, industry. And I realized, 99 percent of my job was engineering. Assessing data quality, building data engineering pipelines. And I probably trained a model like twice my entire job working as an ML engineer. and I thought that was crazy. okay. I trained a model more than twice. If you consider me. triggering a job that trains a model. But in terms of writing actual training code, I only had two projects doing that. and that, that really made me realize okay, maybe there's something else here in terms of like, how do we make ML more accessible to people? Maybe the solution isn't to work on training more models, but work on the pipelining code around it. hugo: Yeah. And particularly when I think the common task framework, like leaderboard style stuff, right? such as Kaggle, which is incredible and all of these things are incredible, but that definitively put the focus on training and not all the other things in the cultural consciousness, I think. and of course having a [00:04:00] single metric to optimize for. Makes the competition easier and all these types of things. but I am wondering, is there room for like data pipeline and ML and AI pipeline style competitions that could get people excited about all these things? Is that something you've thought about? shreya: Yeah, there are actually some data preparation competitions or like data centric AI competitions where the goal is to get good data sets or data set augmentations to train models. I think that's certainly part of the equation, but another part of the equation is also Simply, how do you have good data quality, not necessarily for training models, but also for model inference, right? How do you do the flywheel of continually training your model in response to new data? how do you have your metrics shift over time as you're moving towards product market fit? these are a lot of questions that I think the research community is well suited to explore, even if they're not very easily expressible in competitions. hugo: I love that and I'm glad you've asked those [00:05:00] questions because I don't have answers to those, but I know you at least have provisional answers to some of them. And those are some things we're here to chat about today. So jumping in, I know you're currently working a lot on and thinking a lot about how people can write bespoke and or custom AI pipelines that are reliable. Now, why is this important? Why are you interested in this? shreya: Yeah, foundation models and large language models are really awesome. in context learning, which is being able to provide demonstrations in the prompt, can yield almost like intelligent answers to many people's different tasks. In across different domains, like finance, medicine and whatnot. And that's super cool. And the other, the main cool thing is that you don't need to write, you don't need to train a model and collect data, but what ends up happening is, you might have people in the same domain that are working on prompting a model to do some business tasks for them, but they actually have no different constraints on their outputs or definitions of good and bad outputs.[00:06:00] One person writing. A document summarizer pipeline, might want different key value pairs extracted than another person. and in this case, right? if they're both using the same base model, It,how are we supposed to ensure accuracy? clearly metric definitions have to be a little different, and the systems that we use to evaluate these metrics also have to be a little bit different. that I think becomes a challenge, as people are engineering their main pipelines, how do we build, assistance for them that will help them with an evaluation specific to their own project? Implicit preferences. hugo: Makes sense. I'm glad you mentioned in context learning as well. And, but because I think it's so important and it's one of the things we talk about less than the generative aspects. I, and I think, and it's, super exciting. I actually, I don't, so I did something the other day. I've been playing around with Runway ML, their Gen 3 stuff, right? it's mind blowing what's possible there, [00:07:00] and they have a nice prompting guide with a few examples of prompts, and, they also say if you, there's a little markdown cell, I think, which says if you do, youinitial setup and camera, then details of the scene, then extra details along these lines, that's kind of, and I was like, oh, So I played around with, and I was like, this sounds like something I could just put this entire guide and the examples into Claude and see what it generates. And I played around with it, started generating prompts for me. Then of course, I don't know if you've played around with Claude artifacts yet. Um,so everyone, if you haven't played around with it, this allows you to generate code, and whatever else it may be. It will even, render like react code in the browser for you, that type of stuff. And also I. Okay, I'm chatting with it to generate prompts for runway gen three, but I could actually automate a bunch of this stuff. So I got it to build a little react app with drop downs that allowed me to just play around with things and then it would generate prompts that then I could, and because of the in context learning it got [00:08:00] from the examples, it was relatively good, actually. So shreya: that's so awesome. I think that's so incredible. Like it's just enables like everybody who's a little bit of a developer to like now make these applications, not Not like you're trying to make a company out of this, but now in 10 minutes, you can build this application that maximizes the next three hours of your time when you're playing with runway. that's super cool. hugo: Exactly. And this is actually, I think,we haven't planned to talk about this, but later maybe we'll get to this. I, the, I think something we don't necessarily talk about enough. We all experiences, the sense of play that we all get to experience now. And I think there are significant things we really need to think about implications. For work of all of these models and infrastructure implications for creative fields. but I don't think we talk about enough, the idea of thought to app, like I'm being a bit over the top by, by stating that, but the removal of friction going from having an idea to building an MVP is absolutely wild. [00:09:00] okay. I'm really excited about how interested you are with respect to people being able to build bespoke AI pipelines that are reliable. What are the moving parts needed to be able to write pipelines, AI pipelines that are reliable and custom and bespoke? shreya: Great question. It's very similar to traditional ML. I think there's a first data preparation part. Data is not going to fall out of the sky, right? You've got to do some work to create what you're going to pass in to an LLM. then you have a either single or series of LLM calls. That might transform, think of it as transforming your data in some way. So maybe you start with a document that you want to summarize, as well as some extra context that might be relevant to that document that you might have retrieved from external data sources via data preparation. Then you pass it into the LLM to maybe extract some key information. [00:10:00] Then you might have a second LLM call to Right. The summary based on that key information, and then you have some transformation of that output back to it and user back to storage or something like that. so there's a lot of moving parts in there. Some are LLM specific. Some are not LLM specific. all of them involve data transformation. and at every step you need to have confidence in the quality of the input and quality of the output. If the AI made a mistake halfway through the pipeline, good luck trying to recover, the rest of the pipeline. so I think that's really what makes it hard to kind of prototype each component and string it together, and then have an overall system that, Feels everyone should have faced this problem before, but in reality, you're building your own bespoke version of this. hugo: Yeah, that makes sense. I am wondering, so I've also, and I'll share these in the show notes, I've shared these in the chat, a link to your, wonderful paper or article from [00:11:00] 11 days ago of data flywheels for LLM applications, and I've also shared your spade paper synthesizing data quality assertions for large language model pipelines. So I'm wondering in terms of getting these data flywheels going for LLM applications, what are the most, if people would walk away with two or three takeaways of how to start thinking about this, what would you like them to be? shreya: Yeah, the first thing is, with LLMs, you can deploy something without having a lot of data upfront, like collected training data or evaluation data or something. So when you're doing that, you have to commit to a process of continually evolving your application based on production data. Okay. Second, if you're, if you buy this idea that you should be trying to continually improve your pipeline throughout. A larger scale deployment. how do you best take your production data and improve the components of your pipeline, [00:12:00] right? you, the prompt is not going to just out of thin air change, right? You have no real assessment of whether outputs are good or not. your production outputs are good or not. So some, in some way you need to be able to label those efficiently. Correlate that against your human judgment and then feed that back in to improve your base prompt. Maybe via a few shot examples or some demonstrations. so thinking about dimensions in which how you want to improve your prompt. Coming up with good metrics to evaluate. I think that's process in which we start thinking about when we want to create such a flight wheel. hugo: And so when doing this, how do you initially think about testing data and quality of data and these types of things? Do you start with assertions or where do you move? shreya: Good question. I think a lot of people have different preferences. I like Hamel's take, Hamel's a good friend of ours, on, having unit tests like assertions up front. what are the absolute basics that [00:13:00] your outputs need to respect? if you have a prompt, and in that prompt you say something like, don't include this phrase in your output, or you all must always start your output with some phrase, your assertion should test that this property holds for your output. Those are the basics. then I would say a level above that is, three or four, still binary metrics, but it can be more complex metrics of output quality. So it can be something like captures a diversity of information, has. At least, values for this many keys. I'm just making stuff up. This is just stuff that I've seen, but being able to codify, what makes for successful, and then also inspecting your data every day in deployment to see what gives you bad vibes and then seeing if you can codify and send some like high level metric description. When you have metric descriptions, then it's just a question of evaluating [00:14:00] those metrics, which probably should you write a code function to evaluate the metric? Might work for things like checking response length. Should I use an LLM to validate the metric? in this case, I probably need to provide examples of good and bad with respect to that output in that LLM prompt. So that I can, be confident that the LLM evaluator is doing something that I would do. And then see that over time. so a lot of it is just, figuring out how to codify what's good and bad. Figuring out how to teach the LLM via in context learning, why you think it's good and bad. And then using the LLMs to really scale up your judgments. hugo: Using the LLMs to really scale up your judgments. I like that because that frames you once again, as the domain expert and the LLM as an aid, this is cool. Stepping back a bit, I don't know if I said this when I introduced you, but a lot of your [00:15:00] work is framed as research into human computer interaction, HCI, and this is really speaking to, once again, getting humans to do what they're good at and getting Computers ideally to do what they're good at. So I'm wondering with respect to, the data flywheel we're talking about, you can slice it in a number of ways, one, one way to slice your proposals in, in, in the paper we've linked to, Oh, I've also linked to Hamil's blog posts, Hamil Hussain's blog posts on your AI product needs evals, which speaks to, the levels Shreya was talking about as well. Um,In your paper, one way to break down, to slice your data flywheels paper is, you need evaluation monitoring and continual improvements. So establishing clear, actionable success metrics, then implementing robust monitoring systems and then using the data collected from monitoring to iteratively, iteratively improve models. The reason I'm stating all this again, in the context of human computer [00:16:00] interaction, I'm wondering what patterns and anti patterns you've seen for humans doing what they're good at and computers doing what they're good at, like throughout this entire process of establishing a Dataflow wheel. shreya: Hmm.something that is extremely successful, for LLM as judge, which papers are not writing about because it's so new. I think it's tied to the release of GPT 4. 0. is providing good view shot examples. So if I have a metric that's evaluate that this tone is professional, put into examples that are professional tone, put into examples that are unprofessional tone, ship it. Like this evaluator is very aligned with what you might think is professional, much more so than if you didn't have examples. I think an anti pattern is precisely the opposite of that. People will kind of ship off calls to LLMs to be a judge with some [00:17:00] prompt and there's absolutely no specification of what professional means to them or what good means to them. It doesn't even have to be subjective. Like here's one example for a pipeline, summarizing medical documents and trying to script strip PII. So personally, identifiable information. From the medical document. If you tell LLM, strip P make sure there's no PII in the output. It doesn't really know what PII means, but if you say don't provide a name or gender or age or location, phone number, if you do the work to define it pretty robustly, all of a sudden LLM performance skyrockets, right? So yeah, I guess. A long winded answer to your question is good patterns are people who, specify [00:18:00] phrases that, you know, imagine you're like working with a colleague, right? You might need to write a pretty well defined spec to them. if you've never worked before to make sure you're on the same page about what certain things mean. and the anti pattern is just assuming that LLMs read your mind. hugo: So Shreya, so this is my experience as well. I love this. I also I don't know, I haven't spoken with many people about this but the types of things you just mentioned seem like wonderful things to work with humans as well and I am wondering, like we actually, when just chatting with LLMs or working with them, we're being far more mindful around how we communicate than we ever are with humans. For the most part actually. I'm wondering You think there are things we can learn about human communication from our interactions with LLMs? shreya: Oh gosh, I haven't thought about that explicitly. I'm sure a lot of things. hugo: the end shot probably, or no, the in [00:19:00] context learning I think is a wonderful example for me because when thinking about how to let's say I have an intern and telling them how I approach projects, I can tell them in the abstract, or I can sit down and show them an example, then show them a bad example. And that way more instructive. You don't want them to over index on them though. That's the concern. shreya: Totally. But it can be extremely illustrative. I think another thing is LLMs have such a fast feedback loop compared to like when you're talking with another person or you're collaborating with someone over Slack, like it might take them a bit to respond, but the LLM responds immediately for good or for bad. and for myself personally, I've realized. sometimes when I struggle to communicate with the LLM, it's often because I was not, I was a little bit vague.and it's given me a little bit more appreciation for, okay, how do I be more specific and thoughtful when I'm communicating with other people? Otherwise, it can go back and forth and devolve into long conversation. [00:20:00] hugo: Totally. And for the most part, I agree. I don't agree. For example, with, chat GBT currently with four Oh, I, the amount of time it takes, because it's so verbose, even when you tell it to respond in like 30 words, like you'll sit there and just watch it. Oh, so I'll be working on, let's say some code or let's say I'm working on a markdown document with it, and I'll say, can you just give me a new section, do not reproduce the entire thing, and it will just constantly reproduce the entire thing, sometimes several times, it's gone wild. That's crazy. Yeah. You gotta shreya: put it to memories. hugo: Yeah, shreya: in memory or in the chat and GBT memory, hugo: does it or just the system prompt as well? shreya: I think the memory is included in the system prompt if I'm not maybe there's a separate hugo: the other thing I love that you mentioned PII because I did something the other day, which speaks to a few things that I find interesting in the space, including local models and [00:21:00] composability, and using different models to together. But, a friend sent me, an email asking me to help with some financial stuff for their nonprofit, just doing some math and this type of stuff. But when I read it, I was like, actually, this looks a lot more challenging than I think it is. And I was like, I don't know if I can do this. And I thought, Oh, I'd love to paste it into an LLM and ask Claude about it. And I was like, I can't because this is my friend's business. So what I actually did was, I used Olama and Lama 3, 7B locally, which is surprisingly, incredible for a 7B model, I don't know, I think, but I actually, so I put it in locally and, Worked with it to anonymize. And to your point, I didn't say anonymize, but I was like, anything that is a business name, a person name, or a something name, please change to something else and so on, spat it out immediately. And then I was able to paste it into Claude and verify what my, what I thought was happening. shreya: yeah. Yeah, it's crazy. if you had asked to anonymize, it probably, wouldn't have [00:22:00] worked as well as it did when you specified what anonymity means. And it's silly because, of course, if you ask the chat GBT or if you ask the LLM, hey, what does it mean to anonymize? what kind of information should be stripped out? And then you, in turn, use their response and prompt the anonymization prompt against that. It might do a lot better.yeah, I don't know. I think these are the, interesting idiosyncrasies with LLMs that we, that's different than working with humans. a human might explicitly reason about something. I don't even know. I don't want to get into the territory of anthropomorphizing LLMs. hugo: Totally. I'm interested in your SPADE paper, which I've linked to. so SPADE is an acronym, as you know, Uh, for synthesizing, Oh, wait, what is it? I'm just reading the title of paper synthesizing data quality assertions for large language model pipelines, but that isn't what the acronym is [00:23:00] for. So maybe you can tell us a bit about, shreya: yeah, it's system for, Prompt Delta and analysis. Gosh, I need to see it in front of me to be able to tell you. hugo: That's why you come up with an acronym. shreya: Prompt analysis and Delta based exploration or something. Okay. I can tell you the story behind speed. when GBT three, five came out and even for shortly afterwards, people wanted to build their bespoke. data processing pipelines with it. when I say data processing pipeline, I mean a document summarizer pipeline, or a strip all PII from this document, sort of pipeline, or write a response to this email covering these three keys, whatever. and people found that, it works like 75 percent of the time. And in the 25 percent of the time, it doesn't work. It's because of some stupidly simple thing that the LLM forgot to do or listen to. And maybe if you prompt it again, then that error will go away and it'll be [00:24:00] fine.so they wanted, a set of constraints. Or assertions that they should just run on the outputs to figure out which outputs are bad and should get thrown away, or reprompted or retried. So we were like, okay, let's, how do we do that? Like, why don't, how do we come up with Custom constraints or custom assertions, whatever you want to call it for pipeline outputs. Maybe we can ask an LLM to come up with constraints, but no one really liked that. So we thought, okay, is there a broader, is there a way? It's a think about generating constraints that is applicable to any kind of pipeline. For example, is there a taxonomy or categorizations of the types of constraints that people might want? And then we can use an LLM to identify those types within custom pipelines. So for example, one common constraint people want is, length based constraints. So [00:25:00] give me at least two outputs or say no more than five words. Something like that. So if we ever detect a phrase like that in someone's custom prompt, we should definitely write a constraint around that because we know LLM is pretty bad. and following those instructions and they're easy to evaluate and clearly define success.so we did an analysis of a bunch of people's bespoke pipelines and their prompt history to determine, okay, how did these prompts evolve, presumably when people, add to their prompts. or reword their prompts, it's because there's a failure mode in the LLM, or it's points to something that they care about. For example, if you change your no PII instruction to be something like, don't include name, gender, age, et cetera, but clearly this PII thing is something that you care about, In your outputs. And so we should design a constraint or assertion around that. So that's how the idea for a spade was born, [00:26:00] which is Can we come up with a pretty good taxonomy of prompt deltas? turn those into custom assertions or constraints based on your custom pipeline, and then operationalize it. Where can we find good implementations of these constraints that actually align with the code? Your preferences, maybe code functions and make sure they're correct. So that, yeah, that's great. hugo: great. So I'm interested in what takeaways or someone working in industry building pipelines currently, what can they do immediately as a takeaway from this work? shreya: Yeah. and this is something that people have already explored actually. since we wrote the paper. But you can come, you can bootstrap your own metrics, right? For your own prompts or for your customer's prompts based on a taxonomy of edits, like the taxonomy that we have in our spade paper. there's a taxonomy also in a kai, similar Kai paper, HCI paper. and they're [00:27:00] largely the same. Put this into chat GBT, as well as your prompt and ask for constraints or assertions that match the categories like inclusion rules, exclusion rules, count based rules, qualitative rules, stuff like that. And you can, this is a great start to evals. I think that people just don't explore like they get so roadblocked or writers blocked. but yeah. hugo: Totally. This dovetails really nicely into the eval stuff that we're here to talk about as well. I am just quickly interested in a bit more context around your work. I'm looking at the spade paper. a lot of people working in academia and then people at Lang, Langshan as well. I'm wondering, firstly, Why this group of people and then I know I'm not sure for this paper I know a lot of your work is actually almost sociological if not actually Sociological in terms of seeing how people in industry are actually working and figuring out what the [00:28:00] patterns and anti patterns are So I'm wondering if you did that for this paper as well shreya: Yeah, we didn't do any specific user studies. We did some user studies and talked to a bunch of people who use Spade to understand more about its shortcomings. but we didn't write about any of these insights in the Spade paper. just to focus it on the technical contribution. in typically the HCI papers is where we write about more of these kind of qualitative insights. and the who validates the validators paper was more of a follow up to this, okay, like from the spade paper, I felt we need to have an interface. we need to allow people to iterate on assertions, like we need to interactively create them, right? You can't, it's not just. A system where you drop in your CSVs of gold standard data. A lot of systems are like this. It's crazy. a lot of like optimization, LLM optimization systems are like this. You just drop in a CSV of gold standard data [00:29:00] and they're like, Okay, yeah, we're going to maximize performance on some set. it just, it doesn't work. Always work. There's a UX problem in creating those that call center data. So that's the motivation for what got the HCI project started. hugo: Great. I love the fact that,a lot of the high quality LLMs, I can drop in a CSV and it won't mind about delimiter issues. That's one of my favorite things in the world. The other thing, which is actually incredible is that I can paste in, pretty much any trace back or errors or something like that. And an LLM won't, if I'd pasted it into Google, it'd get confused by my like local file directory structure and paths essentially, whereas LLMs no to just, abstract away all of that stuff and focus in on the actual error message, which is quite fantastic. I think, I'm glad you mentioned who validates the validators because that's what I wanted to get [00:30:00] into next. So this is all around LLM assisted evaluation. And once again, what human, what. We should get humans doing what it makes sense for LLMs to do,maybe, I mean, you've mentioned that this is in some ways a follow up paper or inspired by, the results of SPADE, but I'd love to hear about the genesis of it, what it was like doing this research and what the takeaways are. shreya: I went into it with a lot of, learnings from Spade and feeling like, okay, there are some things I learned from Spade that are sticky, like, uh, creating custom metrics based on prompt deltas. That's awesome. Everyone should do it. the idea that you could actually solve for accuracy, you can formulate this as an optimization problem and probably get assertions with good coverage. That's great. no, no one's going to solve this. No one's going to do this on a daily basis, but the fact that this is possible,is great, I [00:31:00] think, for improving reliability or saying something about reliability. so that's the mindset that I went into the evalgen work. it was also a collaboration with the folks at Chainforge, which is a really awesome open source tool, for prototyping chains in a no code low code interface. and Chainforge users were also apparently feeling okay, I'm like iterating on my prompt, but I like have no idea whether to assess whether it's good or not, for my pipeline. So like, how do I do that? So both Ian, who's the Chainforge creator, and I came into it with, learning things from our respective projects and, Yeah, that's was, that's evalgen was born as an interface for hopefully soliciting, human labels on, um, hugo: Very cool. And I've linked to chain forge and they've even on their landing page. You can see change the little chain forge. ai slash [00:32:00] play, which is a cute little sandbox, not, well, I mean, I use the cute word, cute, expansively awesome sandbox, huge fan of sandboxes more generally, particularly in the space we're working in, if you can get to a wow moment quickly, I, I do want to. dive more into evaluating the evaluators. I, because you mentioned Chainforge and, it's, it's an open source visual programming environment for prompt engineering, right? and it's low code and on no code. I am interested in your thoughts on the future of low code and no code tools for the, for this space more generally. shreya: You asked the right question. I'm super excited about it. I, my next set of projects is, around building low code, tools for people to build complex data processing pipelines with LLMs. So if you want to write LLM pipelines that, operate on streams of unstructured data, that summarize or do LLM calls across document [00:33:00] boundaries. I think all of this can now be done by LLMs, and thus we should build an expressive interface for, a no code interface for people to be able to do that. yeah, I'm super excited. hugo: Incredible. I used to joke that, Low code and no code for data science generally wouldn't work because of delimiter issues. That's one, one of the jokes I used to make. But no, seriously, if, like, how do you like one of the first things you have to, nearly all other things I think you can deal with. I am wondering in the LLM and generative AI space, what are the things that you think, a doable with no code, low code environments and where are the, where will the stumbling blocks be? What's challenging? shreya: Yeah.so right now a good number of low code tools offer what I Think is akin to a map functionality where generic mapper function, Would take in sequence of inputs and then return an output for each element. LLM mapper. So this might be take every document and [00:34:00] extract some information from it. Where I think tools are lacking is in the reduce functionality, right? How do we find groups of similar mapped outputs and do more LLM calls on that? How do we do deduplication? there's a lot of complex issues there. say I wanted to count the number of, people who were super excited about a holiday in a collection of documents, right? or news articles, like this is quite difficult because somehow you need to get all of these people and then somehow you have to get instances of all these people in the documents and then group them and do a count like this is becomes incredibly complex, given the LLMs like might. output a person's name differently in two different LLM calls, like first and last name, or just first name, right? So how do you identify the same person?so there are a lot of interesting technical challenges there. And I think also there's the data quality challenge, right? You want to do this at scale. It needs to work every [00:35:00] single time. and we know LLMs don't work all the time. So how do we identify the times where they don't work, surface that to the human to correct or employ another LLM to correct, so that we can be used in downstream applications. So I think no existing data processing framework has that validator functionality built into it as a first class citizen. hugo: So agreed LLMs don't work all the time. Also needless to say, machine learning models don't work all the time. Okay. In what similar ways do they not work? And in what different ways? Do they not work? That's a good shreya: question. I think LLMs are robust to errors that traditional ML pipelines are not robust to. For example, missing values, in traditional ML, right? In data preparation for traditional ML, there's many different steps involving like imputation, normalization, et cetera. So you could have a missing value or a [00:36:00] corrupt value that ends up getting transformed in a weird, unintelligible way that totally corrupts the downstream model. For example, if it was an important feature. This problem is less likely to occur for LLMs, because LLMs could, if there was like a missing value in a document, or it said something gibberish, And LLMs still might be able to contextualize that and use the other information. I don't know if what I said makes sense, but I think LLMs are just a little bit more robust, in that way. hugo: Let me say, it definitely makes sense to me. I don't know if it's correct yet, so let me provide some sort of potential counterexample. In a traditional ML pipeline, if I have missing values, I may get an error, which means I need to explicitly set how I'm imputing missing values and then, and there may be in a lot of cases, a very different outcome if I impute the mean, as opposed to the median, [00:37:00] right? an LLM may do it. Automatically and fail silently in that sense. Is that something shreya: pipeline also might fail silently. hugo: True. shreya: If you have good validation, then your ML, I think it's easier to have good validation in ML because we have frameworks for it, right? We have TFX, we have TFDB, and then we have like research around what makes for good data validation. We do schema validation. We don't have this as LLMs, so maybe it's easier to silently, silently have bad data show up. And LLMs that I would agree with. hugo: Yeah. Yeah. No, I'm even thinking like in a classic scikit learn pipeline, like I need everything to be a NumPy array. So I do actually need to hard coded in that case. Shreya, how do we evaluate the evaluators? shreya: Hard question. hugo: Hey, you wrote the paper. So shreya: yeah, that's true. hugo: How do you, how do we think about what does that question even mean? [00:38:00] Let's start with that. shreya: Okay. Okay, what I think is bad Is when you ask an LLM to validate something and you have not explained to the LLM how to validate it. You have not defined the criteria that you're trying to evaluate. You have not provided examples of how you as the human might evaluate it. You were just like shipping it off to LLM and saying, you know, it was RLHF somewhere on some data someday. So it's going to work like that is, That is what it means to not validate the validators. What I think is a good way to think about validate the validators is to try to encode your process of validation as much as you can into. The validator prompt or function, if it's a code based function. So you as a human, how would you [00:39:00] grade this output? Would you write a Python function to do it? if you were checking word,You certainly would not go there and eyeball it and look at every single word. To check the word count. No, you would write a function to do this for you and just use that function every time. if you were checking for the professionalism in the tone, right? You wouldn't like, what does that mean to you? Are you looking for complete sentences? Are you looking for short sentences? Are you looking for certain phrases or not phrases? what is it? Sample of professional for you. What is an example of unprofessional for you? I think being able to come up with this spec that's validating the validator process. Does that make sense? hugo: Makes perfect sense. So Then how do we do it? shreya: Do the process! Read the paper! I don't know, come up with your metrics, or bootstrap them, with SPADE or an LLM. Then, think about [00:40:00] example outputs that would pass the metric. Think about example outputs that would fail the metric. And code these as few shot examples for LLM validators. And then see how it does. keep looking for instances where your, judgment differs from the LLM validator's judgment. And update your few shot examples accordingly. I feel like, at some point, people I've talked to, I've also done this for a company. at some point you align, right? You, it's not 100%, but it's certainly not 60 percent where you start.you can go from like 60 to 90.I think that's like a huge boost. hugo: Yeah, absolutely.is there, I'm gonna ask a general question, interpret this in any way you will. I'm just wondering how scientific there is, or if there's much science to Like if we, like vibe check clearly isn't scientific, I mean, it's the beginnings of science and experimentation. I think what we have in data science now is getting to scientific, I [00:41:00] don't think machine learning as a discipline is as scientific as it could be. So I'm wondering how you think about this is, are we trying to establish rigorous scientific, methodologies for LLM and Gen AI in production? shreya: Okay, this is a great question. what, I think, what does scientific mean? does it mean, I think there's a lot of aspects to scientific workflows that I think don't always have to hold in the industry setting or in the LLM setting. working with humans, right? Collaborating on a project. Is that a scientific thing? It's certainly not reproducible. You certainly have a goal you're working towards, but you're not provably taking the optimal path toward that goal. It's okay that it's not, scientific in that sense. I think we should study things scientifically. but the workflows that we do don't necessarily have to be scientific to have business value. and maybe that's a take. Yeah, I hugo: do think it's a hot take in, [00:42:00] in, in some ways, but I like it. I like all hot takes. we're both, as we said, close friends with Hamel Hussein, so who wouldn't like a hot take? Um, I love it. And I love that you asked what is in e even is science, because, I've got a good friend who used to be a research scientist, and I once said to him, how do you feel like I, I joke that I'm a retired scientist. And he's like, no, you are not Hugo. You're still a scientist. You just. you, you employ the scientific method every day and a lot of things you do. So a lot of it is science. I am, shreya: you have to consider the like different flavors of science, like a physics, the way of studying things are empirically making observations and fitting models to it versus trying to, I don't know, there's so many people. Yeah. fancy ways of studying problems. So hugo: yeah, I've got one that I really like, and I,I have a half joke. It's not a joke at all, actually. Generative AI stole the term generative. In my opinion, just like deep learning stole the term inference deep inference is not [00:43:00] inference is actually the opposite of inference in, in many ways, because you're learning nothing about the system per se, you can, there's a Bayesian sense in which you can actually flip it and it is inference but going the other direction. But with, my back, my background is in, I'm probably a Bayesian is what I say, but generative modeling in a Bayesian sense is understanding the data generating process, right? whereas in generative AI, we're so far from that. And one thing I love about Bayesian modeling is that it allows us to model the internals of the system that you're trying to study, develop some form of robust predictions that you can test your model against. So I'm probably thinking. as scientific is having some sort of model of the world and it seems like this is all contained within the LLM so you essentially have a black box that you're prodding and poking so it's actually tough to do something that's somewhat deterministic and somewhat reproducible which I think [00:44:00] Especially if we're using terms like data science is part of what businesses already have a lot of processes that are successful. So I've worked with a lot of, incumbent companies in a variety of spaces on data strategy, AI strategy. They've already got systems that have worked for them for years that shareholders are happy with right now They need to think about these types of things It's not clear that they'll work any better if not worse than the previous one So I suppose that's a word salad as a in terms of how one could think about science in I'm not happy with what I just said, by the way, I could have I could make it a lot more shreya: about it and then get back to me. hugo: Yeah. Yeah. but I'm wondering, it feels and machine learning does in a lot of ways as well. It feels like incantations and I'm in my wizard hat with a cauldron, hoping things go for the best. So I'm wondering if there is an idea to make these processes more reproducible, deterministic and scientific. shreya: Yeah, it's tough. I think a lot of times when you're doing experimentation, the goal is [00:45:00] not to have it reproducible, right? You want to learn something. data science is a great example of this. Sometimes you just want to get high level vibes of what's going on in your data. And you're not like out here to build a reproducible machine learning model. I think that being said, you should go about your exploration process with a huge grain of salt. and only try to get very high level vibes from your data and not try to get anything concrete that you have to swear by numbers.but it's certainly useful to get these high level vibes, right? And it doesn't need to be reproducible. So you can determine majority value in a data set. hugo: I agree it doesn't need to be reproducible in the way that like classic scientific research and computation needs to be. But it does need to be reproducible in an extended sense in that I agree it doesn't need to be reproducible in an extended sense in that If you make a prediction that needs to reflect something that will likely happen in the real world, right? shreya: totally. But I think you can [00:46:00] use common judgment to, to interpret your own findings, to your own. hugo: That makes sense. I'm dropping in the notes, your paper, which I love so much from 2022 operationalizing machine learning and interview study. And finally, so we've had public conversations many times now and I, yesterday I was like, I wonder, wait, how many of I Times have I chatted with Shreya publicly, and this is actually our fifth public conversation Shreya, shreya: so hugo: I know right and I'm surprised you haven't blocked me yet, but I appreciate you coming back every time. But the reason this is relevant is because the first time we had a public conversation for an out of bounds fireside chat, it was on your operationalizing machine learning. paper. And ever since then, we've done conversations on generative AI on LLMs. so we did the one at the Exploratorium, of course, and then the one with Jeremy [00:47:00] Howard and Hamil Hussain, where Jeremy famously said, open AI, not going to make it. And then we recently did one with you and all your coauthors on your journey. I know. what you learn, using Ella working with LLMs for a year. But the reason this is interesting to me is that I'd love to go back and think. Go back to the start of our journey together and think about machine learning. So we've been talking about LLMs and generative AI for so long now, I'm wondering what it can tell us about machine learning, about data science. We've been touching on this so far, but what type of things. What similarities and what type of things are applicable to the world of machine learning and data science? And what are the differences? shreya: Okay. Interesting that you said similarities to traditional ML. and then you started talking like from the algorithms perspective, in my thoughts, we're going to engineers. The first thing is there was a huge, the [00:48:00] success of LLMs shows how difficult. ML actually was, and there's a huge appetite for being, for tools that allow anybody, any day to prototype intelligent software systems, especially in a low code, no code setting. I think that is like the number one takeaway I've gotten from Gen AI. I think the most surprising thing that I would have never expected, but I think I've done a good job adapting my research direction, is in context learning. I would have never expected in context learning to work as well as it does, so well that people actually don't need to train models for their bespoke tasks. in the past I thought ML only worked because we were always overfitting on the world, on whatever world of data people were collecting, and just hoping that, at. Query time at inference time, the distribution match their training distribution and so forth. But apparently you can have foundation models that just do in context learning. I think that's crazy. And I think it remains to be seen, what [00:49:00] are going to be downstream applications of this, for sure. hugo: And so there is a joke that, probably a lot of ML overfits. Clearly LLMs are overfitted themselves to at least swathes of their training data. GPD 4. 0, I get stuck in local minima with it all the time, actually. shreya: the chat thing is crazy to me. I think like when you have chats with super long histories, like then a lot of things go haywire. you're gonna get hallucinations, apparently, even if it's GPT 4. And your conversation history is 25 messages long. you might not have the LLM listening to your instructions. It might start, being verbose and then saying, I apologize, blah, blah, blah. And then, all this garbage verbosity. That, I don't know how to solve this problem of long chats. I don't know why we're doing chats. Maybe it's because it's a problem. Best way to productize. I don't know. hugo: also to your point earlier of end shot [00:50:00] prompting, right? To be able to, but a point that you made in your applied LLMs paper and Hamel has talked about is in these long conversations like seeing what the actual prompt is could be Super useful. Cause if I've contradicted myself in a conversation, I don't know, I don't even know how the memory works, like what it actually looks like. Cause the actual prompt is getting is more than likely long, full of contradictions and nonsense. Right? So I think maybe this is a UX question in the end, how can we surface to me that perhaps start a new conversation or something along those lines? shreya: Yeah, absolutely. I don't have long conversations with LLMs, I just, I try to keep it to three, send it three messages max, and then move on. hugo: Fascinating. I had one with Claude last night, and you know Claude will chat with you for a certain amount of time, even on the pro plan, and then it'll cut you off for a while, which is probably for the best, but I somehow, I pasted like a lot [00:51:00] of long documents into it, and then it freaked out and it started. I've never seen this before that in the browser, Claude started like flashing like different colors at me. And then Firefox crashed. And I thought I'd had a hallucination, not just the LLM. So it got pretty wild pretty quickly. I'm glad you mentioned, Differences between what we train on and then what happens at inference and, overfitting to training data, how this relates to drift as well. So I'm wondering if you can say a bit about the types of drift that are important in ML, and then in your eval gen paper, you introduced the concept of, criteria drift. So I'm interested if you can chat about that and then, why that's important. shreya: Yeah. I like generating now because it's exposing different kinds of drifts that people get and understand and can be characterized like criteria drip being example of human preferences changing over time, like what metrics mean [00:52:00] good and implementations of that metric, like maybe verbosity means something yesterday, but after chat GPT came out, like suddenly verbosity means something else to me now. Okay. that's interesting. I think there's also like a business logic drift that I don't know how to characterize, but, people are observing, like when they deploy LLM project products, like they think that their users are going to have some query workload and then that kind of changes or it refines, I don't have a great explanation of it. I don't know if that made any sense, but somehow there's some, user patterns or behavioral patterns that emerge. and there's a drift there. I think in traditional ML drift, people were trying to quantify a lot more because that's all ML was right. ML was just vectors of floats that were shipped to a model and outcome backs and predictions. So when we thought about drift, we didn't think about it in this like human centered way. Or business [00:53:00] logic way or product logic way. We thought of it as Oh, this feature distribution has a KL divergence of whatever number that's not interpretable.and I'm excited now that, we actually have better ways to reason about what drift means and the impact of that. hugo: That's fascinating. I need to go away and think about, just generating a lot of ideas about how LLMs and interactions with them can just teach us so much about other aspects of what we've done previously. I love that you mentioned kale divergence. Of course, that's callback libel divergence, but I actually heard kale divergence. maybe I'm getting a bit hungry, but I like the idea of kale divergence as, as well.spinach is one of the great greens. My parents were lucky in the sense that I actually loved green vegetables as a kid. I, I used to love raw broccoli. And then sometimes I get my dad to pack like a, a green bell pepper in my lunch and I'd eat it like an apple. So well, in, [00:54:00] in that sense, not nightmare in every other. possible way. But, also, we could, we don't, I said pepper then, but of course, we call bell peppers, capsicum in Australia. shreya: Yeah. hugo: Yeah. So that's a thing. and we call cilantro coriander. I, I've totally lost my train of thought.so no,we have been talking about evaluation, so I'm wondering if you could tell us a bit about evalgen, which is, something you've been working on and actually before that, more generally about the types of, you know, if people want to go away and play with the ideas in evaluators paper, what type of tools, tool chain would you, and I don't get too tool oriented, but people, Need to know what to build with and play with. So how would people implement these types of things? Python and F strings to Python and F strings, print statements and asserts. [00:55:00] Yes, that's cool. That's what's up. That's shreya: the way to go. hugo: Notebooks. shreya: Sure. Whatever you want. I don't know. Notebooks or your ID. yeah, evalgen. So evalgen is the interface that we built that we're still in the process of integrating into Chainforge, but this is an interface that gets humans to label data as good and bad, like label outputs. and evalgen will Present outputs to the user, force the user to give feedback on it. and then in the background, come up with good implementations of metrics. That when they're evaluated to true and false they align in the same way that the user grading would grade so Yeah, I don't know where to start. biggest takeaways from this so I thought that it would be very simple like you just Put an interface in front of them get thumbs up and thumbs down and then after a while you've [00:56:00] gotten like 50 grades and then you're good to go. We've got aligned assertions, but we did a Udo study and we found a lot of things. first people don't want to provide that many grades without seeing concretely what is happening. what are the implement? What are the metrics so far? What is the alignment so far? How many more do I need to do? People are very antsy. They don't just want to keep grading. If they feel like it's not being used in some way. Another thing is, people need to see a lot of LLM outputs before they have concrete criteria of what their metrics or their rubric is. So they might think that they have some criteria and then after they see, they might say Oh, I don't want, you know, the chat to be T outputs to be server growths. And then they might characterize it in a certain way. And then after seeing 12 outputs for their tasks, it might be like, Oh, actually, no, I want to change my definition of verbose, I'm going to go back and regrade things, redo this. So I was like, oh, okay, this makes the system really complicated [00:57:00] because now we have to account for metrics as not being a fixed set or having optimal implementations, right? It's like a fuzzy implementation that's going to align with your own fuzzy definition that's always changing. So then the very notion of optimization here now, Is thrown out the window because you're not optimizing towards a fixed set of goals. I thought that was super fascinating because the eventual project I wanted to build was like optimizing validators. And then now I'm not doing that anymore because what's the point.yeah. I'll leave it with that. hugo: And I've linked to the paper and you can, so actually you did a demo of evalgen recently, which we can link to in the show notes, but next time when you have the new version,We'll, promote that as well. something we've talked a lot about is using LLMs to help us work with LLMs from LLM as a judge to synthetic data generation using LLMs. now I want to, I [00:58:00] want your thoughts into the concerns. And what I'm going to do, I sent you this article the other day that I've put in the notes, which is Cory Doctorow, and weirdly, I read out this quotation recently,on the podcast, and I apologize for any cussing beforehand, but it's a quotation. And it, Discusses the challenges, potential challenges associated with feeding the output of LLMs into other LLMs and what we get with these types of recursive processes. So it's Cory Doctorow and Cory Doctorow says, Sadowski has a great term to describe this problem. Habsburg AI, just as Royal Inbreeding produced a generation of supposed supermen who were incapable of reproducing themselves, so too will feeding a new model on the exhaust stream of the last one produce an ever worsening gyre of tightly spiraling nonsense that eventually disappears up its own arsehole. I, firstly, I just love reading that out, so I appreciate you humoring me there. I do want to use that to frame a conversation. [00:59:00] Around the concern, like, is there a concern that we're doing nonsense while LLMs are interacting with each other as judges, as data generators, as interpreters, as machine learning classifiers, how do then we insert ourselves in there to stop this tightly spiraling nonsense that eventually disappears up its own arsehole? Is my question. Yeah. shreya: Great question. This is why I think you need human in the loop everywhere, like humans being in, thinking of LLMs as scaling up your judgment and how you would do your workflows. And at any point it's diverging, you have to go back. So first of all, you need to continually be providing the mechanism. For assessing, alignment or some way, like always be grounding, always do check ins, right? If you're going to work with somebody, you don't just slack, send them one slack and then never talk to them for the rest of your working together, right? You have, you one on ones, right? Establish some cadence or sync there and then whenever there is divergence doing the work to align. Like I, [01:00:00] I feel like there's no other solution unless you want to automate, you weigh your own brain or thoughts or preferences and be okay in this like LLM hellscape. That's fine. hugo: Absolutely. And funnily, I just, I was in our Twitter DMs cause that's where I read this from, but someone just responded to your paper saying I have reservations about using LLMs as judges. It is costly. And not sure it is even effective. and they've linked to a blog post that says the challenge of AI is reliability. So I think that's something we both agree with in the limit of not doing this correctly with HCI, essentially. shreya: Totally. I think, it's not perfect, As a researcher, I'm not thinking about, absolute costs today. I'm thinking about costs next year. Costs two years from now. Costs are rapidly decreasing over time. just the fact that GPT 4. 0 has come out, and it's significantly cheaper, It came out this year. Like I can only [01:01:00] imagine I got in the best case. It's 10 X cheaper next year, a hundred X cheaper the year after that. And when you think about it in a deployment setting, you're really only paying for the cost of evaluating, using GPT 4. 0 until your metrics are somewhat fixed and you have enough data to then go and fine tune evaluator models. so if you collect enough data, like certainly go and fine tune evaluator models on that data. it will be significantly cheaper and we'll have the same if not better alignment. hugo: Great. So I am interested in your publishing so much and thinking so much and working with so many interesting people. are you intending to productize anything you're working on? how do you think about getting all these ideas? out there for practitioners in order to build cool stuff. shreya: Oh, we're so early. So first of all, I think of myself as a researcher first. Like I, I'm not like trying to start a startup or whatever actively. I'm not like searching for the idea that's going to start me a startup, if that makes sense. I, I [01:02:00] am interested in, helping advance the ecosystem of AI or ML engineering. and I love being in research because, or in academia, because I get this, impartial view. Like I don't have a loyalty to any tool provider or even any LLM, like what does it matter to me if OpenAI is better than flawed or vice versa. so in that sense, I like my job, what I do here, and I would love to stay in academia. I would love to get an academic job and be a professor.and then how do I think about, getting my ideas in product? I think, so as a researcher, it's really hard to also run an open source project or build a product. I will write less and think about interesting research questions in my day will. Be like closing GitHub tickets. so I personally can't do that. What I try to do is, work with companies, with startups, with. Tool developers, and help advise like, Oh, [01:03:00] you should build this interface. or here, let me play around with this. I'm like, actually, no, I think this will be better. let me help get you assertions that will cross you over to the line and really just like doing my small part to advance the process there. hugo: Cool. and I love that. We've spoken more about going back to the ML and data science stuff than we have for some time. I, not insignificant part of the audience of the vanishing gradients, um, classic scientists, data scientists, machine learning engineers, who perhaps do want to start thinking more about whether AI and that type of stuff. They may have played around with it. There are a lot of ideas in this conversation. I'm wondering what type of. Skills you'd encourage. So firstly, it's obvious that the data skills scientists. have some of the most important skills when working with generative AI, but what other types of skills or ways of thinking would you encourage this persona [01:04:00] to develop? shreya: Absolutely. I think a level of modular systems based thinking is really important here. It's very easy to prototype, you know, an LLM Pipeline around one LLM call. But typically all applications like start out like this. There's one LLM call there's inputs and outputs. And, after you look at a good number of your outputs, you realize Hey, maybe I will improve performance overall if I decompose this into two steps, or I actually include a router that. classifies the original intent and then sends it off to a different prompt. And you can see here, right, as you're continually improving, suddenly your system gets a lot more complex, right? You've got this graph of LLM calls that if you're doing this correctly, you're going to end up a graph, right? It's not going to be a single prompt that does everything for you. And having the ability to recognize, okay, I need to add a new module, or I need to add a new node to this graph [01:05:00] is super important. hugo: Awesome. I love that you mentioned, a focus on modularity.in data science and machine learning, composability has always been incredibly important in different forms, though, like I'm thinking, even on the analytics side, the pipe operator in R and R studio and the way it works with dplyr and ggplot2 is one of the, a not Like it is one of the bigger reasons I think that R gains so much adoption for analytics and visual data reporting that type of stuff. Similarly in scikit learn pipelines are one of the best abstractions, that among other reasons why it's garnered a lot of adoption. as well. But I think now with generative AI, because we are seeing, hundreds, if not thousands of bespoke models that we want to chain together in a variety of ways. the idea of composability is, and Unix like philosophies are becoming even more important again. Totally. [01:06:00] shreya: Totally. Yeah. And it's interesting. Look, bookcases are not designed for this kind of, visualization of this DAG of, I don't even know what to call it, right? it's difficult. it's definitely a new way of building AI pipelines. In traditional ML, of course you couldn't know all of your steps up front, but you did know okay, there's only so many data preparation things one can do. And there's not an infinite number of steps you're chaining together. And you're certainly not going to do like the, you're, there's some steps you're not going to do after other steps and vice versa. I, and AI is just very different. We're still like learning, what is our toolbox of ways we can decompose. hugo: Yeah, totally. And when I had Jono Whittaker, who's now at answer AI on the podcast, we talked about composability, but also,the generative AI mindset is thinking through atomic units as well. And [01:07:00] I'll link to a blog post that we published on this as well. But the idea being, if you have something you want to build, attempting to decompose it into atomic units, finding the best of breed models. So let's say I want text to image, right? Sorry, speech to image, maybe I'll use with some whisper large for speech to text and then use, something else for. the text image, right? So I might use stable diffusion three or something along those lines. So I'll be chaining these together. Now, I'm interested in your thoughts on, in the fu okay. What I'm making here is an argument for understanding composability pipelines to put together best of breed models. There is a future in which perhaps we have some big platforms that have all of these combined, and you don't need to have this. style of thinking so much. I'm wondering your thoughts on the medium to long term future of best of breed models versus all in one platforms. shreya: Totally. I think the apt analogy here is AutoML. [01:08:00] AutoML in some ways was a massive success, but you could also have a very pessimistic taste in that like absolutely zero people are starting out building an ML model with AutoML. Like it just doesn't make sense until you're at the stage in which you need to really do that kind of optimization.not just you know, if you're at the size of a company that needs it, but also if You're the maturity of the product, it's just a lot. I think the same thing is going to hold true here for, building LLM. Agents, thoughtfulness when building your application is going to get you 95 percent of the way. And then the auto AI is going to get you the rest of the way, but the auto AI is not going to get you from zero to 95. hugo: Awesome. shreya: And no matter, that's hot take all the marketing copy for all these auto AI providers may say otherwise, but. hugo: Totally. look, we're going to have to wrap up in, in a second. I've linked to your website. and I. Yeah, and I just everyone here in real time and [01:09:00] watching afterwards and listening afterwards do follow Shreya across all the channels. I can actually almost guarantee you, and Shreya may hate me saying this, if you work in the space you will more than likely be able to deliver more value and eventually command more salary, by Shreya. Listening to people such as I honestly think in my work, I am able to deliver significantly more value. not necessarily commanding more salary. but yeah, and so thank you all for joining. Also trade, just thank you for going out. And so I framed it like this way before, but I view the work you're doing and a lot of our mutual friends as just being really at the frontier, and bringing back knowledge that you've found there and in such a moving, growing, tempestuous space, that works so much appreciated. shreya: Thank you. hugo: I keep talking shreya: about my work. So I continue to have a job in academia and hugo: totally right on. I, I am interested, so don't read [01:10:00] Shreya's papers and follow her on the socials and all of that. but if. If there's something you just, a call to action for people, what you'd like to see people doing more of if they want to, explore more in the AI space. shreya: Yeah, building things and sharing. hugo: Yeah. shreya: but you're not in a salesy way. You don't have to be like, I built this thing and it achieves this percent accuracy. I think it's real. I think that social media is actually full of those posts and lacking in posts where people actually talk about challenges that they run into when building these systems. So yeah, share more of that, give people stuff to work on. hugo: Yeah. Amazing. once again, thank you everyone for joining. I've linked to the podcast itself and we'll release this next week as a podcast episode. Once again, please do share with friends. It'll remain on YouTube in perpetuity until YouTube folds. which I don't know if my account will get canceled for this. They, YouTube seems pretty chill these days. yeah. And and [01:11:00] subscribe if you're interested as well. Thank you once again, all for joining and thank you once again, Shreya. Thanks Hugo. shreya: Screenshotting.