The following is a rough transcript which has not been revised by Vanishing Gradients or the guest. Please check with us before using any quotations from this transcript. Thank you.
===

eric: [00:00:00] That's when you can imagine the simplest, dumbest thing, which is, okay, let me try to upload these five papers into a vector database. And just do a dumb and simple query of those papers, and you'll very quickly find out that under these circumstances, an LLM, under certain queries will end up being hallucinating.

Ideas that go across papers. And it's great if you are trying to synthesize new ideas and it's, but it's horrible if you're trying to be factual and understand the concepts to begin with, right? So that, that turns out to be a huge disadvantage for just dump everything inside. Lance DB and try to like query, do dumb and simple cosign simulated queries.

I should also preface this. There is the classic query that any person who is too busy to think will ask and that query is what's this all about? And if you think really hard about like how naive rag works. What's this all about has zero [00:01:00] relevance to any particular textual chunk of scientific papers.

You're just gonna get crap retrieved information and then that's just not gonna work. So that's when I started ex exploring a few other slightly more advanced ideas like Graph Rack originally by Microsoft. 

hugo: That was Eric Ma, who leads research data science in the data science and AI group at Moderna Therapeutics.

Eric is also one of my oldest friends in the space, and I'm so excited to release this episode in which we dive into what retrieval actually mean, whether you are working with scientific papers, product documentation, or internal company wikis, and why naive rag pipelines often fall apart in practice.

Eric walks us through why cosign similarity isn't enough. How vague queries like what's this all about? Break most rag systems, how he used graph rag and manual overrides to improve relevance and why aligning retrieval with user intent is just as important, if not more [00:02:00] as prompt engineering. We also talk about the trade-offs between simple hacks and fancy architecture.

What it takes to build helpful LLM powered assistance and why most real world retrieval is still more art than science. I. Now this conversation was originally recorded during the first run of my course, which I teach with Stefan Crouch, who's working on agent infrastructure at Salesforce. Our course building LLM, applications for data scientists and software engineers.

We're releasing it here because Eric just brings such sharp thinking and practical experience to a space where a lot of people are still figuring things out. The next cohort kicks off July 8th. It's our biggest one yet with guest speakers from DeepMind, hugging Face Spacey, Raza, and more plus $500 in modal credits, $300 in Google Cloud compute and Gemini Credits.

Eight months of Hugging Face Pro, and a bunch of other good stuff. Links in the show notes. Eric also gave a second talk in the following cohort, focused entirely on agents. I might release that one too if you are interested in hearing it or want us to bring Eric back for a [00:03:00] future. Cohort, which I'm working on at the moment.

Let me know on LinkedIn or Twitter. I'm Hugo b Anderson, and this is 

eric: vanishing in Gradients. I'm in the research space and I, as I introduced myself into the Discord, I'm a bio nerd, data nerd, computer nerd. Not that good with chemistry, but can speak some chemistry. And so I'm actually in the research space in biotech data science, which means that most of the work that we do, we actually do it in collaboration with other scientists and we're really part of the scientific process, discovering molecules and the likes.

So it's a really fun thing 'cause I get to use. A lot of my, since 2006 to 20 17, 10 years, 11 years worth of life science training, not the exact stuff that I did in grad school. The broad general themes are there. 

hugo: Incredible. There's so much to unpack and I would love to chat with you more another time about like specific foundation models for biological research as well, which I think that we're just beginning to see what's possible there.

As we've been talking about information retrieval a bunch this week in this world. That's why. Why did [00:04:00] you hear? But you've been using a bunch of rag and information retrieval stuff at work, Eric, so I thought maybe you could just tell us about some of the use case. Why even adopt this technology? 

eric: Yeah, maybe I can explain a little bit.

The focus we put on is like the retrieval part and cosign similarity really isn't everything. So I think there's a blog post I wrote a little while back, which is like, what are five ways that RAG shows up? And what most of us think about is Rag is, oh yeah, go get a vector database, put in some query text, get a cosign, most cosign, similar thing and stuff that into the prompt.

But I think of rag in in a slightly different way. It's the prompt to the LLM includes my intent. Let's say just for example, one of the things that I wanted to build. Back in 2023 when LLMs were first becoming a thing was like an automatic get commit message writer. 'cause I hate writing, get commit messages, right?

So like my intent is I want an LLM to write the get commit message for me, the problem [00:05:00] is without additional context that I retrieved from the computing environment. I don't have the substrate material to write an accurate and good. Get commit message. So how did we build it in that particular system?

The retrieval piece was literally read the output standard out output of get diff with a few flags turned on inside there. And then we take that plus my intents and combine it together to do a little bit more. And so that's like. Once I built, that was when RAG was like really just starting to become a thing and once I had built this thing that didn't cosign similarity, some fancy vector database, that's when like philosophically, it made a lot more sense to me that the core of RAG is how do we do retrieval, right?

You can actually have manual rag, in my opinion. You can have manual rag where a human is curating this, and this was another tool that I built at Moderna and in my open source world, open source life, which is an automated documentation writer [00:06:00] in which I'm declaring my intents. But then I'm not getting a system, a code, like I'm not getting an agentic thing to read through my entire code base and try to figure out what exactly should be the source material.

I'm the human. Specifying, go read this file. Go at, go read that file and combine the contents of that file in my documentation system automatically with Python code. Combine that into the full with the intents, and then write for me the documentation. And so is that using like string matching and grip type things and.

No, it's actually even simpler than that. It like follows exactly how anthropic suggests. We build LLM sys systems, which is be dumb and simple right from the beginning. What I, what the way that this documentation system works is I write my intents as bullet points at the header of a YAML file as the YAML header of a markdown file.

And I have one section called Intense, and it's just like lists of stuff. And then I have another section that is called linked files. And the linked files are just literally file paths [00:07:00] to files that are tied MD whatever that are relative to the repository route in which I'm writing my documentation.

And then the rest of my Python code orchestrates opening. The file, reading it in, putting it, combining it together with the rest of the text, and that's it. It's very dumb and simple There. There were fancier components that I started to add in as well, which were like LLM as a judge. Let me judge whether the documentation, the markdown components that I have are out of date with the intent.

And then I had to do a few rounds of iteration to really refine what I meant by like. There are like at least three ways that this can be out of date, right? I can have my content not being satisfied, not satisfying the intents. I can have code that changed, which induces a change in the content. Or I can have, I forgot what the third one is, but like I, I needed to write three separate LLM bots to each.

Whether the content was out of date in a particular way that ended up using like [00:08:00] structured outputs as well. I was generating A-J-S-O-N and the js ON included like a booing flag. Yes. No. And also included LLM reasoning for it. And it was just like that other experience was like, yeah, okay, rag. This was rag in spirit because I'm retrieving additional context to satisfy the intents of what I need, except I didn't do ra uh, simple co-sign.

Similarity on a vector database, right? Mm-hmm. So I just wanted, like, this kind of exploration helped me broaden how I think about RAG to basically, how do I curate. The additional textual context that I need to feed into the LLM prom to accomplish my goals. That is what RAG really is. 

hugo: Absolutely. And I philosophically, we may di disagree in terms of semantic 'cause the way I framed it in, in, in this course is we want more information retrieval than rag.

And RAG is really just a very small part of information retrieval. And what you are doing is saying the same thing, but you're using the term rag to describe Oh yeah, yeah. Okay. Yeah. [00:09:00] Doubt. Then you, you have been doing rag applications that are more re researchy and recent, right? 

eric: Yeah, because I do science, I read papers and there are times when I.

There have been times when I've been confronted 'cause I'm, the data science team, interfaces with immunologists, antibody engineers, analytical chemists, all have PhDs in their respective fields and I need to catch up with their knowledge really quickly. So one of the things I do is I. I ask them, Hey, what are the papers I should read?

So I go and get like a bunch of papers that are irrelevant to onboarding onto a new initiative. And it's like disparate. It's, there's one paper that is on analytical chemistry chromatography, and then there's another one that's like how Mass Spec is used to do this measurement and I need to understand these papers together.

One of the things that I've been trying to build out is basically how do I get. My understanding of the, this collection of [00:10:00] papers to be, how do I onboard onto this collection of papers fast enough so that I am not the, my conceptual understanding is not the bottleneck to me collaborating with my really smart and talented colleagues.

That's when you can imagine the simplest, dumbest thing, which is, okay, let me try to upload these five papers into a vector database and just do a dumb and simple query of those papers, and you'll very quickly find out that under these circumstances, an LLMA. Under certain queries, we'll end up being hallucinating.

Ideas that go across papers, and it's great if you are trying to synthesize new ideas and it's, but it's horrible if you're trying to be factual and understand the concepts to begin with, right? So that, that turns out to be a huge disadvantage for just dump everything inside Lance DB and try to like query.

Do dumb and simple cosign similarity queries. And so that's when I started. And it's also, I should also preface this. There is the classic query that [00:11:00] any person who is too busy to think will ask. And that query is what's this all about? And if you think really hard about like how naive rag works. What's this all about?

Has zero relevance to any particular textual chunk of scientific papers. You're just gonna get crap retrieved information and then that's just not gonna work. So that's when I started ex exploring a few other slightly more advanced ideas like Graph Rag originally by Microsoft. Someone came up with like nano graph rack, making it even.

I'm making my own poor man's graph rag to try to understand how I can create knowledge graphs automatically from a small collection of papers in a way that's economical and not expensive as well. There's some engineering goals that I have inside there and use that knowledge graph to enhance this query of what's this all about?

I am, if I am not in a state to think deeply, I am. What's this pile of paper is all about? And I need to have additional, [00:12:00] my hypothesis at this point, which is why this is all a bit more researchy, is that. Having an auto constructed knowledge graph and enhancing and having an LLM bot that takes my query and in uses the knowledge graph to enhance the query a little bit more will help me with the retrieval piece, um, which will then help downstream some more with answering accurately across these papers.

So that was a very long-winded way of saying, I'm trying to better answer the question. What's this pile of papers all about? There's fancier rag that I hypothesize should be usable here. Super cool. And do you have any blog posts on this in particular with respect to graph rag? Graph Rag is coming. Okay.

And it's one, it's once I'm done with once. I think I do want to give myself a little bit of space to figure out one, the architecture of graph. I have an architecture diagram in in Xra somewhere, but I wanna figure out. Whether that architecture diagram explains how graph rag [00:13:00] works more accurate, accurately enough or not.

And I do wanna try to successfully do the kinds of queries I'm trying to do across like a collection of 25 papers on something I'm completely not trained about analytical chemistry, right? Like 25 most seminal papers that I need to know from my analytical chemistry PhDs colleagues, and try to understand that.

So I wanna give myself a little bit of space and time to figure that out. Super cool man. I did wanna, 

hugo: and everyone do connect with e Eric on, on LinkedIn and follow him on the platform, formerly known as Twitter. Uh, as well it's Twitter in my heart always. I love that. That's just the rebrand. That didn't work as well.

The small wins in the modern world. Yeah, I have, you've already answered the expected challenges arose that, 'cause you can read blog posts about how to build rag and information retrieval and that type of stuff, but. In the end, in practice, to your point, the fact that someone super sharp people will write, Hey, what's this all about?

When [00:14:00] designing a system? You'd rarely think of that, but when you see it the first time, you're like, oh, of course people ask that. Huh? I just wonder, what are like some other unexpected challenges or gotchas that you think could help people who are building these types of systems at the moment? 

eric: What are the unexpected gotchas that one might encounter?

Yeah. Or challenges. Or challenges. Let me see. I know there's, I've seen this with my colleagues. They get very quickly enamored with a particular technology stack and I just encourage us, re resist the urge to go, this has to be open search or AWS open search or this has to be chroma db. Right. The Tel Stack landscape is evolving fast enough that you don't wanna be bought into any one of them.

And I know this because, 'cause I build my own LLM package called LAMA Bot and initially, and I was just trying to make it like easier. To interact with LLMs in the Python world. So I initially thought Lang chain was a little bit complicated. Let me just try to write an abstraction on top of L Lang [00:15:00] chain.

And it was okay for a little while. And then soon enough, I wasn't experiencing growth pains, like trying to do other things with Lang chain. And so then that taught me one, one lesson, like not to rely solely on or be too heavily bought into one particular technology stack. So that's the first one. And the second one that I.

Try to adhere to is not to build for scale. There's a common lesson that I've learned, which is at first, don't build for scale build to solve the one or two people's problems that you need to solve and then build for scale later. So that's another one. So it's very tempting to go, oh yeah, my LLM can, my LLM application should be able to do this big, wide scope of things and not really like you.

You shouldn't. I think it shouldn't be built that way. I'm just reminded thanks to that point of one other one, which is there's one of the other blog posts I wrote. I pinned down an idea that came to me while I was talking with other colleagues in that. [00:16:00] And reading Anthropics blog post, which is basically you don't want to, you wanna start, if at all, you can start with a deterministic traditional program.

You start with that, and the place where LLMs come in is when you relax a particular constraint, right? If your constraint originally was, you're trying to calculate if you're doing a bill, uh, a bill calculator, and a tip calculator all in one. Your initial inputs would be integers and floats and. Floats, right?

Like that. That would be your initial input. But you can then say you want to relax the input constraint to be, I just wanna like feed in natural language, right? Calculate the total tip for me on this bill of $2,400, rather than calculate bill 2,400. After that, you move onto more relaxation of the order of execution.

That's when you get into agent plan. And so like when building with LLM, just wanna see it as. Relaxation of constraints. You start with the most [00:17:00] constrained form, right? Which is usually just a determinist python program or whatever language program, and you relax constraints. That's where LLMs can become handy.

I love it so 

hugo: much, and I do think it's almost. Heresy to say this in the space because a big part of the vision and the dream has been, oh, we have these LLMs that can do things now. So let's just, actually, I'd published an essay yesterday called Beyond Prompt and Pray, where people are just prompting and praying great results.

But the point I'm trying to make is that we've hoped that we can just speak to this one giant Galaxy Brain and have it do stuff, but actually. It makes a lot more sense to start from the other side. It's a lot more difficult, but instead of starting this one Galaxy Brain and trying to ba put bandaids around it using Lang chain, for example, we start with extreme, incredibly structured software and then start to.

Relax constraints as you put it. And I think that's mm-hmm. Really well said. When you mentioned don't engineer for scale, it actually reminded me of something, which I hope you'll all humor me for a second. Can you see my screen, Eric? 

eric: Oh yeah, 

hugo: dude, this [00:18:00] is, so this is DJ Patel who was the first chief data scientist for the White House.

And it's a note he wrote himself on his first day in the White House. And it's this section that you reminded me of, prototype for one, build for 10, engineer for a hundred. Yeah, which I love a lot. But the other things dream in years plan in months evaluating weeks shipped daily, incredible. What's required to cut the timeline in half, what needs to be done to double the impact.

And I just, I, I love this because it's clear, like to be able to take. A plan like this and thoughts like this and convey it on a single piece of paper like, like that I think is, is incredible. Clearly that's how we ended up at the White House by writing stuff like that. But I really do mean that. And in fact the, when I went freelance seven months ago, the dreaming years, planning months of evaluating weeks shipped daily was incredibly helpful for me figuring out like what on earth I was actually trying to do.

But back to. Rag. I'm wondering, and we'll get to questions from everyone else in in a minute, how do you think about evaluating your ag systems? And really [00:19:00] my question is manifold, I mean both at the macro product level and then at the micro Yeah. Like individual LLM call level and then connecting the these two.

eric: Yeah, that is a good one. So when it comes to evaluations, you really need, so like for example, let me take, let me use as an anchoring example. This like understand 25 papers problem that I wanna try to solve. I need to take the time and evaluate, develop the evaluation criteria on just this one set of analytical chemistry papers while also trying to be generalizable, right?

I can't overfit my evaluation criteria to just analytical chemistry papers if my dream is to try to parse. Scientific papers, scientific literature in general, right? So it's like I need to look at, for example, if I have a query enhancer is the query enhancer component, like this little component of my system.

If I'm asking, what's this pile of papers all about? Does this query [00:20:00] enhancer enhance the query in such a way that it is. The queries that are newly generated are relevant based on my knowledge of the knowledge graph that was generated from this set of 25 papers. So that's like I have to look at it component by component, and that effort is, has to be proportional to the kind of impact that I'm trying to drive, right?

If I want this to be accurate enough for. All of my research colleagues, all 600 of them, to be able to onboard onto new knowledge. Then I need to really in and if it will save over the course of two, three years, hundreds of thousands of hours of researcher time in doing that, then I need to make sure I get it accurate.

But if the goal is for drafting or just skimming the papers, I actually don't feel the need to invest too much time into it. Doing this exact amount of systematic [00:21:00] evaluation and I can instead approach it in a much more, or what I mean is systematic upfront about criteria development, like that's maybe not as necessary, right?

It's gotta be the amount of effort you put in needs to be, you know, consequences that you're gonna anticipate suffering in the long run. That said, I can treat this as an exercise in practicing how to be. Rigorous with the evaluation criteria, right? Does the query enhancer match what I would have done, for example, or does it exceed my expectations personally?

Then I run into this measurement problem of, okay, I gave a rating. Now I need to take someone else and pull them in and go and ask them to give a rating as well. Right? So I need to really develop this criteria if I'm to. Practice this and do it over and over on other important projects. Yeah, I think the TLDR is, there are things for which you can vibe check the get commit, message [00:22:00] writer was built, vibe checked, right?

There was no formal evals on the get commit message writer because the consequences of that weren't high enough. But if the consequences were important enough, I would really not just stick with vibe checking. So that's, I think the TLDR that I've got. 

hugo: That's 

eric: awesome. And I 

hugo: do, 

eric: I am interested to be fair, a lot of people use a lot of vibe checks.

The, these days and vibe checks are how we feel out the system. Right? Like Shankar has that paper about like how we have expectation drifts from as we evaluate more outputs from an LLM. That is the nature of vibe checking, right? I'm checking, I'm iterating, I'm being very fast and dirty. Quick and dirty with my experimentation.

It's only, and you go in and you establish. Formal metrics. Once like patterns are settled and you know you need like accuracy. There's another one that we were building at work, which is like extracting biological sequence information from papers, literature, [00:23:00] also paper parsing, and that one, our evals have to be like very well set, right?

Like we have an evaluation set of papers, we tweak the prompts, we make sure, and then we check like for certain facts, were they extracted or not? If the facts were extracted, were they extracted accurately with no, like the numbers were extracted accurately, et cetera. Like those are the things that we, we need to sit down and properly measure.

'cause our colleagues are gonna dev de depend on this extraction of information to be accurate. 

hugo: So much of this sounds, and we'll need to wrap up in a second sa sadly, but so much of this sounds like what you, you and I have been doing for most of our careers, which is looking at data. Yeah. In this case a lot of it's natural language stuff, so I'm just wondering if you can speak explicitly to the role of looking at data, how you do it.

Do you do it in spreadsheets or do you use more advanced tools? Are you using pivot tables in order to do EDA exploratory data analysis? On your error analysis? I'm just trying to get in the mind and practice 

eric: [00:24:00] of Eric Ma. Okay. Within LAMA bot I built, one of the things that I did was I logged every single L-L-M-A-P-I call.

Inside a local SQ lite database. So if you use Lama Bott, don't worry. I'm not transmitting any of your data to my servers. I have no business with that. It's all kept locally within the repo and that message log DB file is get ignored as well. What allows you to do is allows you to Lama Bott is what I use, 'cause I developed it and actually built a ui.

Also built entirely with Cursor's help 'cause I'm no UI developer. And what it, what that allows me to do is allows me to look through. Logs of all my LLM calls and I've found ways to structure Leva bot such that I can filter by certain prompts that are named by Python functions. So I can actually filter out for, oh yeah, just give me the commit me any L-L-M-A-P-I call that involve the commit message, writing prompt.

And then what I can do now is because there's a UI up there, I can do [00:25:00] the most rudimentary thumbs up, thumbs down. Kind of measurement on, okay, this looks good, and I record that inside the database, and then I go to the next L-L-M-A-P-I call, which has the full context. And this looks no, not good, right?

One of my hopes is over time, as I use the commit message writer and I start to iteratively build this eval set up. I'll take the ones that I've thumbs up and instead of calling out to open AI's GPT-4 oh, I can fine tune five four instead and just do this completely locally. Have the commit message writing be done completely locally by filtering out and exporting in fine tuning format automatically.

The set of L-L-M-A-P-I interactions that led me saying, yes, this was a good commit message, or no, this was not a good commit message. Amazing. And I just want to. 

hugo: Reiterate a few of those points that we've been discussing for weeks here as well, that log stuff to the simplest tool that works for you. SQL Light is such a great option.

[00:26:00] Yeah. Postgres when you're doing things like this, but for example, initially when doing then start hand labeling stuff and make a binary decision. That may be tough sometimes. Yeah. Thumbs up, thumbs down. Yeah. Log all of these things and then use, yeah, these human labeled data to iterate rapidly on your product and store it all.

Yeah. Later. And then perhaps to Eric's point, start thinking about, we may talk about this next week, depending on what you all want to cover, but choosing a small and model and fine tuning on the interactions and, and using that. Yeah. And perhaps even, I'm so excited you're doing it locally. And can I just ask, are you, do you use O Lama locally or what do you Yeah, hundred percent.

eric: Too good. In fact, llama was one of the first, I saw Alama early in its early days and I was like, okay, I want Llama Bot to be able to connect to Ulama asap. Like I was, that was one of the first local LLM packages that I wanted to support with, with the llama bot. So I'm a, I really love it 'cause it gives me the ability to very quickly pull in.

A new [00:27:00] model. I've recently pulled in deep seek R one and tried to use it to write evergreen notes and found out it's not good for writing evergreen notes and all, all by vibe checking, but like I then asked it to solve a math equation and it solved it perfectly. So like I now know it's very easy to use to quickly get a vibe check on what is this model going to be good for and what is it not going to be good for?

Without a doubt. 

hugo: Super cool. So we have some questions in the chat. Sadly we don't have time and I take full responsibility for that. But if you do wanna ask questions in, in, in discord, I don't expect e Eric to, you've been so generous with your time, but if you're able to answer a question or two, please do.

Otherwise, I'm gonna pretend to be you and just answer them as, 

eric: as with the disc that's gonna deal with the kids. Um, but once I'm done with the kids, I'm happy to answer a few questions. Appreciate it, man. 

hugo: Yeah. So thanks once again for your time and please do connect with Eric on, on LinkedIn and follow him on on Twitter.

'cause the types of blog posts we've shared and the, these types of conversations he has a lot of and very gen and I think you also clarify things for yourself [00:28:00] by writing them up. And that's one of the reasons Yes. You share so much as well, which is yes, I really, your writing documentation 

eric: for your future self.

It's very helpful. Amazing. Thanks so much once again, Eric. Yeah, my pleasure. Thanks. 

hugo: Thanks for having me. Absolutely. Thanks for listening everyone, and thanks for sticking around to the end of the episode. If you're enjoying the show, please leave a review and a five star radio on whatever app you're listening on.

It really helps. I'd also love to hear what topics you'd like us to cover and any guests you'd like to hear from. Feel free to reach out on LinkedIn or on Twitter. The podcast handle is vanishing data and I'm Hugo Bound. See you in the next episode.