The following is a rough transcript which has not been revised by High Signal or the guest. Please check with us before using any quotations from this transcript. Thank you.
===

tomasz: [00:00:00] So we've created about one and a half trillion, maybe 2 trillion in value in software over the last 20 years. And all of that value, or a lot of it, is predicated on workflows that were created 20 years ago, right? The way that Salesforce is designed to have an SDR to an AE and then a customer success manager, and that whole workflow, we created a $250 billion market cap company based on it.

And now as a result of AI and GPT five and tool calling and all these kinds of things, we can reinvent those workflows in a pretty material way. Which is awesome for startups because the legacy companies have calcified their software around those workflows and it will be extremely difficult for them to move, which means that 250 billion of market cap for Salesforce is really up for graphs.

We had dinner with a bunch of executives who are running publicly traded companies, and we were asking them what is the impact of ai? One of the things they said is we've ripped out all of our marketing automation solution, and the next thing to go is the CRM. 

hugo: That was Thomas Tous [00:01:00] legendary SaaS investor and founder of Theory Ventures previously to was at RedPoint Ventures, where he invested in Looker, Expensify customer and Monte Carlo in this conversation.

We explore why generative AI could put a trillion dollars of market cap up for grabs, how liquid software will be built in minutes, and what it means when a single person can manage hundreds of AI agents at once. We dig into the skills founders will need in this new era. The hidden technical debt slowing AI adoption and the opportunities no one's talking about yet.

This one's packed with insights you can use today. This is the High Signal podcast. Brought to you by Delfin, the AI agent for data science, produced by Jeremy Herman and Duncan Gilchrist. I'm your host, Hugo Bound Anderson. Let's jump in. Hey there, Thomas, and welcome to the show. Thrilled to be here, Hugo.

Thanks for having me on. Such a pleasure to have you here, especially of such a [00:02:00] pivotal and exciting time for all the things we're working on. And last time we spoke you said something that I has given me a wonderful set of sleepless nights. You, you said all the market cap. Since 1999 is now up for grabs, and I'm interested in what led you to that view and what the implications are for founders, builders, investors, and so on.

tomasz: Yeah, so we've created about one and a half trillion, maybe 2 trillion in value in software over the last 20 years. And all of that value, or a lot of it is predicated on workflows that were created 20 years ago, right? The way that Salesforce is designed to have an SDR to an AE and then a customer success manager, and that whole workflow.

They created a $250 billion market cap company based on it. And now as a result of AI and GPT five and tool calling and all these kinds of things, we can reinvent those workflows in a pretty material way, which is awesome for startups because the legacy companies have calcified their software around those workflows and it will be extremely difficult for them to move, [00:03:00] which means that 250 million of market cap for Salesforce is really up for graphs.

We had dinner with a bunch of executives. Who are running publicly traded companies, and we were asking them what is the impact of ai? And one of the things they said is, we've ripped out all of our marketing automation solution, and the next thing to go is the CRM. 

hugo: I'm particularly interested in your perspective as someone who's both investing in and experimenting with these tools.

So I'm just wondering. Particularly with respect to everything you just said, how do you think about balancing excitement over shiny new capabilities with what actually works in production? 

tomasz: Yeah, so I think the MO has changed, or it's changed because these workflows are changing really quickly. It's really hard to buy a piece of software now and plan on using it for a year or two because if the underlying workflow is changing, I don't know what the workflow will be in six months.

Right? GPT five launches more sophisticated tool calling. There's the open source, G-P-T-O-S. That just came out. You can use smaller, large action [00:04:00] models they're called, and the fastest moving teams. And the most innovative teams are the ones that are every day, they're iterating on the workflow. Maybe they're using a vibe coder to build a new, I dunno, triage tool or a new way of prioritizing leads.

And so this makes it really difficult for executives to buy new workflow software. There are certain categories where this is not the case, but for a lot of categories and for many leaders that we've spoken to. They're trying to build and they know that the building is an interim step and at some point someone will discover this is a new way that we sell and this is a new way that we market.

And once that new. Local maximum is achieved, then there'll be the next top spot and the next sales force that are built. But in the meantime, nobody knows. We're all figuring it out together. And so many of these executives are actually hiring generalists rather than specialists. So there's no SEM specialist, there's no SEO specialist.

There's a smart person who is willing to reinvent the workflow and then maybe one or two technical people to help them. [00:05:00] And we're doing this candidly, inside of theory, we have a head of ai. And we have four technical people on staff and every week we are going through like, okay, what's working? What's new?

What new capabilities exist? What can we do that we couldn't? The hardest part about all this is re-imagining the way that we used to work. And if you've been trained, like I have to work with email for the last 25 years in a particular way, and all of a sudden it's okay, I need to step back and say, if I, knowing everything that I do now, how would I process email, or how would we generate memos or how would we generate marketing?

Copy. I think that's the hardest thing. It's not even the knowledge of the technology, it's the discipline to go and take a step back and try to reimagine. That's such a 

duncan: fascinating and exciting observation. Tomash, could you, so I think something really unique about you as an investor is how much you're actively building at theory and actually trying out the tools and writing a lot about it too publicly.

We see a lot in, in many of your blog posts. I'm curious if you actually take a, a step [00:06:00] deeper there and talk more about what is really guiding your work there and what's your North Star? 

tomasz: Yeah, so on an individual level, I, Alex I. Love to learn and I think as an investor touching the technology gives you a sense of what it can and can't do at a deeper level.

So I'll give you an example. So over July 4th weekend, I wanted to. Automate all of my email processing, right? I've had 20 years of answering emails. I think that's enough. And I started in customer support at Google and went through old training about voice, and there was a wonderful woman named Palo who trained us.

And anyway, so I've had enough of answering each individual email. And the observation is if I'm now summarizing large tracks of the internet using deep research, and it's pretty good, right? So if I ask it, Hey, go and compare these two kinds of technologies. When it comes back, whatever, it's 80, 90% accurate.

Is that good enough for my email? Do I really need to read every email anymore? And I think the answer for 90% of emails is the no, [00:07:00] right? There might be some like quip that somebody sends. It's like some casual thing, Hey, how's the family? And then there's an ask or two or our next step. And if you ask your chat bot like I did this morning, cluster, my emails.

For the newsletters, figure out all the startups that we should be tracking that are not in the CRM and for all of the emails from the internal team. What's the ask? Then you can start to manage the inbox at a higher level of abstraction, and there are tools that you can build. So go and look up this URL to see whether it's in their CRM if it's not, create a research report on it.

Then go and create an Asana task do tomorrow so that I can find it and I can remember. So if you can create lots and lots of, if you identify first these individual workflows and say, how can I operate at a higher level of abstraction? That's where we're going to get the leverage. And the models themselves are becoming much better at this.

I, over July 4th, I spent about a thousand dollars a three days on cloud code because I was using the wrong subscription. I was hitting the API instead of using the CLO code, $200 a month [00:08:00] subscription. And I didn't realize that for a while, so I started using open source models and small models. And so we're testing all of these to figure out how it works and we're doing it at small scale, but if we do it at small scale, then we can project for a large company of a hundred thousand people or 250,000 white collar workers, what is the impact?

How would these workflows change? And that should inform some investment pieces. That's incredible. 

duncan: And to what extent have you been able to actually implement these workflows at theory inside of your investment operation? Have they actually meaningfully changed how you work today? 

tomasz: I'm just starting on my computer, whatever my workspace for now.

But yeah, I'll give you another example. We need to listen to a lot of podcasts. There's a lot of alpha in podcasts when people speak, but if we wanted to listen to all the podcasts that are valuable. There would be no time to do any work, right? So we have a pipeline that takes 40 podcasts, transcribes them, and then extracts a whole bunch of information out of them every day.

And so there you have time compression, right? This is what we're really after when we mean [00:09:00] productivity is how do I accomplish the same thing or more in the same unit of time? And across those 40 podcasts, assume each one produces an hour of content. And if I can read that content and say 40 minutes, I go from 40 hours to 40 minutes in a week with.

A little bit of loss of fidelity, but maybe not too much. And the better I can prompt, the better I can ask the machine learning model with the AI model what I want, the less 

duncan: lossy the compression. That's an extraordinary example. I think we, we all face the problem of too many podcasts and how to actually extract the, the glean useful insights.

Yeah, I just, one I listen to, but the other ones I will compress. Amazing. 

hugo: So you spoke to this briefly to, with respect to. The number of marketing leaders among other types of executives you've been speaking to recently. I'm wondering if we could go a bit deeper into what you are hearing from them about how AI is changing their workflows and what they're buying and what they're not buying.

Now, 

tomasz: the leaders that we've spoken to, they're trying to reinvent the workflows. I guess I would categorize as two [00:10:00] different kinds of marketers. There's the extremely senior, extremely accomplished marketers who are running very large companies, and they have an excitement that. The discipline of marketing is moving more in this, in the direction of humanities.

If it's that much easier to produce a piece of content or a hundred, then the skill is no longer operating the machinery. It is establishing the human connection. It's the Scotty Scheffler Nike ad that went out, right? There's no, no amount of ai. We'll make that ad better. It's the human connection. How do I establish the human connection for Nike golf?

And then there's another class of marketer who think of themselves not really as marketers. They think about themselves as engineers of the marketing function. And that's a very different mentality. They are typically at early stage companies, and they're in the process of building the marketing machine to consistently produce top of funnel.

There, it's rapid tool iteration. Use a tool for three months, figure [00:11:00] out whether or not it works. It's a little bit like I think about marketing as a discipline of hedge funds within startups where you have a portfolio of different customer acquisition techniques. One month, one stock does well, another one does not.

But you need a broad enough portfolio that overall the performance of the marketing system succeeds. And so there the universe of different tools and the novelty of different tools like ai, SEO or some people are calling it GEO, generative engine optimization is a really big one. The machine generation of tens of thousands of pages is another one.

The ability, somebody was telling me there's an agency, I think it's called 64 Stories, and the premise is if you produce one piece of content like this podcast, you should be able to produce 64. Marketing assets from its, and that number is not, is overly precise, but the concept is there's a tremendous amount of insight that you can extract and repurpose across different channels, tiktoks X and so on.

And so I think there, there's a tremendous amount of reinvention around what the machine looks like in contrast to the [00:12:00] senior level marketer who is, I think really excited and for good reason about. Eliminating a lot of the real work that dominated the last 15 years on B2B software. 

hugo: Absolutely. And I am very interested, and we've talked around this, the fact that a lot of people, yourself included, ourselves included, have started to hand roll a lot of tools ourselves, right.

That suit our purposes. Those are our teams, but they wouldn't necessarily. Service or support an entire market. So we've gone, or we're seeing a movement from software such as Salesforce, which for very good reasons, needs to satisfy the needs and demands of a huge market to the ability to start hand rolling things.

And I just wonder what you see. The end game or even middle game in ending up with respect to will we all be building ephemeral, quote unquote ephemeral software in, in, in perpetuity, or will there be some form of consolidation? 

tomasz: Okay, [00:13:00] so we hired this amazing intern named Path and we took him out to lunch for his first Monday and he's, he has a colleague, also Cole, great intern.

But this story is about Pav. So Pav, we asked him, Hey, have you, how many AI apps have you built? And he said, 500 to a thousand. I fell outta my chair. I was like, what? What in the world? How did you build? Because these products are only six months old. He said, every time we need to learn new concept in school, because he's at university, they build a an ephemeral app.

And so he told us a story of how they were living the physics of acoustics, and he and his classmates built an app to learn the physics of acoustics. And then that was it. Single purpose used. So I think that's, and then you, in the GPT five announcement yesterday, they were talking about this, right? How there is now a mode for education where it will teach you, maybe this is two days ago, it'll teach you how to learn in a progressive way.

So I think just what happens in academia will happen in business. There's no reason why. If there's, [00:14:00] if you're like the head of data quality, a major bank, and you want to train your employees on how to use ai, why wouldn't you build a little app? If it really is a one shot, you just put in the prompt and then iterate two or three times and you, you throw it away.

So I think there's the learning component and then there's the ongoing workflow. I think there's this moniker, which is software is becoming liquid, which means I have this little gap, right? I have this little gap in my workflow where this thing is driving me. Great. I have to take this really long list and reformat it and then put it into this thing.

And I know that software can do it. And historically, even if I was a software engineer. I might know some arcane language like a or Pearl or Lis that is super powerful and was designed to do this stuff, but somebody helped me because I don't know what the flags look like or I don't know how to put together this super fancy Excel macro and debug it because the tools just aren't there.

This PB script or app script or whatever it is, and now you know what? Like I'll give it a shot five, 10 minutes, maybe I'll get somewhere and then it actually saves me two or three hours. And the first couple of times, [00:15:00] well, I'll probably get it wrong, but the third time or the fourth time, I'll know what to do.

I'll know how to tell the model, test and make sure it works before you tell me that you're finished. And okay, I think you're right, Hugo. You'll have this explosion of all these little pieces of software all over the place. And I noticed it on my desktop because clock code creates all these little files in my home folder.

Now, my home folder's thousands and thousands of files, and I will forget about them. They disappear In a large company, this represents a. Compliance problem and a control problem. And so at some point there'll be this rationalization like IBM, I think it was IBM or some major Fortune 500 company went from 5,000 internal tools to 500 under the regime of a new CTO.

And so you go through these waves of explosion and contraction, we're definitely in the explosion and the order of mag, you might be two or three orders of magnitude larger than the previous wave. 

hugo: I love it. And with respect to that, I think there is an interesting take. That small teams, and this is something I've seen a lot, and I think we've all seen small teams are very well poised currently to really accelerate what they can do with [00:16:00] AI powered systems and, and technologies, whereas large teams and larger organizations haven't necessarily been able to adopt them and perhaps have been slowed down by them, uh, as well.

So I'm wondering your thoughts on. In future, whether like the future will be small network teams or still our 20th century large scale hierarchical organizations. 

tomasz: Yeah. I think we're moving to smaller look at the architecture of rippling, right? Or what Parker Conrad calls. Compound startups where you have gm.

So let me take a step back. If we really do believe that AI is making us much more productive, then we need small teams to accomplish the same level of gold, right? We also, another core precept is that the time to produce complex software is significantly less. And okay, you can't build, if you want to build a big company within the world of software, you can't replicate the playbook of the last 20 years, which was to take a feature out of Siebel and start Salesforce, you really have to build.

A big, broad compound startup, and that's actually to your competitive advantage, because if you [00:17:00] can do that better than any other startup, the breadth of software that you'll have will differentiate yourself. You'll present an enterprise image much more quickly. The organizational design for a strategy like that looks like a CEO, who's a capital allocator.

It's effectively a holding company the way that Alphabet is, a holding company with Google and YouTube and Android, so on, but it starts at a startup and then you have. Individual CEOs of different business units who are building these products on top of a, a substrate on the bottom of the substrate. And so at least there's a common data layer.

And all these different products either thrive or fail as a result of the efforts of the individual teams. AWS is famously architected this way, and so I, it's, those organizations tend to be much more nimble. And build far more product quite quickly. The challenge with the trade in any organization is the cohesiveness of all of those tools working together, and so that's the tension in the organization, but there's no [00:18:00] reason to believe that you couldn't build a modern day CRM and a fraction of the time, and we can debate the fraction, but a very small fraction of the time.

duncan: Can we double click on kind of the implications for individuals in that as you think about. Career advice of what you would tell, like the 25-year-old engineer in technology, what kinds of skills and what kinds of individuals are really best positioned to thrive in those environments? 

tomasz: Yeah. I think there's probably, boy, I think there's probably two archetypes.

One is an agent manager. I think initially this will be a vanity metric, but ultimately this will be a core. Productivity metric and a metric that's tied to your bonus, which is how many simultaneous agents can I manage? And so today, like, okay, if I really push, maybe I could manage four or five, maybe. I was talking to a very sophisticated software engineer and I asked him, how many can you manage?

And he said, 15. Okay, 15. So that's [00:19:00] my new target, right? Okay. Internal competitive instinct. Let's go 15. If I'm a young software engineer in a university, I think the question is how do I manage a hundred If I really like building and writing software and I wanna be able to do it quite quickly, and I need to be able to manage a hundred agents.

And if I can do that, I am eminently employable anywhere for a long period of time because I will be at least a hundred x engineer at least. 

duncan: There's this amazing trend of kind of all of a sudden every IC is a manager. Mm-hmm. And yet simultaneously, every manager is also an IC again. And so there's this kind of amazing like confluence of roles.

And so to your point, like we just talked about 

tomasz: the IC being a manager, so let's talk about the manager being an ic. The other archetype are architects. Now all of a sudden you have these people who are producing these machines incredibly quickly. Right. You have a, a software engineers managing 15 agents or 10 agents, and the lines of code or the amount of PRDs of product requirements, documents or engineering requirements, [00:20:00] documents that they can rip through in a day will be much faster.

And as the architect of the system, your operating cadence has to be so much faster. Like one of the things that I really struggle with AI is like delegation. I'm really surprised this is not taught in schools. And because what I'm doing is I have an image of what I want in my mind for the AI to do, and I have to communicate it in a way that the agent will achieve that goal and to manage four agents at once simultaneously, I need a pretty can look.

I need a pretty big backlog, and what I'm finding is I'm slow. I'm really slow in being able to generate a backlog that is big enough to keep the AI busy for long 

hugo: periods of the day. I totally agree, and actually there's a tweet from Chip Quinn that comes to mind who's the author of AI Engineering, among other things, but she wrote, I'm Slowly Beginning to accept that my productivity when working with AI coding agents is limited by my human brain.[00:21:00] 

She goes on to say AI can do many tasks in parallel, but I can only track the context of a few. So I only run a few tasks at a time and she concludes I am the bottleneck. So we are really, and and places it so wonderfully with respect to your point of how many things, how many agents can I manage at once as well.

tomasz: That's right. Okay. And so then the question is, that means, okay, if you think about management theory, I am micromanaging. I am limiting the productivity of my team by paying attention to the details that no longer matter. And okay, the first manifestation of this is I publish a blog post and typically I produce chart and I produce charts in the language called R, which is a statistical language.

It's like relegated to the dark rooms of academia, and I spent a long time learning this language, took me like three or four years. I still produce the charts, but honestly, I've forgotten most of the syntax because now the computer, the AI does it. So, okay, now I'm operating at a higher level of abstraction.

Instead of operating at the individual line of code [00:22:00] level, I'm operating at the chart output level, but I'm still limited because it's not like that AI is producing charts all the time that I'm reviewing and saying, oh, that's good. Blog post. That's good. Blog post. I need to operate at a higher level of higher.

And so this is this reinvention. This is what's I think is really hard. And there's this tension because in order to operate at a high level of abstraction, in order to manage even a human, a person at a high level of abstraction, you need a lot of trust. There's a lot of context that needs to be transferred.

The skills of the human, of the person that I'm managing change with time, right? It's just you're training somebody out of school, they have a learning slope, so does this ai, it's GPT-4, 5G, PT five. And so there's this tension where it's, oh, I'm gonna test it. Does it work? I'm gonna test it. Does it work?

And so the faster that we can have that trust building. The more that we can reinvent the way that we work, the higher level of abstraction, but that takes time. 

hugo: There's a term that, that you've used that a lot of people have used called background agents, and I'm wondering, because it's so relevant, I'm wondering if you could just explain what that term means and why it's so relevant to this conversation.

tomasz: Okay. So let's walk [00:23:00] through a workflow to ground this in something real. So let's say I have an email, and it's an email from a startup incubator with a list of 50 different startups that are raising. We focus on ai, uh, startups. And so there might be healthcare startups, there might be space startups, there might be a whole bunch of other startups.

I wanted to go through and find, first, find all the companies that might be relevant, and then for each one, kick off an API, uh, a call to the CRM and add it to the CRM. I don't wanna wait and watch it go through all this work. I trust it, right? I trust the AI to go through and figure out whether or not exists in the CRM and add it and find the right URL and all that kind of stuff.

And so I wanted to disappear into the background. Spin up 25. Suppose there are 25 startups, 25 background processes that all come back and only tell me, Hey, for these two we couldn't find the URLs. Can you help? Can you help me out? I don't wanna sit there and watch. And so this is what I've been wrestling with is, okay, if we do want this parallelization, and that's how I'll manage a hundred agents.

Candidly, if I can kick off [00:24:00] 25 from a single email and then have it come back and report. Great. Now I'm managing 26, the email processing agent and then the 25 that are doing the CRM work. But that's not, it's not obvious today because where do those processes go and how do I manage them when they come back?

There's no Asana task list for this Harrison that Lang chain talks about the idea of an agent inbox, and this does exist within the world of coding agents that do background work and come back after 15 minutes or. Two and a half hours in the case of GPT five and say, I worked on this problem, now I'm done.

Can you check my work? But it doesn't exist within the world of non-technical white collar work, and I think it will. And so this is, I think, a new inbox that needs to exist and needs to be reinvented. And then once that UX is clear and simple, then it'll be very straightforward, I think, for tens of millions of people to be managing a hundred agents at a time.[00:25:00] 

hugo: Amazing. And I am interested, I mean we have a lot of agents that do wonderful things, but they don't do them all the time. They still hallucinate, they do the wrong tool use, however, whatever proportion of the time, they'll dead lube. So I'm just wondering, what are some of the hardest challenges you've faced and what are the failure modes you've seen in like just trying to get agent systems?

To do the job. 

tomasz: Yeah. The first is tool selection, right? So I have built, I think it's 57 or 58 tools in the last two weeks for doing different kinds of things. Let's say I want to change the music on Spotify, that's a tool, right? Or send an email to somebody, or reply to an email. That's a separate tool and it's a different kind of programming.

So why is it We had classic programming. You design every step, do this, then do this, and every time you run the program, it works the same way. And then we introduced LLMs where you can ask it exactly the same question and it doesn't produce the same output each time. So deterministic versus [00:26:00] non-deterministic.

And when the four moment came around, we said, okay, this thing GBT four or clot, the version of cloud or the Gemini version at the time will replace all of coding because it's so amazing. And it is amazing and it's strong in certain areas and weak in other areas. And so then we decided, okay, let's only have it do these things.

And I think we only wanted to do things where right now I want to speak and I wanted to understand my English and translate that into a tool call. So when I say reply to the email from Hugo and Duncan, I wanted to go and look up the email with search, find the email, and then draft. Reply, but if I ask the LLM to do that, it won't do it the right way every time.

So instead, I need to build a tool and that I can build the tool in the Unix philosophy, which is the smallest possible part. Just look up an email and then just draft the [00:27:00] email and then just send the email and then I can have the LLM stitch those three pieces together. That doesn't work because you have a little bit of error between each step.

So if you have a 15% error between each step and they compound, you have something like 25% error fulfills one in four. So then what you have to do is you have to say, okay, I need to build a tool that does that combines these three steps. And so this is what we're, this is what we're, we're wrestling with is now we're changing the way that we code to meet the needs of the LLM to meet it where it's strong.

And the hard part, at least now, is. There are no design patterns. I can't remember when I graduated from grad school and I got my first job as a Java engineer. It was a 300 page book that a senior software engineer on staff gave me and said, these are design patterns. These are ways that software engineers have figured out is the best way to design the Java software.

There are no design patterns for AI today. And so that's what we're in the process of figuring. 

hugo: I love a lot of the blog posts and [00:28:00] essays, anthropic rights, but there one late last year on building effective AI agents where they spelled out LLMs augmented LLMs, LLM workflows was really interesting in terms of showing us some patterns that they've seen emerge through the use of Claude.

And then their, their more recent one this summer on building their research agent essentially, which spawns subagents, and it's almost the opposite end of the spectrum. But with respect to the, the first blog post I mentioned, something they really tell you about is. Hey, before agents, let's think about tool use.

Let's think about retrieval. Let's think about memory. And memory is something that still I, I think is dealt with horribly in in a lot of ways. So I'm just wondering, on the emerging stack for retrieval to tool use and memory, do you think we're getting close to something reliable, or are we still missing key pieces?

I 

tomasz: think there's two parts to memory. There's local memory. So there's memory for a conversation that you're having with a large language model, and then there's institutional memory. Which is, these are all the ways that theory knows how to [00:29:00] do things, and some of this will expose to AI and some of it we don't.

So there's two, I think there's two categories of software. There's another dynamic within the first, which is I think the margins of AI businesses are lower than software companies, and that's because the compute costs are pretty high. I think we'll start to see hybrid architectures where both on your mobile phone and your computer coding agents and others will start to run.

GPU workload on your machine. Mm-hmm. And they'll use small models on the machine, uh, because like a tool model, whatever. Salesforce has an XLAM model that's 8 billion parameter. There's sublime at tool calling Sublime I better than some of the very largest foundation model companies. There's no reason I can't run that on my machine.

And I don't need a behemoth of a trillion dollar, trillion parameter model running basic tool selection across a hundred tools, I can use an 8 billion parameter locally, and that'll materially improve the margins. It'll also materially improve my experience because the latency is a 10. Okay, so if that's the case, [00:30:00] now I have shared memory between the local device state and the server state.

So that needs to be solved as well. So I think you're right. I think we're in extremely early days for memory. There's also views cloud code for a long time. You have to do manual or automated compactions and you have velocity compression of the previous, uh, conversation. So I agree with you Hugo. There's a ton of work to be done in memory and the better that we can manage the memory, ultimately the, the better the long-term performance.

In fact, like one of the things that I really crave is one, I never wanna shut cloud code down. The more it remembers, the more useful it is to me. If you remember that email that I sent yesterday, I don't think it went through or maybe it went to the wrong person or there was a PDF that I kind of remember from last week that Duncan sent me and I need, there was a quote there, and so that's really what I want.

I want that, I'm using AWS analogies here, but like glacier memory, and then I want S3 and then I want Redis. So very hot, medium, and cold, [00:31:00] and we're really early there. And then there's probably a hybrid. And then cloud component 

hugo: to it. I love it. And there was something you stated explicitly, but it was almost embedded in, in, in that wonderful narrative, which is modular LLMs working together and perhaps things being handed off from big models to smaller models.

And the reason I wanna double click on this is I think it's easy in this space to believe. There'll be one, you'll use one giant model that will do ab absolutely everything. So I just wanna tease apart your thoughts on almost like a Unix philosophy of models piped together as opposed to the Apple philosophy of one, one product to rule them all.

tomasz: Yeah. So I run four, there are probably, there are four different model servers running in my machine right now, and so I run a Gemma 1 billion parameter model that's always running, which is for very simple tasks. There's a 12 billion parameter instruct model. That I run primarily for summarization of emails.

I don't [00:32:00] want to send. So when I summarize my emails, I don't wanna send that to a server. I want to control it. There is a 27 billion parameter model that is optimized for speed of summarization of podcasts that I run using MLX server, and then I run an 8 billion parameter Salesforce XA tool selection.

And so I think, I think one of the debates. That large companies will have, is it better for me to spend a lot of money on the cloud or should I buy my employees pretty beefy GPUs and have endpoints for sensitive models or tool calling, or places where I can run local models pretty consistently and aggressively?

Can I actually drive? Some savings there. I don't know the answer. There's obviously additional operational complexity with managing GPUs and hardware and on-prem data centers that people decide to go that route. But I do think this idea of a monolithic AI app, it's suboptimal, right? Uh, the way, so maybe I'll put it a different way.

[00:33:00] When new categories are created, a buyer typically wants to buy the entire cake. They wanna buy an end-to-end solution that solves their problem and with. Time. They say, you know what? I want that second layer to be chocolate, and I want the third layer to have sprinkles, and I want the fourth layer to be vanilla.

And so they'll swap out layers of the cake for better and better components. And so to a year ago, everybody, we all just bought the cake. And now we're at a place where we're swapping out different models that are built for purpose 

hugo: for us technical people. I totally agree, but. Thinking about like a global community of not necessarily su super technical people, do you think they'll be able to have modular models doing different things for them or they all be embedded in different products, for example?

tomasz: Yeah, I think, I wonder if there'll be like a router. I think there'll be, there might be a classifier, and I don't know if it'll be like a classic classifier, like an SVM or a Ector machine, or if it'll be an L LM based classifier where you type in it, and this is you type in a query and then it figures out this is the model.

This is the one of the big innovations [00:34:00] within G PT five, which is you shoot a query to it and it decides, is this the thinking query, is it not? Does it go to a nano model? Does it go to a regular sized model? Does it go to a big model? How long am I spending on this? How many thinking tokens am I applying to it?

And I know I do, I think, I don't know how confident I am in this, but I can't imagine a world where my MacBook M four is running a local router and then ultimately deciding where am I passing this query to? Just the way that Open Router does today for businesses. Right? And so I think that's a real future, particularly in a B2B context where cost optimization will really matter.

Right? One of the questions we were debating, I think two weeks ago is. If you're a startup, what fraction of your r and d budget should be spent on ai? Is it 25%? Is it 50%? It's not a hundred percent because then you have no people, so there, there is some upper bound, but it should be pretty significant. And so if that's really the case, then you can 

duncan: save 10%.

It's a lot of money. I'd [00:35:00] love to double click on the infrastructure question, and in particular, we talked earlier about how humans now should be managing agents and way more agents, maybe dozens, maybe a hundred agents over time, but that obviously has implications on the infrastructure that companies are using and the load on the infrastructure.

And so curious for your reflections on what happens to our developer tools and infrastructure as we start to really scale up agent use. Yeah. Okay. So one 

tomasz: question is, can. Let's say each software engineer is managing a hundred agents. What happens to the ci cd platform if you have, if each software engineer is now committing, I don't know, 25 PRS a day, and then you have that, the integration tests suite and all the tests that you need to run zap break, probably, I would have to imagine it does.

The other limiting factor here, going back to serial task processing of humans, is somebody needs to review all these prs and. Sure you might have, uh, [00:36:00] Dan's generative adversarial networks work. One model is critiquing. The other problem with that architecture is the papers indicate that actually a single model is superior, which is, which actually was counter counterintuitive to me.

A single model critiquing its own work is actually superior than two models critiquing each other's work. So I think you know, everything around like the PR and the CICD and the integration tests emerge queue, all those things, I think become unusually stressed. Then, yeah. So I think that's probably the first thing that breaks.

I do wonder, I think for large volumes, maybe all of front end code, I wonder if you'll have a class of software engineers who never actually read the code, and if that happens, are they product managers and I think they are. And then there's another class of software engineers say super high performance networking or database or kernel level code where.

The AI just won't help that much. It might be able to help on an algorithm here or two, but the [00:37:00] universe of open source code in those domains is relatively limited and they actually remain within like the copilot tab auto complete world here for a while until there are fine tuned models trained on those domains.

duncan: Does that kind of scaling start to unlock really a new category of developer tools like you? You mentioned CICD systems and. I think those are traditionally a, a sleepy world in SaaS. Is, is that like a new opportunity in startups? 

tomasz: Yeah, it's 

duncan: an open question. I 

tomasz: think one of the, one of the things about, okay, so you know, I've been a VC for 17 years, almost 20 years, and there, there were all these rules that I was taught growing up in the business, which was like, you never invest in this category, never invest in that category that this category produces, doesn't produce venture scale outcomes and reflecting on the last five years.

All of those categories are the hottest categories. AI, like investing in an IDE surest way to lose money 10 years ago. Can't monetize it. And now, okay, fastest make companies in the world are AI enabled [00:38:00] ides, right? Legal software, same cap, same, right? Uh, and testing historically has not produced huge outcomes.

Typically billion dollars, which is nothing to sneeze at. It's an amazing outcome, but it's not a deck of corn. And so this is one of the questions that we're asking is, does the developer tooling actually ultimately. Change large database companies, pipelines, and all that kind of stuff. And so now all of a sudden, the amount of computation that's required already, the big companies were struggling with merge queues and CICD flows and test caching and library caching and all those kinds of things.

But now it's an even more intense problem. So it could be one of these categories that in the last 10 years was not as loved by investors, but is transformed. 

hugo: Something we've been talking around, but getting closer towards in the past 5, 5, 10 minutes. As opposed to all the wonderful things AI is capable of is technical debt in in AI systems.

And I'll actually link to a recent blog post of yours, Tom Ash called Hidden Technical Debt in ai, which riffs [00:39:00] among other things on Google's and famous paper, which I'll also link to hidden technical debt in ml. But I'm wondering if you can tell us a bit about. How you think about the type of technical debt we're already accruing and what we may in future?

tomasz: Yeah, so this goes back to the idea. Initially we thought the LLM would manage the entire workload, and now we need lots of different kinds of systems. So Hugo, you brought up memory, right? There's informational retrieval architectures, there's prompt caching, there's prompt optimization, there's evaluations, and so all of a sudden we went from this like beautiful single box to this mosaic.

Of different tools and different teams need to specialize in each one of these. I don't think we are at the end, right? There's security, there's guardrails. I don't, we saw an internal chat. One of my colleagues sent an example of, okay, this is wild. There's a guy in Israel who I'm have to imagine is a security researcher and his home is controlled by a Google home.

So he has, uh, blinds that go up and down and he fires [00:40:00] up Gemini on his phone and he. It says, summarize my calendar for tomorrow. And it summarizes his calendar. And then he says, thank you. And then the mobile phone launches Google Home and opens up all of his blinds. And so he had attacked himself, right?

But he had created a prompt injection attack by putting something into one of the calendar invites on his schedule with a prompt that said, fire a Google Home, and then open all the blinds. And so I think we're going to discover more the security. Challenges associated with these things, particularly with tool calling is really unexplored, and so that'll be a huge part of that.

hugo: Incredible. I, I also love that you mentioned evaluation and as we know we've been talking about in GPD five as well, which went live yesterday. Sam Altman live tweeted the announcement and I think it's very telling that his second tweet in the threat, his second tweet after saying, I'm gonna live, tweet this now.

[00:41:00] He opened by writing evals aren't the most important thing. The most important thing is how useful we think the model will be, but it does well on evals. Now I actually find this slight of mind or Leigh of hand quite interesting because. He clearly thinks they're important enough to make it the second tweet that it performs well, but also wants to say they're not the most important thing because of course they are and they aren't because we evaluate things in a variety of different ways.

So I'm wondering when you build and also when you evaluate o other products out there, how you think about evaluating as a builder and a VC as well? 

tomasz: Yeah, I mean, it's a good question. Evals. I think evals were really important, say a year ago, and they remain well. So there's two different, I think there's benchmarks, right?

And so for GBT five hitting certain benchmarks or exceeding certain benchmarks, it's really important just as a press strategy. But now we're so close on like the SWE Be benchmark and the AI ME benchmark and the systems are so [00:42:00] good, like across many of them that their relative value or their.

Information. The information they provide is now limited. It's not, benchmark is useless. So there's that. I think there's that aspect to it. There's another aspect to it where a lot of people, and the way that we built LM systems, you create a golden set of evaluations and you run it through and you figure out, okay, this model versus that model, how well does it come through?

I think that will remain a critical component of the way that we build systems and we will automate some fraction of it, but it will always be even in classical. Systems that all major search engines employ 10 of thousands of people for evals effectively, and there's no replacement for that labor, even with automation.

So I think that's important. I think as a builder and the way that we measure quality is how often does it work And there's some human level tolerance for error. Just there is if two people are talking, but if it fails more than one in 10 times. 

hugo: Absolutely, [00:43:00] and I love that you mentioned we, we do need some form of robust devaluation.

Evals are otherwise when, for example, seeing what happens when we switch out the hottest new model for something else in our system, but also taking it back to Google search and those types of things, like how we approach these things hasn't necessarily change. Some of the implementation details may be different, but a lot of the processes remain the same.

Something we've talked about is this trend towards ephemeral purpose-built software. Especially among younger developers. I'm just wondering in the next several years and then beyond that, how you see that changing how we all build and maintain or even value software. 

tomasz: Yeah, I think there'll be, I don't know, in order to magnitude more software built and the expectation is that almost every person in a company should have the capability to.

Build little scripts or little pieces of software that improve their productivity. It'll be a new axis of [00:44:00] benchmarking employees against each other or achieving a certain level of goal. So if I'm a customer support rep, which was my first job, like I should be able had a huge, huge volume of tickets, and I operated within one segment of the Google Advertiser Network and then.

One of my colleagues operated in a different one, and we might need slightly different tools, right? Maybe he was in by, I was in social network, so managing the ads of Facebook and MySpace and all those companies, and some people are operating within the world of pharmaceuticals where there's a whole lot more regulation.

And so even though we're on the same team, we need slightly different tools. Historically, there's tools have never been built because. The cost to build those tools and the ROI is negative, but as the cost to build those tools plummets to zero, all of a sudden it's actually hugely positive. So then the question is just, okay, can the person build them and manage them?

But I think at this point, uh, I really believe this English is the new programming language. And if that's the case, that anybody who speaks English should be able to automate much of their work. 

duncan: This has been a really fun conversation. Tomash. One question to, to [00:45:00] wrap things up around hype. I think we obviously go through cycles of.

Hype and real capability in, in both the valley and in the world at large. And, and I think we're at, was clearly at a stage in AI where there is extraordinary real capability. But I am curious if you have any reflections on what the biggest disconnect is today between the hype and the real stuff and if there are areas that are being overlooked that people should be paying attention to inside.

tomasz: Yeah. I think this is, this is a wave worthy of the hype and it's not that every category. Will be impacted by AI at the same time. It takes time and at the limiting factor, I think is just diffusion, like extracting what? Diffusion of human knowledge, what's in human's brains, and then providing it as context to the AI there.

I mean the major category like robotics is going through a huge innovation. I think manufacturing, which is critical importance to the US here, given our trade policy will be absolutely essential coding. Has been [00:46:00] revolutionized. Probably the software or the search is another category that I think has been completely transformed.

And there are other categories, but I maybe put it a different way. I don't think there's a single job, especially in desk work, that will not be completely upended by AI in some form. It'll just take time. 

hugo: Thank you Tom Marsh, for such an enlightening and enjoyable conversation. Oh, I loved it, Vic, very much for having me out here going Duncan, I appreciate it.

Thank you. Thanks so much for listening to High Signal, brought to you by Delphina. If you enjoyed this episode, don't forget to sign up for our newsletter, follow us on YouTube and share the podcast with your friends and colleagues like and subscribe on YouTube and give us five stars and a review on iTunes and Spotify.

This will help us bring you more of the conversations you love. All the links are in the show notes. We'll catch you next time.