The following is a rough transcript which has not been revised by High Signal or the guest. Please check with us before using any quotations from this transcript. Thank you.
===

hugo: [00:00:00] Hi, I'm Hugo Bowne Anderson and welcome to High Signal. The goal of this podcast is to help you advance your careers in data science, machine learning, and AI by learning from experts who are at the forefront of the field. Today, I'm thrilled to be speaking with Ramesh Jahari, Professor of Management Science and Engineering at Stanford University.

Ramesh also has extensive experience working with many multi sided marketplace platforms, both in technical and advisory capacities, including companies like Uber and Airbnb. In this episode, we explore the art and science of online experimentation, especially in the context of marketplaces and tech companies.

Ramesh shares insights on how organizations evolve from basic experimentation practices to becoming fast, adaptive, and self learning organizations. We dive into challenges like the risk aversion trap, the importance of learning from negative results, and how generative AI is reshaping the experimentation landscape.

We also talk about common failure [00:01:00] modes and the types of things you're probably doing wrong, along with strategies to avoid these pitfalls. Plus, we discussed the role of incentives, the necessity of data driven decision making, and what it means to experiment in high stakes environments. To set the stage, I asked Ramesh what he thinks the future of experimentation is, and how organizations should prepare for the next wave of innovation.

Here's what he had to say. 

ramesh: I want to maybe, we could like summarize a little bit, some of the stuff that's come up, and then I think there's a neat way to think about what this means for the future org. The two biggest ideas I think that, that came through, one of them is good experimentation practice in the future should encourage a lot more testing, potentially a lot more risky testing.

And that's the way to get out of a lot of the risk aversion that early adoption of experimentation can create within an org. The second one is that I think in the future. Experimentation should involve much more learning across tests, thinking more broadly about the strategy of the business rather [00:02:00] than just individually within each experiment.

Now, within all of that, there's been this through line of saying that the way I like to summarize this for my students is the inequality that data is greater than methods. The reason it's so important to test more in part is because. Better data is always going to be better than applying a fancier method to try to learn something you never had the data for in the first place.

I'm a huge believer that data trumps methods. And I think this whole theme here about learning across tests and learning more tests, it's really about having the best data possible. And then from there, you can worry about the methods. So you put all this together, where does it lead us? And what I like is a vision of kind of a cycle that I talked about a second ago where.

You run a lot of tests. You learn from those tests that generates ideas that lets you run more tests. And if you imagine that kind of pushing forward, it's almost as if you're moving towards what we could call a self learning organization. Where experimentation is just an arm through which the org as a whole is [00:03:00] trying things out, learning from them and innovating into the next step.

And returning to where we started, as I said, you asked me what's experimentation. I said, even toddlers experiment and that's really what they do. They're toddlers are self learning beings. They experiment in the world around them. They learn from it, they adapt and they evolve and eventually hopefully become high functioning adults.

And so I think for orgs, the same thing. They take baby steps with experimentation. But in the end, what you're heading towards is a world in which Experimentation is helping you, helping augment your learning. And I think that freight self learning is a North Star to aim for. Obviously, I think you think realistically about any org that's going to move in that direction and there's going to be this augmentation process between humans and this algorithmic process of innovation so that the faster you test, the more you learn across tests, the more you're able to test into the future.

So that's, I think, pretty exciting, pretty invigorating. And obviously it'd be amazing to have this chat again in 10 years and see. Have we gotten to this [00:04:00] kind of future of the self learning org? 

hugo: So before we get into the full conversation with Ramesh, let's take a quick moment to chat with the team at Delfina who make High Signal possible.

I'm here with Jeremy and Duncan, the co founders of Delfina and producers of High Signal. Jeremy, Duncan, maybe you can tell us a bit about what you're up to at Delfina and why we're doing this podcast. 

jeremy: Awesome. Thanks, Hedo. At Delfina, we're building agents for data science. And as part of our work, we speak with lots of interesting people in the space.

And so we're looking to identify and share the high signal. And in terms of the clip we just showed 

hugo: from Ramesh, I'm wondering what resonated with either and, and, or both of you within it. 

duncan: I love the discussion of institutional learning. One of the first things I did at Uber was to create a review for the major experiments at the company.

And the origin of that review was to ostensibly make sure that we were shipping the right experiments. But very quickly, we realized there was huge value in that review to [00:05:00] actually educate executives about what was actually working, about what bets made sense and what bets actually did not turn out well.

And that really enabled the leadership and the team to stand on the shoulders of the kind of giants who came before. Fantastic. 

hugo: And I think that's a through line throughout the entire episode that I think a lot of people will get value from. And to your point, Ramesh has worked with some very sophisticated platforms that do large scale experimentation and have essentially become these types of self learning orgs.

Jeremy, I'm wondering if there's anything you'd like to add to Duncan's great points. 

jeremy: Yeah, I also loved the episode. And I've found in, in kind of most domains of life, personal, professional, athletic, that sort of velocity of learning is, is the key to doing well and, and often the key to having fun. And so loved his discussion of type of experimentation and the importance of high quality data and this idea of a self learning organization.

And of course, this is why startups exist are important. And so the one advantage startups have over big companies [00:06:00] is we are smaller. We have worse distribution and less visible. But the one thing we have is an environment where we can iterate and learn very quickly and find novel solutions to important hard problems.

I couldn't agree more. 

hugo: Velocity is so much more important. Thank you both once again for making this podcast possible. And I'm excited to get into the interview with Ramesh. So thanks once again, both. Hi there, Ramesh. It's great to have you here. Thanks, Hugo. It's great to join you. So we're here today to talk about online experimentation, the role of marketplaces and marketplace dynamics and experimentation in data science, machine learning and building businesses.

You have a lot of experience across all these dimensions, but I thought maybe you could just start by giving us a brief introduction to who you are and what you do. 

ramesh: Sure. Yeah. I'm a professor at Stanford. I'm in the department of management science and engineering, which is a mouthful. I think a better way to describe myself is broadly speaking.

I'm someone who likes to think about how we make decisions from data. I've done that kind of work over the last decade plus or so [00:07:00] working with a large number of online platforms and marketplaces, both in an advisory capacity as well as, of course, in my research and teaching roles. And then more recently, I'd say over the last seven or eight years, I've spent a lot of time thinking about The different data scientific tools that these kinds of companies use to make decisions from data.

And so that's where the bulk of my expertise is at these days. 

hugo: So in hearing that, it sounds like you could be more on the academic side of things, but I just want to be very clear that the types of companies you've advised and helped build their businesses range from Uber to Airbnb to Bumble, right? So you've been really like in the weeds with some pretty serious tech companies.

ramesh: Yeah, I will say, I think like it's one of the great things in Silicon Valley is that I've had an opportunity to be. personal with companies actually often in situations where they were scaling through these processes. Yeah. And almost all of those companies you mentioned as well as having a chance to work with [00:08:00] Optimizely, which launched one of the earliest AB testing platforms, third party AB experimentation platforms available to companies.

As well as I think early in my career, the very first start I got was working at a company that then was called Odesk and is now Upwork for building out an online labor market. And I think. Each one of these experiences is different, but one of the great things about being an academic is you can start to chain patterns together and think about common problems, common obstacles, and common ways to think about how to solve those problems.

So that's part of what I hope we get to talk about today. Let's jump 

hugo: in. I want to talk about the importance of experimentation, and the current state of experimentation. But perhaps we could even step back a bit and work towards a definition of what experimentation even is. 

ramesh: Yeah, that's a great idea.

Experimentation is something everybody does, whether you're a toddler or you're an organization. And in the most vanilla, it just means trying stuff. Seeing what's works and seeing what doesn't work. So in, in tech companies, experimentation has a little bit more of a rigorous definition, and I think the best way to think about it is often as [00:09:00] a product manager or product leader, you're thinking about trying out different ideas for how to do something.

Imagine you have like different buttons for a checkout flow and you're testing which one works better. The simplest kind of experiment is something where you take some fraction of the users who are coming to your site. You show them one of the experiences, one of the types of button, and the other fraction of users come to your site, you show them the other type of button, you just see which one converts better.

So that's the most standard example of what would be called an A B test. A and B here referring to the two different versions of the button. And the, the analysis of these kinds of things, AB testing is, I, I often find it a bit of a funny term as an academic, honestly, because I have a world class stats department right next to me.

And obviously in statistics, the AB test is just what has classically been called a randomized controlled trial. So referring to those things through this new sort of modern name, I think it's a bit odd for some of my colleagues to say the least, but that said, that's all it is. It's the same kind of trials that we hear about when people are testing vaccines for something like COVID.

It's the same. Sort of basic architecture of an experiment is, which also means [00:10:00] we have over a hundred years of statistics backing us up to think about how to analyze the data that comes out of these kinds of experiments. I think for the tech industry, probably what's important, especially for business leaders who are thinking about this, what's changed in the last 10 to 15 years, especially, and I think I mentioned optimizely earlier, they're one of the, one of many companies that contributed to this experimentation has gone through something that was sort of niche practiced by.

The biggest, most successful tech companies like Google's, the Microsoft's, the Netflix's of the world. And it became something that's basically available to anyone. If you started a company today, you could be up and running with maybe testing platform by this evening. There's enough third party platforms available that allow you to test your marketing messages.

They allow you to test your site design. They allow you to test backend on the different algorithms you're building out. And. That's important as a change, because it means that experimentation, this idea of being able to compare and learn from different alternatives, that's no longer a competitive advantage anymore.

It's just something that everybody does, or at minimum, has access to. [00:11:00] Everybody should be doing it. Whether everyone does it or not is a different question, but the point is, the barrier to entry has become basically nothing. I think what's interesting for where we are right now is, If you're a business leader, I would say you should already be in a place where this is standard practice when you're trying new things out.

The more interesting question on my mind is. Where do we go from here? What do you do in a world in which experimentation has become standard practice, what's working, what's not working, and then how do we move it forward from where we are? So 

hugo: we discussed earlier how being in academia, but also working with a significant number of pretty, pretty serious companies, you're able to abstract and see high level patterns of what's happening in the space as well.

I'm wondering if you would give us a broad overview of the current state of experimentation in data science within organizations, and I am wondering if It's almost, there are class structures to it. I'm using class very, very loosely, people who do it very well, then a long tail of people who do it with a [00:12:00] variety of sophistications.

ramesh: Yeah. I think it's easiest to tell this story in the evolution of an organization as they become more sophisticated. And, and I think like in particular, just. Thinking about what I just described, imagine that you've just started with experimentation. I think there's a few things that are really interesting about that.

You know what I'll call like the novice environment. And it's not really a novice because you're already pretty sophisticated. If you're running experiments in the first place, I want to say like kudos to you. If you're running experiments, I think that's a good thing. You're on the right track. But that said, there's a bunch of kind of interesting things that can happen.

One thing I think that's important about that is that almost everyone that starts with experiments using it used them just to pick winners and losers. Right? So I've got some ideas that I want to see which are the best ideas, which are the worst ideas. That seems obvious and you might, it's so obvious you might almost ask, what else could there be?

And I think one theme that's really important is. Experiments aren't just about finding winners and losers. It's also about teaching us something. And in fact, one of the things I think that's funny about experiments [00:13:00] is that the, the history of experimentation is that it dates back to thinking about the scientific method, right?

And we all learned about this in elementary school. When you go down the track of the scientific method, you don't run an experiment. So you had a hypothesis you were testing. So usually you're asked to write down. My hypothesis is that Water boils because I'm heating it up. So I'm going to test that by putting two pots of water on the stove and one of them, I'll turn the heat on the other one.

I won't turn the heat on. So what's happening there is your hypothesis is that putting heat into the water boils the water. We run tests and tech companies. We almost never have hypotheses like these when we start out in experiments. We're just like, I just want to know which one works better. I don't care why it works better.

A key feature of companies that are better, better, I think, as a qualitative term here in running experiments or using experiments for strategic innovation, if they think a lot about the why, they're not just trying to understand what worked better. They're trying to understand what worked better. And crucially, one of the things that means is that when things that you thought would work well, Don't end up working well, you also care [00:14:00] why.

So it's a good example of this. I think an important thing is that we might think that if we increase the salience of various kinds of content in search results, like think of an e commerce platform, that this is going to make people more likely to convert. So maybe we go run this experiment. And what we see is that indeed the click through rates went way up.

But a really common experience is you can see higher click through rates and actually worse conversion. And the reason that's often happening is because you've confused the user. You've thrown a whole bunch of extra stuff at them. Now they're actually clicking more to try to find what they want. So all those clicks you're seeing are actually confusion and not high intent leading to higher conversion.

This is a great example where you really need to dig in more to understand what's going on, not just to think about what's better and what's worse, according to some kind of very simplified view of the metrics that are in front of you. So that's I think one interesting thing. The other big one I want to point to just has to do with velocity or speed.

So, there's something funny about the ease with which you can adopt experimentation as a practice, and [00:15:00] yet the, the kind of slow pace with which it usually happens in organizations that first start to do it. There's a lot of different reasons for this. I think we're all familiar with the range of organizational frictions that exist anytime you try to do something new.

In many companies that try to adopt experimentation, this kind of new way of thinking about decisions will mean that there's a lot of discussion that goes into each test. One thing that can happen here is what I like to refer to as the risk aversion cycle. So essentially, you go in and you're not running a lot of tests.

So each test takes on a magnified importance. And you think to yourself, well, I really better be careful. What are we going to test out? What are we going to spend this like limited bandwidth and time and budget that we have to run experiments on? Okay. You spend all that time. And then you get results back and then you look at that and you're like, Oh, wait a second.

Actually, there were a whole host of ways in which we did, we should have done things differently. We should have, you know, tried something different or maybe our analysis was broken. Another very common feature is because you're running a small number of tests, you try to test [00:16:00] a lot of things at once.

And when you walk out from that test, you have an extra, you can't tell, maybe nothing moved, maybe something moved. You don't know why, because you tested so many things at once. So the next time around, you slow down more. You think, Oh, I better think even harder before I run the next test. And you get into this cycle where you're running slower and slower.

Despite the fact that running experiments and the best organizations have learned to test a lot test frequently. And the more you're testing, the more you're reducing the sense of heightened importance on each individual test. And the more you're maximizing your learning. So these two things I think are key.

And, and. You know, I mentioned this kind of concept of novice that I think two features that you see are relative slowness of testing, not a lot of testing run and, and not a lot of learning, especially across experiments. And I think conversely, in the most experienced organizations, what is they test a lot.

They test often, even simple changes. They'll go ahead and test. Almost everything is tested [00:17:00] and. They're, they have more structures in place to learn across experiments and think strategically about, about innovation. I just want to interject, by the way, that a lot of that kind of way of thinking, there's a few collaborators that I'm working with.

I want to acknowledge who have been crucial in framing up these ideas. That's Marvin Tingley at Netflix. Yaf Bojanov at Harvard. Sven Schmidt at Eppo, which also runs a A B testing platform. And then David Holtz at Berkeley. 

hugo: I appreciate all of those insights and the team that you've been collaborating with on, on these types of things.

Something that I want to drill down to in, in there is. I think in the more novice, if we consider the spectrum of novice to expert, on the more novice side, there's also an incentives challenge that I think you've spoken about before, whereby if I'm a data scientist at an organization, there are incentives for me to get as many positive results for my promotion cycle as possible, which incentivizes me needing to get statistically significant results in a positive fashion, which takes a certain, a longer [00:18:00] cycle.

So maybe I don't operate at velocity because of that. There's also a de emphasis on negative results without the recognition that if we're talking about learning in the short to medium term and not necessarily having the highest impact, negative results can actually We can learn a lot from them, and that can help the long term impact of all of these experiments.

Correct? 

ramesh: Yeah, I think there's a lot to unpack there. I want to just start first of all by saying, hitting the nail on the head with incentives is so important. I actually tell, I teach a data science class at Stanford, and one of the things I really like to tell the students is that data science is. Like any other business function, and it lives in an organization that creates incentives for how it gets used.

I believe that sometimes we fall prey to thinking that if you're making decisions from data, then you're being completely rigorous and there's no subjectivity that enters into it. But there's a lot of what I like to call data science is degrees of freedom for how experimentation actually gets instantiated that you brought up.

[00:19:00] Which tests actually get run? How long do they run for? And how do I interpret the results? There's often a lot. Of levers here that are moving that can materially affect the business. But the choices that are made are affected a lot by these incentives. So just to drill into a couple of the things that you brought up, one thing I want to raise is this paper by some researchers at Microsoft that focuses on AV testing with what's called fat tails.

Now the fat tails there refers to the idea. I won't get into that sort of technical definition, but they're imprecisely what it's referring to is the idea that, Hey, you've got a bunch of ideas that you could try out. And some of them are like incremental ideas, right? So like maybe last year. You tried a light shade of blue for the color of a particular button.

Maybe this year you're going to try out maybe a little bit lighter shade of blue. Okay. That's like very incremental idea. I mean, you have some understanding of what you think this is going to do. And worse, that's not going to break anything, right? It's a pretty low risk endeavor at best, like limited upside, limited downside.

And then there's like really big, really [00:20:00] potentially risky ideas, right? For example, you might do something like say, Hey, I'm going to completely alter the checkout flow by removing an entire page from what the user sees on the way to checkout. That's a big change. Who knows what that's going to do? That could be massive upside because you remove friction.

It can be massive downside because you removed a bunch of information the user needed to feel comfortable before they checked out. There's examples of both out there and, and you don't know in advance what's going to happen. The fat tail from the title of Microsoft paper, first, the idea that There's, if you look at the returns that you're going to get from the kinds of ideas that you could experiment with, there's a lot more ideas with these outsized effects than companies are testing, typically.

And there's a lot of value in running shorter tests, more of them with the idea of trying to identify these outsized returns. Now you're absolutely right. When you do that, sometimes what that means is you're going to be picking up on big negative changes, right? You're going to be taking out that page and be like, [00:21:00] Oh, that was a mistake.

I really need that page in there. There's a few things to say about that. One of them is that it is a learning experience that informs future testing back to that point. So in the long run, we want our experiments to be informative about growing the business. But sometimes when you have that negative experience, that actually is informative about choices that you make down the road.

Let's not mess with taking a page out. The other big thing is from an incentive standpoint, This isn't going to happen on its own. Data scientists, exactly as you said, they're not going to sit around and think to themselves, why don't I be the one to come up with that super high risk idea of deleting an entire page in the checkout flow, right?

That's what data scientists today is going to have that idea on their own, except if they're empowered to do so. And this is, I think, one of the things I really want to point out is that In the organizations that sort of run with the highest velocity and learn the most across experiments, there's actually a little bit of a different role for the data scientists there where, where they're, I think there's.

This kind of understanding [00:22:00] everything that you're evaluated on doesn't just come down to let's say the number of positive experiments that you ran in that quarter. Is that metric like the number of positive experiments you ran entirely focuses on that winners and losers view of AB testing first of all, and second, Focuses a lot on just that incremental value that data scientists generating locally instead of the bigger growth for the overall business as a whole.

There tends to be a greater attention to long run impact in evaluation. There tends to be a greater attention to overall business impact and org impact rather than just the local impact of the single data scientist test. I think one last comment I'll make about this is that, that Microsoft paper, I think it's great to, to, to take a look at.

And one of the things that's interesting about it is they just point to the significant value that could be generated if you change the structure of organization of experimentation to running more of these shorter tests. And in order to execute that, you need some risk mitigation measures. [00:23:00] So I think just a couple, I'll quickly throw out there for people who are curious is that, One thing you can do is to phase your rollout.

A lot of companies do this already where you just incrementally increase the allocation to the new idea, so that if it looks like it's really bad, you can shut it off quickly and it doesn't, doesn't have to take the business. And the other one, which is what we worked on at Optimizely is what's called real time monitoring, where essentially you give statistical tools to the user to be able to just continuously monitor the test results.

They can stop whenever they want. And if so, there's early returns that indicate a big hit. You can stop. Also, if there's early returns that indicate a huge positive lift, you can launch faster as well. So there, there's a lot of tools out there to, to get us to a place in tech experimentation where we actually can move fast and try out a lot of these ideas that are in the fat tails that Microsoft paper refers to.

hugo: Fascinating. I love those two, two methods. I'm also wondering, have you seen people roll things out internally and test them internally before serving them to, [00:24:00] to their audiences and markets? 

ramesh: Yeah, I think that's, that's a, I think a practice sometimes called dog fooding, that's pretty common for a lot of different products and services.

There's some challenges I think in extrapolating, right? So you can do things, I think like. debug and validate out whether there's serious engineering issues or interface issues. But I tend to be someone who thinks it's very, it can be a sort of a dangerous position to put yourself in, to rely on that as the primary source of information about the treatment effects that you should expect.

And that's basically because for any platform that's operating at a scale where A B testing is going to be meaningful. Almost certainly the reason you're running experiments is because the heterogeneity of how your population or user population interacts with your product is high enough that you can't predict it without running the experiment.

And I think, by the way, that's an important point, right? If I could predict. How people are going to behave when I make a change. You don't need experiments. That's not what [00:25:00] experiments are for. Experiments are not there just because we feel like we want to run something that's tech savvy. It's really the reason that experiments were designed is because it's the only way to be sure that you've ruled out everything except this change in the future as the reason why you're seeing the effect that you're seeing, because you randomize between users.

So in every other way that there should be a similar population of exposed to both a and B. And the thing is, if you could predict how a particular kind of user is going to respond to A instead of B, you don't really need the experiment. You can just go ahead and use the predictions. And, and I think that's an important point because for certain things, I think, for example, machine learning models that predict the kinds of content that you might engage with.

We sometimes get very good at that in the largest platforms, and that means that rather than experimentation, we can move towards a model where the predictions from the algorithm gradually step in. So there's a number of different algorithmic techniques under the name of contextual bandits that have this flavor, [00:26:00] where you move from experimentation gradually towards more of a model that's essentially saying, I understand how to predict how you're going to react, let's say, to a new intervention.

hugo: Oh man, I wish we had time to get into bandits and reinforcement learning. And actually when you were just talking, I wish we had time to get into the importance of AA testing as well, before we even get to AB testing. Oh, I'm a 

ramesh: huge, yeah. I'm a huge fan of AA testing. So I can definitely talk about that for the rest of the podcast too, if we want.

But yeah, let's keep going. 

hugo: I am interested. We've been discussing that there is. At best, there's a not insignificant cost to learning and running experiments, but a lot of the time it's, it's a relatively significant cost in the short to medium term, but very important for long term success. So we've danced around this a bit, but I want to know specifically, how can data leaders encourage their teams to embrace a mindset of continuous experimentation, even when many of the tests might not lead to immediate success?[00:27:00] 

ramesh: Yeah. So this is such an important point. I think one of the things that's important to understand that's so hard for people to accept when experimentation, you know, is first adopted or even when it's matured is, is just the rates of success of experiments are not that high, even at the best companies, perhaps more so at the best company.

There's a number of published studies, uh, where, you know, these rates of success are like if you're succeeding on 20 to 30 percent of your experiments, again, by succeeding what they mean. I don't like that term to be totally honest, but what that, what I'm using that as a proxy for is that the new thing that you launched ended up quote unquote winning in the AV test relative to the existing control or existing version and a 30 percent rate.

Of success is amazing. That's actually quite high. And that's hard because I go back to the incentives point. First of all, if I told you that I'm going to judge you on how many experiments you went on, and you already know in advance that at best, 30 percent of things are winning, this is not a rosy picture as I'm paying for you about your job experience.

So that's [00:28:00] already an issue, right? Besides that, I think we should ask ourselves, how do we think about that 30%? And like, what's really going on with that 30%? I think there's a bunch of different things that happened there. So the thing that is important is. It doesn't optimizing that metric, making that metric larger may not be the best goal.

Okay. Now in the long run, yes, that's what we want. I think that's an important thing to understand that like long run, let's say six month horizon, one year horizon on the whole, we want the business compass to be pointing in a positive direction. It's very hard to justify experimentation. If it's only a loss, it's not generating more value than, than the losses it's creating.

Right. Point is that that's 30, 70, I think is missing a little bit. A package of experimentation. Where you try a lot of different things and some of them win and some of them win really big. And that goes back to that Microsoft point about the fat tails. So if you really experiment the Microsoft following the guidance of that kind of that sort of approach, you are going to [00:29:00] have some losses, right?

But you're going to cut them off early, which means if you look at the overall returns, it's just like good investing in finance. The things that don't work, you cut them off quickly. The things that work, you learn faster and you implement all the way faster. And so the returns to experimentation have to be measured differently than something in just like the raw percentage success rate across the experiments that we ran.

It should really be measured in terms of business relevant metrics. And some aggregation across those experiments. So you asked me, how can data leaders do a better job of incentivizing this kind of behavior, creating this culture of managing the costs. And I think there's a couple of different pieces to that.

One of them is that it's not really only the data leaders that accomplish this. In many organizations, one of the reasons experimentation gets hamstrung from the get go is that it's set aside as a cost center, while the profit centers of the business are all the different verticals that are actually putting the features and the product out there for customers to use.

When you do that kind of division, you inherently limit the [00:30:00] extent to which experimentation both can have a say in the strategic innovation that happens in the business and also just in the resources available to experiment. So I think one thing that, that I would push on there is that it's useful to think about how you integrate experimentation.

And some businesses, this explicitly means a hybrid approach where data scientists actually get embedded within teams. And the returns that accrue to a product org, literally in thinking about what's the business value generated by this unit over six months or a year, those are actually also attributed back to the experimentation stack and therefore you're actually turning it from a cost center into a profit center.

I think another piece of that is that the same way there's a risk aversion cycle. There's also a cycle that generates if you move at high velocity, because as you lower the bar for experimentation, which means having. Tools and platforms in place that that kind of democratize access to experimentation, make it available to a much wider population of your employees.

You also lower the temperature. You don't make it as [00:31:00] you don't make any individual test as consequential for any individual employee. And, and that alone creates a culture of experimentation where it, it, it enables individuals to move faster and, and try out ideas faster autonomously. And so again, in the best organizations, you'll see that there's this democratized access to experimentation capability.

It doesn't mean there's a gatekeeping. It doesn't mean there aren't safeguards in place and, and controls in place to prevent massive downside risk. But it is the case that the wider, the practice of experimentation That just the lower that sort of temperature gets around the high consequence to any individual test.

hugo: I love that. I love that you mentioned that the hybrid model where we have data scientists embedded in different parts of the org. I think one thing that also achieves very well is A lot of data scientists tend towards optimizing quite technical stuff. Having them embedded in different parts of the organization, different departments, actually brings them very close to [00:32:00] thinking about business value as opposed to optimizing some loss function or, or, or something like that.

Oh man. 

ramesh: Yeah, this is a topic very near and dear to my heart. I, by day, I'm a technical researcher. I love working on fancy methods. I love coming up with new ways to use data to make decisions or to generate insight technically. But when I've worked with companies, I very religiously take that hat off. And I apply a really hard rule to let the nails do the speaking and not the hammers.

And what I mean by that is, When you're working with data scientists, these are often technically very well trained people who have a lot of expertise in, in a wide range of data science domains that they want to bring to the table. They want to show that stuff I know is what you need to solve that problem.

And I actually think one of the hardest skills to learn for a data scientist is to put that aside and say, what do PhD or advanced technical training? was not so much any [00:33:00] individual technique, and that's the way you're going to win here. It was actually just a way of thinking, a way of framing uncertainty, a way of framing problems, and that maybe what the business problems need is not the most advanced methodology.

This is a really hard lesson to learn. I think I used to joke about how I feel like There's not a lot of difference in terms of technical capability between a new PhD grad and someone who's been working for three years, but there's a huge difference in how they approach problems. And that difference comes down to this thing that we're talking about, which is just like the methods can't be the key.

You have to let the data do the speaking. You have to let the business problem do the speaking. Often that means that. Very basic off the shelf techniques are the right fit for this problem. You get to an MVP, a minimum viable solution faster and are able to prove out an idea faster if you work that way than if you insist on implementing the fanciest technique for sort of an esoteric approach to something.

hugo: Oh, totally [00:34:00] agree. Data, business question and how you evaluate, I think are the three, three things that are so important. I'd have to fire myself if I didn't ask you about the impact of generative AI on experimentation. And to set the scene a bit more clearly, I have heard you speak about something that I wholeheartedly agree with, that one of the big things we see in generative AI is essentially what I'd call like some form of like combinatorial explosion of ideas and hypotheses, right?

So in a world in which we're able to generate a lot more hypotheses, what are the challenges that we encounter in the realm of experimentation? 

ramesh: Yeah, so I think you already, it's interesting to me, generative AI is also, if you watch its arc over the last couple of years, it's, Interesting to look at where it's come to, and I think we're all wondering where does it go?

So I've been thinking about that question, like you said, through the lens of experimentation, and what does it mean within the world of experimentation? And I think there's kind of two complementary [00:35:00] pieces to that. So one is the one you brought up. Done well, generative AI should be an incredible partner in ideation.

And I actually really think that the product manager of the future should be On the leading edge, actually, of the AI revolution. I think there's a lot of fear that AI is basically going to replace a bunch of product management. I think it's the opposite. I think, done right, product managers should be among the most well positioned to leverage how AI is used within work.

And that's because what you'll need more than anything else is a partner in coming up with ideas, right? Historically, how did that work? The product managers are the one responsible for the roadmap. And so if you just have this ideation approach that generates 10 X, a hundred X, the number of ideas you had before, and you view yourself as an org that experiments to learn, now you've got a lot more experimentation that has to happen.

So that puts a lot more pressure going back, circling back to where we started. One of the comments I made is what's the competitive edge in the future? And one of the competitive edges is going to [00:36:00] be not, are you experimenting? It's are you experimenting fast enough? Because if you're not experimenting fast enough, you can't keep up with the number of ideas that are being generated.

So in an ideal experimentation work, that's keeping up with generative AI, you're running a lot of experiments. You're cutting off the things that aren't working fast and you're learning fast from the things that are working and continuing to innovate on those. Now, the other thing I want to say there is.

I think there's something I glossed over that's really important, which is if you're running a lot of tests, you're also generating a lot of information. And again, circling back to where we started, a common practice today in the kind of more novice orgs is that you'll have, you tend to have a very bespoke manual review process on each experiment, which is another part of this risk aversion cycle where you go really slowly because every experiment that's run needs a sort of manual doc created of a readout from that experiment.

Now, imagine you're doing that. For a hundred X the number of ideas in this world of generating AI, that's a disaster. You can't, nobody would be able [00:37:00] to keep up with the number of memos and meetings that experiment reviews would generate in that world. And I think one of the things that my coauthors and I were exploring is, look, this is actually a place where AI should also be coming to the rescue because another great ability that's been unlocked through modern LLMs is the ability to synthesize complex, knowledge in a way that actually is incredibly valuable for future innovation.

At the same time that you're testing a lot, you're generating a lot of information, one of the pieces that you want to generate this kind of cross experiment learning is the knowledge management system that's really leveraging AI as well, to mine what's happening in those tests, to detect the patterns that are emerging, and to then again, feedback in to this.

innovation cycle to help you think about the next set of ideas that you should test. So I think that's a kind of funny arms race, right? Where AI is both the generator of a bunch of new ideas, but also I think a key tool in managing the information overload that, that testing all those ideas out can [00:38:00] generate.

So I think ideal world, that's what we envision the future looking. Awesome. 

hugo: So that actually brings me to my final ish question, looking ahead, I'm wondering what you actually believe is the future of experimentation in data science and how organizations should prepare for the next wave of innovation in this area.

ramesh: Yeah, I think I want to maybe. We could like summarize a little bit some of the stuff that's come up. And then I think there's a neat way to think about what this means for the future org. The two biggest ideas I think that that came through, one of them is good experimentation practice in the future should encourage a lot more testing.

Potentially a lot more risky testing. And that's the way to get out of a lot of the risk aversion that early adoption of experimentation can create within an org. But the second one is that I think in the future experimentation should involve much more learning across tests, thinking more broadly about the strategy of the business rather than just individually within each experiment.

Now, within all of that, there's been this through line of saying that [00:39:00] the way I like to summarize this for my students is the inequality that data is greater than methods. The reason it's so important to test more in part is because better data. It's always going to be better than applying a fancier method to try to learn something you never had the data for in the first place.

So I'm a huge believer that data trumps methods. And I think this whole theme here about learning across tests and learning more tests is really about having the best data possible. And then from there, you can worry about the methods. So you put all this together, where does it lead us? And what I like is a vision of kind of a cycle that I talked about a second ago, where you run a lot of tests, you learn from those tests that generates ideas that lets you run more tests.

If you imagine that kind of pushing forward, it's almost as if you're moving towards what we could call a self learning organization, where experimentation is just an arm through which the org as a whole is trying things out, learning from them, and innovating into the next step. And returning to where we started, as I [00:40:00] said, You asked me, what's experimentation?

I said, even toddlers experiment. And that's really what they do. They're toddlers are self learning beings. They experiment in the world around them. They learn from it. They adapt and they evolve and eventually hopefully become high functioning adults. And so I think for orgs, the same thing. They take baby steps with experimentation, but in the end, what you're heading towards is a world in which Experimentation is helping you, helping augment your learning.

And I think that freight self learning is a North star to aim for. Obviously, I think you think realistically about any org that's going to move in that direction and there's going to be this augmentation process between humans and this algorithmic process of innovation. So that you're, the faster you test, the more you learn across tests.

The more you're able to test into the future. So that's, I think, pretty exciting, pretty invigorating, and obviously it'd be amazing to have this chat again in 10 years and see, have we gotten to this kind of future of the self learning org. 

hugo: I love the idea of the self learning org. And just to [00:41:00] reiterate, encourage more and riskier testing.

Avoid the risk aversion trap. That's number one. Number two is data trumps methods, and then learn across tests, not just from each test. I love your example of a toddler. I actually think, perhaps, with the velocity that toddlers learn at, they're probably so and they're they're learning in order to For several reasons, but one is to get basic needs met as well.

So it's high stakes in, in a lot of ways as well. So learning from toddlers about experimentation would be wonderful. Also executives in like the pre data driven era have always been of successful companies have always been like fundamentally very strong experimentalists and detailed in experimentation.

As you probably know, the T test was developed by Gossett at the Guinness. Factory, right? When though, that's a great example, different types of hopes, right? 

ramesh: Yeah. I think there's another thing in bed. I love the Guinness story. I think that's a great one. And I think I want to, I actually want [00:42:00] to key in on something you just said, which is about business leaders being experimentalist.

And I think that's such an important thing to say. Not only to recognize that history, but actually also just to change the conversation around what experimentation means to a business. I think often business leaders will think of experimentation as the domain of the tech folks. Within the business, but there's a different way to think about experimentation, which is everybody's experimenting.

Even the things that aren't a formal AB test are an experiment. Often when you're trying something out, but not necessarily all the statistical rigor underneath it, that's still an experiment. And I think the reason that perspective is important is because it makes it clear. There's a lot of what we do as a business.

Where we have uncertainty present and we're doing the best we can to grapple with that uncertainty. But fundamentally, we're making decisions in the faith of uncertainty. And one of the things I hope for business leaders is that they recognize the ways in which experimentation [00:43:00] can help quantify some of that uncertainty.

One of the things I hope for data science leaders. is that they recognize that not all uncertainty of that form can be quantified through experiments. And sometimes you're doing things where experimentation is actually not the right answer. But nevertheless, sort of the data, the business needs to grapple with uncertainty that's present.

Great example would be starting a new business line. Can't necessarily AB test that. But nevertheless, thinking about uncertainty is important. I think there's so much within the comment that you made about the facts that when you shift the perspective from experimentation being the province of just technical data scientists to really being a framework within which all uncertain decisions being made within a business can fit.

Some of them are statistically rigorous and randomized experiments. Others really involve coarser ways to think about how to deal with uncertainty. I think if it, it helps open up a lot more interesting conversation between people who otherwise maybe wouldn't see what they're doing as variants of the same thing.

hugo: Absolutely. And for another time, there's a [00:44:00] big difference between statistical significance and practical significance. I'm also reminded from this conversation of an old friend of mine, Rob Phillips, who he's a professor of biophysics and biology and physics at Caltech. He once said to me, cause I was teaching some of his students.

Frequentist techniques versus Bayesian and that, that type of stuff. And he said to me, why do we even need statistical tests, you know? And I said, what? And he said to me, did Newton ever use a statistical test? Did Einstein ever use a statistical test? And of course, we do need them for a variety of reasons in certain settings, but they're not necessarily the be all and end all.

And I am excited downstream to have another conversation with you about these things and perhaps even dive into A bit more technical things, such as the A. A. test or thinking about how Bayesian priors can inform statistical tests and yeah, 

ramesh: it's interesting you brought up Bayesian testing as well. I think that's another one of those topics that I'm a huge fan of for exactly these reasons that I was voicing a second ago where.

I think we have to accept, it's easier to accept the ways in which our [00:45:00] beliefs are influencing decisions within a Bayesian A B testing framework. But yeah, that's definitely a technical conversation for a future conversation. 

hugo: Absolutely. I do have one final question. I think we have definitely talked about a lot of the benefits of having a culture of experimentation.

I do want to have a buyer beware warning of some sort here. I'd love to know what experiments are capable of and what they're not capable of. So I'm going to lead the witness slightly and say. Perhaps there's something in cultural consciousness that given a laptop and a phone that companies, a company makes an online experiment, or sorry, an AB test wouldn't tell us, wouldn't create the idea of an iPhone.

Henry Ford would probably say you'd get a faster horse if you did an experiment with horse businesses and you wouldn't get Ford motor vehicles and that type of stuff. So what are we missing when we're only doing experiments? 

ramesh: Yeah, it's interesting. I think that's more a matter of. perspective. It goes back to actually frame that more within the way I describe things in terms of everybody being an experimenter, [00:46:00] right?

So if you think about what it would mean to create a phone, to just imagine a phone from scratch, even if you imagine the phone from scratch. And by the way, just going back to Steve Jobs and think about the iPhone, there's a lot that led up to that, right? The portable music players and different instantiations.

The iPod was 

hugo: actually one of the biggest steps 

ramesh: in, yeah. Exactly. And I think that those are all experiments. Those are all processes of idea of kind of innovation where you're trying something. There's a different world in which maybe the first smartphone was a total flop. For some reason we couldn't foresee.

Maybe there's something about the interface that was totally broken. And, and I think throughout history, of course, there are creative forces that lead to experiments that are wildly successful from the get go, right? Whether that's physical theories or whether it's the iPhone, but we shouldn't allow the relative magnitude of its success or failure to mask the act of stepping into the unknown by generating a new idea and trying it out, which is the experiment.

And I think [00:47:00] that's what I would emphasize. The way I think about your question is not so much that Some things can't be tried and some things can, everything is tried. It's that some things can't be supported by the type of evidence that the straitjacket of current AB testing supports. And if you can't do it that way, that doesn't mean that you are completely not data driven, right?

There, there's always data that informs your thinking about the choices that you make. In that case, it might've been data about how the iPod did and how it was used and who's using it. Probably informed the development of the iPhone once the grand idea popped in, right? Data about the relative cost of cellular coverage and the amount of data that an iPhone would generate and how would you pay for that data with a cellular plan?

I'm sure informed to launch of the iPhone, right? My point is, even those experiments are data driven, right? They're just not experiments for the kind of [00:48:00] design of collecting data and then the analysis of that data after it comes in looks anything like this. The standard AV test in a Optimizely dashboard, but this is, I think, more a failure of imagination of how we talk about experimentation and how we talk about uncertainty, frankly, within organizations than it is anything about a fundamental difference in terms of what it means to experiment.

And I guess that's a bit of a controversial way of saying things, but I found it useful for me to think of the things this way, because it allows me to put a discussion of uncertainty on a common plane often when thinking across different ideas. 

hugo: Fantastic. And I, you just widened my horizons in a variety of ways, because just to recap, my question was, what are the limitations of experiments?

And what you led me to then do is actually think, Oh, no, wait, all these things that I thought were limitations of experiments that you could do via other means are also experiments. They're just not classical AB tests, for example. 

ramesh: Yeah, that's exactly right. And then I think it changes a lot once you start thinking [00:49:00] as a Bayesian, for example, because.

That allows you to bring in this idea of what your prior beliefs are. What do you think will happen? And there's, it opens up potentially different ways of thinking about these, these steps that are taken. So yeah, it's a fascinating world. I know, I definitely don't want to say, I think maybe one, one takeaway from that is I'm not someone who religiously believes literally every single idea has to be A B tested.

That just locks you down in a way that you can't make big steps sometimes. 

hugo: Yeah. For example, Netflix experimenting with streaming as opposed to sending stuff to your door. That wasn't a classic AB test, but look at what happened to the world, 

ramesh: right? Exactly. Yeah. That's a great example. And I think that you brought up the iPhone, but also new businesses, right?

Stitch Fix was a company I worked with. They launched a men's line and it's very hard to imagine AB testing a men's line. It's not even obvious kind of what that means as a matter of experiment. So I think, yeah, those kinds of things are fascinating examples of, of this type of. Yeah. Can't really run a classical A B test.

And yet, you're grappling with uncertainty in the environment around you. 

hugo: In terms of grappling with uncertainty, I [00:50:00] hope we get to have another conversation sometime talking about how Bayesian thinking can help everyone as well. I always joke that I'm probably a Bayesian. There are some slight error bars around that.

Um, but Ramesh, if people who are watching and listening wanted to follow you or get in touch, is LinkedIn, for example, a good place? 

ramesh: Yeah, I'm not the world's most frequent poster, but LinkedIn is a good place to pay attention. Certainly if there's something important or interesting, I'll post there. And then of course, I'm an academic, so I have my own webpage at Stanford and it's easy enough to look up the kinds of stuff I do there, as well as like my Google Scholar profile will always be up to date with, with publications that we've posted.

hugo: Fantastic. We'll put all those links in the show notes. I just want to thank you for your time and expertise, but also with the deep clarity with which you speak, I've learned a lot in this conversation and I'm sure a lot of our viewers will as well. 

ramesh: Great. Yeah. Thanks for having me. It was a lot of fun.

hugo: Thanks so much for listening to High Signal brought to you by Delfina. If you enjoyed this episode, [00:51:00] don't forget to sign up to our newsletter, follow us on YouTube and share the podcast with your friends. All the links are in the show notes. We'll catch you next time.