HS-Tingley-final
===

[00:00:00] Level three is about really recognizing the vastness of our ignorance and realizing that we can use AB testing to inform and optimize decisions. The bottleneck that remains is you're using humans to make decisions, and if you're using humans to make decisions, you're still in quite an expensive mode of AB testing.

Our bottleneck at level four is the creation of the things, the variance or objects or artifacts we're optimizing over. And so let's take away that bottleneck. Let's put a gen AI system in that loop. That was Martin Tingley head of Windows experimentation at Microsoft and former head of the experimentation platform analysis team at Netflix on why humans are the bottleneck and experimentation, and how a five level maturity framework points the way towards self-optimizing software.

Most organizations, Martin argues. Need to break out of running human led AB tests that work well enough that teams don't realize they could be doing more. The path [00:01:00] forward runs through building flexibility into every decision point, so teams can test a multitude of options instead of just several automate decision making via contextual bandits, and ultimately have a closed loop where generative AI produces tests and refines product variance without human intervention.

Martin explains how this shift will turn the world of software from a static experience into a fluid self-correcting system, and it's happening now. We talk about a startup called Co Frame, for example, which is already doing this for Fortune 500 e-commerce companies. We also dig into the strategic side of testing, covering heterogeneous treatment effects, how to use experimentation data for capital allocation, and why the permission to play varies so much between an entertainment service like Netflix and a foundational operating system like Windows.

If you enjoy these conversations, please leave us a review. Give us five stars, subscribe to the newsletter and share it with your friends. Links are in the show notes. [00:02:00] I'm Hugo Bound Anderson, and welcome to High Signal. Hey there Martin, and welcome to the show. Hi, Hugo. Hi. Hi Duncan. How are you both doing today?

Wonderful. Good to see Martin. So great to, to have you here and to see you. And I think it, it's time to congratulate you on a recent one year work anniversary at Microsoft. That's right. I think we're within about a week of that one year point at Microsoft. Super cool. You've been leading Windows e experimentation.

I'd just love to hear like what you've been up to and what that looks like. Sure ha. Happy to talk about that. W Windows is an amazing piece of software. It has a very rich history and it has a lot more history than most of the digital products we're used to using. It's a 40-year-old piece of technology, and from the experimentation perspective, that means it actually predates digital distribution.

Windows is an organization that has its history in multiple year product roadmaps and shipping on. Physical disc and [00:03:00] some of that heritage you can still see in the organization. So that's an important point when we think about how to apply experimentation in the context of Windows. Another point is Windows is an operating system.

It is not a web app. It is not a standalone application. It is not a website. It is the operating system, and it is the operating system upon which the world really depends, and there's sort of two consequences to that. First, just getting the bits out, getting the experience in front of customers somewhat harder than with other services where we think about rich histories of experimentation.

Then there's an element of risk tolerance. The world relies on windows in a very profound way, and so experimentation with windows is always coupled with this very careful approach to risk tolerance and risk management. Of course, experimentation is a great way to, to reduce risks as we ship out software, but this real sense of responsibility to the product and our customers is something that I've really enjoyed learning about and [00:04:00] embracing here at Windows.

Awesome, and it must be. So interesting to have such a change from Netflix where you had a, a, a long tenure, which thinks about experimentation in, in, in its own ways as well, and see the differences between these types of organizations and, and products. I am, I'm so excited to talk about. Online experimentation and actually the, the impact of generative AI on what's happening now and what's possible in, in future, particularly with respect to continual learning and deployment of products.

But that's what I'd call like future music for a lot of people in many forms. And a lot of our listeners are at different stages in their online experimentation journey. So I'm wondering if we could kick off by you just giving us a brief history of online experimentation and where you see most.

Organizations currently at with respect to it? Sure. I think one of the most interesting things that's happened over the last sort of five-ish years is what I would call the [00:05:00] commodification of experimentation. There was a time when the only companies that could run experiments, and by an experiment here we just mean like, Hey, have a hypothesis about how to improve the experience for your customers.

Build out that variant, and then kind of ship that new variant to some of your customers, the existing variant to the other. Get back some telemetry, see if you're moving those success metrics, those business metrics, and then make a decision. And for a long time, the ability to do this was really only held by the big tech companies who could build out their own infrastructure.

Places like Google, Amazon, meta, Netflix, Microsoft with exp. But what we've seen over the last order five years or so is a real proliferation of vendor solutions to experimentation. And so. You can go buy a product. There's dozens of them that will set up a feature bike management system for you, so you can do experience management that will help you collect back telemetry, that will do the stats and really help you make decisions.

So the kind of basic ability [00:06:00] to run experiments is no longer a competitive edge. It's standard. It's what everyone is doing. And so one of the questions that's been on my mind a lot lately is given that reality, how do you as a company think about deriving real competitive advantage from your experimentation practice, given that the capability is now more widely available than it used to be?

As you think about experimentation, Martin, I think you have a concept of different levels. That you've written about publicly before experimentation. Curious if you could expand on what those levels are, how organizations should think about the levels of experimentation and like the common pitfalls or ways that they can go wrong.

Love. Love to, and please throw in questions as we go so I don't get on the train tracks here. This is my five level framework. I've been talking about this a lot externally, and I found it's a message that often resonates. So excited to talk about it here and at level one, we're not even doing experimentation.

Level one is shipping products based on [00:07:00] conviction, based on doing market research and just firing something out there to the world to see how it lands without measurement. And this is often the realm of entirely new product categories. I think the best recent example here is OpenAI just yolo out chat GPT to the universe and found like amazing product market fit.

That wasn't a controlled experiment. They weren't testing chat GPT against an existing thing. They were just like firing it out there. That's also the realm of Windows pre-digital dis distribution, when Windows had to ship on physical disc and it was really hard to update that experience, and so you're just shipping out something, seeing what happens.

So that's level one. Level two is where we get into product experimentation that will be familiar to most folks. At level two. The goal is to have a really specific hypothesis about how to drive value for your business or your customers, and then the product team goes and builds that variant, and then it's an AB test, the challenger versus that incumbent experience.[00:08:00] 

You look at metrics and you make a decision. So one of the examples I like to use at this level is the Netflix Top 10 Row. Top 10 Row is available on Netflix throughout the world now. But back years ago, that was a test. Somebody had this hypothesis that showing what was popular on Netflix, showing a top 10 list as many entertainment services do really helps members plug into cult kind of cultural ze Gees help helps them find something to watch.

So the team builds out this top 10 feature, explores it, tests it in several markets, realizes it's a winner, drives the key metrics the organization was looking at, and then it gets deployed globally. And Duncan, you mentioned pitfalls. I think organizations can get stuck at this level Two sort of state of experimentation.

And I've been thinking about this a lot over the years and it's led to this framework, but I think they get stuck because everything is working okay. And when everything's working okay, it's hard to recognize that there's a better solution, that there's more that's possible. [00:09:00] The feature teams are running experiments.

Experiments are part of the DNA of the company, and so experimentation is happening. The types of experiments tend to be relatively large investments. Again, think of that Netflix top 10 example. A lot of engineering and design and PM work into pulling that off. And so it's the world where individual experiments are celebrated.

Memos are, and emails are sent out to the whole company when things work well. And because of that kind of style of experimentation, the experimentation platform, the infrastructure for executing on experiments, folks running that aren't really incentivized to go change up the paradigm because they're not getting pushback from the product teams, the pro, from the product team's perspective, things are working well.

I have this thesis that things could get better and that what you tend to see in that type of realm is diminishing returns as you kick all of the big low hanging fruit. Your experiment stay expensive. The returns might diminish you. You might slip into an ROI negative territory. [00:10:00] So I think that's a big pitfall.

I love that e example, and correct me if I'm wrong, but you're also being incredibly humble, Martin, because of Netflix test the top 10 row when you were leading experimentation there and it was your team that did it. Yeah, it was our platform that executed that test. Yep. Yeah. Super cool. I am wondering, with respect to tests like that, this is another pitfall that, that you mentioned in your wonderful HBR piece that I'll link to in the show notes.

But hetero heterogeneity of users. Right. So how do you think about things like Top 10 or other types of experiments in North America compared to India? Yeah, it's a great question and. One. One of the big themes of that article you're talking about is like, Hey, looking at the mean is not enough. A lot of AB tests may have differential results across different user segments that are really important to you, right?

Some are only impacting your power users, and depending on your business model, that may be a good thing or a bad thing. Often geographic impacts on of [00:11:00] experiments. Certainly when you think about entertainment services, the way that that consumers want to consume and access entertainment varies a lot by region.

So it's important to, to think through as you consider the results of an experiment. Not only what did this do overall for my business, but what did this do for important subsets of my user base or my customer base. That's a great way to then learn more about that customer or user base and motivate future tests.

Maybe make decisions about modifying the core experience so it can vary geographically, give slightly different experiences to different user bases. And before moving a further in your hierarchy or ladder of experimentation. I am, I mean, you know, we all know William Gibson's statement that the future is already here.

It's not, it's just not very evenly distributed. In this case, I feel like the past is already here and it's not evenly distributed in that we have things that people in online experimentation, native companies such as Netflix, [00:12:00] LinkedIn, booking.com, Uber of course among many others, have been doing for over a decade, but people, other people aren't.

So I'm just wondering how you see the distribution currently of these affordances across organizations. Oh yeah. I think most of the world is at level two. Let me be clear. And I think it's a really good place to be. I just think there's a way to achieve more and deliver better results for your company by thinking about what it means to go above level two.

And maybe that's a good way, good segue, Hugo, into how I define level three. I love that. And so in, in my mind, at level two. The people making the decisions about exactly what to test are the product team. You're gonna go out to market with an AB test of a specific version of that, that top 10 experience.

For example. Level three is about really recognizing the vastness of our ignorance and realizing that we can use AB testing. To inform and [00:13:00] optimize decisions. And so the goal is, as a product team, every time you need to make a decision, instead add optionality into your code base. And so as an example here, say, say you're running a subscription service.

And you want to update the plan selection page. This is a super important decision point for your potential customers. It's where they're gonna tell you how much money they're going to spend. It's where you can put all of your calls to action, all your value propositions. Like, Hey, premium has these features.

Look at this great discount plan we have for cost conscious consumers. Right? And so thinking about something like a planned selection page. The product team has to make an enormous number of specific decisions, like what is the ordering of the plan plans? Which versions of these calls to actions work better?

How should I stack them? What should I lead with? And the big transition at level three is to simply encode all of that as options within your code base and recognize that through iter iterative AB [00:14:00] testing, you can simply optimize over the parameter space that you've defined. Mentally, this is a shift from I'm gonna AB test a product variant to, I'm going to set up this parameterized space and use AB testing to optimize over it.

And a lot of data scientists like this is a, this is a mental leap. It's a change in thinking from this highly scientific process of AB testing to instead recognizing that what we want to do as a product team is optimize. AB testing is a mechanism we have for optimization. I wonder if you can tell us what type of companies are at level three currently.

I think a lot of companies do small things at level three. They may try a few different variants of an experience. For example, I've seen, I've seen AB tests, multivariate AB tests where you have. One variant that's called designed best guess, or simplest possible version, or rich text version or [00:15:00] things like that, which suggests that folks really are building a certain amount of flexibility into the code base.

I think the real rallying cry of of level three is to take that to the next step to recognize, no, don't just go with design's. Best guess, leverage that design team to really identify the important choice points and then fully embrace EB testing as the mechanism to make those decisions. So I think a lot of companies operate at that level Two boundary into level three, and maybe a big takeaway that I hope folks will get from this session is fully embrace level three experimentation is optimization.

Parameterize your code base. Let loose multi-variate, iterative AB testing. And always as you're shipping something new, make the commitment that you're gonna, you're gonna find the best possible version of that thing via AB testing. To what extent Martin is adoption of level three heavily a function of the type of product or experience you're optimizing.

And granted, Netflix, the homepage gets [00:16:00] so many views. You can obviously optimize it like crazy. Most companies do not have anywhere near the volume of usage that a product like Netflix or for that matter, windows would have. And so to what extent the kind of the, do you think the obstacle from moving from level two to level three is often the actual opportunity in front of the business versus more institutional friction or maybe not even understanding the opportunity that could be there?

We'd love to unpack that a little bit. Yeah, it's a great question, Duncan, and there's a reason companies start at level two. You build both the kind of institutional capacity. Many small companies have relatively small user bases or customer bases, so there's only so much learning you can do. But I still think there's a lot of value in thinking through the lens of optimization, right?

We're using this formal scientific practice of AB experimentation, but it's in service of optimizing for your business. And so if you're a [00:17:00] smaller company, smaller member base, like you can't micro optimize the experience. And so maybe at that level three transition, it's about aligning the size of the decisions with the things you're gonna be able to do measurement on.

And I think through that framing, there's always an opportunity to check yourself as a product development team and say, wait, we're arguing over what we should do. We should just let our customers vote with actions on what we should do there. That's great. And maybe as part of that is accepting that there actually is relatively high uncertainty around what customers truly want.

But that you can think about that formally and actually think through how many different arms can you effectively measure here and distinguish between. And if it's more than two, then you probably should be trying more than two things, right? And I always take a lot of inspiration from the line is pollen, quote, if you wanna have a good idea, the first thing you need to do is have many ideas.

And that that really resonates with the type of experimentation we do [00:18:00] in the tech sector. We're frequently crushed when ideas that folks have a lot of conviction in are shot down by paying customers with their actions. And then occasionally we're just surprised on the upside. And so really predicting customer behavior is a real challenge, which is of course why we run the experiments in the first place.

So let's just lean into that. I wonder there to what extent there's like an institutional thing as well that's so important around accepting that you will be wrong and being okay walking away from ideas that seemed good but just didn't work out. Um, you know, it reminds me of a conversation we had last year with Roberto Matri, who was a VP at Instagram and talking about how they are really thoughtful at Instagram about cutting off losers and having low ego about those types of things.

And that can be so hard in organizations, but it is so important if you're going to absolutely have lots of ideas and test all of 'em. Yeah, there's a culture of humility that, that organizations need to [00:19:00] embrace if they're going to leverage experimentation because. Experiments are really good for knocking you in the ego be because it's hard to argue with what your customers want.

And your customers, uh, in their actions will often say that ideas of the product team had really strong conviction out about just don't work. And Duncan, I think there's two parts of it. One is, let's not fall into sunk cost fallacy here when ideas aren't working. Mo move on and try something else. But the second part is every experiment is an opportunity for learning.

And back to the discussion about those heterogeneous treatment effects. So an idea didn't work. Did it work for any subset of the customer base? Can we eek out any interesting observations there? Maybe the experiment didn't work because it confused customers in a particular way. But that points to say a new customer need and that inspires the product team.

And not all experiments are shipped out to the whole world, but every single one is an opportunity for learning and an opportunity to keep making that product better. [00:20:00] I love all these threads you're pulling out for level three. And one thing that I think is essential here, part of level three is. Making sure you can reduce the marginal cost of a test to near zero.

And that involves democratizing online experimentation as much as possible across the org. It, and something that's really interesting here with respect to Duncan's point and your follow up, Martin, is that you need humility. And to do that you need a shift of incentives as, as well, right? And you've written about this in, and we'll link to your want, your company to get better at experimentation.

HBR article, but you, to get everyone on board to get to level three, you do need in incentive shifts as well, right? Yeah. You need incentive shifts. I, they're all, I think, positive for a business, but they can be hard to execute on. And it really comes down to that humility piece. If you're not running experiments, how do you reward folks?

You reward them for shipping stuff, right? Hey, we shipped all these new features out to customers. We, we must be doing a good job. [00:21:00] Once you shift to leveraging experiments to make decisions, the incentive structure does change. I think of the companies that, that I've spoken with me, MEAP does the best job at this of being really disciplined in evaluating product teams based on measured impact to their customer base.

But it, it certainly does change those incentives. And one of the things I love about tech product managers is if you put an incentive in front of them, they're gonna go optimize that. And they're gonna do incredibly well at that because they're amazing people. And so incentive alignment is really important.

And so. As you're starting to scale out experimentation, you're probably gonna find yourself putting incentives in front of folks that are slightly gameable, but still will have benefits. One of the things I talk a lot about at Windows is let's just get better at running more experiments. We need to run more experiments, we need to take more shots, we need to try more things.

We need to get more signal back. Doesn't take a lot of creativity to figure out. You [00:22:00] could game that system. But on the other hand, so what if we game it a little bit? We build the institutional capacity to run tests at high volume and that's gonna help us and we can continue to refine and hone those incentives.

So before moving on to level four, I do, I agree that building that capacity is incredibly important and we shouldn't over index on. Good heart's law or hacking incentives. But I am interested in your thoughts on how to mitigate cross incentives. If you have a bunch of different team, you know, you ship your org chart, right?

But if you have a bunch of teams optimizing for slightly different things, I'll actually give a concrete example. Old friend of mine, Robert Chang, now at Airbnb, early experimentation at Twitter. They had an issue about notifications and emails. The email team, part of the marketing team wanted to send emails out.

Every time something happened to you, the people doing the feed wanted to do something else. Bombarded users com completely. Right? So how do you think about setting up infrastructure and organization to mitigate these risks? It's a great question and it's one we think about a [00:23:00] lot at Windows. 'cause Windows is the, one of the central nodes of the Microsoft ecosystem.

There's a lot of business that flows through Windows is a lot of other Microsoft experiences that flow through Windows. Um, there's the old joke from, or papers assume your favorite utility function. And what you're really saying here is like getting that utility function right is important. And it's also the hard part.

Which is why in the kind of causal inference community, this notion of how do you use short-term signals to predict long-term outcomes is such a massive and important problem. I would say in, in as a personal position, I think getting that utility function correct is slightly less important than simply getting agreement on it.

Like you want the whole organization to be swimming in the same direction, and I think you'll get better outcomes if you achieve that, even if the utility function is slightly imperfect. So if everyone is optimizing towards a given metric that Yeah, we, you got a pretty good idea is reflective of long-term [00:24:00] value for the business.

You're in a position to make coherent decisions and maybe hug, go to link that back to a comment you made about the importance of democratizing experimentation. If you were going to democratize experimentation, you need a coherent way to make democratized decisions, which usually means you need to set up a clear metric that the whole organization is going after.

Once again, that's something that to your point, meta, Facebook has been very good at, on, on and off since there, and to what extent Martin does like thoughtful metrics setting and almost just good management play a really important role here where. At least like at Uber, when we thought of metrics like counting the number of experiments people were doing, that was a useful input, but it wasn't ever the only input.

It was also important to measure impact on our users, on the metrics we cared about and taking collectively. If you're looking at overall impact on kinda business metrics and number of experiments, that gives you a [00:25:00] picture of velocity over the organization. And, and making sure that experiments people are running actually are meaningful and doing the right things for users.

It never is like number of experiments in a vacuum that, that is as metric in a vacuum is not very helpful. Is that kind of how you would think about it or curious if you could expand on that? Yeah, I, number of experiments is always a good metric for the platform itself 'cause it gives a sense of throughput, overall kind of capacity to learn from experiments.

Um, a few of us at Netflix, my last couple years there, started playing around quite heavily with an idea we called experimentation programs. And the goal there is to just recognize that within your organization you can break out the experimentation that's going on into, to different parts of your product.

Just copy the org structure as a starting point, and if you start collecting the right metadata and can map every experiment to the program, it's part of, and then you can just start making really simple plots like what is the distribution of treatment effects that this [00:26:00] program is running? How frequently are they launching experiments?

And our goal in doing that was to really leverage analysis and insights from AB testing, from the individual experiment level to the product strategy level. And so one of the things you might observe from plots like that is, Hey, program A is running a whole bunch of experiments. They all have really small treatment effects.

They at low frequency, but with some consistency are finding things that move the needle in the positive direction. They should probably just automate the living daylight set of their system and let it run. Maybe on the other hand, program B, they're not running many experiments, but tho those they do launch are really moving the needle.

You probably usually in a negative direction, occasionally positive. And so probably that team needs to figure out how to run more experiments 'cause they don't have much capacity, but they're finding real customer sensitivity. A big thing we started pushing. It was like, how do we provide summary information over sets of [00:27:00] experiments that can actually inform product strategy itself versus experimentation just being that tactical decision making lever almost there you're learning about like where are the actual levers in the business that you have.

Leverage over that you can move. Yeah. And, and then how do you inform, like the managerial layer on how to deploy capital, human capital or financial capital, uh, effectively across them? Yeah, so in, in the framing as an optimization problem, I think most frontline product managers, their goal is to make a metric go up as fast as possible.

The goal of a chief product officer is to make the sum of all of those charts go up as fast as possible, which is suddenly to your point, a resource or capital allocation problem. So just to wrap up with level three, essentially reducing the marginal cost of a test to near zero for as many people as possible.

Investing in platforms that automatically test and optimize and leveraging just the power of parameter optimization and, and, [00:28:00] and hill climbing to, to do this now, what's the bottleneck then? So. The bottleneck that remains at level three is you're using humans to make decisions. And if you're using humans to make decisions, you're still in quite an expensive mode of AB testing, even if you reduce the marginal cost of executing on experiments to be relatively low.

So level four, the kind of mental leap is seed decision making to, to your tooling. Versus the humans, and so this is the realm of contextual bandits. This is the realm of micro optimizations and explore exploit algorithms. The goal is to find very small, repeated parts of your product. Where you can generate multiple variants.

And so two examples would be, say the artwork on Netflix. On the homepage, every piece of artwork you see is actually selected for you from a set of a set of, say, order NI can't remember what n is. [00:29:00] Another example would be, you know, familiar to marketers, you're sending out a lot of emails, you, you want to, uh, optimize and personalize this subject line.

And so these are two cases that have the same flavor, and at level four, the structure is you arrive through probably a human-based process at a set of candidates. So a set of artworks, a set of candidate subject lines for your marketing emails. Then you unleash and explore exploit or contextual bandit system on that which says, Hey, let me give a random variant to some small subset of my customer base.

Let me collect information back from those randomly allocated users, and then do targeted or personalized treatment. So let me figure out from that random allocation. For customers like Duncan, we're gonna send you subject line A for customers like you Hugo, down in Australia. We're gonna give you subject line B.

We can exploit that, that learning. And again, the big step here is you've seeded the decision making on that AB test [00:30:00] to the machine. You need to do that because these examples we're talking about are very small, right? We're talking about the artwork for a title on Netflix. Netflix has thousands of titles.

We're talking about the subject line for a single marketing email campaign. You wanna run hundreds or more of those as a large organization. So again, this theme, you've gotta drive down the marginal cost. And here we've done that by seeding decision making to the machine. That allows us to run many experiments and get small optimizations across many small product services.

This is great because I wasn't gonna, I thought about mentioning this earlier, but I've seen a not insignificant number of organizations get to level three and have the promise of acting on what the data says, and yet part of the promise is it's no longer hippo, right? It's no longer a highest paid person's opinion.

And yet when you have it being humans who need to click the buttons and that type of stuff, you can do all the experiments you want, but still have a culture of someone being like, oh, that doesn't [00:31:00] align with what I thought. So let's do the other thing anyway. Yeah, it's, it cer certainly some folks could have such conviction in their ideas that they're willing to over override the data.

I think that's okay. In, in limited uses, I'll say experimentation doesn't replace sound business strategy. And sometimes as a business you do need to make bets. You do need to say, okay, I can look this data in the eye, but I can also try to look into the future. Problem is an over-reliance on that. So Hugo, it's a great point that at level four we shut down the possibility for doing that by seeding and automating the decision process itself.

One of the reasons that works is there's just too many decisions to be made for humans to be actively involved in each of them. And correspondingly, each decision is relatively small stakes. So it's not worthwhile to have humans in that loop. And then the goal with those level four systems is you win as a business in aggregate by eking out very small gains many times.

Such a good point. And I'd be interested in your thoughts on this, Duncan, because. [00:32:00] To your point, Martin, this isn't a replacement for sound product strategy and innovation. And one big concern with online experimentation is climbing to local maxima or down to local minimum. We always say given a horse and cart, an online experiment wouldn't have come up with the car.

And to that point, Duncan, you worked on some really sophisticated, important products at Uber, which an online experiment wouldn't have told you to. Experiment with Uber. Something like UberPOOL. That's right. No. So I, I think of, I guess I think that within any given product, family, there can be maybe a few different levels of experimentation at play.

There can be the decision of, should we have a product like UberPOOL at all. And there can also be a decision of how do we finely tune all of the parameters that enable UberPOOL to operate so that the users get the best experience [00:33:00] possible given market conditions. And within that, there can be lots of micro experiments where you're heavily figuring out like what is the best way to.

Decide if someone should get a match or not, based on where they are, what time of day it is, what the weather is outside, how many drivers there are around 'em, and, and those decisions need to be made together, right? You need to be deciding, okay, how much are we investing in this product and how much are we figuring out how to run hundreds or thousands of micro experiments inside automated feedback loops to personalize or parameterize?

And, and optimize that system. And so I think they all go together, but I think that it's really exciting and powerful as you enable that, that level four of experimentation and lean heavily into it and recognize that there's just enormous heterogeneity in the world at large that you can take advantage of [00:34:00] and thereby improve customer experiences.

Love it Duncan. And so launching something like UberPOOL, that's back in level one, and then I like how you described. Then once that's launched, we start hill climbing, we can start leveraging these higher experimentation velocity solutions to really start climbing those hills. So now Martin, we have quote unquote, the machine being very much in the loop with making decisions.

The bottleneck now is product and content generation. How can we leverage generative AI to take us to level five? Thank you for the setup, Hugo, and you've nailed it there. Our bottleneck at level four is the creation of the things, the variance or objects or artifacts we're optimizing over. And so let's take away that bottleneck.

Let's put a gen AI system in that loop. So let's use that subject line for marketing emails. Example. We can ask the machine, the Gen AI system for say, half a dozen variants [00:35:00] that, that meet our campaign objectives. We can send those into our level four Explore exploit system. Then what we can do is get the machine to analyze the results of that, and the machine might find K variants three and four.

Those ones are doing really well. So why don't we go back to the Gen AI system and say, make more variants along the lines of number three and four, and then inject those back into the system. So what we've done is we've really closed the loop. We've really established there's a generation system. Which in this case we've just moved from, from humans to ai and there's an evaluation system, which is our, our experimentation platform.

The other part, which I classify as level five, is recognizing the parts of product experiences where we can do this, are getting bigger all of the time. So the folks who've done a lot of work on a hyper parameter optimization for algorithms are sort of used to this, right? You say, I've got some hyper parameters.

I need to tune them. Let me try some variants and let me [00:36:00] spin that loop and subtly improve that rexi. With Gen AI text, we can basically get variants for free. Images, we can get variants for free whole UI components. We can get variants for free. And so the amount of the product experience, which we can start viewing is very dynamic and very flexible, is evolving on us in a really profound way.

And so what does it look like if we take those level three ideas, right? Parameterize, the whole code base. We take those level four ideas, which is we can serve out different variants to different people and we can optimize, given a limited set of variants. And then this level five idea of just use gen AI to make additional variants, and suddenly the whole product itself is fluid, it's self optimizing, and it's not really a static, should it be A or B?

It's just. Getting better all of the time through these closed loop of generate variants, test them, personalize the [00:37:00] outcomes, generate new variants, test them, personalize the outcomes. And I know this is starting to to sound a little bit like sci-fi, but I honestly believe this is where we're going. Well, I was actually gonna a ask that it doesn't, it sounds like what I called fu future music before, but last time we spoke you, you mentioned a platform called Co Frame, which actually grounded this really well for me.

So maybe you could just tell us a bit about co frame. Yeah, so co frame's a, it's a startup. It's led by CEO, Josh Payne. I met him at a conference in Texas in the fall where I gave this talk and then somebody pointed out like, Hey, there's this company co frame here. I think they're doing this. And so co frame, they're like open for business.

They have Fortune 500 customers and they're really taking this level five concept and applying it to like prominent web pages and landing pages. I think primarily for e-commerce companies. What they've seen is they can get production level variants to go to market with for AB tests in hours instead of weeks, and they're able to quite rapidly [00:38:00] 10 x the kind of experimentation throughput of these companies they work with by viewing this generation problem through an AI lens.

And then you use the gen AI system to create variants of these landing pages. Send it out to market. You get signals back, you feed that back into the Gen AI system, so it learns what's working, generate new variants, and so it's very much this dynamic, perpetually learning system. One of the questions that comes up with AI is how do you keep it under control?

And so at this level five, and I know Co Frame is doing this, it's really important to provide the model with the right context around brand voice, brand guidelines, hard limits that it can exceed to keep that Gen AI system under control. And also to that point though, when you go up the levels, the, some of those guardrails plus, plus, you'll have built in at this point.

You will have done early when you democratize online experimentation. You do things like that to make [00:39:00] sure that people don't do experiments that perhaps they shouldn't be as well. Right, right. And and so I, I love that connection. What we're really doing is just formalizing a lot of practice in a way that's accessible to a gen AI system.

Because cer certainly any company with even a relatively mature experimentation practice has a process around guardrail metrics, processes around what can be released, what quality level do we have to be at, how do we make decisions in a way that remains coherent? So there is a way to start leveraging that, that institutional capacity in a way that Gen AI models can ingest and act on.

To what extent Martin is the. Kinda the opportunity size and the number of spaces in which you can do that. Level five experimentation significantly larger even than what you could do at level four. I guess when I think about this stuff, I think about the number of places that Uber, where we could do really heavy kind of [00:40:00] personalization, heterogeneous treatment effect optimization.

There weren't that many products where you could do it. There were a bunch, there were high value, but not, maybe only about half a dozen or a dozen. And yet when you look at what Jenny and I can do, now you say, holy cow. In fact, you can personalize almost everything in a product. Yeah. And so does that just kinda change the opportunities set Completely.

I, I think so. And I think it's really exciting. It poses a whole bunch of fascinating problems, but working at Netflix, we always used to say, your Netflix is different than mine. But that came from the sense that the recommendations provided to you are different from me because your viewing habits are different.

But what if your like UI preferences are fundamentally different? Can the whole thing look completely different for you versus me? For folks to our earlier conversation in different geographic regions that have different preferences around density versus simplicity of information. And so I feel like the lids getting blown [00:41:00] off and.

The set of things that we can optimize over in a digital product using these techniques is unlimited right now. And I'm really curious to see what this means for how products are gonna evolve and for how consumer tastes are gonna evolve. I love that. And I actually quick moving towards a conversation about some form of malleable software as people like Jeffrey Lit and, and, and such, such write about.

And I actually spoke with some agentic search leaders the other day. And they were even thinking through search experiences where let's say e-commerce search experience, you're at a clothing shop or shoe shop and you have particularly wide feet, the product learns that and it gives you a dropdown to tell you like the width of shoes, not only the length of shoes or something like that.

So moving to that form of personalization. Yeah, that's a great example. And, and so. Why, why not? There's a million examples like that where kind of the level of personalization and [00:42:00] learning, there's just much more that can be done in that space at relatively low cost. And the real theme I hope that's come through today is as we go up these levels, one of the important considerations is to keep driving down the marginal cost of your next experiment.

And so at level five, we've said building software is becoming free. What does that mean for building digital products? And I don't have all the answers there. I'm not, certainly not pretending that, but I think it means you can try out a lot more things and you can do a lot more personalization.

Absolutely. I am also interested in tying this back to product strategy, product innovation, and I think that one end game of personalization. Reduces shared context and shared reality for end users, right? For example, Spotify, I don't know whether they think about this. So this is just a hallucinated, human hallucinated example.

Spotify could potentially create personalized music that [00:43:00] just speaks to Hugo's tastes based on what I've listened to. But then I can't go to the pub in Australia and sing the songs with my friends. Right. So there, there's a trade off there. So how do we think of the broader level about retaining shared context and reality while personalizing experiences?

I, it's a great question and I. One of, one of the challenges that most digital products that include recommendation systems face is the balance between giving you the exact thing you wanna say in the Spotify example. Listen to right now. Versus giving you opportunities to explore outside of that horizon, which may keep you more engaged in the future.

And so I, I wonder if there's a mapping from the problem you described to, to that problem, namely, if Spotify only gives you like the exact perfect song loop for you, that kind of resonates with you. [00:44:00] It might work in the short term, but I think it puts you too far down a very narrow hole and your long-term retention with the product might suffer.

And so I, I really do wonder if this is a short-term versus long-term optimization problem, which again, comes back to getting the right metrics to go after in that optimization framing. So Martin, what implications does climbing up the this five step ladder? Of generative fermentation have four product data.

ML AI leadership. Let me, let me hit on a few of those. We've talked a lot about how what we're doing as we move up this ladder is we're reducing marginal costs and we're seeding more and more decision making and then variant generation to automated systems. So from the data science perspective, it makes the job a whole lot more fun if you're in an experimentation data science role because you can move from writing lab reports about individual experiments to writing [00:45:00] like really interesting and performing really interesting analyses that span many tests that look for patterns across tests.

And we talked earlier about how at level five you have a gen generation system and an evaluation system. And what you can start doing is running experiments on those. What are the inputs to the generation system that result in the production of higher quality variants that do well in your evaluation system?

How do we tune or optimize that evaluation system to ma maximize the outcomes for the business? And those are like more strategic level problems for the data scientists to go after. So one of the reasons I'm really excited about climbing these levels is it makes data science jobs more, more interesting and engaging.

And then more on the product development side, maybe the product management side. It's a move from thinking about, I'm gonna build this feature to, I am. I'm a shepherd of a system and my goal is to help shepherd this system in a way that optimizes with a business. And so it requires a lot of [00:46:00] systems level thinking.

It requires understanding the technology dunk into your earlier point, like how do you identify the areas in your product that are ready to climb to the next level of this ladder? And then how do you enable that to happen? And so folks have to stop thinking narrowly about, I have a roadmap of features towards, I have this system and I wanna nudge it and add guardrails to make sure it's evolving in the right direction.

I'm wondering what changes at level four and five. For organizations to be able to build out these muscles, for example, at level for, I think there's a common conception that reinforcement learning is, can be tough both in terms of infrastructure, technology frameworks, and hiring good talent to do it as well.

Yeah, it's a great, it's a great question and one of the things I find so interesting is these systems we talk about at level four are. All about continuously running AB tests. It the foundations of this, the science, it's very similar to an AB test. We're [00:47:00] giving different users different experiences.

We're collecting telemetry, we're making decisions. We're just doing that in this automated way. But when I look at a place like Netflix, and I know this is true at other companies as well, like the tech stack for running standard AB tests and for running things like contextual bandits tend to be quite independent and.

I don't have a great theory as to why, but I think it speaks to your point that setting up and getting these automated contextual bandit systems up and running is a real challenge. It takes a lot of technical know-how. The other hand we've seen, as we talked about at the top, the commodification of AV testing.

I suspect we're gonna see the same thing here. It's a ripe opportunity for startups to come and build solutions and sell them. I like that. The other thing I'm actually interested in, and of course. I speak with Duncan a lot and he's influenced me here, but it seems to me, you know, I've worked in data science for an ML for over a decade now, and I don't think there's been enough, [00:48:00] like economists come into data science and the types of tools and techniques from causal inference, but also thinking robustly about concepts such as equilibrium states.

And what I mean by that, and I'm you, I'm sure you've put the pieces together now, but for the. For the listener and viewer. When you do an online experiment, you are perturbing a system away from steady state, so you almost need to wait until it relaxes in some sense, to see the actual results. So thinking about skill sets and ways of thinking, do we need more people with this type of causal inference and econ backgrounds and thinking in the space?

100%. I recently hired an economist. They haven't started yet. I'm excited for them to join us at Windows, and I was super excited. To, to bring them on board. For all of those reasons, economists bring a different way of framing problems, a different set of skills, and absolutely, that's a, an increasingly important part of data science.

Science, I think, is to make sure teams [00:49:00] have a real diversity of perspectives, diversity of problem solving capabilities. One of the interesting points you make is around like. Another bottleneck to the speed of experimentation, which is how long does that system take to relax back to equilibrium? And so a lot of folk wisdom out there, but certainly for UI changes, you don't want to change it and make a decision the next day.

You've gotta let users adjust to that new experience and try to measure long-term behavior versus that immediate reaction to a UI change. Totally. And the other point, and you've made this point elsewhere. Make changes that have rationale and meaning behind 'em. This is actually why I always hated the that the one example that was everywhere, you know, 10 in the past 15 years was like, change the color of a button for an AB test.

And there was never a hypothesis why this would be I impactful. So I suppose I'm speaking to something you've written about, which is hypothesis driven [00:50:00] online experimentation. Yes, but the attentive listener will realize I going to contradict myself, and that's fine. Like particularly at these lower levels we've talked about, I really do believe you need really good hypotheses, right?

Like the top 10 example, you read that hypothesis, it makes a lot of sense. Other companies show top 10 lists, way to plug into what everyone else is watching. Let's try it out. But then when we get to these higher levels, it is a little bit more of like throw it to the wall and see what sticks. On the other hand, you're talking probably about smaller optimizations to the product as well.

I talked to somebody I worked with in grad school about the kind of work I do now in online experimentation, and his comment was, oh, you don't have the governing equations. That's your whole problem. And I thought, yeah, that's true. And so one of the things I think we're trying to do with online experimentation, particularly at those higher levels, is like.

How quickly can we adapt to [00:51:00] changing customer preferences and changing customer expectations? And there it's not so much a really firm causal hypothesis. It's more like, can I stay ahead of the curve somehow? And I recognize those thoughts are starting to get a bit fuzzy, but it's another thing that's been on my mind.

No, it actually makes a lot of sense because the reason, several reasons you want hypothesis driven on the early stages of the latter are due to the costs you incur or the cost of e experimentation, right? So as. Certain marginal costs go down. You're afforded more ability to experiment. I suppose that, and this is a point that you've implicitly made here.

There is another cost of, if I roll something wacky out to the user, like something wild that makes no sense, like that's a bad experience. So there's a cost of potentially bad experience, but as you point out, we're doing smaller optimizations and climbing smaller hills, so the chance of that is less likely.

Yeah. I, I like that. And the, it's also a good point that we're never gonna drive the cost of [00:52:00] experimentation to exactly zero, because I think there always is a cost to changing that experience on your customer or on, on your, your user. And so to circle right back to Windows, that's something that I've been trying to internalize in my work at Windows so that the permission to play did.

The amount of change that users are willing to put up with. I think it varies a lot based on the type of product you have. Leveraging my own experience, Netflix, it always felt like we, we had a lot of sort of permission to play. It's an entertainment service and you open it up and the UI is different.

Maybe that's just gonna be exciting. Folks who use Windows, they're often opening that machine. They have specific tasks. They wanna achieve those tasks very quickly. And so we have to be a little bit more mindful about that. Change management. I think, and I don't know how much this impacts the difference here, but something that I haven't thought about this but that just came to mind is the ability to update or.

Change. The Netflix experience is continuous, it's calculus in some sense, right? Whereas [00:53:00] with something like Windows, there are updates and such things. So the feedback loop is actually qualitatively different, right? The, and I mentioned this at the top, just getting the dips out to the machines on Windows is a.

Actually a pretty substantial ordeal, right? And so that that acts as a natural limiter to how much we can evolve that experience, how quickly we can evolve it, and how we can do that in a really safe way. And there's a cultural thing for users, right? I mean, like I'm, me and my family or whatever, I find getting different experiences on Netflix.

Whereas if my operating system was different to the person next to mine, that may be a concern as well. Right. Certainly if you're, say, teaching someone in your family how to use their Windows machine and the two machines look different, and so maybe to tease out a theme there, it's like rec, recognize the product you're building and recognize what it means to, to change that experience on your customer.

Yeah. [00:54:00] So it's time to wrap up, but I'm wondering for a company that wants to start moving up the hierarchy today, what's one high signal piece of advice you'd give them to start the journey? The high signal piece of advice is to find the bottleneck in your experimentation, practice and tooling, and get rid of it.

And so figure out how to reduce the marginal cost of experimentation. Experiment more, figure out what your new bottleneck is and get rid of it. I firmly believe there is massive opportunity cost in not experimenting, and so the high signal advice I would have is figure out how to experiment more and figure out how to climb that ladder.

Amazing. Thank you so much for not only such a wonderful con conversation, but taking the time to come and share your wisdom and all your hard learned lessons with us. Martin, it's been my pleasure, qo. Thanks so much for listening to High Signal, brought to you by Delphina. If you enjoyed this episode, don't forget to sign up for our newsletter, follow us on YouTube and share the podcast with your friends and colleagues [00:55:00] like and subscribe on YouTube and give us five stars and a review on iTunes and Spotify.

This will help us bring you more of the conversations you love. All the links are in the show notes. We'll catch you next time.