The following is a rough transcript which has not been revised by Vanishing Gradients or the guest. Please check with us before using any quotations from this transcript. Thank you.

hugo bowne-anderson  0:00  
Hugo Bowne Anderson here, the host of the data science ML and AI podcast vanishing gradients for a live stream with Vincent. Varmadam, Vincent G'day, how

Speaker 1  0:10  
are you good? So, yeah, you're in Australia. I'm definitely sort of in the Netherlands. And the kids awake, we just did the whole morning ritual, and life is good, but we did have to go through a morning ritual.

hugo bowne-anderson  0:23  
I and all that I can appreciate that. So as Vincent mentioned everyone, he's in the Netherlands. I'm in Sydney, Australia. So whether at the opposite ends of Tuesday, if you wouldn't mind introducing yourself in the chat and letting us know where you're calling in from, where you're watching from, and what your interest in such things are. You know, do you work in AI or machine learning or data science? Have you seen some of Vincent's videos? I gotta say I'm so excited to be here today. I do want to say a few things by way of introduction. So Vincent is a senior data professional machine learning engineer, among many other things, at probable which is a relatively recently formed company, which is the exclusive brand operator of scikit learn Vincent is known for challenging common assumptions and exploring innovative approaches in data science and machine learning, which is one of the many reasons I invited you on here today. But we're here to talk about no What an interesting and really moving quickly, accelerating space we work in, but perhaps ways we can rethink what we do in data science, machine learning and and AI. And we're going to talk about a bunch of projects that Vincent works on, and make some cool announcements as well, and wrap up with some live coding. Please do ask live demos. Please do ask questions in the chat, and we'll do what we can to feed them into the conversation. Also the obligatory if you enjoy such things, please like and subscribe and share with friends as well. But without further ado, I I want to jump in and and maybe Vincent. You know, I've said a bunch about you, but maybe you can introduce yourself and let us know how we ended up here

Speaker 1  2:09  
today. Okay, well, it was a it was a nice summer in 1988 when I was born in New York, and

hugo bowne-anderson  2:17  
I didn't know it was New York. I knew you were born in America, though I was

Speaker 1  2:19  
born in United States, yes, but like, was there for like, a few years, like, up until I was two, like, there are pictures of me in like, Sesame Street, the theme park in New York. Then I definitely moved back to the Netherlands. I did do two years of junior high school, one in Boston, one in California, like in Cupertino. I lived there for a year, like, Steve Jobs at like, you learned that years later. It was like a block away, kind of because Cupertino, the small town did my college and high school in the Netherlands, I did have like, one year of design engineering as a major before I switched to maths. And I do think that that's kind of relevant, because one thing that you do learn when you're doing design engineering is to always check your constraints, like test your assumptions there and and also things like, if you're going to be designing a wheelchair, the best way to do that is to actually use a wheelchair for a week and see what.

hugo bowne-anderson  3:11  
Who would have thought, yeah, well, I mean,

Speaker 1  3:15  
but that's I noticed that was a stark contrast with the mechanical engineering department, because the way that they learn how to build like a wheelchair, and I will say this was like a fun competition back in my day, what they would do is they would give them all like motors and stuff and like, who could build the best motorized wheelchair? That was kind of like one of their challenges in the way. And that's a very mechanical way of looking at a wheelchair, and there's merit to that. But that sort of design thinking is something I did always, even though I decided not to go for a career in IT, something about that did stick, like, hey, you've got the theory that's nice, but in practice, it can definitely still be broken because the UI is bad or whatever. And I guess, like, that's part of the story and how I got here. Like, when I was in college, I definitely studied math, but I had lots of extracurricular activities. I was a bartender, one of the sort of most famous Dutch comedy theaters. And I do like to think that, you know, if twice a week you're watching a comedy theater thing in the Netherlands, that does have an effect to your presentation skills as well. So I guess, like, that's my background. And if you then look at what I'm doing in data science today, like that background helps paint the picture. I think that's sort of the short story. Did a bit of consulting, did a bit of life at startups. So after consulting, I was at rasa for a bit. Then I was at explosion for a bit, the company behind Spacey. I was like a lead, well, another lead, but I was like one of the core Devs on prodigy for a bit there, which is a lot of fun. And now I do stuff over at Oba blood, I think is what the lot of folks at the company are French. I think that's the way a lot of us pronounce it, and we are definitely a company that tries to figure out a nice funding model for scikit learn, like we have a bunch of the scikit learn maintainers that work for the group, and we're looking for ways to sort of figure out the funding behind that make a proper company around it, hopefully a nice. Big one. It's nice to have more European tech companies. That's also kind of the dream here. But yeah, I'm part of that effort at this point in time.

hugo bowne-anderson  5:06  
That's super cool. There are lots of points of your history that I think that will flesh out in this conversation. But because we're doing a data ml AI podcast, let's focus in on on the working at the bar at the theater, I think because you actually, and I can link to this talking in the show notes, you actually have a so let's step back a bit. As we've said, we're here to talk about rethinking common approaches in data science. And I want to know what inspired this philosophy, how it's evolved over time. But you do have a story from your time in a bar that I, yeah, nicely illustrates some of the, some of the problems that we have in the space. So maybe you can tell this anecdote. Yeah. So

Speaker 1  5:50  
this is one of the anecdotes I get in my talks. It is a it is a fun one. So I was working at a bar, um, and it was my first year of statistics class. It was like, right at the end of the year. And when you're done with that statistics class, they kind of give you the assignment of, now go out there and apply statistics, which you know is fair. You don't want to just do theory. You kind of want to find a data set, do something fun with it. And I remember some of my Some, yeah, some of my classmates that would go and they would find, like, the Olympic medal results, and they found that there was a correlation between like, if you want to win the Summer Olympics, population size helps. If you want to win the Winter Olympics, GDP tends to help. And they were doing all sorts of things like that. But I thought like, Oh man, I maybe want to become like a consultant later. So what will be better for me instead is if I actually go out there and find like a business or a company or something like that, and do like the consultancy thing with the data set. And the company I worked for was a theater. Now this was a small theater, not the big one I would join later, but the small theater was actually looking at, should we expand or not? Like it was sort of a question that they had, and they had, like the growth numbers for every year, so you could see how many seats per average were taken in the first year the second year. And this was, like year 12, or something like that. So me, being the naive first year student that I was, I figured, well, that's 12 data points. We can get a regression line and go through that and and if the regression line goes up or down, and we can prove the statistics that'll be good, like I this is this perfectly matches the theory that I had in my statistics book. Got a nice little curvy match. Professor was super happy with it. I presented it to the director. He was super happy. Super happy with it. But the conclusion was that the number of seats that we were selling that was on a downwards trend, like we weren't selling more seats every year, every year, even more. It definitely looked like the growth had stagnated. And I had a statistics model, and everyone was convinced, like, okay, Vincent, we should not expand the theater. So far so good. This was super cool. But then after I handed this in, I had to work under the night shift again, and the house was super packed. And, like the day after again, the house was super packed. And then at some point it was like, hang on. The last three nights I worked here, the house is super packed, but I'm giving the recommendation that we should not expand. Like something feels weird about this. Then I started thinking. I started thinking about a bit more, and it started dawning on me, like a theater has a max capacity. Like, if you're booked solid every single evening, then that's the reason that you're not growing. And then you can report on the fact that, yeah, the growth is stagnating. But that's because also you can't sell anymore. You're at capacity. So basically, the regression that I was doing was Bob kiss. Like, it made no sense to think about the problem that way, but the tunnel vision that my professor of course had didn't prevent him from giving me a plus on this work. And this was, I guess you could say that this was my first experience with a proper vanity metric. Like, it's very easy to take a metric that doesn't necessarily apply to reality and overfit on it. And yeah, this story comes back a whole lot of times when I give talks and stuff as well, because I think it's like a nice little anecdote. But one thing that is kind of ironic, but also funny in hindsight, it was also good that it didn't expand, in hindsight, because five years later, to go the Dutch government would do like a massive, massive budget cut in theater, so they would have never been able to sort of make the money back, right? Um, but, but the reasoning was flawed, and that's the main point. Like the if you're arguing with theory, you can fail in practice if you're not careful. And that's like a common theme I keep on seeing absolutely

hugo bowne-anderson  9:20  
so. I am interested. This is a case of something that happened at a theater. Yeah. I am interested. What the general failure mode happening here? If

Speaker 1  9:31  
so, if I were to describe it the most in the most general way, what I think has happened in the last 10 years is at some point, someone's claimed that data science was going to be the sexiest profession ever. So lots of people, lots of degrees, lots of curricula, and they all focused on, like, how do you get the best metric for an algorithm? And that's not necessarily a bad practice, right? Like, you want your accuracy to be high and you want your mean squared error to be low. Like, that stuff is fair game, but it is also a way of thinking about the problem, and that's, I think. Where some issues can arise, like you can optimize for a metric, but what we even better is that you worry more about if the metric matches reality in the first place. In my mind, there's like these two Venn diagrams. There's like this one Venn diagram, which is the stuff that your users care about, and this other Venn diagram is the stuff that we optimize. And when you have a data science curricula, you are focusing on, how you optimize, basically, but you're not worried about, like, do these two Venn diagrams overlap? Yeah, that's like, good. And this, this phenomenon seems to translate itself into many, many different like, applications. I think that's something I keep seeing over and over again, not everywhere, of course, like, there are also, like, really solid applications of machine learning. That's totally true, but the it's almost as if this is something that can tell me, if someone is, like, Junior in their career or senior, like, if they're senior, they will have seen this go wrong, and they're therefore just a little bit more skeptical about optimal metrics, whereas someone Junior might be a little bit more impressed because they're taught to think this

hugo bowne-anderson  11:03  
way. Yeah. And so that's why I wanted to open with this example, which I think speaks nicely to how much you advocate for rethinking common approaches in in data and machine learning. So I'm wondering just more generally, what inspired this philosophy in you and how it's evolved.

Speaker 1  11:24  
So I guess there's like, there's two stories, I guess that come to mind now, at least. So a while ago, I was at the Dutch BBC, the npeo, it's the Netherlands Rico omoloop, if you're Dutch, you know what this is, but it's basically the Dutch BBC, and they wanted to have a recommender. And I was, I was part of a group that was there. There were also other groups there. And you know that these people are actually relatively mature, like they did do some of their homework. So they said stuff like, hey, when you see a recommender, like, you're at the end of the video, there's, like, three slots where we can put a video, and that's stuff that we can recommend we like it if people click on the recommender, but we also want people to click to actually watch the thing after, so they were aware of the fact that you could do click baiting. So, like, if we care about a metric that doesn't just summarize how many people click on the recommendation, we actually also care that people sort of watch the thing after. Like, that's the thing that we're actually gonna judge the algorithm on. We're not gonna train on that phenomenon, but we are going to judge the algorithm on so that was cool, right? Like, the sort of the management that was there was aware of the fact that you can optimize for the wrong thing, and they had thought about that beforehand. So that was, like a really great feeling. But there was also stuff missing, like they didn't, they hadn't built an AB testing system yet. So even though there were people around me, and also from like this other group that was definitely into, like deep learning. They were telling us, like, Oh, can we do the deep learning thing? Please do that. And we had to disappoint them by saying, well, we should do an AB testing system first, because, you know, otherwise, you can't compare algorithms. Makes no sense to have an algorithm if you cannot sort of judge it. So they said, Okay, fair enough, and we build it. And then once the AB testing system was done, they said, Can we do the deep learning thing now? And we said, not really, because we have to test the AB testing system. So we ran an AA test, and we did that kind of stuff. So again, they said, like, okay, Vincent, can we now do it again? Can we now do the deep learning thing? And we said, well, there's actually this other thing we want to do. We came up with the worst recommender ever. So the idea was, we're just going to give you like a random thing, because you can imagine there's an order in which the videos appear, and if people click the first thing anyway, if that's like a thing that they do, then we want to measure that if 2% of the people click the video, and Our algorithm only boosts that to 2.1% clicking. Let's say then you know that your algorithm is not really doing anything good, because it's barely outperforming the random thing. Yeah. So again, the deep learning people said, like, okay, that actually makes sense. Let's do that first. And by this time, of course, like, it's not just that we're building things. I'm also looking at data, doing a bit of data mining, trying to figure out what people are watching afterwards, etc. And then it just started to dawn on me that, like most of the time, people just watch the next episode of a show, and that's not in the UI, like you have to go back a level and then click the next episode in order to do that. So my first algorithm was just the first slot in the recommended. Let's just have that be the next episode. And no one had thought about that because they were thinking like the group who was complaining about the deep learning algorithm. Learning algorithm had not considered the fact that maybe the next episode would be a good idea. And that, to me, was the biggest eye opener, because that group had some pretty senior people in it. I believe they even had like, a PhD or two with like, background and writing very impressive papers and all that. But if you're so distracted from like, Oh my God, I want to use the cool algorithm that you forget then to recommend the next episode. Then something really weird is happening in industry. And that was sort of the moment where I started thinking, like, okay, maybe I need to not talk about, like, when I do talks at pi datas and stuff. Maybe I shouldn't focus in on how algorithms work. But. Rather how they're applied and how people are thinking about them, because if people are forgetting to recommend the next episode, then we are in for a world of hurt when the AI winner is going to happen like that was just such a mind boggling thing. I think this was like 10 years ago, and around that time, I really started rethinking my career as well. I noticed so before that, I was really, really worried about understanding all the algorithms, because that was, I thought that was like the path forward, but then the clash with reality at some point just became way more interesting to me. So that's what I started focusing on. That's

hugo bowne-anderson  15:29  
a wonderful example you you mentioned there was a second I've got some questions about that, but you mentioned there was a second example that came to

Speaker 1  15:36  
mind as well. Yeah. So this is kind of a weird one, like around that time I went to a little conference called spacey in real life, like this was the they did this one conference, and I had never done NLP before, but I was at a PI data, and I saw a talk by Lynn Cheney. I hope I pronounced her last name correctly, but she did this one talk about word embeddings, and that just seems so magical to me, that that would even work. So I figured, okay, I need to learn about this NLP, stuff. What's the best library? Well, back then it was definitely spacey, and spacey looked cool, so Okay, I'll go there. And I was hanging out with the, you know, with, with the people who made Spacey. We went we went into a bar, went into another one. One bar was super busy. The other one was definitely more spacey. I was like the joke I made that got the attention of Enos. And, you know, this is how we got the ball rolling. I ended up sort of making some YouTube videos for them. And while I was exploring the spacey library, at least in spacey version two, I noticed something kind of interesting, which was that spacey took an opinionated approach. Normally, if you're doing scikit learn and you think about classification, then the assumption is that you're dealing with mutually exclusive classes. That is to say we're doing classification. It is either a or it is B or it is C, right? So we have a photo, there's either a cat or a dog in it, and you could do that inside a spacey but at least in version two, this was a little bit harder, because the spacey assumption is, well, there's a sentence and there could be a label that we can attach, but we assume that they are never exclusive. So you can imagine, instead of there being a cat or a dog in a photo, the definition would be more like, oh, there could be a dog in it, and there could be a cat in it, but it's not cat or dog. And then I started thinking about, like, that's also kind of an interesting, I don't know if it's this was definitely version two. I think in version three, this was definitely nuanced a bit more because the community did want mutually exclusive classes to appear more directly. But there was also kind of like a little eye opener to me that like, oh yeah. Like, scikit learn offers you a great API, but you can the way you define the problem, of course, is not set in stone like you can also make a little bit of a change the way you think about classification algorithms. And it also makes complete sense, because a photo can have a dog and a cat at the same time in it, but if the way you define your classification problem is, well, it's either cat or dog, right? Yeah, something about the way you define a classification algorithm there, there's also, there's also an opportunity to mismatch with reality there as well. And then the way I saw that in spacy, I think, was like the first time that I saw a library taking opinionated stance there that mutually exclusive classes are not inherently the way the world works.

hugo bowne-anderson  18:21  
And on the flip side. So I love that example because it also reminds me on the flip side, in one of your talks, you give an example of binary classification that uses it's a probabilistic prediction, which then there's a threshold to give a binary classification right. And usually we set a threshold above which we'll classify as x, below which will classify as y. And you give a nice example which is kind of like a counterpoint to this, whereby, let's say the probability is between point two and point eight. I think is your example. Actually, maybe then you don't want to predict either of the things. You only want to make a concrete prediction after a threshold has been passed. And in the middle, say, I'm going to not predict this. Actually,

Speaker 1  19:10  
yeah, that's, that's the won't predict flag. That's the way I like to call it. Like it would be nice if algorithm. So this is maybe because you've watched the same video. So maybe I should introduce this. So the the thinking here is like, Hey, we're dealing with a classification problem, and I'm assuming that we're dealing with an algorithm that has this predict proba thing attached. So it doesn't say Class A or B. It also just emits some sort of a probe of value. And then you would hope that that probe value says something about confidence, and therefore you could say, the closer you are to 0.5 the less confident you are about a decision. So maybe we should add some wiggle room there that we don't automate every single decision out there. The algorithm is only allowed to automate a decision if it is quote, unquote confident. That was sort of the premise. But I think the point that I was this was the one where I show this was the one where I show that that's not enough, right? Like that talk, I think. Know exactly now. So like, the the idea there is, of course, like pretty sound like a lot of the algorithms. If it's if your predict proba is like 0.9 you can assume it's more confident than 0.8 The only issue is that you can still have a proba of like, 0.999 and the algorithm can still be wrong. Yeah. So the question is, like, how much faith do you want to put in that predict proba thing? Or is it maybe not enough? And the main point that I'm making is outliers in that talk. So if you imagine, like, yeah, like the sigmoid line. So like, on one there's like a splitting line in the middle. On one side, we've got blue dots. On the other side, we've got red ones. Then we can say, like, everything close to the splitting line that's uncertain. But then if I take an extreme example, and I say, go six miles away from the closest blue dot, but on the left, but on the on the side of the line, where we have all the blue dots, let's say, but it's miles away. We've not seen anything like it. Well, then the probe of value can still be super high, but it's unlike anything we've seen before. So maybe a good idea would be not just to say, well, let's take the proba value as a confidence measure. Let's also do outlier detection while we're at it. So if we can confirm that this is something so like, it's so far away from the bell curve, we can't hear it ring anymore, right? Like, really, really far away, then we should also just raise the won't predict flag, it'll be a different reason to raise the flag. But there is more than one way to not automate a decision. And again, in my mind, at least, if you're so trained to think about classification systems that you care about that predict probe value and that probe of value is going to be a proxy for confidence, then it's going to be hard, I think, for you to take a step back and consider there's also other ways of doing it. That's kind of the point I want to make in that talk. The talk, I think, is called constraining artificial stupidity. So that talk is all about like techniques that you can constrain stuff like that from happening. But if Yeah, to take the general hat again, it is also the same of take a step back and rethink the approach, because there's usually more than one way to go about it. Yeah,

hugo bowne-anderson  22:04  
and that's what I want to get into next. I do want to just kind of elucidate something here, which you do clarify in that talk, among others, I think is part of the challenge here is that machine learning algorithms and statistical modeling more generally, particularly good when they are at interpolation, but not extrapolation. And when we're dealing with outliers, we're trying to extrapolate to essentially out of training set stuff where whatever, whatever signal we've discovered just may not apply out there. And like we have no we don't have our posterior after seeing the data, and we get into Bayesian stuff later. Actually, the posterior is still relatively broad out

Speaker 1  22:45  
where the outliers are, yeah. I mean, you can do stuff like, Okay, we've got our data set, and I don't know, like, Gaussian mixture model your way out of it. Maybe, like, you can maybe say something about, like, Okay, this data set probably has a distribution behind it. If we can model the distribution, we can also do something with, like the likelihood values and stuff. And that might be a way out of it, you know. But in the end, like another way to maybe also think about it. And there's something that's like a theme that also comes back in my talks, you can kind of imagine that there's, like a real life problem, and then we're going to take that real life problem and then translate that into something analytical. And then in the analytical domain, we, you know, tensors flow, we do our thing, and then we get, like, a 10% boost, or something like that. And that's also kind of what's happening here, like, in the end, I don't care about the classifier in the end, I care about how the classifier reacts to reality. But what's always the case here is that we make a translation from real life to analytical domain, and when we have a 10% boost, you can wonder, well, if we could translate it back to the real world, how much translation error do we actually have? Does the 10% boost compensate for the translation error that might be happening? And if not, then maybe we should worry more about the translation error than the 10% boost of the algorithm gives us. Like that's also a way to think about it. If the if the analytical boost is small, but it really maps well to reality. That is usually way better than if your analytical boost is huge, but it makes no reflection of reality whatsoever. I

hugo bowne-anderson  24:18  
couldn't agree more. I think this also speaks to something we've been talking around is focusing on the problem rather than focusing on the solution, right? So to take which of the I mean the I think the recommended system example is the most obvious case of this out of the ones we've discussed right where you want to recommend someone something to someone, and perhaps the next thing that you'd think someone would watch intuitively is a good approach, as opposed to start using deep learning and neural networks. My favorite, my

Speaker 1  24:56  
favorite example of this, and this is from the same talk of. Um, so in the weirdest possible way, I ended up talking to someone who helps out with the World Food organization. So a little bit of background. So my, my master's degree was, in the end in operations research, and that field is all about, like, optimizing logistical chains, traveling salesman kind of, kinds of problems, like, pretty math heavy, but really, really interesting stuff too. But okay, like traveling salesman kind of a problem, like that kind of logistical scheme. And you can maybe imagine, at the World Food organization, they are allocating foodstuffs, and they would like to do that in a cheap way. So this logistical part of the equation is actually not unsubstantial, like that. The transportation costs are kind of real here, as well as, like, the cost of goods and all that. And the way that they used to work was they would go up to, quote unquote, a village. They don't actually go to a village, but what they do do is, for certain regions, they ask, quote unquote, the region, like, hey, what food stuffs do you need? And they will say something like, Okay, we need, I believe what they used to do was something like, Okay, we need this many kilos of beef, this much beans. And you know, they would list the food stuff that they would need. And you know, they have PhDs at their disposal that work on this, and they have been working on this for years. And you can imagine that, like, this is traveling salesman problem kind of stuff. Algorithms for this. They don't scale very well. Like, it's very hard to paralyze, so you might have to wait for weeks until the big machine is done finding, like, the optimal allocation. The tricky thing here with all of these operations research algorithms is that if the world changes only a little bit, you have to rerun the algorithm, like the price changes of a good then the allocation is bad already. So, okay, so like, this is a really, really tricky technical problem, and I do have great respect for people who make progress just in that domain, because it is fundamentally hard stuff. But then someone just observed out of the blue, like, they don't need beef, they don't need beans, they need nutrients. Like, if someone, if we're dealing with people that are facing hunger, then they're not going to be that picky, like, if, if we can give them lentils instead of beans, who have, like, a very similar, like, nutrient profile, right? Oh, hang on, maybe the algorithm is forced to look in a search space where the optimal solution doesn't even exist. And the what I I have, I probably have the numbers wrong, because this was 10 years ago. But what I was told is that this observation and then switching to a simple algorithm to deal with that led to a 5% reduction, which, you know, they would be used usually they'd be happy with, like a 0.1% reduction after a year of research. But this was like 5% overnight with this one observation that fits the world better. And, you know, this is one of those moments where lives are properly saved, and you can this is also, I think, the best example of you can have the best algorithm, but if you apply it to the wrong problem, it doesn't matter. Yeah. And if people like one philosophy that relates to this is his name is akoff. I forget his first name, but he was this person in the 80s and 90s who he used to be an operations researcher, and then he pivoted to systems thinking a bit more like he was one of the founders of the systems thinking domain. And he also has this nice example of, if you're designing the best car,

hugo bowne-anderson  28:22  
sorry, Russell alcohol, yeah, could be

Speaker 1  28:25  
Russ, I think so. But he has this one analogy also in one of his books, where, if you're trying to design the best car, the way you get there is you don't design the best engine and then the best exhaust pipe. You get there by figuring out how to hook up a really good engine to a really good exhaust pipe. But if the communication between two parts is broken, it is simply not going to work. And it's also like, if you're in consulting, by the way, there's also, like, a pretty good analogy for your day to day, like, instead of worrying about optimizing one thing, if you try to optimize the communication between two things, then usually it's easier to claim that you've made an improvement to the entire system. And something like that is also happening, I think, in other domains, but that, like, usually, that's the stuff that matters, like, Can I have a cog that fits the system better? Because just having a better cog doesn't mean that it's a better system like that. I think that message and analogy holds a lot of truth in it. In general, well, this

hugo bowne-anderson  29:19  
actually speaks to a far greater point as well, I think, which is, what are we actually doing? What are we building and why? Right? So you've spoken about the challenges of what happens when we're optimizing for area under the curve and this type of stuff, f1 score and all of these things, when perhaps we should be doing more design thinking and thinking about product and experience and this type of thing in a lot of cases. So maybe you can speak to how systems thinking can help us with all of these things.

Speaker 1  29:53  
Yeah, I mean, so, I mean, the easiest explanation is a story I just gave you, like the. It's, it's always good to be aware of the fact that you might be chasing a vanity task, right? It's very easy to sort of say, Oh, I have a classification problem. I'm going to optimize the heck out of it and not necessarily worry if the labels are correct. Or, like, Where does the data even come from? And then, like, that sort of a thing. Like, some of it is also just, this is a weird and interesting exercise, but something I have seen go quite, quite nicely. I used to work at Dutch eBay as a site called markplatz, and one thing they did, which I just thought was really, really cool, they would just have, like random users, and they would show them a new feature, and developers were forced to sit in on that session. So, like, they had a new so they had, like, a new feature, let's say it was like a new bannering system or something like that. And they would say, like, hey, you've just clicked around the website. You can see the your front page update. Does this change make sense to you? And then very often they would say, Now, it makes no sense to me at all. And then, like, what could you answer? Why? Well, because for years, I've been looking for this one thing, and the banners have now all changed because of my recent searches. And an engineer would go, Well, isn't that what you want? And no, no, I want to be able to configure this just having like a human tell you that in your face. This is maybe also a Dutch thing. Like the joke I always make is we're Dutch, like we don't like to stab you in the back. We prefer it if we just stab you in the front. It's much quicker that way. Like something about like being punched in the face of reality, like exposing the engineer to the end user. That was really beautiful to see. So I don't know if they still do that practice, but I thought that was, like a really, really beautiful thing to actually do. Just have the actual human being in front of you, and when something fails, you can actually see how it fails. And something about that is it's so much more information than, like, looking at logs of the website, in a way, right? So one thing is data, the other thing is just information. And this is like insight, like, that's, that's the the top of the line, knowledge you can get, without

hugo bowne-anderson  31:57  
a doubt. And that's why I said almost facetiously at the start of this conversation, when you mentioned, if you're designing a wheelchair, maybe useful to sit in a wheelchair, and yeah, for a week, and I said, Oh, who would have thought slightly, slightly cheekily, I think. But, I mean, we are discovering this a lot in in our space. But I think part of the challenge, one of the big failure modes you're you're speaking to is people just don't understand what the end user needs or wants or experiences, and however they can appreciate that or have to have empathy.

Speaker 1  32:32  
Well, I think there's also another avenue to this. I think so one, one thing that I'm really thankful for, as I mentioned, I have both a background in design, like design thinking, but also in operations research. So in operations research, one lesson they do really, at least that's what my professors did. They really drilled this one in me. So you got to imagine, in operations research, let's say you do traveling salesman problem. You need to know the distances between the cities or the cost between the city like, how much does it cost for a truck to go from A to B if you're going to make the optimal allocation there, you really need to make sure those costs are accurate, because that because the constraint is not going to be useful unless, like, the constraint actually matters, right, like, if you're going to get the wrong allocation, if you don't think about that. And the same thing holds for like, cost of goods that you're purchasing and that kind of a thing. But one thing you can also imagine is that, Hey, I am trained in thinking about algorithms in a very constrained way. Oh, constraints actually don't appear that much in machine learning algorithms. Like it might appear inside of a solver or something like that. But like, oh, hang on, I'm used to thinking about constraints. Wouldn't it be cool if I could tell my algorithm? Hey, here's like, optimize for accuracy, but only if these five examples get the right class. Like, that's a constraint I want to just attach to my algorithm. So to say, so, I think part of the solution is also that maybe we're trained too narrow as far as like, what can algorithms even do? Like, we do have mathematically sound algorithms to add constraints to systems, and if they're quadratic, which most linear algorithms are, we can usually have like a closed form solution to add constraints to our classification and regression tasks. And I also like to think that there are these tricks and techniques from the realm of search, and there's all these different algorithms and techniques and computer science that we can build upon. A part of me sometimes also things that we have lost a lot of creativity by only focusing on, like, one way of thinking about the algorithmic system of the problem. It's not just always the match with reality. It's also just like, hey, there are, like, all these fancy algorithms out there. Not all of them are deep learning, and sometimes they can also cause some inspiration too. We'll see an example of that at the end. I have a demo of this. But, like, that's also if, like, I really, really wish earlier in my career that I did this more. But like to give you another example, like, imagine, imagine, you've done deep learning and you've done scikit learn, but you've never done PIM c3 or the Bayesian thing at all, right, like the Bayesian way of. Thinking about the problem. Oh, actually, that is super that can really help you, like, get out of a problem sometimes, but it's a very different field, if you look at, like, the classic assumptions and all that. So, like, that's also an escape hatch. That's also a way to think about it's not just always, like, talk to people and like, how does the business work? Sometimes it's also just realizing, like, there are, like, so many tools you can consider, but just wanted,

hugo bowne-anderson  35:22  
and I wasn't going to mention this yet, but of course, your way of thinking has so you work in education, in building product yourself, and working with algorithms, data, that type of stuff, but you've also built packages around the type of stuff you you think is useful. So perhaps you could tell us a bit about Scikit Lego and how that evolved, and then with Scikit, learn. Now

Speaker 1  35:50  
that's okay, there's a few anecdotes here. And I I am, one thing I do want to mention up front is I am going to brag a bit. Depending on who you talk to, it might have a slightly different story, but we just reached a million downloads. So I'm definitely going to brag a bit congratulations. And that's for like, it Lego, yes, psychic Lego has reached a million downloads. That was like, you know, get celebratory drink. Okay. So I used to work at this company, and at some point I just started noticing, like, you know, when I go to a different client, there's just a few of these scikit learn components that I just always re implement, really. And people around me thought it was a bad idea, but I did think, like, I don't want to rewrite this stuff all the time. I would prefer to have that, like, speech somewhat open source also, so other people can contribute cool ideas, because they're also just some of these, like, weird components that, for sure, shouldn't be in scikit learn core, but they it makes sense to have a few of them around and like, like, I have some time series tricks, for example, as some seasonal patterns. Like, there's estimators for that. So I figured, like, okay, let's I need these Lego bricks around. So me and this, one of their colleagues thought, like, hey, one thing that's actually kind of cool is we also teach this code stuff sometimes. So His name is Matthias. Me and Matthias, we said, You know what, we'll just have an open source project, and we'll just tell our employer it's also going to be useful for training. And it actually was, because one thing, one thing we could do during a training is say, today, you're going to learn how to commit to an open source project like that. We could do that in a day, because we can control the repo. We can kind of show what it's actually like to make an issue, make a PR and all that. So that was, like, the original reasoning behind it. But then me and mattai just started challenging ourselves. Like, what if we actually took us somewhat serious, and what if we actually started adding genuinely useful algorithms to this? Like, what would happen if we did that? So then we opened up the project. We said, like, hey, anyone, anyone who can give us an algorithm with a benchmark that convinces us it's a good idea to host it here, we'll just accept it. And then Mattis also did some good work in implementing some algorithms for, like, algorithmic fairness. So if you're interested in, like, mitigating some bias, like, there we have some features for that. This was, like almost seven or eight years ago. So we also added some support for pandas conversion inside a psyched Lego. And, you know, the whole bunch of stuff got added, but it's now a few years later, and a few interesting things have happened, one of which is that some of the algorithms that started in Scikit Lego have found their way inside of scikit learn. So Scikit Lego had a quantile regressor, I think, a few years before scikit learn did. But one thing I do like to think is that what Scikit, Lego is really good at is just getting ideas out there, and if people end up liking it, then more serious projects, if they really want to, can sort of pick that up. And that's, I think, kind of the story here. I'm still a maintainer. I still hang out there. Matthias is definitely doing some other stuff now. So we were looking for a new maintainer. So we have someone called Francesco, if you might be listening. So once again, Francesco, thanks for all your effort. You've really been helping out here. So we've also got some new faces in the mix that are also contributing new ideas. But the whole point of psyched Lego, at least for me now, is that it should be just fun to maintain, and hopefully it should inspire some people to play around with different ideas. And I think it's pretty good at that, like, it's not a it's definitely not like one of the biggest scikit learn plugins or anything like that, but it is big enough for people to at least sort of play around with it. And that's been, like, fun and cool to see. That's

hugo bowne-anderson  39:17  
awesome. I've linked to it in the chat and in the show notes. I'll put it there as well. For those listening,

Speaker 1  39:22  
one thing that, one thing that is, if you want to have a blast, you know how in Python, you can import to this and that it gives you a poem. Most of my open source packages have a poem as well. The one in Lego is all about the fact that we don't try that. We really don't want to have a fight with Legos lawyers. That's fantastic, because it is a copyrighted term, and we are really making it clear in the import this poem, that we are not interested in copyright law. But do import this when you're playing around with Lego, you're probably going to smile a

hugo bowne-anderson  39:52  
bit, I will. And not only does import this in Python, give you a poem. It's not any old poem. It's the Zen of Python. It is deep. Yes,

Speaker 1  40:00  
well, so this is actually something I learned from sympy. So simpy was the first open source project that I saw that also just did this and Lego. The way Lego goes about it is, if I may, if someone contributes a pretty big idea, we ask, we like, we tell them, like, Hey, this is a meaningful contribution. If you want to, you can add a line to the poem in this like, what's the lesson you learned while maintaining psyched Legos. Like, the first paragraph of the poem is about, like, please, Lego, don't sue us. The rest are actually genuine lessons we've learned while maintaining this thing. And I do want to share like, one part of it actually psychic

Speaker 1  40:34  
Lego, because I think we have the poem. Yeah, this the poem is also just online. I You

hugo bowne-anderson  40:43  
gonna read me positive Vincent.

Speaker 1  40:46  
It does seem to Yeah, so it does seem to be happening. So there's a lot of power and simplicity. It keeps the approach strong. If you understand the solution better than the problem, you're doing it wrong like that was a good one. And there's another one command, F, yeah, be careful with features, as they tend to go sour. And this is my favorite one. Defer responsibility to the end user. This might just give them power. So one thing we stumbled across in the incited Lego is at some point there were users that had, like, pretty good ideas, but it did feel at some point that the library had to really think of every single concern. And at some point, we figured, well, we can also just document that something is a concern, and we don't have to implement every single thing, because in the end, understanding the problem that's the responsibility of the end user, like that's something a library can't do for you. So we also made a very conscious decision of we're not going to implement everything, because some responsibility has to be with the end user. That's something a lesson I did learn while maintaining Lego so that, I think that's my favorite line. But anyway, there's lots of there's lots of, like, cool stories in that poem, please, if you're going to play around with Lego, read that. It's my advice.

hugo bowne-anderson  42:00  
Fantastic. And as you know, I mean, I've been podcasting on and off for for years. What year is it now? 2024 24 Yeah, at least six years, or something like like that. And this is the first time anyone has recited poetry, poetry to me on a podcast, Vincent. And I gotta say, I don't know if you can see under this beard, but I am blushing really. Okay,

Speaker 1  42:24  
then, okay, fine, then I'm going to read you the first paragraph, roses are red. Roses are red, violets are blue. Naming your package is really hard to undo. Haste. Can make decisions in one fell swoop. Note that Lego is the trademark of the LEGO Group. It really makes sense. We do not need the bluff. Lego does not sponsor, authorize or endorse any of this stuff

hugo bowne-anderson  42:45  
that is awesome. That's the first paragraph

Unknown Speaker  42:47  
of the psychedler. Like this is the zen of psyched Lego.

hugo bowne-anderson  42:50  
So everyone pip install or poetry install or pixie add psych it Lego and import this and check check that out. I am. I want to now move on to we've mentioned kind of the importance of focusing on the end product, and what you're actually, what you want to measure there, and what what you want your success to be. Of course, at the other end is, is data, and the data generating process, and so going back to you, I just want to go back to you in a bar, you know, while at college, you know, making drinks in a theater. But the reason that's the case is because I do think that's a really nice example. If you think about the process, I'm not even talking about doing any of the mathematics or statistics. If you think, if you sit down and write down what you think the process that is generating the data will result in, you may come up with the idea of the capacity of a theater. And as opposed

Speaker 1  43:57  
you're exposed, you're exposing yourself to the problem. That's the way I would phrase it, yeah,

hugo bowne-anderson  44:02  
yeah, I like that. And I haven't quite thought about it like that, but, you know, I've, I jokingly say I'm probably a Bayesian.

Unknown Speaker  44:13  
That's a good joke, actually. But I,

hugo bowne-anderson  44:15  
yeah, I thought you'd, I thought you'd like that, yeah. But I, and one of the, one of the reasons for that is it, for me, I mean, it's a shame it's called Bayesian in some ways, because, you know, we all joke it probably should just be called statistics. And statistics could could have another name. But all jokes aside, it does force you to explicitly write down your assumptions and the data generating process. So I'm wondering, and of course, you know, in one of your talks, you also mentioned pimc Three and how that can be used to model certain things. So Bayesian side, maybe you can tell us about the data generating process and then talk about, kind of the pros of Bayesian inference here.

Speaker 1  44:52  
Yeah. So, like, okay, so I'll start with so sometimes it's also about finding the right word, right? So, like, a year ago. When I was preparing a keynote, I noticed like, hey, yeah, exposure is maybe the right word. So, like, exposure is the word, like, how do you how do you actually understand the problem? Well, by exposing yourself to it, not, not by hiding from it. And similarly, I was also thinking about, what is it about this Bayesian thing? Like, there's a word that I was sort of missing. And then the word, I think describes the Bayesian stuff pretty well is, it's very articulate. It's not a black box. Like it's pretty you kind of have to write down what the assumptions of the entire system are before you say, dot fit. And also, like, in hindsight, you can sort of look at some of the numbers and say, like, oh, okay, does this distribution make sense? Like, it's very explicit, and something about that makes it very articulate. And I that's something I kind of like about it. Another way to think about these Bayesian systems, and it's this is one of those things where it really helps to have, like a presentation ready with some charts. But one thing you if you look at where Bayesian systems are typically used a lot of that is indeed a bait ad testing system domain. Like, if I, if I talk to people at booking, like, a lot of people like, Pim c3, stuff because of the AB testing nature of it, no. And one way to maybe think about that is, like, I don't know. Like, suppose that you have some sort of hypothesis that, like, Okay, I'm gonna go over the booking example. Booking is probably not using this this example, but bookings like one of the bigger employees in Amsterdam, which is why I'm going for them. Okay, so

hugo bowne-anderson  46:24  
that has, like, done incredible work in online experimentation and AB test. Yeah,

Speaker 1  46:31  
it's very hard to spell AB testing without the word booking in it these days, yeah. Like, they have some good papers. If you go to the PI Data talks, like, there's a couple of really solid talks. They all come from that, like little garden, I suppose, but I don't know, like, let's say that I am booking. I have a website, and there are some hotels that claim to be kid friendly. And let's say that some of these hotels also offer business suites. And sometimes people book and they something about their order makes it very clear that they've got kids then, then already, very quickly you want to calculate stuff that sounds like, okay, given that the family that's booking right now has kids, what is the likelihood that this matters? And there you already kind of go, like, you are very explicit about the information that is known upfront, then there's something very explicit about, like the experiment that you're doing or the subgroup. And then you're interested in inferring some knowledge, which is, like, quantified in a probabilistic way, but you have to be very explicit, like, you really have to say this is given and that's the thing I want to measure. And these are the other variables that I want to measure the effect of. And if you're interested in doing the AB testing system, like, oh, then I really want to know what is the effect of this thing to that thing, keeping everything else in mind. There's also a really cool, causal little example there, here. Like, have you ever exposed yourself to the smoking data set? No, it's one of my favorites. Okay, so the smoking data set has

hugo bowne-anderson  48:01  
is it part of calm code? Yeah, yeah. I discovered it through you, yeah, through calm code, yeah, some time ago. Though,

Speaker 1  48:09  
we'll talk about com code later, because it's also the thing I do. But the really cool, the smoking data set has a really cool lesson in it. A lot of data sets too, but this one is just so explicit. Okay, so you've got a data set and there's a couple of variables in it. One of them is it, one of them is quote, unquote health, which is basically, do you still live after 10 years? The other variable is, are you a smoker? And the other variable is, What is your age? And what you can do is you can say, I group by whether or not you're smoking, and then I track whether or not you are alive in 10 years. And when you do that exercise, you're doing quote, unquote statistics, and you come to the very reasonable conclusion that smoking is indeed good for you, because people who smoke, they end up being more alive after 10 years than people that don't. And this is kind of a conundrum, because are we lying with statistics, or what is happening here? But then you know, the pattern that you're really measuring there is that people who smoke are typically younger, like people, at least, if I look at my friend group, most smokers stop once they start having kids, some something, something responsibility happens when the kids around. So, yeah, okay, if it's mainly the younger people that are smoking, and we're checking if they're alive after 10 years, yeah, then you're not really capturing whether or not, if you're just averaging on the smoking, you're going to lose too much information. So the way you should model this instead is you say, well, whether or not you smoke is dependent on age, and given that fact, then we can say, okay, and whether or not you're alive in 10 years depends on smoking. Given that your age is a thing and your age and when you do it that way, then you can see that there's a negative trend when you're smoking, but it's completely hidden if you don't include this one fact about the age. And again, this is why Bayesian models are in my book, are considered to be articulate, but you do have to sort of. Of do this exploration yourself a bit. One thing is also what I think is so nice about this example is you're able to learn this lesson because there's a column called age in the data set. Imagine if that was missing, you would totally get to the wrong conclusion. Well, that

hugo bowne-anderson  50:16  
that's part of the one of the biggest challenges, I think, when thinking about causal inference. And we don't necessarily want to get too much into causal graphs and that type of stuff. It's a field of its own. Yeah, in your lovely example, you can take age and put an arrow from age. You've got a graph. Age is a node. Arrow to smoky or not. Arrow to survival, death. That type of stuff, not that type of stuff. I didn't mean to be flippant about survival or death, but my point was that you it's impossible to know if ever your causal graph is complete, and if you don't have the age column, then like, what other things do you want to take into consideration? Well, and

Speaker 1  50:59  
here's my point, and therefore the data set will not be enough. Like, that's the main conclusion of the smoking, if there's ever, there's, if there's ever No, there's actually two stories, I suppose. Like, one story is the way you model sometimes really has to come from you. Natural intelligence is still a good idea. Like, that's, that's, I think, a lesson. Another, I think, another lesson that I think is really cool about that data set is data set, why on earth are we taking 10 years from now, while you were still alive? Why is that the best proxy for health? It isn't, let's be real, right? But again, both of these very important conclusions, they come from a human. They don't come from the data set. They don't come from the algorithm, they only come from you. Like, natural intelligence is all you need, kind of an approach. So again, like, and yeah, I'm passionate about this sort of thing, because it sometimes does feel like I'm the only one talking about this stuff. But I do think it really matters if we really want to prevent artificial stupidity from happening. These kinds of ways of thinking are maybe going to help us more than throwing more tensors at it. That feels like a very easy Hill to defend in my book,

hugo bowne-anderson  52:11  
absolutely. I mean, it's challenging when at the same time a lot of like, the cultural consciousness is obsessed with, like, how many GPUs you can fire up simultaneously. But cynical snide remarks aside, there's so many interesting stories and ways of thinking in here. I'm wondering if there's some way we can kind of collect them all into just thinking. Is there like a list of the most common mistakes that people make in order to, I'm trying to reduce it down to a few takeaways. Yeah. I mean,

Speaker 1  52:49  
well, I mean, okay, I guess this is, like a really good segue to a book I'm writing. So the way, the way, at least, I started thinking about this, is like people have come up to me after my conference talks, and they all came I kind of came to me like Vincent. Where do you learn all of these lessons? Because they seem kind of valuable. Are they in a book or something? And I've always kind of said no, but I talk to people, and I expose myself to problems, and I do want to admit it has always been relatively easy for me, because I used to have an employer that would basically pay for all my conferences. So I would totally go to lots of conferences. I would totally drink beers with people, and I would totally hear stories. So like, I've been, I've been able to expose myself to a lot of interesting anecdotes that way. But it's kind of like this cartenary Tales thing. I think that's the way to think about it. It's not like there's a like, there are me, there might be a few general lessons out there, but maybe it's also about recognizing the patterns and all of these stories. So what I've figured I might do, and I've been, I've been telling people I should do this for a year, and I finally actually started doing it. So my goal for the next year or two is to just start writing these stories down some way. So you just have a little book, and the title of the book is data science fiction, where the whole point is that we are just going to have a resource where lots of these stories are that people can just go ahead and read, and hopefully when they have, like, their business use case, they might think, like, Will this work? Or is this going to be data science fiction? Like, is this going to be something that we tell ourselves is going to work? Or is there maybe, like, cautionary tales that would be of help. What I do want to give maybe one shout out, though, because one project that I am very impressed by in this realm is the Dion checklist. The crew over at driven data have made this it is a very, very sensible checklist of things you want to check for when you're doing data, stuff in practice and everything that can go wrong is in that checklist is backed by newspaper articles of something actually having gone wrong. So if you want to have a conversation with your boss about it, you can actually pull up a newspaper article and say, like, this could also happen to us. Like the Dion checklist definitely is like one of the most underappreciated. Resources and data science, in my mind, but the goal of the book is to actually be a little bit more story driven. Like, there's going to be some fiction in there, there's going to be some charts, but it's basically going to be like, instead of me waiting for another conference to happen to share some of these anecdotes, I'm just going to write them down in a little book, and the book's going to be, I'm going to show the link at the end, but there's also, like, the first three chapters are up, and everyone can just go ahead and

hugo bowne-anderson  55:22  
read them, fantastic, and I'm excited to chat about them as well. I've included the link to the Dion checklist in the chat, in the show notes, and I've also included a link to an essay, or a set of essays, that DJ Patel Hilary Mason and Mike lakittes wrote in northern hemisphere summer of 2018 called of oaths and checklists, something I think about quite often. And I think it may be what inspired Peter bull and Isaac and all those people, all those wonderful people driven data. I could get that wrong. It is.

Speaker 1  55:55  
It is sort of the same. It is a similar social circles, like I could see Venn diagrams overlap there, and it's also like the same time frame. So wouldn't surprise me. Wouldn't surprise

hugo bowne-anderson  56:03  
me, yeah. But one of the points of the the essay, essentially, is to people talking a lot about data scientists, the practice maybe having oaths, or something along those, like the equivalent of a Socratic oath or something like that. And this piece is positioning the idea of checklists, along the lines of the check this manifesto, which I highly recommend to everyone also your are you? Are you doing a live stream today with the people from driven data? Or is it

Speaker 1  56:32  
okay? I guess I can also on behalf of my employer, probable, I do a podcast called sample space. And yes, the next guest is going to be Peter from Frisian data, and I will be interviewing him later today about the Dion checklist and other stuff, because I because I think it's a really cool project, and it's also relevant to stuff that we do over at probable. So that's like a very typical guest I would have on the podcast.

hugo bowne-anderson  56:53  
Fantastic. Well, definitely check out Vincent and probables podcast as well. I, I

I'd like to jump into a demo soon, but I'd like to get your thoughts on, you know, something we've been talking around, something you argue for a lot, is the value of simple, interpretable models in a variety of different scenarios. I'm just interested in how you balance this with the push for more complex and potentially more powerful models as well.

Speaker 1  57:32  
I mean, when there's a couple of attitudes to have, like, one way to think about I gave a talk about linear models, like, how to win with simple linear models. And I think there's still and I think there's still a lot of truth to that. Like, the sort of the internal joke over at probable is that we have the Vincent benchmark. Like, did you also run a linear model, and how did it compare to that? Like, if you haven't done that exercise, then please do that. And otherwise, you know, Vincent's going to be angry. That's

hugo bowne-anderson  57:57  
almost equivalent to, like, a majority classifier in the binary

Speaker 1  58:03  
dummy classified. It's basically, this is Vincent's baseline. And one thing that I've always kind of noticed is if you force yourself to just do the linear model thing for a bit, then you also force yourself a little bit more to get value out of your feature engineering, which also forces you to think a little bit more about your problem. So I've always also appreciated this exercise kind of as a forcing function. Great. That said, I do want to mention that there's something cool about the llms that is a bit new, and I do think it's actually kind of useful. And this is something that back at explosion, we definitely did experiment with. So a trick that I've always loved using when figuring out data quality is having two models. So let's say you've got one model that's on the relatively simple side, like very rule based, maybe like a Python function with some domain knowledge from a colleague, or if you're doing NLP, it's like a relevant words list, and if that word appears, then match, like relatively simple classifier. You think about it, it shouldn't be stupid, but it should be relatively, quote, unquote, simple. That's kind of the goal there. Oh, and then the other model you've got is the Hello deep learning, mega XG boost, mega black box thing. Just worry about accuracy on that one. Now, what happens if they disagree? Because the cool thing about it is, like, even if you have lots of unlabeled data, you can still train two, two of these models on a relatively medium, small subset. But I have found just in practice, it's not like you could throw away all of active learning, but just when two models disagree, something interesting is usually happening. Like, either there's a misclassified label, which you can use to improve the xgboost model, or you're going to notice a pattern that's not in your rule based system. And both are very valuable in the end. And coming back to the LLM, like, one thing I do think is really cool about it is because it's so promptable, you can all. Ways, have in the back of your pocket for a lot of NLP tasks. Have a not necessarily perfect, but you do have access to a second model. The only thing you need is a prompt, and you do have a model that should be able to do a bit of classification in ner. And I recall a lot of benchmarks we did over at explosion, like they're not perfect or anything. And we have found that if you get like, 5000 labeled examples or so, a lot of the spacy models can outperform these LLM models on a lot of tasks, simply because you've got more labels at your disposal. But in that beginning phase as you're labeling, okay, that's actually not a bad use case for an LLM like, use that as the always available second model and keep the human in loop. I think that's a very legit use case, and I do think there's a lot of value in that. So I'm not so one. I am a little bit against maximalizing The LLM like, I'm not an MLM maximalist or anything like that. But from a distance, I can look at a few of these use cases and go, Hey, that's actually kind of cool, but the human is still in the loop. Like, that's the only thing I would add to this. Like, the moment that you automate the whole thing and don't keep a human in the loop, then I do start getting a lot more critical. The only the main shame is that you do have a lot of LM maximalism going on. It's kind of weird to see how an industry it kind of felt like, Oh my God, we've got chatgpt, the company now really has to do something with this, and that caused a lot of proof of concepts, but not necessarily great engineering practices yet, but that's kind of the vibe I'm getting from people in industry. There's a lot of proof of concepts, not necessarily a whole lot of stuff in production, and not necessarily a whole lot of proper engineering, but there's been a bit of FOMO, which caused businesses to start taking this stuff a little bit more serious. And I wish people would not use FOMO, but use curiosity a bit more. But again, always having that there's a spacey LM package people can check out that's pretty good. Like it makes it easy to configure an LLM as a back end, and you get, like, a spacy object back and like that. Stuff just works pretty well, but having a good second model for always at your disposal for NLP tasks, that's cool, for labeling that that's clear use case. I'm like, I'm convinced that that's actually fair.

hugo bowne-anderson  1:02:13  
Yeah, I think that assessment's pretty reasonable. I do. I will say on you in within that you mentioned that people are finding it difficult to productionize certain things with llms and or, like, integrate them into, you know, traditional software systems and software stacks. I'm paraphrasing slightly, but there is an argument. I will not make this argument now. I've made this argument before, but I would. This is not an argument I'm making now, but there, I think there are strong arguments that llms, the way that they've played out, are almost antithetical to software. I mean, if in terms of building software, we want it to be reliable and correct, for example, and consistent, llms and definitively not anywhere.

Speaker 1  1:03:02  
So, so make that a small cog in the system, but still have the system right and have a box

hugo bowne-anderson  1:03:07  
around it as well. Don't necessarily, yeah, make it the big brain. I mean, we do have this when we think about agentic workflows. For example, we have a big brain that sends every sends everything out to everything else, essentially,

Speaker 1  1:03:19  
well or just not have it automate the thing. That's also a way to think about it, and maybe an anecdote, something I did learn over at rasa is so let's say you're a chatbot. Scenario. User types, Hello, my name is Vincent. I would like to order a pizza. Let's say that that's something a user typed, then you can infer that the user's name is Vincent. In the case of Vincent, that's relatively easy, because Vincent is a relatively common name in English. So like that'll be somewhat easy to detect, but one of my high school buddies, his name is Fuat. That's a less common name in newspaper articles in the West, and therefore probably also a harder name to detect by an LM system. So what do you do? Do you automatically detect the name and move forward, or do you just have a step in the chat bot that says, hey, just to confirm, is this your name? Yes, no button, right? And that's like, this is also a general theme. We're going to see an example of that in a bit. I also feel like a lot of machine learning problems are actually more UI problems. So do we really need to have a bigger tensor for the retrieval or do we need to have a better interface for the user to find the thing that they're looking for in search, like when is or like, Can we do something clever with icons instead of having everything be text, like stuff like that, might be more the thing that a user is interested in. One of the coolest chat bots that I've ever seen was a bank in the US. I forget which one, but they had a chatbot, which is basically almost like a form. But the Chatbot would just be there to help you find the button in the menu somewhere. So imagine that you've got your iPhone, you're trying to find that one setting that changes the Bluetooth to like, I don't know there's a really deep menu, but the whole point of that chatbot was. Just to get you to the right part of the menu, and the iPhone app would just kind of update. It was kind of like a nice experience, but it was basically, oh, here's give me the two questions, and it would be like a session. You didn't have to, like, do the search thing once. It would just do a follow up question after. But is that really chatbot technology, or is that also just UI? And in my experience, a lot of it really should be UI, like that. That's probably going to be the easier when

hugo bowne-anderson  1:05:25  
I couldn't agree more. I want to jump into a demo, but before that, I am wondering

Unknown Speaker  1:05:32  
I have the time right so go.

hugo bowne-anderson  1:05:34  
Yeah, exactly I am if, if someone had, you know, one, one takeaway from this conversation about how to rethink data science and machine learning, what would

Speaker 1  1:05:52  
try to expose yourself to really different problems once in a while and then reflect like that's kind of the tactic that I've been doing. So just to give example, you've been hanging out with Hamal and like the crew that just did this big LLM course thing, I basically signed up to that not because I was particularly interested in fine tuning my own llms, but just because I felt like exposing myself to that field. The compute credits also totally helped, by the way. Like that was, that was a fun little bonus. But the reason I went for that is because i a i was also in a discord group now, where I can see what people are talking about, and that is a source of inspiration for me to go do my own personal experiments a bit more again. But I do kind of question everything that I see there. So one thing, one thing that did really dawn on me, and I still don't really know how to think about it is, it does seem that people really like to use chat GPT to simulate training data for other systems. I got that I was actually, like, almost shocked to learn like that. People wanted to do it that way. And the reason was, is because before, I have seen that go wrong over and over and over again, but then again, we have not had generative systems that are as good as the stuff we've got now. So maybe it is actually kind of valid to maybe play around with that some more okay forcing function, and I'll figure out a way, but like, that's something I will go ahead and explore then, and I'm pretty confident that I am going to be able to at least do something that's relatively interesting, something that could be a PI data talk or a tool that's useful to me personally, like another thing that I'm doing right now is I'm scraping weather data because I want to predict the output of my solar panels. It's silly stuff like that, but it is like a problem in reality, right? So if you're going to, if you're going to go away with any lesson, try to broaden your horizon a bit, and you do that by exposing yourself to different problems and also going down a rabbit hole and then back up again.

hugo bowne-anderson  1:07:45  
I love it. And as to the question, I don't want to, I can't speak to what that group would would say I, and I don't work enough with synthetically generated data,

Speaker 1  1:07:57  
so same, I know, but I want to now, like, that's want to now, yeah, but yeah,

hugo bowne-anderson  1:08:02  
I will quote I've just inserted in the chat, and I'll put in the show notes a blog post by Cory Doctorow talking about someone called Sadowski who coined a term called Habsburg AI. And essentially, they're reflecting on what will happen if we continually use the output of certain llms into other llms. So I'm gonna run Yeah, Sadowski has a great term to describe the problem Habsburg AI, just as royal inbreeding produced a generation of supposed Supermen who were incapable of reproducing themselves, so too will feeding a new model on the exhaust stream of the last one produce an ever worsening gyre of tightly spiraling nonsense that eventually disappears up its own asshole. So people can read that. I promise, Vincent, I wouldn't swear on this podcast, but it's a quotation. Yeah, no,

Speaker 1  1:08:55  
the only thing I'm smiling, it really does sound like Corey. Yeah, fantastic. He likes to use those words. So that's always something to be giggle, but, but now, like, I guess, like, another, one thing I really do appreciate about that, course, that Hamal and friends made, is I it was nice to be able to cling on to a forcing function. Like, my life is a bit busy now, like, I have a wife and kid and, you know, like, it's a tactic and all that. But if I'm able to find a great forcing function that at least exposes myself to a problem, like, usually, that's the hard bit. It's the it's not that, like buying a book is easy, actually reading it is the hard way, is the stuff that's hard. But if you can find a forcing function that really motivates you to go deep like that, that might be the better game to play. Like find the forcing

hugo bowne-anderson  1:09:44  
function. Let's jump in. We're demoing soon. Yeah, do you want to tell us? Maybe you can tell us a bit about com code first. Yeah,

Speaker 1  1:09:53  
yeah. So that can be the first demo, actually, but I do have to share my screen. Yes.

hugo bowne-anderson  1:09:59  
And what if we could try to narrate it as well? Sometimes I leave these demos in the actual podcast, so if people may get something out of it, we can, or I can cut it as well.

Speaker 1  1:10:11  
Yeah, so it is definitely good to dictate what we're looking at. So I'm sharing this in case people

hugo bowne-anderson  1:10:17  
have not seen it, but in show notes as well. Yeah. So this

Speaker 1  1:10:21  
is a very first short it's very short first demo, but this is calm code. This is a project that I've maintained since covid. It is just a learning resource of lots of tools that I think are very likable. It's me and another maintainer now, like, I've got, like, a partner in this. Now, just a whole bunch of free content with tools that make your day to day a lot easier. We try to we try to prefer the stuff that's, you know, very relevant, but not necessarily hypey. But this is the stuff that I do recommend. But one thing you can do now is you can actually go to comcode, io slash book, and you will find this thing called Data Science Fiction. This is where I'm going to write the thing. You can get feedback and all that. Right now, there are three chapters. One chapter is about data at a hospital. The other one's about sending letters, and the other one's about waiting lines. And if you've seen my talks before, these are basically the anecdotes that you might expect that would go into one of my talks. And the goal is to have like 50 of these. These are all kind of like short essays, similar to how the rework. People do it, yeah. And this is like, I guess it's like, the first demo is now also done data science fiction. If people want to go ahead and read it, this is where you can find it. I am going to go about writing this somewhat slowly. It's chugging along nicely. I don't want to haste this thing. But when there's new chapters out the it'll be in the emails. If you follow me, I'll talk about new chapters. So if this sounds interesting to folks, do check out data science fiction over at com code, comcode, io, slash book. Okay, end the speech on that one, because now I want to talk about actual code. And the first thing I want to talk about, and this is like solving a problem that I've got on my own. So if people saw my keynotes, they might have seen this project of mine called archive front page. This is a GitHub repository that does a little bit of Git scraping. So there are GitHub actions in this thing. And basically what I do is every day I download the new articles from archive, and then I have a bunch of classifiers, and if the classifier detects a topic that I'm interested in, so I'm interested in reading articles about maybe active learning, or another one about new data sets that appear, et cetera, then another part of the workflow gets triggered and The GitHub pages will update. That's kind of the way that this system works. So you can go to GitHub Pages. The whole ml ops thing, by the way, is just GitHub actions. And I have a section on new data sets that you can just go ahead and look at. The system works in such a way that I do sentence classification. So if a topic is detected in a sentence, that's kind of like the easiest way to sort of bootstrap this that also gets highlighted with a probability value, you know. But this was sort of my way to expose myself to interesting ideas on archive when top when a new paper comes around with a topic that I'm interested in, it's not a perfect classifier, but this does kind of work. And what I can also do is I can also label some extra data if I see like a line that I think is misclassified, I can go back to my labeling tool and actually sort of go ahead and fix it. But it did start to dawn on me that even though this is a cool way to expose yourself to new ideas, it does only expose yourself to papers that are being written now, and I maybe want to be able to find stuff that's a bit older. So I was kind of thinking like, Okay, this kind of works, but maybe I should do something else, and that is sort of in this little demo over here. So just to clarify a few things how this works, what I've done is I've went to Kaggle, and on Kaggle, I found the big archive dump. So these are all the abstracts from archive. I think from the start, I could be wrong. I took the computer science ones, and I think I've got, like, 600k abstracts from archive with titles. And what I did is I went ahead and I embedded them. So I've got sentence transformers. I took the Matryoshka embeddings. Those are relatively new, but I have a lancedb running locally. I didn't even build an index. This is all happening like brute force, but what I can now do is I can come up with some sort of a query. I can put that inside of I can embed that right? That's the first step, and then maybe I can build myself a good little search engine and find interesting articles that way. And this is something I did learn from that Hamal course. Like, because we're doing embedded search, we can actually do a little bit of prompting. So I'm. I be able if, let's say, I'm interested in label quality, so I could write a problem like this. Data set has bad oops, has new keyboard, bad labels in it. So I have a little search box over here, and I'm just typing this data set has bad labels in it, and then I can go ahead and hit search. And again, I'm brute forcing. So does this take a while, but then some results come back. So if just the query, this data set has bad labels in it, one of the articles that I see here is exploring large scale public medical image data sets. This does kind of fit right, but because I'm dealing with neural search, I can actually make this prompt a whole lot larger, like, it's not a search term that I can use. I can actually look for vibes a bit more so I could maybe add something like, label quality is important. Bad labels have consequences on downstream tasks like these are things I can add, right? And again, I'm not doing brute force, so this takes a small while, but one thing that I did, again, sort of I'm exposing myself to neural search a bit more. One thing I did pick up from hamals, course, is like, maybe we should celebrate prompting a bit more. Normal search also has its place, but I can really come up with a pretty long query, and because I'm searching for vibes a bit more here, hey, yeah, maybe the search bar is going to become a search box. Now, like that, that's, I think, a pretty cool observation. But at the same time, it's also not enough, um, because if I look at the examples that I get here, they are kind of in the domain of what I'm looking for. But this is the point where I also took a step back, and I noticed that it's actually kind of hard to come up with a great prompt, because I don't know what's in this data set. So I also don't necessarily know what jargon I should put in this prompt in order for it to like, do proper work, right? It's there's still a challenge here, and I'm not intent on learning all the jargon up front. I just want to find the thing. And this is where it helps to have worked at explosion before, like, before I was working on this thing called prodigy. It was like a labeling tool. And this is one of those moments where I do kind of go like, Yeah, this is a great starting point. But at this point, I just want to go ahead and label because if I look at this example, this example, this really looks like something I'm interested in. I want to go ahead and hit positive match on that, and then I can kind of maybe scroll down look for something that I'm not interested in. Oh, and identifying mislabeled instances and classification data sets. Good grief, that is. So what I'm after, right? Maybe scroll down a bit. Mitigation on the state of German text summarization. Okay, yeah, that's something I'm just totally not interested in. So again, this is kind of a UI thing, right? Because it's, I'm using the fact that sometimes it's easier to write a prompt and sometimes it's much easier to say yes on that, no on that.

hugo bowne-anderson  1:18:01  
So we do have a question in the chat as the, why did you decide to not use BM 25 or some other semantic search algorithm?

Speaker 1  1:18:10  
So, PM, 25 is that the token based stuff.

hugo bowne-anderson  1:18:17  
So it's a ranking function that, Oh, right. I mean, it runs a set of documents based on the query, I think,

Speaker 1  1:18:26  
right? Well, so the thing is, I am doing something else instead. But one thing to keep in mind, right? I have a kid, I have a wife, I have a job and lots of responsibilities. I might have a few hours in the week to work on this sort of thing, which also means that I try to consciously work on sprints that are relatively simple, and there are definitely things that I could do to improve this. It's just that at this point in time, I hope you imagine that the UI was the more interesting thing to me. Yeah, so it wasn't so much throwing more tensors at it or throwing fancier like cost functions. I can definitely still do all of that later. But the main observation at this point is that I'm searching in a session and not in a search query. Yep, right. And I think that's the more interesting thing, because I'm using the fact that, for me as a human, it's much easier to also do a little bit of labeling. And I'm assuming this is sort of a longer term search. I'm looking for rare things here, and in that case, it helps if there's a bit of interaction. Absolutely the way this works internally, though, like I am doing a bit of re ranking, but the way that this works, did I hit I think I hit search when I Yeah, because these are here. So the way this works internally is I'm retrieving, I think, like, 500 or 1000 examples from landsdb, and then those things have to be sorted. But I'm actually using scikit learn for that. So scikit learn, as it turns out, this is a supervised, semi supervised learning and then in scikit learn. So scikit learn, it turns out, has a semi supervised learning module, and it's a bit of a hidden module, because it's not necessarily the use case most people. Will have. But the way it kind of works is imagine, if you will, I've got this. I have an image in front of me right here, and there are two circles, one scattered circle in the middle, one scattered circle on the outset. And there's like one label in each circle, and the algorithm has to figure out how to classify the rest of the data. If this were a classification algorithm, the way it would work is, there would probably be like some separating line between the two labels, but it wouldn't use the underlying structure of the circles, and that's kind of the point of what people like to call unsupervised machine learning. So the idea is, yeah, we are going to do classification, but we are going to focus on the underlying structure and or clusters of the data to propagate the label forward, kind of like a Markov chain, if you will. So you can imagine there's like a cluster. We're going to sort of connect all the nearest neighbors together inside of and because the clusters are so well defined, there's never going to be an arc from one cluster to the next, and that is going to help propagate the label within this data set. So to say, oh, and what's super cool about that? Well, that might be exactly what I need. Here we have embeddings. They are going to form clusters. And if I give a few positive labels, then I want to see then I've given a very strong signal, like, find me more like that, right? And that's how this works for now I have there's still things I want to maybe improve upon it, but there's a few things I want to highlight here that I think are very useful. One I am definitely pursuing, like the UI might be more important than the ML. At the same time, I am trying to expose myself to a field that's relatively new to me, like I have done embedding stuff before, but I am also exposing myself to prompting a bit more in doing this, and not prompting to an LM, but prompting for, like, neural search. And I do think something about that is interesting, and this is going to help me learn, like, an aspect or two of the new field. But also, three, this is a forcing function, like I'm forcing myself to expose myself to something, because this is a problem I actually have. I want to read more interesting articles, and Twitter doesn't really provide that to me anymore. Like five years ago, was better at that than these days. But I don't really care about the Machine Learning under the hood. I also really just care about solving this problem. So that means I can also explore all sorts of other avenues that aren't just throwing more tensors at it.

hugo bowne-anderson  1:22:28  
Really cool. Vincent, and thank you for sharing. I mean that it's a really nice kind of foray into things that interest you, but llms as well, and thinking about the data that's generated, then creating a UI where you can label things yourself, then doing a bit of semi supervised, like it's kind of reaching for things when they become necessary based upon the problem, which I really appreciate you for I do. And also I was just gonna ask, is there is this on GitHub or something? Or can people use some of the code to build something themselves, or so soon,

Speaker 1  1:22:58  
what I what I kind of want to do now is I want to generalize this somewhat. So I think the best thing I can do right now is make it such that this is easy to reproduce for other people. I think that archive is like general enough that more than just me will be interested in this, and then I can kind of turn this more into a competition. So I want to host this on hugging face with the idea that, hey, other people can now try all sorts of other embeddings and all sorts of maybe other indexing techniques and like, you know, I need to write the right abstractions for it, but it's just a little flask app, and people can just always shoehorn a Python function in. But I think that's going to be like, it's, it's, the idea has been proven to me enough that it would help to maybe do this in public if I want to learn more from it. Some people, like, I've previewed this before on my YouTube channel. People have come to me personally saying, like, Hey, I would love to help out the next like, I'm going to Taylor Swift concert soon as well. So the next two weeks, like, insanely hectic after that. It is my goal to sort of put this live a bit more so we can, sort of everyone can sort of have their own little play time. Little play time with it, because that is way more scalable than me trying out every single embedding at this point.

hugo bowne-anderson  1:24:09  
Well, do follow Vincent on Twitter and on LinkedIn, and I'll put those notes links in the show notes so you can find out when he releases this stuff. Vincent, I do have one last question, and maybe stop sharing your screen now, unless there is one more demo. Oh, there is one more. Let's jump in. Yeah.

Speaker 1  1:24:31  
Even better. This is I promised three demos. Those are the first two did. So this is something that came out of the live stream that I do on behalf of my employer, probable, and I'm not going to do the live coding, because I do think that we've don't have the most time for it, but I do want to maybe also paint a bit of a picture in terms of what some of the stuff I'm exploring at my employer. So what we're looking at here is a Jupyter Notebook, nothing super fancy, and we're looking at. Titanic data set. There is going to be a chapter about this data set and data science fiction, by the way, because there's something very insane about this data set once you go but

hugo bowne-anderson  1:25:08  
is it even a real data set? Is I've never found any documentation to suggest that it actually comes from the Titanic.

Speaker 1  1:25:16  
Oh my. The rabbit hole really goes deep on that one. No spoilers.

hugo bowne-anderson  1:25:22  
Kaggle. Is the first place the where I'd been able to trace it back to. So

Speaker 1  1:25:27  
Kaggle, and like a generation of orally books, use this data set. We're going to ignore, like some of the interesting bits for now, in favor of what do people typically do when they model this? Well, you've got to

hugo bowne-anderson  1:25:41  
you can do a majority class classifier to begin with, as a baseline gives you 65% accuracy or something. Yeah, no,

Speaker 1  1:25:50  
and that's totally fine. Like, you can basically do an average here, like, who survived? Yes, no, it's a binary classification. Doing a dummy classifier. Very good practice. All that's well, and good

hugo bowne-anderson  1:25:59  
biological this is a really good predictor as well. And

Speaker 1  1:26:03  
passenger class, you know, there's you can do it. You can make do with simpler models. But the reason, I think this data set is interesting, sort of, from a tooling perspective. So this is the thing we want to predict, and we want to use all of these separate columns to go in. So we've got this string column that says male or female. We have this integer column that gives you the passenger class. So that's one, two or three, and we want a couple of five or not. Yeah, we want to predict survive or not. But also we have to do some one hot encoding on some of these things, like there are steps involved in terms of pre processing to get this data ready for prediction. Most machine learning algorithms cannot deal with a string column. You have to turn them into something numeric. At

hugo bowne-anderson  1:26:46  
some point. You remember when we used

Speaker 1  1:26:50  
to use get dummies, the that's, I have a video on the probable YouTube channel, and why? That's a very, very terrible idea in production, in case you're just but yes, but people used to do this with get dummies. There was a pandas function that would do the get dus thing. In general, please use the walnut encoder instead, because the walnut encoder will remember which classes were seen during training, and then if the if it doesn't appear during fitting. Like, let's say pandas get dummies doesn't care about like, oh, there's a new class at prediction time. Oh, we're going to want to encode that anyway, and just see what happens. And sizes mess up and et cetera. One on encoder will give you a proper error. But again, from a tooling perspective, there are some columns we can just pass to the algorithm just fine. So we've got age, that's a numeric value. We've got the fair and, you know, there's a couple of columns we can just pass in. That's fine, but it's a couple of other columns where we have the passenger class and the sex, and that has to be one also encoded. And that's not even mentioning, like, the name of the passenger, like, because you really wanted to do something with machine learning on that, those are very long strings, so you would have to maybe make biograms, or, like, bag of words things with them. But you have to encode this in a different way as well. And if you want to get started with a fairly basic algorithm in Scikit, learn this is the way you still have to do it. The pipeline API is amazing, because you can really write proper, elaborate pipelines with it. But I was looking at this sort of wondering, imagine you're a recent graduate and you've got a very strong stats background. There's like a function in here. There's like some objects in there, like you actually do have to know a bit of Python in order to do some modeling here. Now the amount of Python you have to know is actually not, you know the worst thing out there. But I was wondering, like, is there an opportunity to rethink a little bit of stuff here? And what I'm about to show you is not what I think is the best practice out there, but it is, I think, an interesting idea that I'm at least exploring. So I'm wondering, like, can we rethink the way we declare the pipeline? I still like the idea of the scikit learn pipeline, but I'm just wondering, Can we do something clever when it comes to declaring this guy. So one thing that I'm exploring right now, like you can see over here, by the way, like this, is the representation that scikit learn sort of gives you of the pipeline. But I've been working on this library called playtime, so you can pip install Scikit playtime right now. And what I'm saying is, you know, what, if we just say there's just some features and we just have this one function, then we have some features over here. These are just the columns to grab. Then there's some stuff to one hot ENCODE, and then there's some back. There's another feature that we just want to do the bag of words thing for. And maybe, if we just have like functions that we can have these plus operators and we put them together, then maybe that can just generate the same pipeline for us.

hugo bowne-anderson  1:29:41  
So those pluses make, like, I kind of want pipes there or something. Well,

Speaker 1  1:29:46  
this is a fee. So this is a feature union. So the way this works internally is when you see a plus here, uh, make unions actually sort of happening under the hood, but you are pointing to something that's actually equally interesting, like, what if we multiply here instead? Mm. Like, Oh, what we could do then is we could say, well, multiply everything here with everything here. Oh, and hang on, what if we have like a bar, like a like a pipe, oh, well, then I can actually pass that to whatever scikit learn thing after that's also something I could just kind of build in cool, huh? Okay, so I'm not sure yet if this is a great idea, but I am, no, I am tickled a bit here, because I do think there's something about changing the notation that also changes the way you think about it. And that I think is also, again, UI, this is kind of an interesting thing. You're definitely not going to be as elaborate as the normal cycler pipeline is going to allow you to be, by the way, like this is not something that's going to replace the pipeline anytime soon. And also, I have some colleagues and some collaborators on GitHub that are also doing something different in the same realm for the scrub project. Scrub is also this thing on the probable umbrella. There are also other ways you could rethink these pipelines. But I do want to maybe give a glimmer, sort of to tell people like, this is the level of rethinking that you can also do. Like, some of the stuff is also like, do I like the code that I'm writing? And if not, there's also a UI there, which I think is good to maybe think about. Now to give you, like, a final preview of, like, Vincent, wow, on Earth. Did you end up doing this? The future of this is going to be time series. So one thing that I'm particularly interested in is, suppose, like, one thing you can also do here is there is also a seasonal operator, like, you can pass out a date column, and then it can do like seasonal features during the year. So you can have, like, hills in January and stuff, but, but if you imagine that, I've got like, a seasonal pattern feature here, if I were to multiply that by something that's one hot encoded for maybe, like the day of the week, oh, then I can have a seasonal pattern for every day of the week. And in retail, that's super useful, yeah, because the seasonal effect that you're going to have in the weekend is usually really different. Really different than you're going to have during the week. And why stop there? Because we might also have a different seasonal feature or store. And this is this particular kind of time series, to me, feels a lot more easy to declare this way, then with the normal scikit learn pipeline. And again, what I hope that people get from this is kind of the oh yeah, this is also UI, and a bit of exposing yourself to a different problem, and then taking a step back and rethinking and sort of iterating and iterating and stuff. But I'm not thinking about tensors, and I'm hoping more people will join me in this.

hugo bowne-anderson  1:32:41  
I love it, and I really think it's lovely to bring full circle this idea of rethinking what we're doing, but in terms of the tools we're using and the tools where we're building as well,

Speaker 1  1:32:52  
yeah, and again, you can also go too far in rethinking. Like, if you keep rethinking, but not build anything, that's also like a rabbit hole, right? Like, don't jump into that, but yeah, like, hopefully this is enough for people to sort of think about. And again, like most of the tools we want to have haven't been built yet, right? That's also a thing to remember. So, yeah, enough stuff to do, I would say. So

hugo bowne-anderson  1:33:20  
I'd like to wrap up, we've talked a lot about kind of how to rethink the space we work in and certain failure modes. I suppose I'd love to know just what you're what you what you love, about, what, what you do, what we do, and what, what you're excited about in the coming, coming months and years in the space. I Yes.

Speaker 1  1:33:43  
So I mean, one thing this is really particular to the current employer, like, I work at a company called probable, and you know, there's a couple of scikit learn maintainers hang out there. A part of my job is also to be kind of like a Deaf advocate. But in this particular case, I really get to dive into some of the scikit learn code bases and really expose myself to, like, decades of proper computer science. And there are some real gems in that. And part of my job is also to, you know, share some of those ideas and make sure that they're properly archived and not forgotten. So personally, that is something I'm really looking forward to. My role is a little bit different, because I don't just do DevRel stuff. Usually, what I do is I like, hide in the cave for two weeks, make sure that I've got, like, a month of content, and then I hang out with another team and join them for Sprint. That's kind of more my role at this point for the company, but I do, but I do enjoy being able to do that with this with a scikit learn ecosystem. It's a really weird library, in the sense that it's kind of rare that for a decade, I have this feeling of, huh, psycho learn can do that. And like, I'm past the decade now, and I still have that feeling. So that's on the short term, that's, that's what I'm really looking forward to. But also, like. Combined with that, I as much as I really like the employer, I think my life is more interesting by also doing stuff besides just working for a employer. So I also have my personal projects that I'm really looking forward to. One thing that, again, the Hamal course did is it did motivate me to make a new version of bulk. Google it, if you go to the repo, people will be able to find what it does. Bulk is going to have, like, a really cool upgrade. So I'm also looking forward to do that. So to do that com code now as a collaborator, really eager to sort of do stuff like that. I have the data science fiction book, and I have stuff like this for the probable live stream, and I think that something about that chaotic mix, yeah, that that's super cool like that really blends well with the family situation as well. So that, honestly, I'm just looking forward to continuing the riding this wave. That's the main thing

hugo bowne-anderson  1:35:44  
we have a lot in common, my friend, that feeling is the chaos at all.

Unknown Speaker  1:35:49  
Yeah, like that productive chaos. That works really well for me right now. Awesome.

hugo bowne-anderson  1:35:55  
And I've put links to probable to your Twitter to calm code in the show notes and in the chat as well. I am interested if there's anything else you'd encourage people to do, or what a call to action for listeners who want to find out more

Speaker 1  1:36:10  
and do more. Also, there's, like, two things, I guess, on the short term. I mean, PI data Eindhoven is happening next week, and I will be speaking there. So yeah, you're gonna see a very similar demo, maybe, but I'm gonna talk about the title

hugo bowne-anderson  1:36:23  
first. People on vanishing gradients, yes, now.

Speaker 1  1:36:27  
So the title of the talk is scikit learn can do that, because that's the feeling I wanna describe. And I'm gonna go a little bit in depth into some of the details that. Like, there's one trick that I think more people should just know for their day to day. And there's this one other thing that scikit learn does that's hidden that I want to talk about. So if that sounds interesting, check out.

hugo bowne-anderson  1:36:46  
Come on, man, I can't make it to Eindhoven Nix.

Speaker 1  1:36:50  
The video's gonna be PI data. Video's gonna be online.

hugo bowne-anderson  1:36:54  
And bite my tongue then, yeah,

Speaker 1  1:36:57  
but the PI datas are really fun, and I've had a small role in, like, kickstarting PI data in the Netherlands. So it's really like, for me, it's also really cool to hang out there just to see what, like the new generation of PI data people are doing. So it's like, pretty epic. Another thing that might also be good, I promised Michael I would mention it. So Michael from talk Python, that's another very everyone should check out. I

hugo bowne-anderson  1:37:17  
mean, if you're listening to this, you probably know of talk Python to me, but if you don't, Michael's,

Speaker 1  1:37:22  
Michael's really cool. And also the Python bites, he does a Brian Auckland, similar story, like, really, you know, very, just cool, relatively calm, likable podcast I have collaborated with him. So like, the final thing I did as I was leaving explosion in spacey and joining cycler probable, was work on a course you can check it out, no pressure. If you do end up buying it, it's relatively cheap, but it does also help support the talk Python podcast, which is also something I really enjoyed about working with Michael there. So, like, that's a call to action I can come up with. I mean, other than that, what I would also maybe, if I can give, like, the best call to action, I would love for people to blog more so if, if you have, like, a feeling of, Hey, today I learned something like, just write that paragraph down and, like, make sure that people can find it. I am again, like, Twitter still, still has some Twitter vibes to it, but it's not the same in terms of, like, cool information just sort of floats up. But I think one thing we could do more is maybe have like, good blogs with RSS feeds. And I've noticed that a lot of lot of more people are doing it more actively. But if I can motivate you to do something similar, I do miss, like, I would wish more people had, like, really solid data science blogs of like, Hey, here's a cool trick. Here's a weird little scikit learn thing. Like, probable employs someone to just focus on the scikit learn docs. That's a podcast for another episode, because there's lots of stories there. Um, but even even employing someone to sort of try to share all the lessons out there is not enough. Like, if the community could help out, but just sharing lessons that they've learned individually, I think that will also just be really cool for everyone. I

hugo bowne-anderson  1:39:00  
totally agree. And also I don't know who out there thinks this. I've all I've had challenges thinking this that you know, to write a blog post, I need to write something very meaty and substantial, whereas you can just write a quick little thing of something you learn as well. And actually, Vincent, when you and I hung out in person a couple of months ago in Amsterdam, we talked about how Simon Willison is so, so prolific. Yeah, nice spot and check out Simon Willison um, his LLM client project, his Data Set Project, also co creator of Django, among other things. So no, but that's another

Unknown Speaker  1:39:33  
minor thing. Minor thing. Just casually,

hugo bowne-anderson  1:39:36  
I asked him how he did so much, and he said there are two things. One, he has his til blog, his Today I learned blog where he's just like, can just put something he learned that doesn't need to be new novel. The other thing, dump it. Yeah, exactly. The other thing, which you spoke to earlier is he was like, I often will only engage in a project if I know it'll be a small project and take a short amount of time. And I can time box it, then I can write it up. Quickly. So, you know, these types of tools are useful for structured work as well.

Speaker 1  1:40:05  
Another fun one. Okay, I'm gonna make another picture. I wrote a library a while ago called doubt lab, and the number one way people can help me on that doubt and then lab, the library is all about doubting labels and giving you, like a little lab to sort of do that there's, there are really just a couple of functions with tricks like comparing two model outputs or like finding outliers and like, the point is that I want to find reasons to doubt a label to help you prioritize which labels to check first. Like, that's sort of the goal of the library. And what I think is the main contribution anyone can do for that library is to use some of the tricks to actually find bad labels in actual public data sets. And like, the tiniest blog posts of that would be the best contribution you can make to the docs. But you don't have to do it on my blog. You can also do it on your own. But like, if you can find a data set that is widely used and you can find some label errors in it, usually, that's a fun story. Another thing you can also do, that's how I got started with like talking at conferences. There tend to be a couple of these data sets about video games. Can you data mine your way into an edge? Is a fun thing to maybe do, like they pulled down the API. But something I did, like, when I got started, back in the day, World of Warcraft actually had their auction API public. So one of the first things like, I wanted to learn Apache Spark, so I needed to have a big data set, and so I downloaded gigabytes of World of Warcraft Auction House data in order to do economic analysis between Auction House servers. Like, what influences the price of a potion, so to say, like stuff, like, silly stuff like that. Do not underestimate how fun those conference talks are, or how fun those blog posts are. The like or like like, another thing I did is,

Unknown Speaker  1:41:57  
you know, the Lego Minifigures. So those are collectibles, right? If you buy 16 packs, what's the probability you get a full set?

hugo bowne-anderson  1:42:08  
How many other so there's,

Speaker 1  1:42:09  
there's six, I think there's 16 in a collection. But there's an interesting like math, like simulation exercise here, like, how many packs do I have to buy in order to guarantee that I've got five full sets?

hugo bowne-anderson  1:42:20  
Yeah. I mean, you do the one minus trick, right? And then you but

Speaker 1  1:42:25  
then, okay, but now you say, I want to buy, how many sets do I have when I buy 200 that like, it gets subtle, but you need, like, a different, different math formula for that, I think is intractable, but you can simulate it very easily, and then you can actually decide for yourself. Does it make sense to invest in Lego Minifigures when it comes to like selling it 10 years later down the line on eBay, because it is a collectible? Yep, silly tangible things like that do really go a long way. And also, like a pro tip for any blog posts and like conference talk in general, back when I was reviewing conference talks or PI data, if the talk has to prove to me two things, if it is proven, if it's enjoyable, like I think it's interesting, and will I learn something like those are, if it does those two things, you're good. Because most talks, yeah, most talks and blog posts don't do both at the same time. But if you, even if I learn only little, and if, even if I think it's only a little bit interesting, little bit interesting, that's sufficient for me to want to read it. Yeah, totally. And come up with a good title, and come up with a good title, sure, but, but still, like a lot of this stuff, it is time. I don't want to suggest it's super easy, but it might be easier than you might think. And you might also want to appreciate how crucial it is to keep good ideas from like, if we want good ideas to come around and keep happening, we have to share them, right? That is something that needs to happen. So again, I implore you, please do interesting stuff and blog about it, or talk about it, or something like that. Like, the world definitely could use more of that, and feel free to focus on stuff that you are, like, super nerdy, interested in. Those are usually the cool things to read.

hugo bowne-anderson  1:44:07  
I encourage you all to do the same. Normally, if someone came on the podcast and said the world doesn't have enough content, I'd say, Get out of here. But in this case, I mean we are talking significantly about, like, writing things and communicating things that increase signal to noise ratio, which is, which is one of the most important jobs we can do in the in the noise currently.

Speaker 1  1:44:30  
Yeah, and like, don't be the influencer, but be the nerd. Like, be the person who's celebrating that you're super excited about. This one little weird ass detail. Like, I was talking to a guy the other day, you know, knitting machines, how they can sort of knit a figure into, okay, can you, can you come up with an algorithm that can do that with the least amount of yarn, like a super hard problem, right? Super interesting, yeah, and because you also want it to look pretty, and there's all sorts of interesting constraints there. Right? But again, I encourage the person to definitely blog about it, because that stuff is super duper interesting. I wish more people

hugo bowne-anderson  1:45:07  
would do it. And for anyone still listening, Vincent has just given us 10 ideas of questions to ask and figure out and write about as well.

Speaker 1  1:45:17  
Yeah, and figure out what makes you tick and go for it. I'd

hugo bowne-anderson  1:45:24  
like to thank you once again. I'd like to thank everyone who's joined. And you know, there are 25 people who are still here have been been with us for over an hour and a half. Well, it's getting up to two hours, which is super awesome. Please do check out calm code. Check out probable check out Vincent on LinkedIn and Twitter. Someone has mentioned they're excited for the PI Data Eindhoven event as well. Thank you all for sticking sticking around, but most of all, Vincent, I always love chatting with you. I always learn a lot, and I'm always entertained as well. But thank you most of all for your wisdom and doing the work you do and bringing back all the battle stories. Also, my

Speaker 1  1:46:06  
pleasure. Thanks for doing the podcast. There's not a lot of solid data science podcasts, and this is

hugo bowne-anderson  1:46:10  
one of them. I appreciate it. So thanks everyone. And oh yeah, once again, like and subscribe, of course, if you're still around. And what I'm also going to do, we have these live streams quite often now, so I'm going to put a link reminder it was on my checklist to do at the start. I'm going to put a link to the Luma calendar with our events on it, in the chat as well. We got a couple coming up, and actually, you mentioned Hamilton Dan's course, I've got one in a couple of weeks with them about what they learned, teaching llms to 1000s of people.

Speaker 1  1:46:44  
Like one thing, it's kind of, you remember norm con, yeah, I do same vibe. Like the thing that they did had, like a norm con vibe to it, yeah, I did not expect an LLM land, but it kind of it was a unexpected heap. At some point, everyone just kind of jumped in on it. Something about that also, something about that also made

hugo bowne-anderson  1:47:04  
it a really cool event, absolutely. And of course, the credit, but also to your point earlier, and the point of blogging and communication and having searchable communication, these types of things. The discord for this, course, is a phenomenal resource. Yeah, loads of stuff about

Speaker 1  1:47:17  
Yeah, not a huge fan of discord in the way the app works, but I'm still eager to hang out there.

hugo bowne-anderson  1:47:22  
Definitely. Yeah, I'm not a huge fan of many apps in the way the apps work. So

Speaker 1  1:47:26  
yeah, notifications are crazy on that thing. You really have to turn a lot of stuff off, but I did learn a bunch of stuff there, so that's definitely cool. Yeah, looking forward to that podcast. Sounds like fun.

hugo bowne-anderson  1:47:36  
Likewise. All right, everyone, thanks once again, and we'll see you in the next episode. And.

Transcribed by https://otter.ai