Hello friends, we are back with episode 117 of the RM Weekly Highlights podcast.
I am personally fresh from quite an adventure out in the Great Smoky Mountains, so I'm unwinding
a bit, but it's great to be back in the swing of things and to help me do that, of course,
is my great co-host, Mike Thomas.
Mike, how are you doing this morning?
I'm doing well.
Not quite as traveled as you lately, but looking forward to the warm weather this summer getting
me out there a little bit more, and as you know, we have some travel to do together later
this year.
You bet.
You bet.
Luckily, the travel I'll do to the conference is much less than what I just did.
It'll be a three-hour drive from me.
You'll have a short flight, but yeah, we are very much looking forward to Chicago in September
for the POSIT conference, and certainly, while we're on that subject, it's not too
late to register for our Shiny in Production workshop, so if you want more info on that,
you can just search for the POSITConf 2023.
You'll find the registration right there, and we'd be happy to have you.
Shameless plug.
You know, that's what we do here, but we don't just plug our stuff, of course.
We're plugging our weekly because our weekly is the reason we're here, and our issue this
week was curated by John Calder, another fantastic release, and as always, he had tremendous
help from our RWeekly team members and contributors like you all around the world, and we're going
to get really geeky with this one because our first highlight is a GitHub gist, but
it's got a lot packed in it, and it is from a good friend of ours who's been very active
in the Shiny in our community, Dr. Peter Salamos, who I was very fortunate to have join me in
the recent ShinyConf Shiny Dev Series live recording, which hopefully we'll have the
recording out very soon, but he was amazing with this post, which talks about how we can
do web API calls in R, which we have talked a lot about the Plumber package for good reason.
It is becoming a huge presence in the capabilities for R to have, in essence, equal footing of
all these other web technologies to put very sophisticated or simple or maybe somewhere
in between processes for use as a web API, but Plumber was not the first to do this,
folks.
In fact, this gist is a great stroll down memory lane for me because there is a lot.
In fact, Peter has 10 examples of producing web APIs from R, all with different frameworks,
and the first one that caught my eye, this isn't quite in chronological order of the
gist, but as I scroll through this, there is a package called RServe.
This is authored by Simon Urbanek, who, if that name sounds familiar, is a member of
the R core team, so there's some great cache right there.
He released RServe, get this, 2006, that is, if my math is correct, 17 years ago, R had
this capability.
Can you believe it?
It's just crazy, isn't it?
I remember experimenting a little bit of RServe back in my grad school days, and we were trying
to put a simple mix of PHP, MySQL, and a little bit of R analysis for some local parks to
serve up their data and give them some insights.
I admit I was way over my head, I didn't know how any of this stuff works, but it's great
to see just knowing what I know now, doesn't seem too bad.
And then there are others that struck my fancy a bit.
Rook is another package that was released in 2011.
It's actually been forked more recently, it was originally authored by Jeffrey Horner,
and it is based on the use of reference classes, or RC.
You don't hear about this as much as the R6 and S4 paradigms, but if you ever want to
know how you could do web APIs with this, Rook is an interesting thing to explore as
well.
And then I caught OpenCPU.
This is authored by Jeroen Ums, who we just talked about a few weeks ago, as the architect
behind the R universe infrastructure.
So I always thought to myself, OpenCPU was really kind of ahead of its time, it can do
all sorts of things from a web API standpoint.
I think that may have been a precursor to his adventures of our universe, of course,
don't quote me on that.
Jeroen, if you ever want to write back to us, let us know, but I think you definitely
got some inspiration from your adventures from OpenCPU because it is amazingly intricate
and very innovative.
And then, of course, there are some other nice ones here too, such as Thomas Lynn Peterson's
exploration of web APIs with the Fiery package.
That's an awesome name, Fiery, I love it.
And right off the bat, Thomas is quick to say that this is different than Shiny in his
very general, you have to do more of the heavy lifting, but with that you get more flexibility
in his opinion.
So it is an interesting take on how you could use Fiery from both a Shiny context, but also
in this context, just a simple web API.
You can do either one.
It's up to you to stitch it all together.
And so the whole gist is basically the same hello world type example, but again, coded
in 10 different packages.
And then in the end, you get to see a little comment in the gist that talks about the release
dates of each of these packages.
And that's where I saw that as if I didn't feel old enough, like I said, with our survey
being out in 2006, all the way to more recently, where we've seen adventures from John Kuhn
with the Amburix package, which I believe has been archived actually, but it was an
adventure nonetheless.
So a great timeline there at the end.
And he even has a nice little ggplot that shows the download stats that you can run
in your art console as well if you want to get some download numbers historically of
what's been happening here.
But really awesome, I'd say just run with it, have a nice comparison to see where we
come as a community in this technology.
And I really enjoyed seeing all the technical bits here and comparing and contrasting how
you can do web APIs in R with many of these packages.
Well, you covered quite a bit of it.
And yeah, I couldn't agree more.
And there are some big, big names behind these different API framework R packages, as you
had mentioned, and I thought it was really, really cool that Peter took the time to sort
of put this all together.
Because again, it's, I think, a great demonstration of just all the different options that we
have and different ways of sort of doing the same thing, right?
And one might be a little bit more tailored to your use case than another, depending on
what you're trying to accomplish.
So it's nice to have the different options out in front of us.
And for sure, it includes some of maybe the better known R packages such as Plumber and
HTTP UV that I'd seen before, but so many others that I did not know about, including
that Beaker, Rook, and OpenCPU, as you mentioned, R-Serve, REST R-Serve as well, which maybe
is a more recent incarnation of R-Serve.
So really, really interesting.
And APIs at a higher level are just really a phenomenal way to wrap your model's logic
into something that folks across your organization can use from their other applications that
don't need to know anything about R. And some people have strong opinions against microservices,
but I am not one of them.
And it doesn't even have to be a model, right?
It can be any R logic that you can wrap up into an API to allow others to leverage.
So again, it's just really nice to be able to see all these different options in front
of us and know that there are so many different ways to stand up an API from R and not everything
needs to be deployed with Python.
Oh, hot take number one of many, perhaps, I don't know, but no, I've lived that life
as well in the recent years where sometimes I'll do a simple API in front of a complicated
database and that way the user doesn't have to care about database credentials.
I take care of all that from the back end.
They just have to call a little get request, I'm able to get the info they need and protect
the other stuff.
And it's just that the art is stitching it all together.
I mean, I think separating business logic into fit for purpose things, it's kind of
like the Linux philosophy.
One package that does one thing and does it well, you could have microservices, they each
do one thing and do it well.
But again, the art is how do you orchestrate it together where it's easy for your users
of your overall app to get in line with it and be able to use it effectively, but also
as user developer or the team of developers to be able to maintain that and to be able
to develop them in sync.
Those are separate issues, of course, but you can have as much flexibility as you want
with these frameworks and it's very important in my day-to-day job right now.
Yes, and shout out to actually a project that I worked on with Peter, which was a Shiny
app that included the ability to download, I think TIFF files or raster images or something
like that from within the Shiny app, which those links in the app called out an API that
he had developed to sort of do exactly that, be able to query a database essentially and
download just the data that you wanted.
Oh, yeah.
I'm a little jealous.
I'd love to be able to collaborate with Peter on a project Sunday.
That sounds like a lot of fun.
Yeah.
I guess I'm kind of curious now that I got to figure out what framework he used for standing
up that API.
Yeah.
You got 10 possibilities to choose from, I guess, you're guessing.
Now you mentioned, Mike, trying to wrestle some data files maybe from a remote source
or another location.
Well, that's a good transition to our last highlight of today, which is, of course, you've
got your data.
They could be either locally stored or on a web server somewhere or somewhere in between.
What are some interesting and hopefully easier ways for you to get these into R for your
post-processing so that you can actually get to your data analysis?
Our next highlight is a blog post authored by Kieran Healy, who is a professor of sociology
at Duke University all about reading, in this case, remote data files or local data files.
In this case, we're talking about some congressional type data.
But he starts with case one, which is, you have a bunch of CSV files for whatever reason,
they're stored in separate files.
Well, this might be the easiest case to deal with because in functions from the readR package,
like read underscore CSV, note that's different than the base R, read.csv.
You can have a vector of files to read from.
Well, geez, all you have to do then is make sure you can get the path to your files.
Maybe it's a directory somewhere, feed it into read underscore CSV.
Bob's your uncle, as they say, and you've got it.
You've got all the data tidied up and bundled together.
Now, this is where we take a turn.
We're about to drive into another curve here, Mike.
It's time for Excel.
You have your seatbelts on.
Yeah, I might have to put two on.
Yeah, me too.
This is giving me flashbacks to my road trip recently, we were driving up this mountain
area, and then we had high winds, and I'm like, oh, it's our car going to tip over.
Well, Excel sometimes makes you feel that way too, because you never know what you're
going to get with this.
Unfortunately, with Excel, not everything is as easy as it was with the CSV version
because yes, there are packages to read Excel files into R. Well, yeah, you almost have
to at this point.
So one way or another, you have to deal with Excel, but read underscore xosx from the read
Excel package, unfortunately, does not bind multiple files together at once.
So what can you do when you still have a directory of Excel files stored somewhere and you want
to minimize the amount of times you call read underscore xosx?
Well, that's where our friend from the per package comes for per with map underscore
CFR, that is the easiest way to feed in that direct that pass the file names, and then
bind them all together by row.
Now that's all local.
This was talking the big picture here is about what happens when they're remote, right?
Well, in some websites, even today, you might have a website that looks just like an index
of links.
This is actually called a directory listing much like you might do on your computer, but
in a web form.
And so the example goes to having a whole bunch of Excel files that are stored on a
FTP front end site from the CDC.
Now how can you get these?
Well, this is where you have to get a little creative sometimes because you notice that
all these links are going to have a pattern to it.
They're going to have the same base address, but they might change the file name prefix,
but then they might have the same suffix.
One of the ways that Kieran tackles this issue is you could actually scrape this web page
using our vest to grab all the hyperlinks, which are of course the a tag in web lingo,
and then grab the source of each of those links and then use that and do a little massaging
to get the file names, get the output like extension.
And then now you've got that remote set of file names.
Now had these been CSVs, you'd be done because you could just read remote files directly
with read underscore CSV.
It doesn't care if it's local or remote as long as you get the path right.
But this is where it gets tricky with Excel, you can't quite do that easily.
So Kieran takes a novel trick, which admittedly I use in a production setting for a much different
context is you have to temporarily download that Excel file to a temp area somewhere and
then make sure that the file name, no matter what cryptic thing it puts in front of it,
has the right extension at the end.
That's important because read Excel will not be happy with you if it's not an XLSX or XLS
file.
So once he does that little magic again, in a nice little utility function that he calls
get underscore life table, he's able to simply then call read Excel's read XLSX on that temp
file and import it in.
These are tricks that those of us who have been in the trenches will have to do at some
point.
But luckily the example is very easy to follow in this blog post, it's very well narrated
all the different steps and why Kieran chose the frameworks he did.
But once you've got it, then you've got your tidy data once again, you're ready to do some
analysis.
But yes, Excel strikes again, but hey, it doesn't have to be too painful.
You don't have to feel like you're driving through a 2000 high mountain to get through
it.
This blog post is a great way to see just what's capable of.
Even your files are not exactly stored in a convenient place.
But Mike, what'd you think about this adventure here?
Yeah, I think it's a great demonstration about as data scientists, we have to employ a few
different tools usually to get to the end of what we need to accomplish here.
And that's very well demonstrated in this blog post with the use of R of S to try to
scrape those links, the use of curl to create a connection to the FTP site, to retrieve
the file names and sort of control what Kieran gets back.
I was going to say, you know, leave it to the life sciences industry not to provide
an API to their data, but instead an FTP with a bunch of Excel files.
Am I right, Eric?
Oh, yeah.
You couldn't have said it better.
Oof.
That hurts.
Sorry.
I'm just kidding.
No, but it's really beautifully concise code that Kieran puts together and that helper
function that get underscore life table is really nice.
There's assignment inside of a function, which is something that you don't see out in the
wild too often, but I can definitely see what he's going after there.
Employing the HGTR package, temp files as well, like you said, to download its data
temporarily, then read it into a data frame with read underscore XLSX from the read Excel
package.
And then just like you said, at the very end, what sound does a cat make?
We are employing per this one liner, essentially a map data frame across that, that get life
table function that he custom created.
And it couldn't be easier at the end to just concatenate all that data together into a
single table.
So really well articulated blog post.
And I think it's a use case probably that all of us have encountered at one time or
another.
It's going to be a great reference for me to have in my back pocket for the next time
that I inevitably run into a bunch of Excel files.
Yeah.
You can't avoid it.
It's like having clouds in the sky.
They're going to happen someday, but I, what I really like though, is that the post is
kind of like, if you're encountering this situation, you start to see a pattern of how
things are similar in this case, the links, right?
If you see that they're similar, stop yourself from being like, okay, I'll just copy paste,
copy paste, just, just put the brakes on for a second because you never know in those,
in this case, it would have been like 50 or so of these copypages or there was a file
for each state.
You could just have one typo somewhere and then it's all going to go crashing down.
So if this is part of a process that you want to automate and hopefully be hands off once
you initially develop it, you want to automate as much as you possibly can and avoid the
manual effort.
I don't think you need to hear us tell you that twice, but take it from someone who's
been down that trap of trying to get something done quickly and ended up, you know, hurting
myself a bit in the foot, so to speak, because I had one error and one typo and it was so
difficult to track down.
So if you can scrape or find some logic to piece the paths together, like, like Karen
does here, you're going to be much better off.
So again, just take it from somebody who's been through it the hard way.
Absolutely.
Couldn't agree more.
And one note, just because I think I got burned by this in the past recently, if you are going
to, if you, you absolutely need to download a file first before you read it into a data
frame, please use the temp file function from, from Basar.
Don't create a new directory in whatever working directory your user is going to be working
in at that point in time.
Also.
Oh, you're not the only one who's been down that trap before I had one of my most well
known internal packages was doing that for everything when I didn't know any better,
but that's been, that's been changed since because you know, you live and learn.
There's a lot of learning to be had folks and the rest of the issue of our weekly this
week and John, like I said, has done a tremendous job.
So we'll take a few minutes here to call out some additional fines.
And for me, it's more of a call to action.
You might say Oscar Barufa, if that name sounds familiar, he is the one who's authored the
big book of our site, which has been very much featured in previous episodes and a wonderful
resource for people to get their learning on across different subjects.
Well, he's looking for a community input on a possible proposal that he wants to send
to the art consortium, which if it's accepted could be a sanctioned project with funding
for some additional development to basically supercharge much of the infrastructure and
overall quality of the big book of our, he's got a pretty hefty wish list, but if this
gets selected, he would like to use some community input that he gets in this call to action
as part of his proposal.
So we'll have a link in the show notes where if you want to get in touch and send your
feedback to Oscar, he would greatly appreciate it.
But I did have a look at some of the things he's asking for to work on in this art consortium
proposal.
There are very important things in my opinion, like having a robust database structure of
the content that he could do a lot more with.
And also he wants to put it in the quarter.
Why not?
That's a big ticket item.
And then making, he has a big wish list here, but I think the last item might be the one
I most got my attention on is that each book, when you click on that, imagine if you're
able to get its table of contents right in the big book of our resource.
That could be a huge win for usability and user experience, but that's going to take
some infrastructure enhancements and engineering development to make happen.
So you know, I wish Oscar the best of luck.
Hopefully this little plug here, we'll get some more eyes on this with some more community
input.
But yeah, we're, we look forward to seeing hopefully some great improvements to the big
book of our in the near future.
I feel like there may potentially could be some overlap between what's happened in the
our universe from a technological standpoint in terms of, I don't know, doing a great
job at representing a ton of information, um, as well as the big book of our kind of
from a high level is sort of doing the same thing, right?
Trying to do a great job at encapsulating a ton of content resources.
So I, you know, that I think we've got some steam under our belt on this topic from, from
what's gone on in the our universe and hopefully Oscar can get some of those tailwinds as well
for this project.
Very cool.
Yep.
Well, Mike, what did you find this week?
I found two that I wanted to call out.
One highlight is something that I think is, is talked about quite often.
And I think folks have quite varying opinions on it as well from time to time.
It's balancing classes in classification problems.
It's a blog by Matt K. Um, and sort of the subtitle is why that's generally a bad idea.
So he, he looks at it from a few different use cases, uh, using all of your favorite
tidy verse packages as well to sort of explore differences in prediction and accuracy as
well as different resampling approaches like, like smoke, um, for balancing and rebalancing
training data and to take a look at sort of how that impacts the actual accuracy of the
predictions at the end of the day.
And some really great discussion as well on this blog post about when maybe it makes sense
to rebalance and when it might not make sense to rebalance.
So that was one highlight that I found.
And then a second one is, uh, an interview with Woojoon Jung, who is the founder of the
R Korea group and he was interviewed by the R Consortium to discuss, uh, his efforts
on using R in finance and accounting in Korea.
And I feel like, you know, we, we do hear a lot about life sciences and other sciences
where R is, is used heavily in, um, and in finance and accounting, maybe don't get quite
as much recognition.
And this could be biased a little bit on my end because I do a lot of work in that field.
Um, but I thought it was nice to see an interview and Woojoon Jung's perspective on how R is
being leveraged currently and, and hopefully in the future as well, um, in this particular
industry.
Yeah, that's something I'm going to be looking at as well.
I've greatly admired many of the prominent members in the finance community.
And in fact, um, one little fun fact, I believe the author or at least one of the main authors
of the VS code R extension, Koon Ren is heavily involved in finance as well.
So yeah, I've been greatly admired by the work.
So I'll be, I'll be reading that interview.
Um, after I record this one, I guess I'm downtime again.
Very, very nice.
And what else is nice?
Well, again, the rest of the issue.
And of course we will have links to everything we talked about in the show notes.
And we always love hearing from you.
If you want to get in touch with us on the podcast itself, we have a little contact page
directly linked in the show notes.
And as well, if you have one of those fancy new modern podcast apps like fountain or pod
verse, he could send us a little booster Graham inside the app itself to give us a little
bit of a nice message right in your podcast player.
So very easy to connect with a service called Albie.
And I have links to how you can do all that in the show notes as well.
And of course, our weekly loves your contributions.
We are just a pull request away from our markdown draft of the current issue.
If you find a great link to a package, a great blog post, an interesting fine, a great community
resource.
We love to hear about it.
So get it, go to our wiki.org.
You're going to see a link to the draft right there and a way to quickly send a poll request.
We'd love to hear what you have to share with us on the art community.
And you can also get in touch with us directly.
I am still sporadically on Twitter with that the art cast, but I'm also on Macedon with
at our podcast at podcast index on social causing all sorts of mischief over there.
Mike, where can people find you?
Yes.
Likewise, I find myself using Twitter less and less these days.
The content just seems to be less and less data science in terms of what I'm seeing on
my timeline for whatever reason.
So I am trying to be more active on Macedon.
And you can find me over there at Mike underscore Thomas at fossil.org.
And by the way, we do want to give a great shout out to James Wade, who very much surprised
us last night as we record this with an enhancement to his GPT tools package that we just talked
about recently, where apparently now we can transcribe our very own podcast with GPT tools.
Can you believe that?
You don't have to listen to us anymore.
Now, wait a minute, please still download our episodes.
But yes, I may have to put this in my production.
This is awesome stuff.
So that space is evolving quite rapidly.
So he's got links to that and most other examples that have come in GPT tools.
So that's worth checking out as well.
Yes.
Thank you, James.
Yeah.
Thanks for the shout out.
We appreciate it.
And it's great.
Great to see the innovations in this space.
Well, now we're going to that's going to wrap up episode 117.
My little one is frantically pulling my chair away.
So that means I got to go.
But we'll be back with episode 118 of our weekly highlights next week.