Python is not a great language for data science

Yes, I’m ready to touch the hot stove. Let the language wars begin.

Actually, the first thing I’ll say is this: Use the tool you’re familiar with. If that’s Python, great, use it. And also, use the best tool for the job. If that’s Python, great, use it. And also, it’s Ok to use a tool for one task just because you’re already using it for all sorts of other tasks and therefore you happen to have it at hand. If you’re hammering nails all day it’s Ok if you’re also using your hammer to open a bottle of beer or scratch your back. Similarly, if you’re programming in Python all day it’s Ok if you’re also using it to fit mixed linear models. If it works for you, great! Keep going. But if you’re struggling, if things seem more difficult than they ought to be, this article series may be for you.

I think people way over-index Python as the language for data science. It has limitations that I think are quite noteworthy. There are many data-science tasks I’d much rather do in R than in Python. I believe the reason Python is so widely used in data science is a historical accident, plus it being sort-of Ok at most things, rather than an expression of its inherent suitability for data-science work.

At the same time, I think Python is pretty good for deep learning. There’s a reason PyTorch is the industry standard. When I’m talking about data science here, I’m specifically excluding deep learning. I’m talking about all the other stuff: data wrangling, exploratory data analysis, visualization, statistical modeling, etc. And, as I said in my opening paragraphs, I understand that if you’re already working in Python all day for a good reason (e.g., training AI models) you may also want to do all the rest in Python. I’m doing this myself, in the deep-learning classes I teach. This doesn’t mean I can’t be frustrated by how cumbersome data science often is in the Python world.

Thanks for reading Genes, Minds, Machines! This post is public so feel free to share it. Share

Observations from the trenches

Let’s begin with my lived experience, without providing any explanation for what may be the cause of it. I have been running a research lab in computational biology for over two decades. During this time I have worked with around thirty graduate students and postdocs, all very competent and accomplished computational scientists. The policy in my lab is that everybody is free to use whatever programming language and tools they want to use. I don’t tell people what to do. And more often than not, people choose Python as their programming language of choice.

So here is a typical experience I commonly have with students who use Python. A student comes to my office and shows me some result. I say “This is great, but could you quickly plot the data in this other way?” or “Could you quickly calculate this quantity I just made up and let me know what it looks like when you plot it?” or similar. Usually, the request I make is for something that I know I could do in R in just a few minutes. Examples include converting boxplots into violins or vice versa, turning a line plot into a heatmap, plotting a density estimate instead of a histogram, performing a computation on ranked data values instead of raw data values, and so on. Without fail, from the students that use Python, the response is: “This will take me a bit. Let me sit down at my desk and figure it out and then I’ll be back.” Now let me be absolutely clear: These are strong students. The issue is not that my students don’t know their tools. It very much seems to me to be a problem of the tools themselves. They appear to be sufficiently cumbersome or confusing that requests that I think should be trivial frequently are not.

No matter the cause of this experience, I have to conclude that there is something fundamentally broken with how data analysis works in Python. It may be a problem with the language itself, or merely a limitation of the available software libraries, or a combination thereof, but whatever it is, its effects are real and I see them routinely. In fact, I have another example, in case you’re tempted to counter, “It’s a skill issue; get better students.” Last fall, I co-taught a class on AI models for biology with an experienced data scientist who does all his work in Python. He knows NumPy and pandas and matplotlib like the back of his hand. In the class, I covered all the theory, and he covered the in-class exercises in Python. So I got to see an expert in Python working through a range of examples. And my reaction to the code examples frequently was, “Why does it have to be so complicated?” So many times, I felt that things that would be just a few lines of simple R code turned out to be quite a bit longer and fairly convoluted. I definitely could not have written that code without extensive studying and completely rewiring my brain in terms of what programming patterns to use. It felt very alien, but not in the form of “wow, this is so alien but also so elegant” but rather “wow, this is so alien and weird and cumbersome.” And again, I don’t think this is because my colleague is not very good at what he’s doing. He is extremely good. The problem appears to be in the fundamental architecture of the tools.

Some general thoughts about what makes a good language for data science

... continue reading