Introducing "Smash Data" and a Book Review

Welcome to "Smash Data"

Oct 08, 2023

It took me a while to come up with a good name for this blog, but I did it! Introducing, SMASH DATA!

I really, really like the name, because it’s what we do. We data professionals are either part of managing data or part of creating/using data, and the latter often go together. We smash data to create new data in a data refinery, for instance. For those of us in data science (or any science), data engineering, or software engineering, the common denominator is that we “SMASH DATA” and use it to do something useful. For more on that, you can read my old post WTF Even is Data Science.

We are data smashers! It’s what we do! For the kickoff of this blog, I’m going to review a book. I very often review books that I buy or people/companies send me. I amplify good information, and do my best to smash noise.

I am a Packt Author

Many of these will be Packt books. Packt sends me cool books relevant to my domains (software engineering, cybersecurity, and data science), and I am happy to review them. This relationship helps both of us. OTHER PUBLISHERS ARE WELCOME TO DO THER SAME, but they do not. Authors occasionally send me their books, and I’ll review them as I can find time, as well.

My approach: I review what is useful and good, and I don’t review what is low quality. I have a pile of books in my closet that I will never review. I don’t donate them anywhere, because I don’t want anyone to learn from them. I simply do not amplify bad. For instance, if a book pretends to teach Graph Machine Learning but is actually just a book about Neo4j, I’m not going to review it. Other people with lower standards can review it.

So, if I review a book, I like how the author presented the information, and I found it factual.

My reviews are honest. I am a Packt author, but that doesn’t mean that I will always agree with other Packt authors. Even in the books I find good or great, there are still parts that I will disagree with or want to discuss, and a blog is good for that. So, don’t be surprised if my blog posts are dead honest. Discussion helps improve information.

Need Help? Send it my way!

I will happily review books from my connections and from other publishers. That has always been the case, as I have shown in my previous reviews on LinkedIn. If you have written something that is in one of my domains (software engineering, cybersecurity, data science, network science), let me know if you want my thoughts! Just keep in mind it takes time to read books and do reviews. I’m not going to review everything. I don’t have time.

Blog Purpose

I enjoy reading and learning. I also enjoy building and solving problems. This blog gives me a place to write about what I am thinking about, what I am reading, and interesting ideas for building. This isn’t a “How to Learn Python” blog. I am an explorer. I get weird with data, using Natural Language Processing to identify context, Network Science to identify influence and communities, and play with all kinds of cool stuff like Causal Discovery, Information Theory, Computational Linguistics and Humanities, and more. I like to get weird with data, not teach people how to do loops and if statements, or how to do object oriented programming. Someone else can do that. I am a mad scientist. This blog gives me a creative space to explore data science and software engineering.

My other blog gives me a much more focused space to explore Network Science and Social Network Analysis. I need that one to stay focused on that topic, and this one to be my playground for other things.

Enough about me and the why behind this blog.

Book Review: Machine Learning Engineering with Python

You can get the book here!

Previously, in my career, I was a Senior Platform Engineer on McAfee’s AI Research team. That’s a fancy title, but it was essentially a Senior Machine Learning Engineer. I worked on the platform that supported large models and did work such as setting up retraining pipelines. So, this book is all about what I did in my previous job, and what I still do now.

Andrew P. McMahon has written a dense tome on the field of ML Engineering. At 400+ pages, there is plenty to learn from this book. I have so far read about 1/3 of the book, so I’m going to give my thoughts on as much as I have read. I can already tell that he knows this subject well, and I’ll return to this book as I need.

Before I start praising this book, I will say that the one downside is that so far each chapter has a bit too much going on, so I took a break as the author moved on to talking about something else. Just go in prepared for that. But the information is good, the chapters are just dense. That’s my only complaint. It doesn’t take away form the content, it just could have been broken up a bit more.

My review is going to be a discussion, not a feature list.

In chapter one, the author breaks the workflow down into “Discover, Play, Develop, Deploy”, and he shows that this simple workflow contains the important work specified in CRISP-DM. He compares his four step plan with CRISP-DM and describes the work done in each of the four steps.

For instance, the steps in Discover include:

Speak to the customer and then speak to them again
Document everything
Define the metrics that matter
Start finding out where the data lives

And Play includes:

Detailed understanding of the data
Building a working proof of concept
Agreeing on the model/algorithm/logic that will solve the problem
Collecting evidence that the solution is realistic
Collecting evidence that good ROI can be achieved

There’s two other parts, but I’ll stop there. You should get the book and read it if you are new to ML Engineering or if you are an ML Engineer and want to learn how others are doing their work.

Personally, I have my own process that I like, and it has similarity with this plan. Mine is Plan, Collect, Experiment, Validate, Deploy, Monitor. In my approach, the Plan stage is all about figuring out what is needed and why, figuring out what data can be helpful and where it lives, and considering approaches that can be useful. Collect has to do with collecting data for building the model and setting up retraining pipelines. Experiment is where I build several models and compare them. In Validate, I spend a lot of time making sure that the model is actually working well enough, and I patch up holes. This is iterative. In Deploy, the model is added to a model registry, and then picked up for deployment. In Monitor, I keep an eye on the model and the infrastructure that supports the model.

I’m giving my approach not to say that I disagree with his but to say that we all do things a little differently. His plan has fewer steps, but more complexity in each part. My plan has more steps, but each step has less complexity. I like his plan, and it is a good plan if you don’t have one.

In Discover, he mentions “Speak to the customer! And then speak to them again!” It is important to get detailed requirements, but it is also important to get to the WHY. For everything someone wants you to do, ask WHY. It is often useful to ask WHY repeatedly.

“We need a model that can do this thing.”
“Why?”
“Because the CMO saw that somebody has a model that can do this thing and wants one too”
“Why?”

“We need a model that can predict malware with much higher accuracy.”
“Why?”
“Because low quality malware models break computers”
“Ok then!”

That might be the real why. It might not be an important model. Or, it might be a checkbox model. Even checkbox models are often prioritized. ML Engineers don’t usually decide company priorities.

The author also mentions to document everything, because you will be judged on how well you deliver against the requirements. This is true, but I want to add one more reason: it is a good defense strategy. In any kind of engineering, you’re going to be faced with some people giving you unrealistic requirements. You need to document their ask, and you need to document your response. You need to actively respond when someone tries to give you unrealistic requirements. Even if they do not intend, they are setting you up for failure. You will run into this in Machine Learning, especially because many people do not understand the actual capabilities of ML.

There’s also lots of cool tech to learn in this book. For instance, by chapter 3, the author has already shown and describe model registries, version control, retraining pipelines, drift detection, and explainability. You will need to know all of this. I personally loved these parts, as I had some catching up to do. Thank you very much!

I really appreciated the overview of the process of building models, but also the fact that the author discusses infrastructure. His experience is obvious. If you are only building models with no understanding of the infrastructure that can support the model, you are not doing ML Engineering. You need to understand the stacks that exist before and after the model and anticipate retraining, human-in-the-loop correction, and so on.

I also really appreciated that the author didn’t just describe “data drift” as one all encompassing thing but described data drift and concept drift separately. The author shows some fancy ways of identifying drift, but I have to wonder if Causal Discovery might also be useful in detecting drift. That would be a cool experiment. Also, I mainly do NLP, and in language, words occasionally are repurposed and take on new meaning. I would be fascinated playing with approaches to detect concept drift in language. I will probably do that, at some point. I need it.

This is a very advanced book. This is not a “How to build your first ML model” book, and it doesn’t tell you how SVMs work. You should read this book AFTER you know a bit about ML and have some experience building models.

Chapters are long and cover a lot of material, but it is all important material, and accurately presented.

I’m going to end my review here. This is a long book and I’ve already learned some new things. It is clear that this author is experienced in ML Engineering and that this book is a useful resource for those of us who are in the field. This is an excellent contribution to Data Science.

The book is both technical (it shows code implementation) as well as high level. You need to understand both, and it will help you.

Personally, this book has helped bring me up to speed on certain things where I fell a bit behind, so THANK YOU VERY MUCH. It has also showed other gaps. This is useful! I’m looking forward to playing with some of the drift stuff for NLP. It’s been a pleasure to read this, and I will continue.

Thank You!

Thank you everyone who read this post, the first post for this publication! If you enjoy this, you might also like 100daysofnetworks. I also wrote a book about Network Science that you might enjoy. You can get it here!

Smash Data!

Discussion about this post

Ready for more?