7 Tips for Machine Learning Success
The first part of our Business Guide to Machine Learning (ML) broke down how the umbrella concept of ML is far more nuanced in a business environment. The most effective strategies look at ML in a practical sense, employing both complex deep learning and less-intensive "cheap learning" techniques to optimize enterprise processes and gain tangible business intelligence (BI) insights.
The goal of deploying ML within your business applications is to improve your bottom line or press your company's competitive advantage. But in the larger scheme of your organization, making the most of the time and resources you invest in this process goes far beyond the algorithms. The IT decision-makers in your business need to make sure everything factoring into your ML impementation—from the data and logistics to how you're engaging with users—works cohesively together to maximize effectiveness.
Ted Dunning, Ph.D., is the Chief Application Architect at enterprise Hadoop vendor MapR, and co-author of two books on what he refers to as "Practical Machine Learning." Dunning has developed ML technologies for a number of companies over the years, including the ID Analytics fraud detection system (purchased by LifeLock) and the Musicmatch Jukebox software, which later became Yahoo Music. He also currently serves as Vice President of Incubation for the Apache Software Foundation.
Dunning has watched the ML space evolve over decades, and learned a lot about what works and what doesn't in a practical business environment. Below, Dunning lays out seven best practices to follow when developing business solutions rooted in ML.
1. Don't Forget Logistics
Successful ML isn't just about choosing the right tool or algorithm. Dunning said you also need to figure out what approach is a good fit and design it for the particular situation you are addressing. For example, Dunning talked about ML in an online marketing campaign as opposed to far more complicated scenarios such as algorithms guiding an autonomous car. Expending your resources for an incremental algorithm improvement is worth the trouble for the car, but in the marketing scenario, you would see a far better return from optimizing all of the logistics around it.
"Oftentimes, for businesses, it's the logistics, not the learning, which gives you the value. That's the part you should be spending your time and resources on," said Dunning. "Adjusting the algorithm would give you a small improvement. But adjusting that data, the [graphical user interface or] GUI, and how you're listening to and engaging with your users could easily give you a 100 percent improvement. Spending time tweaking the algorithm is worth a fraction as much to businesses as is listening to your users."
To illustrate this point, Dunning explained how he once built a model for identifying application fraud (opening fake accounts with stolen identities) in a company's customer database. The model he built got great results, but Dunning noticed it weighted the gender of the applicant very heavily.
It turned out that the logistics were off. The way the application process worked, the applicant only filled out their gender after they had already become a customer and had passed a number of screening steps to filter out fraudsters. So by using the gender field, the ML model was cheating the logistics of the whole fraud process. That has nothing to do with the algorithm, and everything to do with how the company was getting its data in the first place.
2. Mind Your Data
Dunning is full of catchy tidbits of wisdom. After starting with "it's the logistics, not the learning," he said the other half of that idea is "it's the data, not the algorithms." A large part of ensuring your ML algorithms are delivering valuable insights is making sure you're feeding them the right data. Dunning said, if you're not getting the result for which you're looking, then more often than not it's because you're not using the right data.
"People get all wound up and ego-bound to particular algorithms, but nowadays, because of the tools out there, everyone and their mother can and is coming up with all sorts of new algorithms," said Dunning. "The data is far more important, and will give you far more lift than endlessly tweaking your algorithms. If you're working on a hard problem like speech recognition or computer vision, that's one thing. But this is a data-driven field. In the majority of scenarios, you'll benefit far more from adjusting what data you're getting and changing the question."
That's what Dunning did in the mid-2000s when building a video recommendation engine at a company called Veoh Networks. The team was working to identify pairs of user-generated videos that people clicked on more than expected, but the algorithm wasn't working. They were thinking in terms of music, where users know their favorite artists and songs by name. So they changed the question by tweaking the user interface without touching the algorithm itself.
"In user generated videos, nobody know the artists and lots of videos had really spammy titles to get more views. Cycling on algorithm tweaks would have never given us good results," said Dunning. "What we did was changed the user interface to emit a beacon signal every 10 seconds [to gauge how long viewers were watching a video]. We found that if we used the beacon instead of clicks for the raw data of the recommender, we got awesome results. The lift for this one change was several hundred percent improvement in engagement due to recommendations, with no algorithmic changes."
3. Algorithms Are Not Magic Bullets
ML implementations thrive on continual trial and error. No matter how good your algorithms are, if your system is interacting with humans, then it will need to be adjusted over time. Dunning stressed that businesses should constantly be measuring the overall effectiveness of their implementation, and identifying the changes and variables that are making it better and making it worse. This may sound like a platitude, but Dunning said, despite how obvious it sounds, very few people are doing this or doing it well.
"A lot of people want to deploy a system or take some action, and they want their algorithm to run perfectly forever," said Dunning. "No algorithm is going to be a magic bullet. No user interface design will stick forever. No data collection method will never be superseded. All of this can and will happen, and businesses need to be vigilantly measuring, evaluating, and reevaluating how their system works."
4. Use a Diverse Toolset
There are dozens of ML tools available, many of which you can use for free. You've got popular open-source frameworks libraries such as Caffe, H20, Shogun, TensorFlow, and Torch, and ML libraries in a number of Apache Software Foundation (ASF) projects including Mahout, Singa, and Spark. Then there are subscription-based options including Amazon Machine Learning, BigML, and Microsoft Azure Machine Learning Studio. Microsoft also has a free Cognitive Toolkit.
There are countless resources available. Dunning has spoken to numerous businesses, data scientists, and ML practitioners, and always asks them how many different frameworks and tools they use. On average, Dunning said most said they use a minimum of 5-7 tools and often far more.
"You can't become glued to one tool. You're going to have to use several, and as such, you'd better build your system in a way that it's agnostic," said Dunning. "Anyone who tries to convince you that this tool is the only one you'll ever need is selling you a bill of goods.
"Something might happen next week that upsets the apple cart, and at the rate of innovation we're seeing, that will keep happening for another five to 10 years at least," Dunning continued. "Look at a cheap learning example where maybe you're re-using an existing image classifier to analyze pictures in a catalog. That's deep learning with computer vision thrown in. But there are tools out there that have packaged it all up. You need to measure, evaluate, and vacillate between different tools, and your infrastructure needs to be welcoming to that."
5. Experiment With Hybrid Learning
Dunning said you can also mix cheap and deep learning together into something of a hybrid. For example, if you take an existing computer vision model and re-construct the top few layers where a decision is being made, then you can co-opt an existing framework for an entirely new use case. Dunning pointed to a Kaggle competition in which contestants did just that; they took a data set and wrote a new algorithm on top to help a computer distinguish cats from dogs.
"Distinguishing cats and dogs is a very subtle thing for a ML algorithm. Think about the logic: Cats have pointy ears but so do German Shepherds. Dogs don't have spots, except for Dalmatians, etc. That can be pretty difficult to recognize in and of itself," said Dunning. "The guy who won developed a system that did this with 99 percent accuracy. But I was more impressed by the person who came in third. Instead of building from scratch, he took an existing image recognition program from a different task, took off the top layer, and put a simple classifier in there. He gave it some examples, and soon, it was 98 percent accurate in differentiating cats from dogs. The whole process took the guy three hours."
6. Cheap Doesn't Mean Bad
Despite the overt connotation, Dunning said cheap learning doesn't mean bad learning. The amount of time you spend on a ML implementation doesn't directly correlate to its business value. The more important quality, he said, is to make sure the process is repeatable and reliable. If the business is able to achieve that without investing an undue amount of resources, then that's all the better.
"Cheap doesn't mean bad. If it works, it works. If it's cheap and it works, that's grand. But the effort you put into building it doesn't define the value. That's a sum-cost fallacy," said Dunning. "What defines the value is how it improves the business. If the [machine learning implementation] improves profits or decrease costs or improves your competitive situation. It's the effect, not the effort."
7. Don't Call It AI
Dunning stressed that, when talking about these techniques, businesses should use the precise terminology: ML, computer vision, or deep learning. All of this tends to fall under the umbrella term "artificial intelligence" but, to Dunning, the definition of AI is simply "stuff that doesn't work yet."
"The best definition I've ever heard for AI is that it's the things we can't explain yet. The stuff we haven't figured out," said Dunning. "Every time we get something to work, people say 'Oh, that's not AI, it's just software. It's just a rules engine. It's really just logistics regression.' Before we figure something out, we call it AI. Afterwards, we always call it something else. In many ways, AI is better used as a word for the next frontier, and in AI, there will always be a next frontier. AI is where we're going, not where we've already reached."
This article originally appeared on PCMag.com.