It's all about the data, stupid.
Knowing your task and knowing your data
The most important part in the machine learning process is understanding the data you are working with and how it relates to the task you wish to solve.
It will not be effective to randomly choose an algorithm and throw your data at it. It is necessary to understand what it is going on in your dataset before you begin building a model.
Each algorithm is different in terms of what data and what problem setting it works best for. While you are building a machine learning solution, you should answer, or keep in mind the following questions:
- What questions am I trying to answer? Do I think the data collected can answer that question?
- What is the best way to phrase my questions as a machine learning problem?
- Have I collected enough data to represent the problem I want to solve?
- What features of the data did I extract and will these enable the right predections?
- How will I measure success in my application?
- How will the machine learning solution interact with other parts of my research or business product?
The algorithms and methods in machine learning are only one part of a greater process to solve a particular problem and it is good to keep the big picture in mind at all times. Many people spend a lot of time building complex machine learning solutions only to find out they don’t solve the right problem.
When going deep into the technical aspects of machine learning it is easy to lose sight of the ultimate goals.