“Have an open mind. It is not very uncommon that a classical and simple algorithm might beat the hottest techniques.”
For this week’s machine learning practitioner’s series, Analytics India Magazine got in touch with Tien-Dung Le, a seasoned data scientist and a Kaggle Grandmaster. In this interview, he shares his experiences from a career that spans over a decade. He also offers a few valuable tips for cracking Kaggle competitions.
About The Kaggle Journey
Tien pursued computer science and machine learning in college and as a part of his PhD coursework, he worked on reinforcement learning for robotics systems. Currently, he works as a data scientist at KBC Bank & Verzekering.
“It has been a long journey. I first started learning AI at university, but it was more rule-based. I learned the new ML fashion via Kaggle community,” said Tien about his machine learning journey. The participants on Kaggle post reliable machine learning resources, which helped Tien in adapting to the competition.
During the initial days on Kaggle, Tien said that he found the regression prediction problems to be challenging. “I was a data engineer harvesting data from different resources. We needed to use Hadoop to process our data transformation. Then I was thinking about how to mine them (we used the term « data mining » at that time). It took quite a while before I actively participated in Kaggle competitions,” explained Tien.
Since most of the contests on Kaggle are aimed at solving real-world use cases, Tien advised the participants to first understand the business problems. “We need to know what happens behind the competition, what are the purposes of the organisers and how can businesses profit from our technical solutions,” added Tien.
Once one gets a feel of what the organisers are expecting, Tien recommends understanding the evaluation metrics and how to design a solution to optimise these metrics.
This would be usually followed by a long trial-and-error process. “I test different approaches until I see one or two good solutions. Starting from these baselines, I improve them until the deadline of the competition. Of course, stacking is also important, but we should only exploit it at the very end of any competition, said Tien.
“While improving the solution, it is recommended to have an open mind. It is not very uncommon that a classical and simple algorithm might beat the hottest techniques. Also, a note that, at Kaggle, magic features might be very important. Magic features exist in many forms from leakage of the test set split, a leakage of the business processes to a leakage of the predictions themselves.”
Another important aspect from his Kaggle journey, he explained, is that he focused on one competition at a time. Tien strongly believes that participating in multiple competitions is good for learning but not for competing. So, he usually spares at least two weeks on each problem before switching his attention to others.
“In case I need to build a machine learning model, I will try first with H2O. The advantage of H2O is that you could see many metrics directly without coding.”
Tien has been in the industry for over a decade. As an industry insider, he advises the data science aspirants to read the problem description carefully. “Ask business people to demonstrate their daily work. Ask what inputs and outputs are they expecting from their processes. What do they expect from an ML solution and how do they want to evaluate the performance. Note that in reality, the cost to evaluate an ML solution could be very expensive and time-demanding,” explained Tien.
He also pointed out a recurring issue amongst business stakeholders and data scientists. For instance, if the objective says, “ increase the sale”, one should ask what it really means. It could be increasing the volume (predicting the probability of production and optimising) or it could be increasing the revenue (predicting the expected sale and optimising the total revenue).
Once these questions have been communicated, it is then important to discuss with different stakeholders to know about the stored data. Underlining the importance of communication, Tien urges the data scientists to quickly address if these data points aren’t good enough to build a good solution. “If not, what could we do or improve? And then ask them how they want to integrate the ML solution in the business pipelines,” advised Tien.
Few Words For The Newcomers
Talking about the hype around machine learning, Tien said that many developing countries still lack proper data collection techniques and AI applications. They still use basic binary classification and/or regression with structured data, which might continue in the coming decades as well.
“We talked a lot about the latest deep learning technologies and new challenges, but in my opinion, our AI capability is still very far from our human ability. I think in the next 10 years, the AI community will be working primarily on problems concerning a combination of images, audios, text, and other sources,” said Tien.
Dissecting through the inflated expectations of the outsiders who consider ML as magic, Tien explained that machine learning needs historical data and data scientists need support from business people. And to get data, we need to have a good digital infrastructure. So to think that data scientists can solve all problems is not correct. A machine learning application requires much more effort than just building models.
For those who want to become great data scientists, he has the following tips:
- Understand the business domain and search for areas/business workflow where AI could bring value.
- Be open-minded, cooperate with business experts when building a model, analyse the results together and understand why a model works or does not work.
- Consistently learn and test new AI techniques or solutions.
For the aspirants, he suggests:
- Reading books/articles or following an ML course, then to start by applying techniques for hands-on experience.
- Playing with Kaggle is also a good way to have an idea of how machine learning works. Then go back to reading books/articles to have a deeper understanding.
- While participating in a Kaggle competition, follow the forum as there are many good ideas/resources shared by other participants.
On a concluding note, Tien advised the readers to keep working and not to give up. Because who knows what kind of surprising outcome awaits!
The post A Seasoned Data Scientist: Interview With Kaggle Grandmaster Tien-Dung Le appeared first on Analytics India Magazine.