Data Science Competitions

Data science competitions are a great way to connect theory and practice. Over the past three years I have worked on over thirty competitions, mostly from Kaggle but also from other platforms as well. These competitions are real life problems posed by well-known companies from a variety of industries. Taking part in these competitions have allowed me to master my skills in predictive analytics, allowing me to efficiently identify innovative approaches of formulating a business challenge as a prediction problem, and then solve the prediction problem with state-of-the-art machine learning techniques such as factorization machines, gradient boosted decision trees, and deep neural networks.

Below are some selected data science competitions that I took part in. Included are the description of the problem, some key insights to achieve good performance, and some of my reflections after completion. Links to the competition pages are also included.

I'm one of only 122 users on Kaggle to have achieved the "Grandmaster" status. There are currently about 84,000 users on Kaggle who have completed at least one competition and well over 1 million users when including those who have not competed. My all-time high ranking was 14th worldwide, achieved in 2016 during my most active period of competing. Since then I have mostly focused on applying my experiences to my research work.

Here is a link to my Kaggle profile.

End Date: August 28, 2015

Result: 1st place out of 2236

Best Method: XGBoost or LightGBM

  • Description: Predict a transformed count of hazards or pre-existing damages using a dataset of property information. This will enable Liberty Mutual to more accurately identify high risk homes that require additional examination to confirm their insurability.
  • Type of Data: Tabular, completely anonymized feature/variable names
  • Type of Problem: Regression, prediction can be any positive integer
  • Evaluation metric: Normalized Gini coefficient
  • Key Insights: This problem was particularly well-suited for gradient boosted decision trees due to the combination of categorical and numerical variables (features). The data was anonymized so we could not derive any specific business insights, but predictive analytics is shown to work very well.
  • Reflections: This competition was particularly challenging because there weren't many options to perform feature engineering or model ensembling. The result was the competitors had very similar scores. The uniqueness of my winning solution was in deriving an alternative approach to the problem that performed better than the standard regression approach.

End Date: August 19, 2015

Result: 2nd place out of 600+

Best Method: XGBoost or LightGBM

  • Description: The goal for this competition is to use data from social media to narrow the search for health code violations in Boston. Competitors will have access to historical hygiene violation records from the City of Boston — a leader in open government data — and Yelp's consumer reviews. The challenge: Figure out the words, phrases, ratings, and patterns that predict violations, to help public health inspectors do their job better.
  • Type of Data: JSON file with information about restaurants, Yelp profiles and reviews
  • Type of Problem: Regression, prediction can be any positive number
  • Evaluation metric: Root mean squared logarithmic error (RMSLE)
  • Key Insights: At the time of this competition natural language processing (NLP) in predictive analytics have yet to become popular. My result was almost exclusively from restaurant profile information and the history of health violations recorded in previous inspections. As one might expect, violations in previous inspections are good predictors of future violations, but it turns out that violations of geographically nearby restaurants also hold predictive power. It would be interesting to do this competition again with the NLP skills that the predictive analytics community now wield.
  • Reflections: This was a really fun competition because the the data is real and the problem is relevant. However, as is the case with many other competitions, there was a leak in the inspection results dataset where restaurants with severe violations would be re-inspected within two weeks and the re-inspection would be recorded the same way as the original violations, resulting in high predictive accuracy in the training sample by simply copying violations into the next recorded observation which was the re-inspection. Fortunately, this leak wasn't an issue in the test sample because evaluation was conducted in real-time, and therefore failed inspections wouldn't be known at the time of submitting our predictions so the leak cannot be used.

End Date: February 15, 2016

Result: 4th place out of 2619

Best Method: XGBoost or LightGBM

  • Description: Prudential wants to make it quicker and less labor intensive for new and existing customers to get a quote while maintaining privacy boundaries. The goal of this competition is to develop a predictive model that accurately classifies risk using data provided by Prudential.
  • Type of Data: Tabular, contains basic information about insurance applicants along with some information about employment, insurance, family, and medical history. Data is anonymized.
  • Type of Problem: Regression, prediction must be integer from 1 to 8
  • Evaluation metric: Quadratic weighted kappa
  • Key Insights: Similar to the Liberty Mutual competition, this problem is well-suited for XGBoost given its data format. Surprisingly nothing really worked beyond the very basics, and there was a serious problem with overfitting as the noise was large compared to the signal.
  • Reflections: I made a silly mistake with my final submissions that resulted in me finishing 4th instead of 2nd. That was a lesson learned. The real takeaway is that although it's addicting to optimize predictions based on the public leaderboard, this could and will lead to overfitting so having a proper local validation framework is truly important.

End Date: April 25, 2016

Result: 2nd place out of 2125

Best Method: XGBoost or LightGBM

  • Description: In this competition, Home Depot is asking Kagglers to help them improve their customers' shopping experience by developing a model that can accurately predict the relevance of search results. Search relevancy is an implicit measure Home Depot uses to gauge how quickly they can get customers to the right products. Currently, human raters evaluate the impact of potential changes to their search algorithms, which is a slow and subjective process. By removing or minimizing human input in search relevance evaluation, Home Depot hopes to increase the number of iterations their team can perform on the current search algorithms.
  • Type of Data: Tabular, but each row contains search terms and product descriptions in text
  • Type of Problem: Regression, prediction can be any number between 1 and 3
  • Evaluation metric: Root mean squared error (RMSE)
  • Key Insights: This is a competition that heavily relies on NLP concepts. Simple text features such as the length of the search term and product description, Jaccard similarity between the description and the search term, and the search terms themselves are very effective in predicting relevance.
  • Reflections: This was my first NLP type of competition. I learned a lot through this competition and my team was able to keep improving our performance until the very end.

End Date: June 6, 2017

Result: 5th place out of 3307

Best Method: XGBoost, LightGBM, Factorization Machines, Recurrent Neural Networks

  • Description: Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term. In this competition, Kagglers are challenged to tackle this natural language processing problem by applying advanced techniques to classify whether question pairs are duplicates or not. Doing so will make it easier to find high quality answers to questions resulting in an improved experience for Quora writers, seekers, and readers.
  • Type of Data: Unstructured text, an observation is composed of two strings of text
  • Type of Problem: Classification
  • Evaluation metric: Logarithmic loss (logloss)
  • Key Insights: This is very much an NLP problem, and as is the case with most NLP competitions, most features will work, and it's really about creating as many features from the text as possible. However, the key to winning this competition lies in the data itself, which was manually subsampled by engineers at Quora. People quickly found out that there was a network aspect to the data, where cliques of duplicate questions would appear together, so using network-based information (e.g., number of shared links between two sentences in the training data) as features led to dominant performance that NLP could never match. This was fun because it's the first time that I saw network-based features playing such a major role in a predictive analytics competition, but it was also frustrating because this aspect of the dataset was an artifact of the data creation process and would not be of any use in the actual real-life problem of detecting duplicate questions.

End Date: May 15, 2017

Result: 1st place out of 579

Best Method: XGBoost, LightGBM

  • Description: We have acquired news articles containing potentially relevant information. Using these, we need you to use historical reports to determine the topics for new articles so that they can be classified and prioritised. This will allow analysts to focus on only the most pertinent details of this developing crisis. The data for this challenge has been acquired from a major international news provider, the Guardian. The training data represents the past historical record, and the test data represents new documents that require classification.
  • Type of Data: Unstructured text, an observation is an article from The Guardian newspaper
  • Type of Problem: Classification, each article can belong to multiple topics
  • Evaluation metric: F1-score
  • Key Insights: This is purely a NLP poblem. Surprisingly, most of the more complicated things I tried didn't work very well, the basics of using words as feature vectors did most of the work.
  • Reflections: It took a couple of days to set up a proper train/test/validation framework because this competition doesn't have the same kind of resources as those hosted on Kaggle. As a result, there are fewer competitors and the entry barrier is much higher.

End Date: February 29, 2016

Result: 32nd place out of 974

Methods: XGBoost or LightGBM

  • Description: Predict service faults on Australia's largest telecommunications network. Telstra is on a journey to enhance the customer experience - ensuring everyone in the company is putting customers first. In terms of its expansive network, this means continuously advancing how it predicts the scope and timing of service disruptions. Telstra wants to see how you would help it drive customer advocacy by developing a more advanced predictive model for service disruptions and to help it better serve its customers. Using a dataset of features from their service logs, you're tasked with predicting if a disruption is a momentary glitch or a total interruption of connectivity.
  • Type of Data: Log files, semi-anonymized feature/variable names
  • Type of Problem: Multi-class classification
  • Evaluation metric: Multi-class logarithmic loss
  • Key Insights: On the surface this is a very simple problem of mapping anonymous log information to service faults. However, due to a data leak (observations were sorted unintentionally by time), many time-series related features were extremely useful in predicting faults. Usually when time is involved in a prediction problem a number of time-based features could be created and tend to have high amounts of predictive power. This was certainly the case here.
  • Reflections: I spent about 10 days on this competition and was very happy with the result I got. It took some time to realize and find the data leak, but once found it was a few days of fun creating and testing new features. I can imagine this type of problem playing a major role in many companies.

End Date: February 29, 2016

Result: 6th place out of 1212

Methods: XGBoost, LightGBM, k-Nearest Neighbors, Kernel Density Estimation

  • Description: The goal of this competition is to predict which place a person would like to check in to. For the purposes of this competition, Facebook created an artificial world consisting of more than 100,000 places located in a 10 km by 10 km square. For a given set of coordinates, your task is to return a ranked list of the most likely places. Data was fabricated to resemble location signals coming from mobile devices, giving you a flavor of what it takes to work with real data complicated by inaccurate and noisy values. Inconsistent and erroneous location data can disrupt experience for services like Facebook Check In.
  • Type of Data: Log files, x-y-coordinates and check-in locations
  • Type of Problem: Large scale multi-class classification
  • Evaluation metric: Mean average precision @ 3
  • Key Insights: This is one of the rare problems where nearest neighbors works very well. There are a large number of classes (e.g., location labels) so standard machine learning methods aren't as suitable. However, tree-based prediction methods like XGBoost or LightGBM still out-performed the best when features are designed properly and enough time was given to it to learn all of the classes.
  • Reflections: This competition was a lot of fun because many different methods worked well enough to allow for a powerful ensemble. Furthermore, each of the methods had their unique feature needs so it was more than just plug-and-play, leaving room for experimentation and creativity.

End Date: February 21, 2018

Result: 10th place out of 2383

Methods: XGBoost, LightGBM, Deep Neural Networks, Convolutional Neural Networks, Recurrent Neural Networks, LSTMs

  • Description: Mercari, Japan’s biggest community-powered shopping app, would like to offer pricing suggestions to sellers, but this is tough because their sellers are enabled to put just about anything, or any bundle of things, on Mercari's marketplace. In this competition, Mercari’s challenging you to build an algorithm that automatically suggests the right product prices. You’ll be provided user-inputted text descriptions of their products, including details like product category name, brand name, and item condition.
  • Type of Data: Product descriptions, basic features, with the price that was chosen by the seller
  • Type of Problem: Regression, prediction can be any positive number
  • Evaluation metric: Root mean squared logarithmic error (RMSLE)
  • Key Insights: This is a unique competition because it's a kernel competition. Competitors are required to submit code to Kaggle's kernels and run it on Kaggle's servers. There is memory and time limitation so our code has to be fast and efficient. It also requires reliable code because failure to run in the second stage is unacceptable. The text descriptions of products play a major role in predicting the price that sellers like to sell it as, and therefore recurrent neural networks are most suitable. However, recurrent neural networks are much slower than convolutional neural networks so it's actually the most suitable for this kernel competition.
  • Reflections: This was the first time that I actively used deep learning methods for a NLP-based competition. Surprisingly, deep learning methods are very easy to use and don't require as much tuning as I'd thought.

End Date: August 30, 2016

Result: 11th place out of 1969

Methods: XGBoost, LightGBM, FTRL, Factorization Machines

  • Description: In this competition, Grupo Bimbo invites Kagglers to develop a model to accurately forecast inventory demand based on historical sales data. Doing so will make sure consumers of its over 100 bakery products aren’t staring at empty shelves, while also reducing the amount spent on refunds to store owners with surplus product unfit for sale.
  • Type of Data: Sales transaction values for each product at each retail store
  • Type of Problem: Regression, prediction can be any positive number
  • Evaluation metric: Root mean squared logarithmic error (RMSLE)
  • Key Insights: The key to successfully predicting future demand is to construct the data in a way that allows you to leverage past time-series information. In predictive analytics terms, this means having the right validation framework so that the validation error is reflective of the true prediction problem at hand. Due to the large number of products and stores, this problem has high cardinality categorical variables. Given this aspect of the problem, factorization machine is naturally the best method.
  • Reflections: Product demand prediction is important and having a good method is significantly better than using a simple moving average at the store-product level. I highly recommend using a good machine learning method for predicting demand, but it's important to have the right validation framework to avoid overfitting.

End Date: January 18, 2017

Result: 29th place out of 979

Methods: XGBoost, LightGBM, FTRL, Factorization Machines

  • Description: Kagglers are challenged to predict which pieces of content Outbrain's global base of users are likely to click on.
  • Type of Data: Tabular data, observation shows the clicked ad from a set of display ads
  • Type of Problem: Classification, rank ads based on the click probability of a set of display ads
  • Evaluation metric: Mean average precision @ 12
  • Key Insights: As is the case with most display advertisement related prediction problems, there a large number of observations and the data is high cardinality. Factorization machine has been proven to be the best algorithm for this problem.
  • Reflections: Although factorization machine is best, I didn't use this method. Instead, I used FTRL, which is a less effective method.

End Date: July 20, 2017

Result: 62nd place out of 938

Methods: Convolutional Neural Networks

  • Description: Planet, designer and builder of the world’s largest constellation of Earth-imaging satellites, will soon be collecting daily imagery of the entire land surface of the earth at 3-5 meter resolution. While considerable research has been devoted to tracking changes in forests, it typically depends on coarse-resolution imagery from Landsat (30 meter pixels) or MODIS (250 meter pixels). This limits its effectiveness in areas where small-scale deforestation or forest degradation dominate. Furthermore, these existing methods generally cannot differentiate between human causes of forest loss and natural causes. Higher resolution imagery has already been shown to be exceptionally good at this, but robust methods have not yet been developed for Planet imagery. In this competition, Planet and its Brazilian partner SCCON are challenging Kagglers to label satellite image chips with atmospheric conditions and various classes of land cover/land use. Resulting algorithms will help the global community better understand where, how, and why deforestation happens all over the world - and ultimately how to respond.
  • Type of Data: Images, with multiple classes that each image belongs to
  • Type of Problem: Classification, each image can belong to multiple classes
  • Evaluation metric: F2-score
  • Key Insights: Pre-trained convolutional neural network models (e.g., VGG16) work very well, they require retraining of the final layer on the competition's training set.
  • Reflections: I used this competition to learn how to use convolutional neural networks for image classification problems. It turned out to be a great experience.