Air quality prediction with XGBoost

While travelling in South East Asia, I noticed the air quality issues in some bigger cities. It affects peoples
lives directly because they might get breathing problems, will stay only inside buildings and/or are wearing
masks. When people will know that at a certain time the air quality is bad, they can take measures to
prevent possible (health) problems.
In the “Human Health Effects on Air Pollution” study (Marilena Kampa, Elias Castanas 2007) the relation
between air quality and the health of the people having to deal with that air have been shown. This has led
to the introduction of the Air Quality Index. The AQI is an index for reporting daily air quality. It tells how
clean or polluted the air is, and what might be the associated health effects.

For my final project at Udacity’s Machine Learning nanodegree I have created a model to predict air pollutants. I have used the data from a Kaggle competition

Since the competition is already 5 years old (btw, did you know Kaggle started in 2010 already?), I wanted to use new techniques to see if I could improve on the scores. Therfore I decided to use eXtreme Gradient Boosting (XGBoost) which is a very popular algorithm nowadays because of it’s speed and accuracy.

Please see my final report on github for the results. The exploratory plot below displays the relation between hour of day and the mean pollution level for 39 pollutants.

Mean pollution level per hour of day calculated using xgboost