(Cover image from Quartz)
In part 1 of this blog post, I described how I prepared an Azure ML Studio experiment to try to predict Hong Kong's air quality (AQ) from past AQ and weather data. Let's see what were the results produced by this first attempt.
A disappointing accuracy
I did a first test with the "Boosted Decision Tree Regression" algorithm and its default parameters and got mediocre results, as my model scored a Mean Absolute Error of
0.89. In other words, the model was on average wrong by nearly 1 point, which is pretty high considering that the index we're trying to predict has 11 possible values (from 1 to 10+). At this stage, I could have started fiddling with the algorithm's parameters, but I prefered to compare it with an other algorithm.
So I created a parallel pipeline to train a model with the "Decision Forest Regression" algorithm and compare the score with the one I had before.
Unfortunately, the score was even worse with a Mean Absolute Error of
0.92! I was getting serious doubts about the outcome of that experiment. Either there wasn't any relevant correlation between my data points, or the data just wasn't good enough. I still could fine-tune each algorithm's parameters but my results were so off that this fine-tuning wasn't likely to bring satifying improvements.
And then, I stumbled upon this article which basically explains that, back in 2015, a team from Microsoft did "data analysis" experiments (well, "machine learning" as they would call it today!) to predict China's air quality from weather data. They got satisfying results for a 6-hour forecast but less accurate predictions over the next days. And all of the sudden, my own results didn't seem that bad anymore!
Switching to an hourly model
I understood that predicting air quality from day to day is pretty challenging to achieve, but I could try to take the hourly approach that was tested by Microsoft and see if I could get better results. So back to LiteDB, I generated a new dataset composed of:
- AQ and weather data for 3 consecutive hours
- AQ for 6 consecutive hours after that
A great benefit from this change was that my dataset now had a lot more rows, from 1,400 in the daily model to more than 25,000 in the hourly model! Hopefully that would mean more relevant data to train on.
Time to test this new approach. After importing the new dataset in the existing experiment, I added a "Select Column in Dataset" stage before the split to filter out the AQ data for the hours H+2 to H+6, as I only wanted to predict the AQ value the hour following the past AQ and weather measurements.
I ran the experiment again, this time producing much better results: I got a Mean Absolute Error of
0.29 for the "Boosted Decision Tree Regression" and
0.31 for the "Decision Forest Regression".
I was curious to see if I could tweak the parameters of the "Decision Forest Regression" to improve the accuracy, so I followed the tips mentioned in the algorithm's documentation and increased both "Number of decision trees" and "Number of random splits per node". This increased a lot the training time but didn't dramatically improve the accuracy, with a Mean Absolute Error of
0.30. So I decided to stick with the "Boosted Decision Tree Regression".
Last step was to see how accurate the model could be on a 6-hour forecast. So instead of running the same dataset against 2 different algorithms, I modified the experiment to run 2 different filtered datasets against the same algorithm. The left path would predict the AQ for H+1 and the right path for H+6.
I already knew the accuracy of the model when predicting the AQ for the next hour but was quite eager to see how well it would perform on a 6-hour prediction. The Mean Absolute Error turned out to be
0.54. In other words, the predicted air quality 6 hours ahead was accurate by half a point.
What are the lessons learned from this exercise? Well, the first one was the confirmation that, as all ML practitioners will tell you, data preparation can be very time-consuming. The cumulated time I spent looking for the right data sources, importing them and creating datasets from them was much longer than the time I actually spent running the experiments. The second takeaway was that we need to have reasonable expectations from the data we use. In my case, I realized that the influence of weather conditions on air quality was not as strong as I originally anticipated, and the chaotic nature of weather certainly doesn't help in producing accurate predictions over long time frames.