Exploring technologies with real life use-cases is always more fun. When experimenting with machine learning, I try to avoid made-up datasets because they are usually biased and built to yield good results. Instead, I prefer to use real data to evaluate ML algorithms and hopefully create useful predictions.
I live in Hong Kong where air quality is an ongoing concern. The Environmental Protection Department is monitoring air quality in a number of locations to publish an air pollution index ranging from 1 to 10+ and makes this data available publicly on an hourly basis. So I saw an opportunity to apply some machine learning and predict air quality for the coming hours and days!
I'm no weather or environment specialist, but my intuition was that the weather has some influence on air pollution; hot days usually brings lower air quality and the wind obviously moves pollutants around. So my initial goal was to see if I could predict the air quality (AQ) index based on past AQ and weather, or more precisely predict tomorrow's AQ based on the AQ and weather of the past 3 days.
Getting the data
Next, I had to select the data sources. For such an experiment, it's important to remember that you need both:
- historical data to train your model
- current / real-time data to feed your model and produce the prediction
Some datasets provide historical data and an API to retrive current data, some don't, so this has to be carefully planned beforehand.
Air quality data
The Hong Kong government has launched an open data initiative some years ago and there's now a portal that registers all public data sources. There's quite a lot of stuff available there with varying degrees of quality (in both usefulness and convenience of access) but we should recognize that there's some good will shown by the authorities and I hope it will expand and improve.
When it comes to environment and air quality, things are pretty good as we have access to an "API" returning the current AQ reading from 16 stations, updated every hour. I put "API" between quotes because what the endpoint actually returns is in XML format - which is fine - with all the relevant data (the AQ readings) in plain text within a
CDATA section! So close yet so far... So I ended up extracting the AQ values through regular expressions.
Historical data is also available in CSV format through monthly digests since December 2013.
Now that I had the AQ data sorted out, it was time to do the same thing for the weather data. Obviously, the first thing I checked was the very same open data portal and I did find a current weather "API" (here again, some XML with all the data in a
CDATA section!). But two things bothered me with this data source. First, access to historical data wasn't straightforward and required to first use the open data portal to generate URLs for each past day. But most importantly, this API didn't return any data about winds, which made it less interesting for my experiment.
I spend quite some time looking around all available online services to see which one would provide the best data. I ended up using OpenWeatherMap which has a current weather API that's free to use. I also purchased a "history bulk" set for 10 USD so I could download 5 years of hourly data, which is a good deal. I was a bit worried that the readings from OpenWeatherMap seemed a bit off when compared to those of the Hong Kong Observatory (like a couple of degrees warmer for example) but I thought that they were probably using a single station located in some urban area.
Preparing the data
So I had my raw data, consisting of air quality and weather readings for every hour since January 2014. Next, I had to somehow merge them to generate a clean dataset that could be used as training data for my machine learning experiment.
The approach I took was to inject everything in a lightweight database, from which I could then query, extract and clean-up the data. I choose LiteDB, which is a NoSQL datastore that's very easy to embed in a .NET application. The schemaless nature of LiteDB made the ingestion process very straightforward and once all the data got into LiteDB, I just had to query it back using LINQ queries.
As explained above, my goal was to predict the AQ for day d from the AQ and weather data from days d-1, d-2 and d-3. As the weather data was composed of temperature, humidity, pressure, wind speed and wind direction, I had to generate some CSV with the following "schema":
aq, aq_1, t_1, h_1, p_1, ws_1, wd_1, aq_2, t_2, h_2, p_2, ws_2, wd_2, aq_3, t_3, h_3, p_3, ws_3, wd_3. My data sources provided hourly data so I computed average daily values... which was easier said than done for the wind! A basic average wouldn't work here so I had to convert wind speed and direction into a vector, average those vectors and convert the result back into speed and direction. And yes, I had to look up the good old trigonometry on Google... I also added extra rows for the weekday (ranging from
6) and month as I expected these to help in the prediction.
Running the first experiment
With my data ready, I could finally prepare the actual experiment. I decided to use Azure ML Studio as it provides an excellent user experience, a good choice of learning algorithms to play with and the convenience of publishing your model as an API in just a couple of clicks.
This is not the place to explain fundamental machine learning concepts; if you're interested in learning more about the different flavors of ML and their respective algorithms, the ML Studio documentation has a nice section about that and also this handy cheat sheet that can help with deciding which algorithm to choose. Let's just cut to the chase: we're dealing with a regression problem because we want to predict the air quality index, which is a non discrete value. So among all the algorithms available on ML Studio, I had to experiment with the ones from the "Regression" section.
I built a pretty classic pipeline, consisting of:
- the input dataset
- a data splitter that routes 70% of the data to the training and the remaining 30% to the scoring
- a model trainer, applying the chosen algorithm to the training data
- a model scorer, testing the resulting model against the scoring data
- an evaluation module to output the performance of the model
So, what results did this experiment produce? Head to part 2 to find out!