In part 3 I discussed how I was surprised that the weather was so predictable based on the linear regression model I put together. To recap, my classifiers are Cloud Coverage, Barometric Pressure, Humidity, Min. Temp, Max. Temp, Wind Degree, and Wind Speed. The label is the Temperature itself and the index column is the Date and Time.
My first attempt was to use 100% of my data, which at this point is nearly a months worth of hourly updates on nearly 20,000 cities in the United States. As of a few minutes ago this is now over 5 million rows and growing rapidly.
Utilizing 99% of the data for training and 1% testing I was able to obtain an accuracy rating of 80.10% and in all tests came within 2 degrees of the actual temperature when predicting one day in advance. Yes, I know, it is only one day in advance, but I am keeping it small to start with. I honestly expected an accuracy of around 20% of less when doing a simple linear regression so you can imagine my surprise with 80%.
I then asked myself the question; if I can obtain 80% on my first run through, what if I start tuning this data in an attempt to gain further accuracy? That is exactly what I did. I started adding fields, removing them, and giving them higher/lower weights while normalizing and in all situations I was never able to obtain higher than an 85% accuracy which is simply not good enough. After all, if I tried to ask people to rely on my numbers when I am only accurate 85% of the time with a +-4 degree variance I doubt anyone would consider my work valid.
Then it hit me. I was trying to feed my model too much data. My first thought was to simply give it every city in the United States and see what happened. Then I realized while working with the data that the same cloud coverage, pressure, humidity, etc. occurs quite frequently in different parts of the country but with wildly different temperature results.
For example, in Seattle when the Cloud Coverage was 90%, Humidity was 56%, and the Pressure was 1008 in Seattle the temperature was 46 degrees. Those same conditions occur all over the United States with ranging temperatures like in Crook County, Oregon where the temperature was 40 degrees. Six degrees may not seem like a huge swing but it is more than enough to confuse a linear regression model.
That was when it hit me. Rather than giving the model the entire United States I needed to give it a more refined data set specific to each city. I have the latitude and longitude for every city I track and knowing that each degree of latitude and longitude equals approximately 69 miles I could easily reduce the data set to a geographic area, although it means one model truly does not fit all.
With this in mind I ran a script that generates a csv file for every city and includes the data for only cities within 5 degrees of latitude and longitude in all directions. In essence, if the city is located at 44.16 by -120.08 (Crook County, Oregon) I say give me all cities and their data within 39.16 x 49.16 latitude and -125.08 x -115.08 longitude. The average number of cities within this region is around 500 cities, in the case of Crook County it is 987 cities worth of data.
This process took a long time, nearly 9 hours to create nearly 20,000 csv files full of data and then another 6 hours to generate the linear regression pickles needed for each city. What I am left with is a model unique to each city and surrounding cities and on average the accuracy is 97% and the margin of error dropped to just under 1 degree when predicting one day into the future.
Now these are the results I was looking for, although I expected it to be a LOT more more than it was. There are going to be some issues though. I only have about one month of data on each city and that means all of the data points are relatively well clustered because it is winter everywhere. What I anticipate happening when Spring/Summer/Fall roll around is that the accuracy of the models will decrease and the margin of error will increase.
The driving factor of this is that things like barometric pressure does not waver much throughout the year, but the temperature does. The barometric pressure in Seattle has been averaging 1015 over the last month with an average temperature of 40 degrees Fahrenheit. In the Spring/Summer/Fall the temperature will easily hit 90+ degrees with a similar pressure which will mean the model will need to rely on other factors much more heavily than it does now.
Knowing this, I can start to create models based on the time of year. A city will end up having twelve models each, one for each month of the year which should help keep the relative factors the same year over year, with few exceptions. Generally speaking the temperature in Seattle in December should be consistent between 30–50 degrees which is much easier to model than a swing of 20–100 degrees if I tried to model the entire year together.
Lastly, in time, it will be possible to create more refined models down to a week or even per day, but I will need years of data to prove the effectiveness and I simply cannot afford that expense right now. The end-game for this, in about five years, will be to create a model for every day, week, month, for every city in my database. Then each night calculate the forecast using an average of the day, week, and month model. By then, the models should be so well refined that the margin of error should stay well within acceptable levels and the confidence of the predictions should be strong.
I estimate the cost of obtaining the last 6 years worth of data from the Open They exact pricing is unknown, although the smaller plans start at 950 USD per month and only allow 50,000 calls per day. Each city will require 2,190 calls to obtain the last six years of data which means I will need to make over 43 million calls before obtaining all of the data for the 20,000 cities currently tracked. At that pace it will take 876 days or 29 months to obtain it all. That means the cost would be approximately 28,000 USD over that period. They do have bulk download options, but the price is negotiated and I can assume it would be near that same estimate, although I may be mistaken. I do plan on reaching out to them to see if they can give me a “student” discount. :)
That is enough for today. The next steps will be to start visualizing this data and making it available for everyone to see and use. I have purchased the domain name MeteoML.com and surrounding domains and will begin putting a website together that will allow exploration of the data.
As always, thank you for following along. I hope you are enjoying this as much as I am.