I left you with the knowledge that I had expanded to the entirety of the United States and that I was waiting for a full data set to accumulate before proceeding. The time has come to start working with the data and getting more intimate with it.
Before going too far, I have also started capturing daily Carbon Monoxide, Nitrogen Dioxide, Ozone, and Sulfur Dioxide readings as well just in case they play a role later. I doubt they will, but I would rather have the data and not need it than need it and not have it.
My first step is to get more familiar with the data and attempt to isolate anything that truly stands out when statistics are applied. To do this I am running Linear Regression algorithms against the data in different configurations to see if anything jumps out. The idea being that certain indicators will play a larger factor when predicting the temperature than others. e.g. if the wind speed and direction change, does that truly have a major impact on the temperature.
A disclaimer, I know this is not going to show me a whole lot and I know that it will not actually predict the temperature. I honestly doubt I will achieve a 20% accuracy which statistically is worse than guessing. I just know that a bunch of you following my work will say something along the lines of “why do that…it’s not helpful”. On a normal day I would agree, but remember, the point of this exercise is to get more familiar with my data rather than trying to predict things right now.
With that out of the way, I decided to use Python for this test. SKLearn, Pandas, NumPy and MatPlotLib were my choice libraries since I have worked with each of them in the past and like their API’s.
The first step was to export nearly 3Gb of data that I have collected over the last three weeks or so. Being selective I decided to export Cloud Coverage, Humidity, Min Temperature, Max Temperature, Wind Speed and Direction, and the Temperature. I then dumped all of this data into a CSV file which I used as my training and testing data. The file was just over 75mb when exported.
Using Pandas I imported that file using the read_csv method which surprising did not error out. I was afraid I would run out of memory in a development environment, but luckily for me, it did not happen. Once imported I setup my features and labels. In my case, everything except Temp was a feature. Since all of the data imported was an integer or a float I had no problems doing this. This is partly why I chose the fields above as well, no strings as they cannot be included in a Linear Regression algorithm, at least not the way I am setting it up.
I decided since my data set was so large I would use 5% of the data set for testing and the rest could be used for training. I have over 3 millions rows of data so that may seem small, but in reality it is 150,000 rows used to validate the data which should be plenty. The last thing I setup was to predict one day into the future. That may not seem like a lot, and it is not, but it is a starting point.
With everything set it was time to run the script and see what my accuracy was. Anyone want to take a guess before I tell you? Remember, I do not expect it to be above 20%.
Ok, enough suspense. Using Linear Regression the model was able to come up with an accuracy of 79.709 percent! I ran it multiple times to make sure that was right! That means nearly 80% of the time the model was able to successfully predict the next days temperature given a set of inputs. Far better than I had anticipated for a simple Linear Regression.
To refresh some of you who may have not been following along from the beginning. My goal with this project was to see if given enough variables if we could accurately predict the temperature, and other variables, based on previous weather patterns. The fact that this model has such a high accuracy tells me that it is in fact possible.
I believe that given enough variables, more than I currently have available, it will be possible to predict the temperature for any given point on earth given the current weather conditions at least one day in advance. If I start to consume data points like upper and lower level winds, high and low pressure fronts, time of year, Doppler radar, and satellite data and incorporate them into the model I develop I have no doubt that the accuracy of the model will easily rise into the 90% range.
I know that in the long-term a simple Linear Regression model will not suffice, but it does help validate my theory that weather is nothing more than a series of patterns that can accurately be predicted given enough data points.
I want to give a big Thanks to InterSystems for giving me a license to their software to use for this research. Without them this would have been much more difficult and costly.