The last time I left you I had completed some Linear Regression models and found that weather does in fact have predictable patterns associated with it. As time passes so does the size of my data sets and I am starting to get a good series of data to work with across the nearly 20,000 cities I track on an hourly basis.
I know that Linear Regression will only go so far so it is time to move on to other types of analysis. Before going to far, I do predict one day into the future every hour for every city and have found my accuracy to be within +-4 degrees for approximately 80% of the cities I track. If I expand the +- to 8 degrees I get around 90% accuracy. The other 10% is so far off that it is not even worth mentioning. The main reason behind this is the city is isolated and only has a small amount of surrounding cities which means a smaller data set to train with.
Time to move to Time Series Prediction/Analysis which I believe will provide more insight into the data. It will help isolate anomalies in the data as well as account for the date and time. With Linear Regression the prediction has no way to measure the time so overnight or early morning predictions were usually the farthest off from the actual temperature.
Time Series analysis should help prevent that type of issue from arising since it knows the time of day and can use that in the prediction process. For this exercise I have reduced my data set to a single city rather than a cluster of them. The largest reason for this was so that I could analyze a single city without clouding the graphs and such with noise from other cities.
I also want to note that 100% of my time series analysis is done in python using a variety of packages. Namely they are: Pandas, Numpy, MatPlotLib, and SKLearn. A few others as well like datetime and os, but those are not worthy mentioning.
To begin with I wanted to plot the max/min/actual temperature for a single city and see how those three classifiers trended with one another. This should be a good way to determine if the min/max is accurate. If the actual temp climbs above or below then we know the min/max is a bad field to use. The graph of the min/max/temp is below starting in mid-November to today.
What you see above is about 90% training data with 10% reserved for testing data that is appended to the end. If nothing else it shows that there are definite trends in weather of the course of a month that we should be able to work with. This particular city has experienced about 1 week of cooler weather followed by around 1 week of warmer weather.
The good news is though that the actual temperature was ALWAYS between the min/max that was predicted by the Open Weather API. This will come in handy later I am sure.
The next step was to start calculating things like a rolling mean and the rolling standard deviation. The chart below shows the results on that analysis.
What is helpful here is the rolling mean of this data set. Knowing the rolling mean gives my algorithm a place to start when predicting values. If we know the rolling mean we know that in the next 24 hours, unless something drastic happens, the weather should fall close to that line. In my custom modelling I believe this will be a good starting point before taking other factors into account.
Part of the Stationarity test is to run a Dickey-Fuller test which will help us test the null hypothesis. Let me explain here because this was new to me and I presume it will be for many of you as well. The null hypothesis is a general position that there is no relationship between the data. In my case, no relationship between the classifiers used. There is a lot of math and a variety of different ways to administer this test so I will let you explore this test further if you like. My results are below:
Results of Dickey-Fuller Test:
Test Statistic -3.523081
#Lags Used 1.000000
Number of Observations Used 25.000000
Critical Value (1%) -3.723863
Critical Value (5%) -2.986489
Critical Value (10%) -2.632800
I chose to stick with 1 lag since there is roughly 1 month of data in the data set being tested. My results are a little disheartening though since I cannot reject the null hypothesis since my p-value is > 1, 5, 10%. The reason I did not stop here though is because my test size is extremely small, only 25 observations which is less than 10% of the data set. Additionally, when this happens it means that the data has seasonality, which the weather obviously does, even on a small scale of a month.
The results above really did not surprise me much so I decided to try a few different ways to analyzing the time series. Below are the results of a Naive, Simple Average, Moving Average, and Holt-Winters analysis of this data set.
Moving Rolling Average:
Going into this series of tests I expected the Holt-Winters test to be the most accurate based on my research, but it was no where near accurate. I do believe that the analysis could be refined more by adding more data, or potentially less data, but I have not yet completed that analysis.
Ultimately, I knew that an out of the box time series analysis would not be the final answer, although I did expect somewhat better results even with my limited data set. None-the-less I do believe that a combination of the above will make it into my final algorithm. I believe that by using a combination of the Simple Average and the Moving Rolling Average I can further narrow down a starting point for any given day.
Lastly before I let you go today, I realize that my techniques are nothing out of the ordinary. The true purpose of these tests are two fold. One, I have never done this type of statistical analysis before and I am loving it. Two, I want to uncover ideas and inspiration for my ultimate algorithm that I hope to discover. I also realize that the analysis above is incomplete since it does not factor in any elements other than temperature and that alone is definitely not enough to make an accurate prediction.