When I last left you I had chosen a technology stack and a city to conduct research on. After many tried and failed attempts I realized I simply did not have enough data for a single city, and surrounding areas, to successfully predict the weather farther than a few hours in advance. With that knowledge I decided it was time to gather more data and expand to the entire United States.
Oh, and I gave my project an official name: MeteoML. Not the most clever name ever, but the domain name was available. :)
Knowing that my data source was the Open Weather API, mainly due to costs, I imported their cities list which gives me their internal CityID and the latitude and longitude. The current weather API they expose gives a range of data from the current temperature, to cloud cover, to humidity, pressure, wind speed, and more. I chose to import all of this data and normalized it into a relational series of classes. Remember, I am using InterSystems IRIS Data Platform which is behaves much differently than a more traditional database would.
In IRIS I chose to create Persistent classes which allow me to create a class, or a model, that mimics the data I want. For those who are unaware of the technology, think of a class as a table in a database on steroids. Out of the box functionality allows me to create a class, populate the properties with data and then simply call %Save() on the class which persists it to disk, same as committing in SQL.
With that brief lesson out of the way, I chose to create classes that represented each type of data and related them all together on the CityID. The classes are Cities, Clouds, Temperature, Weather, and Wind. If anyone is interested I am happy to share the specific details, but for the sake of brevity I won’t put you through that here.
With my classes in place I then wrote a routine that runs hourly which imports the current weather conditions for every city I track. Currently, that is 19,972 cities located in the United States. Every hour I reach out to the Open Weather API and ask for the current conditions and then populate the classes, discussed above, with the data and save it after doing a check to ensure it is not a duplicate entry.
Since I only pay for updates every two hours that means running it hourly could mean I duplicate weather and could skew my results. To avoid that I ensure that the CityID and DateTime do not already exist. If they do, I skip if for that hour and move on to the next city. Surprisingly, this process takes an average of 45 minutes to complete.
Since I am allowed 600 API calls per minute and I currently only use about 350 I decided to start incorporating other data as well. This data includes current Carbon Dioxide, Nitrogen Dioxide, Sulfur Dioxide, Ozone levels, and the UV Index for each city. All in all I bring in nearly 1Gb of data per day if everything ran as intended.
This process has been running daily, with minor outages here and there due to, well, my bad code and reboots of my system. I plan on allowing this data to accumulate for another week, which will give me two solid weeks for every city. At which time I will use that as my training set while the next week accumulates, that third week will become my new training set while I wait for the next week to populate.
I plan on continuing this process week over week which will allow my Machine Learning model to become smarter, since there is more data, but it also allows me to test the model weekly for accuracy. Currently, I am using ML.Net, although I am finding many limitations with how it is built and functions. After all, it is built for the masses, not for my project specifically. This means that I will need to build my own algorithm that is tailored to my data and models.
That is where I will leave you for now. Part 3 I will start to get into how I am building my machine learning models, some of the math and logic behind it, and how accurate it is based on this extremely small data set.