Working with real data - baby steps
How’s the weather ?
Let’s do a really simple analysis of the daily temperatures for a particular region, say an area in LA. Our goal is to collect a year’s worth of daily temperatures and find the dataset’s mean and variance.
Capturing the data
If you head over to https://dev.meteostat.net/, you can get full data dumps of individual weather stations. A little browsing around tells you that the weather station at Los Angeles/Jefferson (coordinates point to here) has the id KCQT0.
To get a csv file of the daily data from this weather station, you can type the following command in your terminal
curl "https://bulk.meteostat.net/daily/KCQT0.csv.gz" --output "KCQT0.csv.gz"
A typical entry in the csv file looks like
2000-04-28,16.9,15.6,20.0, , , ,4.7, ,1014.3,
The API docs again tell you that this corresponds to
date, average air temperature in celsius, minimum air temperature in celsius, daily precipitation total in mm, snow depth in mm, average wind direction in degrees, average wind speed in km/hr, peak wind gut in km/hr, average sea-level air pressure in hPa, the daily sunshine total in minutes.
Framing the data
Let’s read the csv file and put the data into a pandas dataframe.
Code for creating dataframe:
#imports
import pandas as pd
# Make a list of column names
col_names = ['date','avgtemp', 'mintemp', 'pp', 'snow', 'wind-dir',
'wind-speed', 'wind-gut', 'air-pressure', 'sunshine']
#Reads the comma separated csv into a pandas dataframe
daily_weather_df =pd.read_csv('KCQT0.csv', sep=',',names=col_names, header = None)
print("Weather dataframe looks like:")
print(daily_weather_df.head())
Output:
Weather dataframe looks like:
date avgtemp mintemp pp snow wind-dir wind-speed wind-gut
2000-01-01 10.4 7.8 13.9 NaN NaN NaN 2.0 NaN
2000-01-02 12.0 7.2 15.6 NaN NaN NaN 8.1 NaN
2000-01-03 11.4 5.6 18.9 NaN NaN NaN 1.3 NaN
2000-01-04 12.6 7.2 20.0 NaN NaN NaN 3.0 NaN
2000-01-05 13.3 5.6 21.7 NaN NaN NaN 1.9 NaN
air-pressure sunshine
2000-01-01 1018.9 NaN
2000-01-02 1021.0 NaN
2000-01-03 1026.5 NaN
2000-01-04 1024.9 NaN
2000-01-05 1018.0 NaN
Now, wait a minute. That doesn’t seem right … the column labels seem off! What is going on ?
This brings us to lesson number one. Pay attention to the formatting of the input. Let’s look again at a typical row in the csv file.
2000-04-28,16.9,15.6,20.0, , , ,4.7, ,1014.3,
Notice the comma at the end. This makes pandas think there is an extra column entry for every row…. which is why the column labels were off.
Corrected code for creating the dataframe:
#imports
import pandas as pd
# Make a list of column names
#Since each row in the csv file ends with a comma, pandas thinks there is a col entry there.
#So we create a column called dummy
col_names = ['date','avgtemp', 'mintemp', 'pp', 'snow', 'wind-dir', 'wind-speed', 'wind-gut', 'air-pressure', 'sunshine', 'dummy']
#Reads the comma separated csv into a pandas dataframe
daily_weather_df =pd.read_csv('KCQT0.csv', sep=',',names=col_names, header = None)
#Delete dummy col
del daily_weather_df['dummy']
print("Weather dataframe looks like:")
print(daily_weather_df.head())
Output:
Weather dataframe looks like:
date avgtemp mintemp pp snow wind-dir wind-speed wind-gut \
0 2000-01-01 10.4 7.8 13.9 NaN NaN NaN 2.0
1 2000-01-02 12.0 7.2 15.6 NaN NaN NaN 8.1
2 2000-01-03 11.4 5.6 18.9 NaN NaN NaN 1.3
3 2000-01-04 12.6 7.2 20.0 NaN NaN NaN 3.0
4 2000-01-05 13.3 5.6 21.7 NaN NaN NaN 1.9
air-pressure sunshine
0 NaN 1018.9
1 NaN 1021.0
2 NaN 1026.5
3 NaN 1024.9
4 NaN 1018.0
OK, that looks better.
To numpy and beyond
We’ll now take the avgtemp column of our daily_weather_df data frame and turn it into a numpy array. Since we only want a year’s worth of temperatures, we’ll only keep the last 365 values from the numpy array.
Code:
#imports
import numpy as np
#Get a numpy array of the temperatures of the last 365 days
daily_temp = daily_weather_df['avgtemp'].to_numpy()[-365:]
print(f"The first 10 entries in daily_temp are {daily_temp[:10]}")
Output:
The first 10 entries in daily_temp are [22.2 21.9 21.2 21.2 21.7 21.9 21.8 23.2 24. 23.4]
Let’s now find the mean and variance
Code:
mean = np.mean(daily_temp)
variance = np.var(daily_temp)
print(f"Mean temperature is {mean} celsius")
print(f"Variance is {variance}")
Output:
Mean temperature is nan celsius
Variance is nan
Oh dear, now what happened?
This brings us to lesson number two. Pay attention to missing/invalid input values! If you examine the numpy array, daily_temp, that we created, several entries might be NaNs1.
Ignoring NaNs
To find the mean and variance, we need to work around the NaN values that are present in our numpy array. Fortunately, there are in-built functions to do exactly that.
Corrected Code:
#Find the mean and variance (ignore the NaN values)
mean = np.nanmean(daily_temp)
variance = np.nanvar(daily_temp)
print(f"Mean temperature is {mean} celsius")
print(f"Variance is {variance}")
Output:
Mean temperature is 19.301098901098904 celsius
Variance is 14.313350440768023
Footnotes
-
NaN means Not a number ↩