Pandas Time Series example with some historical land temperatures

Monthly averaged historical temperatures in France and over the global land surface

The aim of this notebook is just to play with time series along with a couple of statistical and plotting libraries.

Imports

%matplotlib inline
import pandas as pd  # 0.23.0
import numpy as np
import matplotlib.pyplot as plt  # 2.2.2
import seaborn as sns  # 0.8.1
from statsmodels.tsa.stattools import adfuller  # 0.9.0
import statsmodels.api as sm

Loading the data

Historical monthly average temperature in France from the World Bank Group.

Historical data to understand the seasonal CYCLE: This gridded historical dataset is derived from observational data, and provides quality controlled temperature and rainfall values from thousands of weather stations worldwide, as well as derivative products including monthly climatologies and long term historical climatologies. The dataset is produced by the Climatic Research Unit (CRU) of University of East Anglia (UEA), and reformatted by International Water Management Institute (IWMI). CRU-(Gridded Product). CRU data can be mapped to show the baseline climate and seasonality by month, for specific years, and for rainfall and temperature.

The data was loaded into a spreadsheet and then exported as a csv file, after changing decimal separator from comma to point and removing useless columns (country code).

temperature = pd.read_csv('data/tas_1901_2015_France.csv')
temperature.head()

	tas	Year	Month
0	3.25950	1901	1
1	0.32148	1901	2
2	4.82373	1901	3
3	9.38568	1901	4
4	12.69960	1901	5

print(len(temperature), "rows")
print("column types :\n", temperature.dtypes)

1380 rows
column types :
 tas      float64
Year       int64
Month      int64
dtype: object

Creating a DatetimeIndex

Pandas’to_datetime() method requires at least the year, month and day columns (in lower case). So we rename the Year and Month columns and create a day column arbitrarily set to the 15th of each month.

temperature.rename(columns={'Year': 'year', 'Month': 'month'}, inplace=True)
temperature["day"] = 15
temperature.head()

	tas	year	month	day
0	3.25950	1901	1	15
1	0.32148	1901	2	15
2	4.82373	1901	3	15
3	9.38568	1901	4	15
4	12.69960	1901	5	15

Now we can apply the to_datetime() method, change the index and clean the dataframe.

temperature["Date"] = pd.to_datetime(temperature[['year', 'month', 'day']])
temperature.set_index("Date", inplace=True)
temperature.drop(['year', 'month', 'day'], axis=1, inplace=True)
temperature.head()

	tas
Date
1901-01-15	3.25950
1901-02-15	0.32148
1901-03-15	4.82373
1901-04-15	9.38568
1901-05-15	12.69960

Plotting these monthly averaged temperatures is rather messy because of seasonal fluctuations.

temperature.plot(figsize=(12, 6));

png

Note that we could also have created a PeriodIndex with a monthy frequency:

temperature = temperature.to_period(freq='M')

However this seems to be less handy for resampling operations.

Resampling

First thing we can do is to compute the annual mean, min and max. The resample() method is used with the ‘AS’ offset, which corresponds to the year start frequency (see this following documentation for the whole list of rules).

temp_annual = temperature.resample("AS").agg(['mean', 'min', 'max'])

temp_annual.head(5)

	tas
	mean	min	max
Date
1901-01-01	10.177163	0.32148	20.1921
1902-01-01	9.991733	2.26860	18.1870
1903-01-01	10.070277	2.95230	17.7461
1904-01-01	10.307586	2.31437	18.9831
1905-01-01	9.982693	0.49251	18.8757

temp_annual.tas.plot(figsize=(12, 6))
ax = plt.gca()
ax.grid(color= (0.1, 0.1, 0.1), linestyle='--', linewidth=1, alpha=0.5)
plt.autoscale(enable=True, axis='x', tight=True)
plt.ylabel("Temperature (°C)")
sns.despine()

png

Rolling window mean

Next we compute a centered rolling window mean with 10 years windows and an exponentially-weighted moving average with a 2 years half life.

temp_annual[('tas', 'mean')].plot(label='yearly temperature', figsize=(12, 6))
temp_annual[('tas', 'mean')].rolling(10, center=True).mean().plot(label='10 years rolling window mean')
temp_annual[('tas', 'mean')].ewm(halflife=2).mean().plot(label='EWMA(halflife=2)')  # Exponentially-weighted moving average
ax = plt.gca()
ax.grid(color= (0.1, 0.1, 0.1), linestyle='--', linewidth=1, alpha=0.5)
plt.autoscale(enable=True, axis='x', tight=True)
plt.legend()
plt.ylabel("Temperature (°C)")
sns.despine()

png

We can also compute and plot the standard deviation of the data over the rolling window.

m = temp_annual[('tas', 'mean')].rolling(10, center=True).agg(['mean', 'std'])
ax = m['mean'].plot(label='10 years rolling window mean', figsize=(12, 6))
ax.grid(color= (0.1, 0.1, 0.1), linestyle='--', linewidth=1, alpha=0.5)
ax.fill_between(m.index, m['mean'] - m['std'], m['mean'] + m['std'], alpha=.25)
plt.autoscale(enable=True, axis='x', tight=True)
plt.ylabel("Temperature (°C)")
sns.despine()

png

Checking for Stationarity

Well I am not very familiar with Statistics related with time series, so I just grabbed something that looks fancy on this page :) This is the Dickey-Fuller test from StatsModels.

Let us apply this test to the original monthly temperature dataframe. First we convert it into a time Series.

ts = pd.Series(temperature.tas, index=temperature.index)
ts.head()

Date
1901-01-15     3.25950
1901-02-15     0.32148
1901-03-15     4.82373
1901-04-15     9.38568
1901-05-15    12.69960
Name: tas, dtype: float64

Then we perform the test:

dftest = adfuller(ts, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key, value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)

Test Statistic                   -1.651531
p-value                           0.456242
#Lags Used                       24.000000
Number of Observations Used    1355.000000
Critical Value (1%)              -3.435185
Critical Value (5%)              -2.863675
Critical Value (10%)             -2.567907
dtype: float64

We already saw that the mean is clearly increasing starting in the 80s while standard deviation seems to be uniform. Now we can observe that the Test Statistic is significantly greater than the critical values and that the p-value is not so small, which also seems to indicate that the data is not stationary.

Let us try to make this time series artificially stationary by removing the rolling mean from the data and run the test again. We start by computing the mean on a 120 months rolling window.

ts_mean = ts.rolling(window=120).mean().rename("t_mean")
ts_mean.plot();

png

Note that there are NaN values at the begining of the serie because the there is not enough data in the rolling window.

ts_mean.head()

Date
1901-01-15   NaN
1901-02-15   NaN
1901-03-15   NaN
1901-04-15   NaN
1901-05-15   NaN
Name: t_mean, dtype: float64

Now we merge the mean temperature with the original monthly temperature and compute the “corrected” temperature t_modified.

stat_temp = pd.merge(temperature, ts_mean.to_frame(), 
                     left_index=True, right_index=True, how='inner').dropna()
stat_temp["t_modified"] = stat_temp["tas"]-stat_temp["t_mean"]+stat_temp["t_mean"][0]

Let us re-compute the rolling mean for original and modified data.

stat_temp.tas.rolling(window=120).mean().plot(label='original', figsize=(12, 6))
stat_temp.t_modified.rolling(window=120).mean().plot(label='modified')
ax = plt.gca()
ax.grid(color=(0.1, 0.1, 0.1), linestyle='--', linewidth=1, alpha=0.5)
plt.autoscale(enable=True, axis='x', tight=True)
plt.legend()
plt.ylabel("Temperature (°C)")
sns.despine()

png

Now we run the Dickey–Fuller test again, on the modified data.

ts2 = pd.Series(stat_temp.t_modified, index=stat_temp.index)
ts2.head()

Date
1910-12-15    4.385380
1911-01-15    2.021325
1911-02-15    3.879265
1911-03-15    6.911458
1911-04-15    8.259301
Name: t_modified, dtype: float64

dftest = adfuller(ts2, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key, value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)

Test Statistic                -5.913709e+00
p-value                        2.602534e-07
#Lags Used                     2.300000e+01
Number of Observations Used    1.237000e+03
Critical Value (1%)           -3.435647e+00
Critical Value (5%)           -2.863879e+00
Critical Value (10%)          -2.568015e+00
dtype: float64

This time the data is found to be stationary (Test Statistic is lower than the critical values and small p_value), which is not a big surprise…

We should probably consider a longer time span when testing for the unstationarity of temperature.

Comparison with global land temperature

We are now going to compare France against global land’s temperatures, taken from Berkeley Earth’s website.

glt = pd.read_csv('data/Complete_TAVG_complete.txt', delim_whitespace=True, skiprows=33).iloc[:,0:4]
glt.columns = ['year', 'month', 'Anomaly', 'Unc'] 
glt.head()

	year	month	Anomaly	Unc
0	1750	1	0.382	NaN
1	1750	2	0.539	NaN
2	1750	3	0.574	NaN
3	1750	4	0.382	NaN
4	1750	5	NaN	NaN

Let us list the years with NaN values in the Anomaly column.

gby = glt[['year', 'Anomaly']].groupby('year').count()
gby[gby.Anomaly<12]

	Anomaly
year
1750	4
1751	1
1752	6
2017	7

As we can see monthly Anomaly data are complete between 1753 and 2016.

glt = glt.loc[(glt.year >= 1753) & (glt.year <= 2016)]

Estimated Jan 1951-Dec 1980 monthly absolute temperature, copied from the file header:

mat = pd.DataFrame({
    'month': np.arange(1, 13),
    't_abs': [2.62, 3.23, 5.33, 8.36, 11.37, 13.53, 14.41, 13.93, 12.12, 9.26, 6.12, 3.67]})
#    'unc': [0.08, 0.07, 0.06, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.06, 0.07, 0.08]})

Let us merge the absolute temperature with the Anomaly.

glt = pd.merge(glt, mat, on='month', how='left')
glt['t'] = glt.t_abs + glt.Anomaly

And create a DatetimeIndex.

glt['day'] = 15
glt["Date"] = pd.to_datetime(glt[['year', 'month', 'day']])
glt.set_index("Date", inplace=True)
glt.drop(['year', 'month', 'day'], axis=1, inplace=True)
glt.head()

	Anomaly	Unc	t_abs	t
Date
1753-01-15	-1.108	3.646	2.62	1.512
1753-02-15	-1.652	4.461	3.23	1.578
1753-03-15	1.020	3.623	5.33	6.350
1753-04-15	-0.557	3.898	8.36	7.803
1753-05-15	0.399	1.652	11.37	11.769

glt.t.plot(figsize=(12, 6), linewidth=0.5)
ax = plt.gca()
ax.grid(color= (0.1, 0.1, 0.1), linestyle='--', linewidth=1, alpha=0.25)
plt.autoscale(enable=True, axis='x', tight=True)
plt.ylabel("Temperature (°C)")
sns.despine()

png

Now we resample the data to a yearly frequency.

glty = glt.resample("AS").agg(['mean', 'min', 'max'])

Let us plot these yearly averaged temperature along with the 95% confidence interval.

glty[('t', 'mean')].plot(figsize=(12, 6), label='Yearly mean')
ax = plt.gca()
ax.fill_between(glty.index, glty[('t', 'mean')] - 0.95*glty[('Unc', 'mean')], 
                            glty[('t', 'mean')] +  0.95*glty[('Unc', 'mean')], alpha=.25, label='95% confidence interval')
ax.grid(color= (0.1, 0.1, 0.1), linestyle='--', linewidth=1, alpha=0.25)
plt.autoscale(enable=True, axis='x', tight=True)
plt.ylabel("Temperature (°C)")
plt.title('Global land temperature')
plt.legend()
sns.despine()

png

Now we plot both France’s and global land’s yearly mean.

temp_annual[('tas', 'mean')].plot(figsize=(12, 6), label='France')
glty.loc[glty.index >= temp_annual.index.min(), ('t', 'mean')].plot(label='Global land')
ax = plt.gca()
ax.grid(color= (0.1, 0.1, 0.1), linestyle='--', linewidth=1, alpha=0.25)
plt.autoscale(enable=True, axis='x', tight=True)
plt.ylabel("Temperature (°C)")
plt.legend()
sns.despine()

png

Well it seems like France is getting warmer faster than the average land surface.

Average increasing rates

Now let’s try to perform a linear regression on both temperature data between 1975 and 2015-2016. In order to use OLS from statsmodels, we need to convert the datetime objects into real numbers. We do so using the to_julian_date tool. We also add a constant term since we need an intercept.

df_france = temp_annual.loc[temp_annual.index >= pd.Timestamp('1975-01-01'), ('tas', 'mean')].copy(deep=True).reset_index()
df_france.columns = ['Date', 't']
df_france.set_index('Date', drop=False, inplace=True)
df_france['jDate'] = df_france['Date'].map(pd.Timestamp.to_julian_date)
df_france['const'] = 1.0
#epoch = pd.to_datetime(0, unit='d').to_julian_date()
#df_france['Date2'] = pd.to_datetime(df_france['jDate']- epoch, unit='D')

df_land = glty.loc[glty.index >= pd.Timestamp('1975-01-01'), ('t', 'mean')].copy(deep=True).reset_index()
df_land.columns = ['Date', 't']
df_land.set_index('Date', drop=False, inplace=True)
df_land['jDate'] = df_land['Date'].map(pd.Timestamp.to_julian_date)
df_land['const'] = 1.0

mod = sm.OLS(df_france['t'], df_france[['jDate', 'const']])
res = mod.fit()
print(res.summary())
df_france['pred'] = res.predict(df_france[['jDate', 'const']])

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      t   R-squared:                       0.761
Model:                            OLS   Adj. R-squared:                  0.755
Method:                 Least Squares   F-statistic:                     124.4
Date:                Thu, 24 May 2018   Prob (F-statistic):           1.07e-13
Time:                        16:31:45   Log-Likelihood:                -24.189
No. Observations:                  41   AIC:                             52.38
Df Residuals:                      39   BIC:                             55.81
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
jDate          0.0002   1.62e-05     11.152      0.000       0.000       0.000
const       -430.3375     39.620    -10.861      0.000    -510.477    -350.198
==============================================================================
Omnibus:                        3.133   Durbin-Watson:                   1.971
Prob(Omnibus):                  0.209   Jarque-Bera (JB):                1.820
Skew:                          -0.254   Prob(JB):                        0.403
Kurtosis:                       2.101   Cond. No.                     1.39e+09
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.39e+09. This might indicate that there are
strong multicollinearity or other numerical problems.

print('Temperature change per year: ', res.params[0]*365.0)

Temperature change per year:  0.06583231229893599

mod = sm.OLS(df_land['t'], df_land[['jDate', 'const']])
res = mod.fit()
print(res.summary())
df_land['pred'] = res.predict(df_land[['jDate', 'const']])

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      t   R-squared:                       0.793
Model:                            OLS   Adj. R-squared:                  0.788
Method:                 Least Squares   F-statistic:                     153.0
Date:                Thu, 24 May 2018   Prob (F-statistic):           3.00e-15
Time:                        16:31:45   Log-Likelihood:                 14.544
No. Observations:                  42   AIC:                            -25.09
Df Residuals:                      40   BIC:                            -21.61
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
jDate        7.56e-05   6.11e-06     12.368      0.000    6.32e-05     8.8e-05
const       -176.0015     14.975    -11.753      0.000    -206.267    -145.736
==============================================================================
Omnibus:                        4.505   Durbin-Watson:                   2.032
Prob(Omnibus):                  0.105   Jarque-Bera (JB):                1.852
Skew:                           0.062   Prob(JB):                        0.396
Kurtosis:                       1.979   Cond. No.                     1.36e+09
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.36e+09. This might indicate that there are
strong multicollinearity or other numerical problems.

print('Temperature change per year: ', res.params[0]*365.0)

Temperature change per year:  0.027594017104916324

df_france.t.plot(figsize=(12, 6), label='France')
df_france.pred.plot(label='France linear regression')
df_land.t.plot(label='Land')
df_land.pred.plot(label='Land linear regression')
ax = plt.gca()
ax.grid(color= (0.1, 0.1, 0.1), linestyle='--', linewidth=1, alpha=0.25)
plt.autoscale(enable=True, axis='x', tight=True)
plt.ylabel("Temperature (°C)")
plt.legend()
sns.despine()

png

So over the last 40 years, France is getter hotter by 0.06583 degree per year in average, against 0.02759 degree for the global land.