Chapter 20 Simple phenology analysis

Learning goals for this lesson

  • Understand why phenology modeling is tricky
  • Appreciate the dangers of p-hacking
  • Appreciate the importance of a causal theory in data analysis

20.1 Phenology analysis

Now we’ve finally reached the point in this module where we seriously start analyzing phenology. We’ll use the example of bloom dates of the pear cultivar Alexander Lucas that we already worked with in the frost analysis. As you may remember, we have a time series of first, full and last bloom dates recorded at Campus Klein-Altendorf between 1958 and 2019. For this analysis, I’ll just use the first bloom dates:

If you want to work with this example on your own computer, you can download the file here:

I suggest that you save this file in the data folder of your home directory: data/Alexander_Lucas_bloom_1958_2019.csv. Otherwise, you’ll have to adjust the path below. Note that the data are provided as a .csv file. To avoid any “comma vs. semicolon”-related problems, I recommend that you open it with the read_tab function.

Alex <- read_tab("data/Alexander_Lucas_bloom_1958_2019.csv")

Alex <- pivot_longer(Alex,
                     cols = c(First_bloom:Last_bloom),
                     names_to = "Stage",
                     values_to = "YEARMODA")

Alex_first <- Alex %>%
  mutate(Year = as.numeric(substr(YEARMODA, 1, 4)),
         Month = as.numeric(substr(YEARMODA, 5, 6)),
         Day = as.numeric(substr(YEARMODA, 7, 8))) %>%
  make_JDay() %>%
  filter(Stage == "First_bloom")

20.2 Time series analysis

The first analysis people often do with such a dataset is a simple analysis over time. Let’s first plot the data:

           JDay)) +
  geom_point() +
  ylab("First bloom date (day of the year)") +
  xlab ("Year") +
  theme_bw(base_size = 15)

It’s a bit hard to see a clear pattern here, but let’s check if there is a trend over time. We’ve already encountered the Kendall test in an earlier lesson, and we can also apply this here:

## Warning: package 'Kendall' was built under R version 4.3.2
Kendall(x = Alex_first$Pheno_year,
        y = Alex_first$JDay)
## tau = -0.186, 2-sided pvalue =0.03533

I’m getting a p-value of 0.035. The tau value is negative, so it seems we have a significant trend towards earlier bloom.

To determine how strong this trend is, many researchers use regression analysis, with model coefficients fitted to data. Let’s try this. We first assume a linear relationship between time and bloom dates: \[Bloomdate = a \cdot Phenoyear + b\] To make the code easier to read, I’ll name the Pheno_year x and the Bloom date (JDay) y. I’ll also plot the data using ggplot:

x <- Alex_first$Pheno_year
y <- Alex_first$JDay

summary(lm(y ~ x))
## Call:
## lm(formula = y ~ x)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.0959  -6.3591  -0.5959   6.6468  20.1238 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 429.16615  142.06000   3.021   0.0037 **
## x            -0.16184    0.07144  -2.266   0.0271 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 10.07 on 60 degrees of freedom
## Multiple R-squared:  0.0788, Adjusted R-squared:  0.06345 
## F-statistic: 5.133 on 1 and 60 DF,  p-value: 0.0271
           JDay)) +
  geom_point() +
  geom_smooth(method = 'lm',
              formula = y ~ x) +
  ylab("First bloom date (day of the year)") +
  xlab ("Year") +
  theme_bw(base_size = 15)

If you examine the outputs carefully, you’ll see that we’re getting a best estimate for the slope of -0.16. We also get a p-value here, but for a time series we’re better off using the one returned by the Kendall test. If you look at the diagram, you’ll notice that many of the data points are still pretty far from the regression line, with many of them way outside the confidence interval that is shown.

Maybe we can find another model that better describes the data. Many people would now start on a process of gradually making the model more complex, e.g. by fitting a second-order polynomial, or maybe a third-order polynomial… Let’s cut this process short and immediately fit a 25th-order polynomial:

summary(lm(y ~ poly(x, 25)))
## Call:
## lm(formula = y ~ poly(x, 25))
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.7311  -4.5098  -0.1227   2.8640  15.4590 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   107.3387     1.0549 101.753 < 0.0000000000000002 ***
## poly(x, 25)1  -22.8054     8.3063  -2.746              0.00937 ** 
## poly(x, 25)2   -5.8672     8.3063  -0.706              0.48451    
## poly(x, 25)3   14.7725     8.3063   1.778              0.08377 .  
## poly(x, 25)4   -5.3974     8.3063  -0.650              0.51995    
## poly(x, 25)5  -11.6801     8.3063  -1.406              0.16825    
## poly(x, 25)6    2.1928     8.3063   0.264              0.79329    
## poly(x, 25)7   -0.3034     8.3063  -0.037              0.97107    
## poly(x, 25)8    6.0115     8.3063   0.724              0.47391    
## poly(x, 25)9  -22.2895     8.3063  -2.683              0.01094 *  
## poly(x, 25)10   5.9522     8.3063   0.717              0.47825    
## poly(x, 25)11  -6.1217     8.3063  -0.737              0.46590    
## poly(x, 25)12   3.2676     8.3063   0.393              0.69636    
## poly(x, 25)13 -14.8467     8.3063  -1.787              0.08229 .  
## poly(x, 25)14  13.5180     8.3063   1.627              0.11237    
## poly(x, 25)15  10.1544     8.3063   1.222              0.22946    
## poly(x, 25)16 -12.6116     8.3063  -1.518              0.13767    
## poly(x, 25)17  -1.3315     8.3063  -0.160              0.87354    
## poly(x, 25)18  -6.3438     8.3063  -0.764              0.45000    
## poly(x, 25)19  14.9753     8.3063   1.803              0.07978 .  
## poly(x, 25)20   3.4573     8.3063   0.416              0.67972    
## poly(x, 25)21 -29.1997     8.3063  -3.515              0.00121 ** 
## poly(x, 25)22  10.4145     8.3063   1.254              0.21799    
## poly(x, 25)23   2.9898     8.3063   0.360              0.72100    
## poly(x, 25)24 -14.3045     8.3063  -1.722              0.09363 .  
## poly(x, 25)25 -20.9324     8.3063  -2.520              0.01631 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 8.306 on 36 degrees of freedom
## Multiple R-squared:  0.6237, Adjusted R-squared:  0.3623 
## F-statistic: 2.386 on 25 and 36 DF,  p-value: 0.008421
           JDay)) +
  geom_point() +
              formula = y ~ poly(x, 25)) +
  ylab("First bloom date (day of the year)") +
  xlab ("Year") +
  theme_bw(base_size = 15)

It’s a bit hard now of course to make a clear statement about the temporal trend, since there is no single number for the slope. The slope varies over time, but it can be calculated by taking the first derivative of the complex 25th-order polynomial we computed. But we got a stellar p-value, and almost all of the data points are contained in the confidence interval!

It’s probably obvious to you that this is not a useful model. If we make the model equation increasingly more complex, we’ll eventually find a structure that can perfectly fit all data points. This is not desirable, however, because we have no reason to expect that there is such a perfect relationship between our independent and dependent variables. Each measurement (such as a bloom date recording) comes with an error, and, more importantly, there may also be other factors in addition to time that affect bloom dates. If we make our model more complex than we can reasonably expect the actual relationship to be, we run the risk of overfitting our model. This means that we may get a very good fit, but our regression equation isn’t very useful for explaining the actual process we’re trying to model.

In the case of our 25th-order polynomial, we’re obviously looking at an overfit. In many other cases, however, overfits aren’t quite so easy to detect. The scientific literature is full of studies where authors took the structure of their regression model far beyond what could be considered a reasonable description of the underlying process. Especially in the field of machine-learning, overfitting is a very serious risk, in particular where the learning engines are basically black boxes that the analysts may not even fully understand.

20.3 p-hacking

A concept that is closely related to overfitting and that also often occurs in machine-learning is that of p-hacking, also known as a fishing expedition. The idea behind these terms is that when you look at a sufficiently large dataset, with lots of variables, you’re bound to find correlations somewhere in this dataset, regardless of whether your data involves actual causal relationships. Screening large datasets for ‘significant’ relationships and then presenting these as meaningful correlations is bad scientific practice (Nuzzo, 2014)! Possibly, such screening can lead to hypotheses that can then be tested through further studies, but we should not rely on this to generate ‘facts’.

20.4 The process that generates the data

Many researchers are focused on finding structure in the datasets they work with. While this can often produce useful insights, it can also lead to us neglecting an arguably more important objective of science - to find out how things work! What we really want to understand is the process that generates the data, rather than the data themselves.

While this argument may seem like semantics, it involves a change in perspective that can have important implications for how we do research. Rather than digging straight into the numbers, we first need to sit back and think about what may actually be going on in the system we’re analyzing. We can then use causal diagrams to sketch out how we think our system works:

\[A \to B\] With this realization, we quickly come to the conclusion that our analysis so far has been a bit beside the point.

Up to now, we were looking for the following relationship:

\[Time \to Phenology\] Can this possibly be what’s going on here? Can time drive phenology? While I won’t completely rule out that trees are somehow able to track the passage of time within their annual cycle, I’m pretty sure they have no idea what the current year is…

20.5 An ecological theory to guide our analysis

Before we start on an analysis of the kind we’re dealing with here, we should have a theory of what’s going on. In this case, this would be an ecological theory about what drives tree phenology. Let’s start with a very simple theory: Temperature drives phenology. I guess we already knew this, but it may still be useful to state it explicitly:

\[Temperature \to Phenology\] As you can see, this theory doesn’t involve time. The reason why we were able to find statistical relationships earlier is not that time (as in the calendar year) actually affects phenology - it is that temperatures have been rising over the years, mainly in response to anthropogenic global warming.

\[Time \to Temperature\] Well, actually this is \[Time \to GreenhouseGasEmissions \to ClimateForcing \to Temperature\] Since temperature, especially at the local scale, is subject to considerable random variation, this relationship may not be particularly strong. The noise in this relationship can easily keep us from detecting the actual response of tree phenology during our time series.

The full causal diagram of these three variables (without the greenhouse effect part) is \[Time \to Temperature \to Phenology\]

Sometimes we have no data on the intermediate steps in such a diagram. In such cases, we either have to live with a model that doesn’t actually get to the direct causal mechanisms in play, or we have to get a bit more creative with our analysis. Fortunately, here we actually do have data on temperature, so we can focus on the direct causal relationship between temperature and phenology.

20.6 Temperature correlations

Now that we clarified that it would make more sense to narrow in on the temperature-phenology relationship, we gained some conceptual clarity, but we are faced with a much greater statistical challenge. In correlating years with bloom dates, we had just two series of numbers to look at, since we had exactly one bloom date per year. Temperature, in contrast, is a variable we can measure at different temporal scales. If we just look at the average annual temperature, the statistical challenge is similar to using just the year number. Let’s see if we can find a correlation for this. You may have to click the button to download weather data first. I’d suggest you save it then under data/TMaxTMin1958-2019_patched.csv.