Chapter 10 Getting temperature data

Learning goals for this lesson

  • Appreciate the need for daily temperature data
  • Know how to get a list of promising weather stations contained in an international database
  • Be able to download weather data using chillR functions
  • Know how to convert downloaded data into chillR format

10.1 Temperature data needs

Obviously, without temperature data we can’t do much phenology and chill modeling. This is a pretty critical input to all models we can make or may want to run. It also seems like an easy-to-find resource, doesn’t it? Well, you may be surprised by how difficult it is to get such data. While all countries in the world have official weather stations that record precisely the type of information we need, many are very protective of these data. Many national weather services sell such information (the collection of which was likely funded by taxpayer money) at rather high prices. If you only want to do a study on one location, you may be able to shell out that money, but this quickly becomes unrealistic, when you’re targeting a larger-scale analysis.

On a personal note, I must say that I find it pretty outrageous that at a time where we should be making every effort to understand the impacts of climate change on our world and to find ways to adapt, weather services are putting up such access barriers. I really wonder how many climate-related studies that have been done turned out less useful than they could have been, had more data been easily and freely available. Well, back to the main story…

To be clear, it’s of course preferable to have a high-quality dataset collected in the exact place that you want to analyze. If we don’t have such data, however, there are a few databases out there that we can draw on as an alternative option. chillR currently has the capability to access one global database, as well as one for California. There is certainly scope for expanding this capability, but let’s start working with what’s available now.

10.2 The Global Summary of the Day database

An invaluable source of temperature data is the National Centers for Environmental Information (NCEI), formerly the National Climatic Data Center (NCDC) of the United States National Oceanic and Atmospheric Administration (NOAA), in particular their Global Summary of the Day (GSOD) database. That was a pretty long name, so let’s stick with the abbreviation GSOD.

Check out the GSOD website to take a look at the interface: https://www.ncei.noaa.gov/access/search/data-search/global-summary-of-the-day. This interface used to be pretty confusing in the past - and I almost find it more confusing now. Fortunately, if you click on the Bulk downloads button, you can get to a place where you can directly access the weather data: https://www.ncei.noaa.gov/data/global-summary-of-the-day/access/. What we find here is, at first glance, even more inaccessible than the web interface, but at least we can recognize some structure now: All records are stored in separate files for each station and year, with the files named according to a code assigned to the weather stations. You could now download these records by hand, if you wanted to, but this would take a long time (if you want data for many years), and you’d first have to find out what station is of interest to you.

Fortunately, I found a list of all the weather stations somewhere on NOAA’s website: ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-history.csv, and I automated the tedious data download and assembling process in chillR. My attempt resulted in a reliable but fairly slow procedure, but a former participant of this module, Adrian Fülle, found a much more elegant - and much faster - way to achieve this.

Let’s see how this works:

There’s a single chillR function, handle_gsod(), that can take care of all data retrieval steps. Since there are multiple steps involved, we have to use the function’s action parameter to tell it what to do:

10.2.1 action=list_stations

When used with this action, handle_gsod() retrieves the station list and sorts the stations based on their proximity to a set of coordinates we specify. Let’s look for stations around Bonn (Latitude= 50.73; Longitude= 7.10). I’ll also add a time interval of interest (1990-2020) to narrow the search.

library(chillR)
station_list<-handle_gsod(action="list_stations",
                          location=c(7.10,50.73),
                          time_interval=c(1990,2020))
require(kableExtra)

kable(station_list) %>%
  kable_styling("striped", position = "left", font_size = 8)
chillR_code STATION.NAME CTRY Lat Long BEGIN END Distance Overlap_years Perc_interval_covered
10517099999 BONN/FRIESDORF(AUT) GM 50.700 7.150 19360102 19921231 4.86 3.00 10
10518099999 BONN-HARDTHOEHE GM 50.700 7.033 19750523 19971223 5.79 7.98 26
10519099999 BONN-ROLEBER GM 50.733 7.200 20010705 20081231 7.07 7.49 24
10513099999 KOLN BONN GM 50.866 7.143 19310101 20230729 15.43 31.00 100
10509099999 BUTZWEILERHOF(BAFB) GM 50.983 6.900 19780901 19950823 31.47 5.64 18
10502099999 NORVENICH GM 50.831 6.658 19730101 20230729 33.14 31.00 100
10514099999 MENDIG GM 50.366 7.315 19730102 19971231 43.26 8.00 26
10506099999 NUERBURG-BARWEILER GM 50.367 6.867 19950401 19971231 43.63 2.75 9
10508099999 BLANKENHEIM GM 50.450 6.650 19781002 19840504 44.56 0.00 0
10510099999 NUERBURG GM 50.333 6.950 19300901 19921231 45.42 3.00 10
10515099999 BENDORF GM 50.417 7.583 19310102 20030816 48.82 13.62 44
10504099999 EIFEL GM 50.650 6.283 20040501 20040501 58.41 0.00 0
10526099999 BAD MARIENBERG GM 50.667 7.967 19730101 20030816 61.65 13.62 44
10613099999 BUCHEL GM 50.174 7.063 19730101 20230729 61.90 31.00 100
10503099999 AACHEN/MERZBRUCK GM 50.817 6.183 19780901 19971212 65.40 7.95 26
10419099999 LUDENSCHEID & GM 51.233 7.600 19270906 20030306 66.06 13.18 43
10400099999 DUSSELDORF GM 51.289 6.767 19310102 20230729 66.43 31.00 100
10616299999 SIEGERLAND GM 50.708 8.083 20040510 20230729 69.46 16.65 54
10418099999 LUEDENSCHEID GM 51.250 7.650 19940301 19971231 69.55 3.84 12
10437499999 MONCHENGLADBACH GM 51.230 6.504 19960715 20230729 69.61 24.47 79
10403099999 MOENCHENGLADBACH GM 51.233 6.500 19381001 19421031 70.05 0.00 0
10501099999 AACHEN GM 50.783 6.100 19280101 20030816 70.81 13.62 44
6496099999 ELSENBORN (MIL) BE 50.467 6.183 19840501 20230729 71.21 31.00 100
10409099999 ESSEN/MUELHEIM GM 51.400 6.967 19300414 19431231 75.12 0.00 0
10410099999 ESSEN/MULHEIM GM 51.400 6.967 19310101 20220408 75.12 31.00 100

This list contains the 25 closest stations to the location we entered, ordered by their distance to the target coordinates. This distance is shown in the distance column. The Overlap_years column shows the number of years that are available, and the Perc_interval_covered column the percentage of the target interval that is covered. Note that this is only based on the BEGIN and END dates in the table - it’s quite possible (and usually the case) that the dataset contains gaps, which sometimes cover almost the entire record.

10.2.2 action="download_weather"

When used with this option, the handle_gsod() function downloads the weather data for a particular station, based on a station-specific chillR_code (shown in the respective column of the table above). Rather than typing the code manually, we can refer to the code in the station_list. Let’s download the data for the 4th entry in the list, which looks like it covers most of the period we’re interested in.

weather<-handle_gsod(action="download_weather",
                     location=station_list$chillR_code[4],
                     time_interval=c(1990,2020))

The result of this operation is a list with two elements. Element 1 (weather[[1]]) is an indication of the database the data come from. Element 2 (weather[[2]]) is the actual dataset, which we can see here:

weather[[1]][1:20,]
X DATE Date Year Month Day Tmin Tmax Tmean Prec YEARMODA Tmin_source Tmax_source no_Tmin no_Tmax
1 1990-01-01 12:00:00 1990-01-01 12:00:00 1990 1 1 -1.000 1.000 0.000 0.000 19900101 original original FALSE FALSE
2 1990-01-02 12:00:00 1990-01-02 12:00:00 1990 1 2 0.000 2.000 1.000 0.000 19900102 original original FALSE FALSE
3 1990-01-03 12:00:00 1990-01-03 12:00:00 1990 1 3 -0.389 2.000 0.722 0.000 19900103 original original FALSE FALSE
4 1990-01-04 12:00:00 1990-01-04 12:00:00 1990 1 4 -1.111 2.000 -0.056 0.000 19900104 original original FALSE FALSE
5 1990-01-05 12:00:00 1990-01-05 12:00:00 1990 1 5 -1.111 3.111 1.556 0.000 19900105 original original FALSE FALSE
6 1990-01-06 12:00:00 1990-01-06 12:00:00 1990 1 6 0.000 2.389 1.333 0.000 19900106 original original FALSE FALSE
7 1990-01-07 12:00:00 1990-01-07 12:00:00 1990 1 7 -0.111 4.278 1.056 0.000 19900107 original original FALSE FALSE
8 1990-01-08 12:00:00 1990-01-08 12:00:00 1990 1 8 -0.111 7.000 3.278 0.000 19900108 original original FALSE FALSE
9 1990-01-09 12:00:00 1990-01-09 12:00:00 1990 1 9 3.778 8.000 5.333 0.508 19900109 original original FALSE FALSE
10 1990-01-10 12:00:00 1990-01-10 12:00:00 1990 1 10 3.000 6.000 4.556 1.016 19900110 original original FALSE FALSE
11 1990-01-11 12:00:00 1990-01-11 12:00:00 1990 1 11 3.278 7.000 5.167 0.254 19900111 original original FALSE FALSE
12 1990-01-12 12:00:00 1990-01-12 12:00:00 1990 1 12 -1.000 5.222 1.778 0.000 19900112 original original FALSE FALSE
13 1990-01-13 12:00:00 1990-01-13 12:00:00 1990 1 13 -1.278 4.000 1.389 0.000 19900113 original original FALSE FALSE
14 1990-01-14 12:00:00 1990-01-14 12:00:00 1990 1 14 -0.222 5.000 3.167 0.000 19900114 original original FALSE FALSE
15 1990-01-15 12:00:00 1990-01-15 12:00:00 1990 1 15 0.889 9.000 4.556 1.016 19900115 original original FALSE FALSE
16 1990-01-16 12:00:00 1990-01-16 12:00:00 1990 1 16 6.222 11.000 9.944 0.000 19900116 original original FALSE FALSE
17 1990-01-17 12:00:00 1990-01-17 12:00:00 1990 1 17 1.000 11.000 8.500 0.000 19900117 original original FALSE FALSE
18 1990-01-18 12:00:00 1990-01-18 12:00:00 1990 1 18 -1.000 7.000 2.722 0.254 19900118 original original FALSE FALSE
19 1990-01-19 12:00:00 1990-01-19 12:00:00 1990 1 19 2.000 7.111 4.611 0.000 19900119 original original FALSE FALSE
20 1990-01-20 12:00:00 1990-01-20 12:00:00 1990 1 20 4.000 8.500 6.056 2.286 19900120 original original FALSE FALSE

This still looks pretty complicated, and it contains a lot of information we don’t need. chillR therefore contains a function to simplify this record. Note, however, that this removes a lot of variables you may be interested in. More importantly, this also removes quality flags, which may indicate that particular records aren’t reliable. I’ve generously ignored this so far, but there’s room for improvement here.

10.2.3 downloaded weather as action argument

This way of calling handle_gsod() serves to clean the dataset and convert it into a format that chillR can easily handle

cleaned_weather<-handle_gsod(weather)
cleaned_weather[[1]][1:20,]
Date Year Month Day Tmin Tmax Tmean Prec
1990-01-01 12:00:00 1990 1 1 -1.000 1.000 0.000 0.000
1990-01-02 12:00:00 1990 1 2 0.000 2.000 1.000 0.000
1990-01-03 12:00:00 1990 1 3 -0.389 2.000 0.722 0.000
1990-01-04 12:00:00 1990 1 4 -1.111 2.000 -0.056 0.000
1990-01-05 12:00:00 1990 1 5 -1.111 3.111 1.556 0.000
1990-01-06 12:00:00 1990 1 6 0.000 2.389 1.333 0.000
1990-01-07 12:00:00 1990 1 7 -0.111 4.278 1.056 0.000
1990-01-08 12:00:00 1990 1 8 -0.111 7.000 3.278 0.000
1990-01-09 12:00:00 1990 1 9 3.778 8.000 5.333 0.508
1990-01-10 12:00:00 1990 1 10 3.000 6.000 4.556 1.016
1990-01-11 12:00:00 1990 1 11 3.278 7.000 5.167 0.254
1990-01-12 12:00:00 1990 1 12 -1.000 5.222 1.778 0.000
1990-01-13 12:00:00 1990 1 13 -1.278 4.000 1.389 0.000
1990-01-14 12:00:00 1990 1 14 -0.222 5.000 3.167 0.000
1990-01-15 12:00:00 1990 1 15 0.889 9.000 4.556 1.016
1990-01-16 12:00:00 1990 1 16 6.222 11.000 9.944 0.000
1990-01-17 12:00:00 1990 1 17 1.000 11.000 8.500 0.000
1990-01-18 12:00:00 1990 1 18 -1.000 7.000 2.722 0.254
1990-01-19 12:00:00 1990 1 19 2.000 7.111 4.611 0.000
1990-01-20 12:00:00 1990 1 20 4.000 8.500 6.056 2.286

Note that the reason for many of the strange numbers in these records is that the original database stores them in degrees Fahrenheit, so that they had to be converted to degrees Celsius. That often creates ugly numbers, but it’s not hard:

\(Temperature[°C]=(Temperature[°F]-32)\cdot\frac{5}{9}\)

We now have a temperature record in a format that we can easily work with in chillR.

Upon closer inspection, however, you’ll notice that this dataset has pretty substantial gaps, including several entire years of missing data. How can we deal with this? Let’s find out in the lesson on Filling gaps in temperature records.

Note that chillR has a pretty similar function to download data from the California Irrigation Management Information System (CIMIS).

There’s surely room for improvement here. There’s a lot more data out there that chillR could have a download function for.

Now let’s save the files we generated here, so that we can use them in the upcoming chapters:

write.csv(station_list,"data/station_list.csv",row.names=FALSE)
write.csv(weather[[1]],"data/Bonn_raw_weather.csv",row.names=FALSE)
write.csv(cleaned_weather[[1]],"data/Bonn_chillR_weather.csv",row.names=FALSE)

Exercises on getting temperature data

Please document all results of the following assignments in your learning logbook.

  1. Choose a location of interest and find the 25 closest weather stations using the handle_gsod function
  2. Download weather data for the most promising station on the list
  3. Convert the weather data into chillR format