Lukas Püttmann    About    Research    Blog

All the news that's fit to parse

I’ve uploaded a new working paper called “Patterns of Panic: Financial Crisis Language in Historical Newspapers”, available here. In the paper, I’m analyzing titles of these five major US newspapers since the 19th century to construct a new indicator of financial stress:

Newspaper Since Titles
Chicago Tribune 1853 9.0m
Boston Globe 1872 6.7m
Washington Post 1877 7.7m
Los Angeles Times 1881 7.9m
Wall Street Journal 1889 3.9m

For the paper, I looked at a whole lot of trends in newspaper language. It led to some cool figures that don’t fit in the paper, so I’m putting them here instead. The figures show the number of titles per quarter that contain some words, averaged across five newspapers.

Here is the figure for the words “war” and “peace” (the gray bars are the world wars):

War and peace in newspaper titles

The “war” series jumps, unsurprisingly, around the major wars. Next to the world wars, the American Civil War (starting 1861) stands out. It reaches a new high in the first quarter of 1863 on Lincoln’s signing of the Emancipation Proclamation. The usage of “peace” has a correlation of 0.51 with “war”.

The next figure shows the occurences of the names of U.S. presidents:

Presidents in newspaper titles

“lincoln” spikes first in the third quarter of 1858 during the Lincoln–Douglas debates, then again during the presidential elections of 1860 and last when the Confederate States surrendered in 1865. For “roosevelt” and “bush”, two presidents shared a surname. The name “kennedy” appears during John F. Kennedy’s presidency and then spikes again when Ted Kennedy tried to run for president in the first quarter of 1980.

And, staying within my home country:

German chancellors in newspaper titles

The fake Hitler diaries were published by the magazine Stern in 1983, so that probably explains that spike. The 1912 spike in “Brandt” might be due to the founding of the German company of the same name in that year.

The next figure shows the occurence of “earthquake(s)” in titles.

Earthquake in newspaper titles

Large spikes occur around the following earthquakes

  • Djijelli (1856)
  • Hayward (1868)
  • Charleston (1886)
  • San Francisco (1907)
  • Kantō (1923)
  • San Fernando (1971)
  • Loma Prieta (1989)
  • Sichuan (2008)
  • Haiti (2010)
  • Nepal earthquake (2015)

In 1994, the Northridge earthquake devastated Los Angeles. The big spike in this series is driven by reporting in the Los Angeles Times in which 501 (out of 27,669) titles in the first quarter of 1994 contained the term “earthquake(s)”.

Here are the terms “soviet” and “russia”:

Russia and soviet in newspaper titles

The series for “russia” spikes during the Russo-Japanese War and at the beginning and end of World War I. During the Russian Revolution, the word “soviet” appears for the first time. The two words move together for three decades, but then paths diverge after 1960. The Cold War pitted the West against the Soviet empire and the importance of stand-alone “russia” declined. This changed in the autumn of 1991 when the Soviet Union disassembled and nation states, including Russia, took its place.

Checkout the paper, if you like. I’d be grateful for any comments you might have.

Holes, kinks and corners

As Pudney (1989) has observed, microdata exhibit “holes, kinks and corners.” The holes correspond to nonparticipation in the activity of interest, kinks correspond to the switching behavior, and corners to the incidence of nonconsumption or nonparticipation at specific points of time.

This is from “Microeconometrics”, by Colin Cameron and Pravin Trivedi and refers here.

Collected links

  1. Fred: “Of places and patents”. And this.
  2. Language Log: “Replicate vs. reproduce
  3. A nice quantile regression use case
  4. The Guardian on the Oxford English Dictionary
  5. Geopolitical Hedging as a Service
  6. How Making Something Better Can Make Everything Worse” which links here
  7. Cool NYT figure
  8. stop() / return() Early for Shorter Code
  9. Martin Thoma:

    Example 2: This one is actually confusing. Likely also with context.

    EN: Professors say, students are doing well.
    DE: Professoren sagen, Studenten haben es gut.
    DE: Professoren, sagen Studenten, haben es gut.
    EN: Professors, students say, are doing well.

  10. Jason Collins: “Angela Duckworth’s ‘Grit’: The Power of Passion and Perseverance’

Term spreads and business cycles 3: The view from history

In the first installment in this series, I documented that term spreads tend to fall before recessions. In the second part I looked at monetary policy and showed that it’s the endogenous reaction of monetary policy that investors predict. I worked with post-war data for the US in both parts, but we can extend this analysis across countries and use longer time series.

Start a new analysis and load some packages


The dataset by Jordà, Taylor and Schularick (2016) is great and provides annual macroeconomic and financial variables for 17 countries since 1870. We can use the Stata dataset from their website like this:

# Download dataset
mh <- read_dta("")

# Extract labels
lbls <- tibble(var = names(mh), 
               label = vapply(mh, attr, FUN.VALUE = "character", 
                              "label", USE.NAMES = FALSE))

# Make tidy "long" dataset and merge with label names
mh <- mh %>% 
  gather(var, value, -year, -country, -iso, -ifs) %>% 
  left_join(lbls, by = "var") %>% 
  select(year, country, iso, ifs, var, label, value) %>% 
  mutate(value = ifelse(is.nan(value), NA, value))

Check out the dataset:


# A tibble: 61,200 x 7
# year country   iso     ifs var   label      value
# <dbl> <chr>     <chr> <dbl> <chr> <chr>      <dbl>
#   1 1870. Australia AUS    193. pop   Population 1775.
# 2 1871. Australia AUS    193. pop   Population 1675.
# 3 1872. Australia AUS    193. pop   Population 1722.
# 4 1873. Australia AUS    193. pop   Population 1769.
# 5 1874. Australia AUS    193. pop   Population 1822.
# 6 1875. Australia AUS    193. pop   Population 1874.
# 7 1876. Australia AUS    193. pop   Population 1929.
# 8 1877. Australia AUS    193. pop   Population 1995.
# 9 1878. Australia AUS    193. pop   Population 2062.
# 10 1879. Australia AUS    193. pop   Population 2127.
# ... with 61,190 more rows

Importing Stata datasets with the haven package is pretty neat. In the RStudio environment, the columns even display the original Stata labels.

There are two interest rate series in the data and the documentation explains that most short-term rates are a mix of money market rates, bank lending rates and government bonds. Long-term rates are mostly government bonds.

Homer and Sylla (2005) explain why we usually study safe rates:

The method of using minimum rates to determine interest rate trends is informative. Today the use of „prime rates“ and AAA averages is customary to indicate interest rate trends. There is a very large range of rates higher than minimum rates at all times, and there is no top limit except legal maxima. Averages of rates, if the did exist, might be merely averages of good credits with bad credits. The lowest regularly reported rates, excluding eccentric rates, comprise a practical limit comparable over time. Minimum rates will not show us where most funds were lending, but they should provide a fair index number for measuring long-term interest rate trends. (p.140)


The level of interest rates is a more complex concept than the trend of interest rates. (p.555)

So let’s plot those interest rates:

mh %>% 
  filter(var %in% c("stir", "ltrate")) %>% 
  ggplot(aes(year, value / 100, color = var)) +
  geom_hline(yintercept = 0, size = 0.3, color = "grey80") +
  geom_line() +
  facet_wrap(~country, scales = "free_y", ncol = 3) +
  theme_tufte(base_family = "Helvetica") +
  scale_y_continuous(labels=percent) +
  labs(y = "Interest rate", 
       title = "Long-term and short-term interest rates over the long run",
       subtitle = "1870-2013, nominal rates in percent per year.",
       caption = "Source:",
       color = "Interest rate:") +
  theme(legend.position = c(0.85, 0.05)) +
  scale_color_manual(labels = c("long-term", "short-term"), 
                     values = c("#ef8a62", "#67a9cf"))

Which gets us:

Long-term and short-term interest rates for 17 countries since 1870

Interest rates were high everywhere during the 1980s when inflation ran high. Rates were quite low in the 19th century. There are also some interesting movements in the 1930s.

Next, select GDP and interest rates, calculate the term spread and lag it:

df <- mh %>% 
  select(-label) %>% 
  filter(var %in% c("rgdpmad", "stir", "ltrate")) %>% 
  spread(var, value) %>% 
  mutate(trm_spr = ltrate - stir) %>% 
  arrange(country, year) %>% 
  mutate(trm_spr_l = dplyr::lag(trm_spr, 1),
         gr_real = 100*(rgdpmad - dplyr::lag(rgdpmad, 1)) / dplyr::lag(rgdpmad, 1))

Add a column of 1, 2, … T for numerical dates:

df <- df %>% 
  left_join(tibble(year = min(df$year):max(df$year), 
                   numdate = 1:length(min(df$year):max(df$year))))

Check for extreme GDP events (to potentially exclude them):

df <- df %>%
  mutate(outlier_gr_real = ifelse((abs(gr_real) > 15) | (abs(gr_real) < -15), 
                                 TRUE, FALSE))

Print the data:


# A tibble: 2,448 x 12
# year country iso     ifs ltrate rgdpmad  stir trm_spr trm_spr_l
# <dbl> <chr>   <chr> <dbl>  <dbl>   <dbl> <dbl>   <dbl>     <dbl>
#   1 1870. Austra… AUS    193.   4.91   3273.  4.88  0.0318   NA     
# 2 1871. Austra… AUS    193.   4.84   3299.  4.60  0.245     0.0318
# 3 1872. Austra… AUS    193.   4.74   3553.  4.60  0.137     0.245 
# 4 1873. Austra… AUS    193.   4.67   3824.  4.40  0.272     0.137 
# 5 1874. Austra… AUS    193.   4.65   3835.  4.50  0.153     0.272 
# 6 1875. Austra… AUS    193.   4.51   4138.  4.60 -0.0927    0.153 
# 7 1876. Austra… AUS    193.   4.57   4007.  4.60 -0.0341   -0.0927
# 8 1877. Austra… AUS    193.   4.39   4036.  4.50 -0.111    -0.0341
# 9 1878. Austra… AUS    193.   4.44   4277.  4.80 -0.357    -0.111 
# 10 1879. Austra… AUS    193.   4.60   4205.  4.90 -0.297    -0.357 
# ... with 2,438 more rows, and 3 more variables: gr_real <dbl>,
#   numdate <int>, outlier_gr_real <lgl>

Scatter the one-year-before term spread against subsequent real GDP growth:

df %>% 
  filter(!outlier_gr_real) %>% 
  ggplot(aes(gr_real / 100, trm_spr_l / 100)) +
  geom_hline(yintercept = 0, size = 0.3, color = "grey50") +
  geom_vline(xintercept = 0, size = 0.3, color = "grey50") +
  geom_smooth(method = "lm", size = 0.3, color = "#b2182b", fill = "#fddbc7") +
  geom_jitter(aes(color = numdate), size = 1, alpha = 0.6, stroke = 0,
              show.legend = FALSE) +
  facet_wrap(~country, scales = "free", ncol = 3) +
  theme_tufte(base_family = "Helvetica") +
  scale_y_continuous(labels=percent) +
  scale_x_continuous(labels=percent) +
  labs(title = "Lagged term spread vs. real GDP growth",
       subtitle = "1870-2013, annually.",
       caption = "Source:",
       x = "Real GDP growth",
       y = "Term spread")

Which creates:

Lagged term spread against real GDP growth

Lighter shades of blue in the markers signal earlier dates. The clouds are quite mixed, so correlations don’t seem to just portray time trends in the variables.

The relationship doesn’t look as clearly positive as in our previous analysis. Let’s dig in further using a panel regression. I estimate two models where the second excludes outliers as defined above. I also control for time and country fixed effects.

r1 <- plm(gr_real ~ trm_spr_l, 
          data = df, 
          index = c("country", "year"), 
          model = "within", 
          effect = 'twoways')

r2 <- df %>% 
  filter(!outlier_gr_real) %>% 
  plm(gr_real ~ trm_spr_l, 
      data = ., 
      index = c("country", "year"), 
      model = "within", 
      effect = 'twoways')

stargazer(r1, r2, type = "html")

This produces:

Dependent variable:
Real GDP growth
Term spread (lagged)0.108*0.120**
Adjusted R2-0.075-0.074
F Statistic3.474* (df = 1; 2097)6.627** (df = 1; 2055)
Note:*p<0.1; **p<0.05; ***p<0.01

So also using this dataset we find that lower term spreads tend to be followed by recessions.


Homer, S. and R. Sylla (2005). A History of Interest Rates, Fourth Edition. Wiley Finance.

Jordà, O. M. Schularick and A. M. Taylor (2017). “Macrofinancial History and the New Business Cycle Facts”. NBER Macroeconomics Annual 2016, volume 31, edited by Martin Eichenbaum and Jonathan A. Parker. Chicago: University of Chicago Press. (link)

Related posts:

Models and surprise

Hadley Wickham is a statistician and programmer and the creator of popular R packages such as ggplot2 or dplyr. His status in the R community has risen to such mythical levels that the set of packages he created were called the hadleyverse (renamed to tidyverse).

In a talk, he describes what he considers a sensible workflow and explains the following dichotomy between data visualization and quantitative modeling:

But visualization fundamentally is a human activity. This is making the most of your brain. In visualization, this is both a strength and a weakness […]. You can see something in a visualization that you did not expect and no computer program could have told you about. But because a human is involved in the loop, visualization fundamentally does not scale.

And so to me the complementary tool to visualization is modeling. I think of modeling very broadly. This is data mining, this is machine learning, this is statistical modeling. But basically, whenever you’ve made a question sufficiently precise, you can answer it with some numerical summary, summarize it with some algorithm, I think of this as a model. And models are fundamentally computational tools which means that they can scale much, much better. As your data gets bigger and bigger and bigger, you can keep up with that by using more sophisticated computation or simply just more computation.

But every model makes assumptions about the world and a model – by its very nature – cannot question those assumptions. So that means: on some fundamental level, a model cannot surprise you.

That definition excludes many economic models. I think of the insights of models such as Akerlof’s Lemons and Peaches, Schelling’s segregation model or the “true and non-trivial” theory of comparative advantage as surprising.

Term spreads and business cycles 2: The role of monetary policy

In the first part of this series I showed that term spreads can be used to predict real GDP about a year out. This pattern comes about, because investors expect the central bank to lower short term interest rates.

But we don’t know what’s causing what. Is the central bank driving business cycles or is it just responding to a change in the economic environment?

This matters for how we interpret the pattern we found. Investors could either have expectations about the business cycle or about arbitrary decisions by the central bank.

The central bank’s main tool is changing at the interest rate at which banks can lend, the federal funds rate. In this post, I will look at how the Fed Funds rate comoves with the term spread and how the unexpected component in that rate (the “shock”) is related to it.


First, run all the codes from the previous post.

Get the Fed Funds rate and calculate how it changes between this month and the same month next year:

fd <- fred$series.observations(series_id = "FEDFUNDS") %>%
  mutate(date = as.yearmon(date)) %>% 
  select(date, ff = value) %>% 
  mutate(ff = as.numeric(ff)) %>% 
  full_join(fd) %>% 
  mutate(ff_ch = dplyr::lead(ff, 12) - ff)

Make the same scatterplot as before:

fd %>% 
  filter(date <= 2008) %>% 
  ggplot(., aes(ff_ch, trm_spr)) +
  geom_hline(yintercept = 0, size = 0.3, color = "grey50") +
  geom_vline(xintercept = 0, size = 0.3, color = "grey50") +
  geom_point(alpha = 0.7, stroke = 0, size = 3) +
  geom_smooth(method = "lm", size = 0.2, color = "#ef8a62", fill = "#fddbc7") +
  theme_tufte(base_family = "Helvetica") +
  labs(title = "Term Spread 1 Year Earlier and Change in Fed Funds Rate",
       subtitle = "1953-2017, monthly.",
       x = "Change in Fed Funds (compared to 12 months ago)",
       y = "Term spread (lagged 1 year)")

Which gets:

Term spread against changes in the Fed Funds rate

So the pattern is still there. The term spreads drop a year before the Fed Fund rate falls.


Identifying plausible exogenous variation in monetary policy is the gold standard of monetary economics. A host of other ways have been proposed, but basically every course on empirical macroeconomics starts with the shock series by Romer and Romer (2004).1 This paper filters out the endogenous response of monetary policy with respect to the movement in other economic variables using a regression of the fed funds rate on variables that are important for the central bank’s decision, such as GDP, inflation and the unemployment rate.

I won’t reproduce their analysis here, but just take their shock series from the journal page. For this, we also need the following package to read Excel data:


The following codes go to the AER website, download the files into a temporary folder (so we don’t have to manually delete them again), unzip the the codes and extract the relevant part:

td <- tempdir() 
tf <- tempfile(tmpdir=td, fileext=".zip") 
download.file("", tf) 
unzip(tf, files="RomerandRomerDataAppendix.xls", exdir=td, overwrite=TRUE) 
fpath <- file.path(td, "RomerandRomerDataAppendix.xls")
rr <- read_excel(fpath, "DATA BY MONTH") 

Plot the shock series:

ggplot(rr, aes(DATE, RESID)) +
  geom_line() +
  theme_minimal() +
  labs(title = "Romer and Romer (2004) monetary shock",
       subtitle = "1966-1996, monthly.",
       x = "month",
       y = "Romer-Romer shock")
Romer-Romer monetary policy shocks

Merge the rr dataframe with our previous fd dataset:

fd <- fd %>% 
  left_join(rr %>% 
              gather(var, val, -DATE) %>% 
              mutate(val = ifelse(val == "NA", NA, val),
                     val = as.numeric(val),
                     DATE = as.yearmon(DATE)) %>% 
              filter(var == "RESID") %>% 
              spread(var, val) %>% 
              select(date = DATE, shock = RESID))

Make the plot:

ggplot(fd, aes(shock, trm_spr)) +
  geom_hline(yintercept = 0, size = 0.3, color = "grey50") +
  geom_vline(xintercept = 0, size = 0.3, color = "grey50") +
  geom_point(alpha = 0.7, stroke = 0, size = 3) +
  geom_smooth(method = "lm", size = 0.2, color = "#ef8a62", fill = "#fddbc7") +
  theme_tufte(base_family = "Helvetica") +
  labs(title = "Term Spread 1 Year Earlier and Romer-Romer Monetary Policy Shock",
       subtitle = "1966-1996, monthly.",
       x = "Romer-Romer shock",
       y = "Term spread (lagged 1 year)")

Which creates:

Term spread against Romer-Romer monetary policy shocks

Now the pattern is gone.

What I’m learning from this is that term spreads are informative about the endogenous component of monetary policy. Investors have sensible expectations about when the central bank will lower interest rates due to a slowing economic activity.


Bernanke, B., J. Boivin and P. Eliasz (2005). “Measuring the Effects of Monetary Policy: A Factor-Augmented Vector Autoregressive (FAVAR) Approach””, Quarterly Journal of Economics. (link)

Christiano, L., M. Eichenbaum and C. Evans (1996). ““The Effects of Monetary Policy Shocks: Some Evidence from the Flow of Funds”, Review of Economics and Statistics. (link)

Nakamura, E. and J. Steinsson (2018). “High-Frequency Identification of Monetary Non-Neutrality: The Information Effect”, Quarterly Journal of Economics. (link)

Romer, C. D. and D. H. Romer (2004). “A New Masure of Monetary Shocks: Derivation and Implications”, American Economic Review. (link)

Uhlig H. (2005). “What Are the Effects of Monetary Policy on Output? Results from an Agnostic Identification Procedure”, Journal of Monetary Economics. (link)

  1. Some other approaches are: Orderings in a vector autoregression (VAR), sign restrictions, high frequency identification and factor-augmented VARs. 

Cosine similarity

A typical problem when analyzing large amounts of text is trying to measure the similarity of documents. An established measure for this is cosine similarity.


It’s the cosine of the angle between two vectors. Two vectors have a maximum cosine similarity of 1 if they are parallel and the lowest cosine similarity of 0 if they are perpendicular to each other.

Say you have two documents \(A\) and \(B\) . Write these documents as vectors \(\boldsymbol{x} = (x_{1}, x_{2}, ..., x_{n})'\), where \(n\) is the length of the pooled dictionary of all words that show up in either document. An entry \(x_i\) is the number of occurences of a particular word in a document. Cosine similarity is then (Manning et al. 2008):

\[\begin{align} \text{sim}(\boldsymbol{x_A}, \boldsymbol{x_B}) &= \frac{\boldsymbol{x_A}' \cdot \boldsymbol{x_B}}{\lVert \boldsymbol{x_A} \rVert \cdot \lVert \boldsymbol{x_B} \rVert} \nonumber \\ &= \frac{\sum_{i=1}^n x_{i,A} \, x_{i,B}}{\sqrt{\sum_{i=1}^n x_{i,A}^2} \cdot \sqrt{\sum_{i=1}^n x_{i,B}^2}} \label{cosine_sim} \end{align}\]

Given that entries can only be positive, cosine similarity will always take positive values. The denominator normalizes document lengths and bounds values between 0 and 1.

Cosine similarity is equal to the usual (Pearson’s) correlation coefficient if we first demean the word vectors.


Consider a dictionary of three words. Let’s define (in Matlab) three documents that contain some of these words:

w1 = [1; 0; 0];
w2 = [0; 1; 1];
w3 = [1; 0; 10];

W = [w1, w2, w3];

Calculate the correlation between these:


Which gets us:

ans =

    1.0000   -1.0000   -0.4193
   -1.0000    1.0000    0.4193
   -0.4193    0.4193    1.0000

Documents 1 and 2 have the lowest possible correlation while 2 and 3 and 1 and 3 are somewhat correlated.

Define a function for cosine similarity:

function cs = cosine_similarity(x1, x2)
  l1 = sqrt(sum(x1 .^ 2));
  l2 = sqrt(sum(x2 .^ 2));

  cs = (x1' * x2) / (l1 * l2);

And calculate the values for our word vectors:

cosine_similarity(w1, w2)
cosine_similarity(w2, w3)
cosine_similarity(w1, w3)

Which gets us:

ans =


ans =


ans =


Documents 1 and 2 again have the lowest possible similarity. The association between documents 2 and 3 is especially high, as both contain the third word in the dictionary which also happens to be of particular importance in document 3.

Demean the vectors and then run the same calculation:

cosine_similarity(w1 - mean(w1), w2 - mean(w2))
cosine_similarity(w2 - mean(w2), w3 - mean(w3))
cosine_similarity(w1 - mean(w1), w3 - mean(w3))


ans =


ans =


ans =


They’re indeed the same as the correlations.


Manning, C. D., P. Raghavan and H. Schütze (2008). Introduction to Information Retrieval. Cambridge University Press. (link)

Leonardo da Vinci

In Walter Isaacson’s new Leonardo da Vinci biography:

The more than 7,200 pages now extant probably represent about one-quarter of what Leonardo actually wrote, but that is a higher percentage after five hundred years than the percentage of Steve Jobs’s emails and digital documents from the 1990s that he and I were able to retrieve.

I also liked this:

Leonardo’s Vitruvian Man embodies a moment when art and science combined to allow mortal minds to probe timeless questions about who we are and how we fit into the grand order of the universe. It also symbolizes an ideal of humanism that celebrates the dignity, value, and rational agency of humans as individuals. Inside the square and the circle we can see the essence of Leonardo da Vinci, and the essence of ourselves, standing naked at the intersection of the earthly and the cosmic.