Lukas Püttmann    About    Research    Blog

Cosine similarity

A typical problem when analyzing large amounts of text is trying to measure the similarity of documents. An established measure for this is cosine similarity.


It’s the cosine of the angle between two vectors. Two vectors have a maximum cosine similarity of 1 if they are parallel and the lowest cosine similarity of 0 if they are perpendicular to each other.

Say you have two documents \(A\) and \(B\) . Write these documents as vectors \(\boldsymbol{x} = (x_{1}, x_{2}, ..., x_{n})'\), where \(n\) is the length of the pooled dictionary of all words that show up in either document. An entry \(x_i\) is the number of occurences of a particular word in a document. Cosine similarity is then (Manning et al. 2008):

\[\begin{align} \text{sim}(\boldsymbol{x_A}, \boldsymbol{x_B}) &= \frac{\boldsymbol{x_A}' \cdot \boldsymbol{x_B}}{\lVert \boldsymbol{x_A} \rVert \cdot \lVert \boldsymbol{x_B} \rVert} \nonumber \\ &= \frac{\sum_{i=1}^n x_{i,A} \, x_{i,B}}{\sqrt{\sum_{i=1}^n x_{i,A}^2} \cdot \sqrt{\sum_{i=1}^n x_{i,B}^2}} \label{cosine_sim} \end{align}\]

Given that entries can only be positive, cosine similarity will always take positive values. The denominator normalizes document lengths and bounds values between 0 and 1.

Cosine similarity is equal to the usual (Pearson’s) correlation coefficient if we first demean the word vectors.


Consider a dictionary of three words. Let’s define (in Matlab) three documents that contain some of these words:

w1 = [1; 0; 0];
w2 = [0; 1; 1];
w3 = [1; 0; 10];

W = [w1, w2, w3];

Calculate the correlation between these:


Which gets us:

ans =

    1.0000   -1.0000   -0.4193
   -1.0000    1.0000    0.4193
   -0.4193    0.4193    1.0000

Documents 1 and 2 have the lowest possible correlation while 2 and 3 and 1 and 3 are somewhat correlated.

Define a function for cosine similarity:

function cs = cosine_similarity(x1, x2)
  l1 = sqrt(sum(x1 .^ 2));
  l2 = sqrt(sum(x2 .^ 2));

  cs = (x1' * x2) / (l1 * l2);

And calculate the values for our word vectors:

cosine_similarity(w1, w2)
cosine_similarity(w2, w3)
cosine_similarity(w1, w3)

Which gets us:

ans =


ans =


ans =


Documents 1 and 2 again have the lowest possible similarity. The association between documents 2 and 3 is especially high, as both contain the third word in the dictionary which also happens to be of particular importance in document 3.

Demean the vectors and then run the same calculation:

cosine_similarity(w1 - mean(w1), w2 - mean(w2))
cosine_similarity(w2 - mean(w2), w3 - mean(w3))
cosine_similarity(w1 - mean(w1), w3 - mean(w3))


ans =


ans =


ans =


They’re indeed the same as the correlations.


Manning, C. D., P. Raghavan and H. Schütze (2008). Introduction to Information Retrieval. Cambridge University Press. (link)

Leonardo da Vinci

In Walter Isaacson’s new Leonardo da Vinci biography:

The more than 7,200 pages now extant probably represent about one-quarter of what Leonardo actually wrote, but that is a higher percentage after five hundred years than the percentage of Steve Jobs’s emails and digital documents from the 1990s that he and I were able to retrieve.

I also liked this:

Leonardo’s Vitruvian Man embodies a moment when art and science combined to allow mortal minds to probe timeless questions about who we are and how we fit into the grand order of the universe. It also symbolizes an ideal of humanism that celebrates the dignity, value, and rational agency of humans as individuals. Inside the square and the circle we can see the essence of Leonardo da Vinci, and the essence of ourselves, standing naked at the intersection of the earthly and the cosmic.

Term spreads and business cycles 1

This is the first part of a series of posts on term spreads and business cycles. There'll probably be three parts.

The term spread is the return differential between a long-term and a short-term safe bond. We can use this to learn about how market participants expect the economy to perform over the next year or so.

My explanation loosely follows Stephen Cecchetti and Kermit Schoenholtz’s book “Money, Banking, and Financial Markets”.


Consider an investor who either buys a long-run bond running for two years and pays \(i_{2,t}\) per year or he invests in two subsequent short-run bonds. If he chooses the second option, the bond pays \(i_{1,t}\) in the first year and he expects to earn \(i^e_{1,t+1}\) in the second year. If we neglect any risk, it’s plausible to assume that interests rates adjust such that the investor earns the same using either strategy:

\[(1 + i_{2,t})(1 + i_{2,t}) = (1 + i_{1,t})(1 + i^e_{1,t+1}).\]

When we evaluate this expression and omit all terms that multiply two rates (these will typically be small), we get:

\[i_{2,t} = \frac{i_{1,t} + i^e_{1,t+1}}{2}\]

We can extend this for any years n:

\[i_{n,t} = \frac{i_{1,t} + i^e_{1,t+1} + \dots + i^e_{1,t+n-1}}{n}\]

So long-term interest rates are composed of expectations about future short term interest rates.

Combine this with what the Cecchetti and Schoenholtz call the Liquidity Premium Theory. Returns that accrue farther into the future are more risky, as the bond issuer may be bankrupt and we don’t know what inflation will be. This means that rates on bonds with longer maturities are usually higher.

The authors add a factor \(rp_n\) (the risk premium of a bond running \(n\) years) to the original equation:

\[i_{n,t} = rp_n + \frac{i_{1,t} + i^e_{1,t+1} + \dots + i^e_{1,t+n-1}}{n}\]

As \(rp_n\) is higher for greater \(n\), interest rates will tend to be higher for longer maturities.

The term spread \(ts_{n,t}\) is then

\[\begin{align} ts_{n,t} &= i_{n,t} - i_{1,t} \nonumber \\ &= rp_n - rp_1 + \frac{i_{1,t} + i^e_{1,t+1} + \dots + i^e_{1,t+n-1}}{n} + i_{1,t} \label{term_spreads} \end{align}\]

A positive term spread can mean two things. Either we expect the average future short-term interest rate to rise or the difference between the two risk premia (\(rp_n - rp_1\)) has increased. We would expect this difference to be positive anyway, but it might widen even more when inflation becomes more uncertain or debt becomes riskier. But disentangling the two explanations is difficult.

It’s more interesting when the term spread turns negative. The difference between the risk premia probably stays positive, so investors expect short-term interest rates to decrease.

Short-term interest rates are mostly under the control of the central bank, so this probably means that people expect monetary policy to loosen and that the central bank lends more liberally to banks.

And why would the central bank do that? That’s usually to avert a looming recession and buffer negative shocks. Given that the central bank also responds to changes in the economic environment, it’s not clear what’s causing what here.

But either way, when term spreads turn negative, investors expect bad things. This is why the term spread tells us something about investors’ expectations.


Let’s look at the empirical evidence for this, as argued and presented by the authors. They kindly provide Fred codes with all their plots, so they’re easy to reproduce. First get some packages in R:


Insert your FRED API key below:

api_key <- "yourkeyhere"
fred    <- FredR(api_key)

Pull data on 10 year and 3 month treasury bill rates:

# Long-term interest rates
fd <- fred$series.observations(series_id = "GS10") %>%
  mutate(date = as.Date(date)) %>% 
  select(date, t10 = value) %>% 
  mutate(t10 = as.numeric(t10))

# Short-term interest rates
fd <- fred$series.observations(series_id = "TB3MS") %>%
  mutate(date = as.Date(date)) %>% 
  select(date, tb3m = value) %>% 
  mutate(tb3m = as.numeric(tb3m)) %>% 

# Turn dates into month-years and order by dates
fd <- fd %>% 
  mutate(date = as.yearmon(date)) %>% 

Plot the two series:

fd %>% 
  gather(var, val, -date) %>% 
  ggplot(aes(date, val, color = var)) +
  geom_hline(yintercept = 0, size = 0.3, color = "grey80") +
  geom_line() +
  labs(title = "The Term Structure of Treasury Interest Rates",
       subtitle = "1934-2017, monthly",
       x = "month",
       y = "yield",
       color = "Treasury bill:") +
  scale_color_manual(labels = c("10-year", "3-month"),
                     values = c("#ef8a62", "#67a9cf"))  +
  theme_tufte(base_family = "Helvetica")

Which produces:

Treasury bond yields, 10 year and 3 month since 1934 monthly

The series on 3-month Treasury bond yields starts in 1934 and the other series on 10-year yields starts in 1953. Both series peak in the early 80s when inflation ran high. As argued before, long-run interest rates are usually above short-run interest rates. The term spread is the difference between the two:

fd <- fd %>% 
  mutate(trm_spr = t10 - tb3m)

Things become more interesting when we compare the behavior of the term spread with real GDP growth rates:

# Get quarterly GDP
qt <- fred$series.observations(series_id = "GDPC1") %>%
  mutate(yq = as.yearqtr(paste0(year(date), " Q", quarter(date)))) %>% 
  select(yq, rgdp = value) %>% 
  mutate(rgdp = as.numeric(rgdp),
         gr = 100*(rgdp - dplyr::lag(rgdp, 4)) / dplyr::lag(rgdp, 4))

# Aggregate monthly to quarterly and get lag
qt <- fd %>% 
  mutate(yq = as.yearqtr(date)) %>% 
  group_by(yq) %>% 
  summarize(trm_spr = mean(trm_spr)) %>% 
  mutate(trm_spr_l = dplyr::lag(trm_spr, 4)) %>% 
  full_join(qt, by = "yq") %>% 
  filter(yq >= "1947 Q1")

Make a plot of the two series:

qt %>% 
  select(-rgdp, -trm_spr_l) %>% 
  drop_na(trm_spr) %>% 
  gather(var, val, -yq) %>% 
  ggplot(aes(yq, val, color = var)) +
  geom_hline(yintercept = 0, size = 0.3, color = "grey80") +
  geom_line() +
  labs(title = "Current Term Spread and GDP Growth",
       subtitle = "1953-2017, quarterly",
       x = "quarter",
       y = "percentage points",
       color = "Variables:") +
  theme_tufte(base_family = "Helvetica") +
  scale_color_manual(labels = c("Real GDP growth", "Term spreads (lagged)"),
                     values = c("#ef8a62", "#67a9cf"))

We get:

Term spread against GDP growth

The term spread often falls before GDP growth does. When the term spread turns negative, recessions tend to happen. Compare the lagged term spread to GDP growth:

qt %>% 
  filter(yq >= 1953) %>% 
  ggplot(aes(gr, trm_spr_l)) +
  geom_hline(yintercept = 0, size = 0.3, color = "grey50") +
  geom_vline(xintercept = 0, size = 0.3, color = "grey50") +
  geom_point(alpha = 0.7, stroke = 0, size = 3) +
  labs(title = "Term Spread 1 Year Earlier and GDP Growth",
       subtitle = paste0("1953-2017, quarterly. Correlation: ", 
                         round(cor(qt$gr, qt$trm_spr_l, 
                                   use = "complete.obs"), 2), "."),
       x = "growth rate of real GDP",
       y = "Term spread (lagged 1 year)") +
  geom_smooth(method = "lm", size = 0.2, color = "#ef8a62", fill = "#fddbc7") +
  theme_tufte(base_family = "Helvetica")

Which gets us:

Term spread 1 year earlier and GDP growth

This is exactly the relationship that makes people think of term spreads as a good predictor of GDP in the near future.


Cecchetti and Schoenholz summarize it like this (p.180):

[…] [I]nformation on the term structure – particularly the slope of the yield curve – helps us to forecast general economic conditions. Recall that according to the expectations hypothesis, long-term interest rates contain information about expected future short-term interest rates. And according to the liquidity premium theory, the yield curve usually slopes upward. The key statement is usually. On rare occasions, short-term interest rates exceed long-term yields. When they do, the term stucture is said to be inverted, and the yield curve slopes downward.

[…] Because the yield curve slopes upward even when short-term yields are expected to remain constant – it’s the average of expected future short-term interest rates plus a risk premium – an inverted yield curve signals an expected fall in short-term interest rates. […] When the yield curve slopes downward, it indicates that [monetary] policy is tight because policymakers are attempting to slow economic growth and inflation.

We still don’t know what’s causing what. Is the central bank the driver of business cycles or is it just responding to a change in the economic environment?

Stay tuned for the next installment in this series in which I’ll look at the role of monetary policy.


Cecchetti, S. G. and K. L. Schoenholtz (2017). “Money, Banking, and Financial Markets”. 5th edition, McGraw-Hill Education. (link)

Presentations at the AEA

I presented at this year’s AEA in a session on automation. Apart from the papers in that session, I also enjoyed the following:

  • Lifecycle of Inventors
  • Susan Athey presented “Online Intermediaries and the Consumption of Polarized and Inaccurate News During the 2016 Presidential Election” in this session. They tried to estimate what the political leanings of media were that people consumed during the 2016 Presidential Election. They document that most of the media is left of center. But the strongest result is that they show a lack of reliable right-wing media. She explained that if they hadn’t coded Fox News as at least moderately accurate, then there would be no such right-wing media outlet.
  • Automation and the Workforce
  • John Horton’s “Labor Market Equilibration: Evidence from Uber” (pdf).
  • Trade and Innovation
  • Shopping in Macroeconomics” (best session title!)
  • Credit Booms, Aggregate Demand, and Financial Crises”, included a new paper by Matthew Baron, Emil Verner and Wei Xiong. They painstakingly digitized historical bank equity returns to create a new financial crisis indicator for 47 countries since 1800.

Collected links

  1. Google Maps’s Moat, by Justin O’Beirne (who runs one of the best blogs I know):

    At the rate it’s going, how long until Google has every structure on Earth?

  2. John Horton simulates the new Goldsmith-Pinkham, Sorkin and Swift paper on Bartik instruments.
  3. Thomas Leeper:

    What have I learned from this? First, everything takes time. Coming up with compelling research takes time. Data collection takes time. Data analysis takes time. Writing takes time. Peer review takes time. Rejection takes time. Recovery from rejection takes time. Responding to reviewers takes time. Typesetting takes time. Email takes time. Writing this blog post takes time. Everything takes time. It’s been eight years but that time has brought a publication I’m quite proud of.

  4. Brian Hayes provides a readable introduction to net neutrality:

    In round numbers, the web has something like a billion sites and four billion users—an extraordinarily close match of producers to consumers. […] Yet the ratio for the web is also misleading. Three fourths of those billion web sites have no content and no audience (they are “parked” domain names), and almost all the rest are tiny. Meanwhile, Facebook gets the attention of roughly half of the four billion web users. Google and Facebook together, along with their subsidiaries such as YouTube, account for 70 percent of all internet traffic.

  5. 100 Jahre DIN - Ein Urgestein der deutschen Wirtschaft (in German)
  6. Where did all my probability go?

Growing and shrinking

Stephen Broadberry and John Joseph Wallis have written an interesting new paper (pdf). They argue that much of modern economic growth is due to not only a higher trend growth rate, but also due to fewer periods of shrinking.

It’s easy to reproduce some of their results using the nice maddison package.

Get some packages:


Let’s concentrate on some countries with good data coverage, keep only data since 1800 and calculate real GDP growth rates. We also add a column for decades, to be able to look at some statistics for those separately.

df <- maddison %>% 
  filter(aggregate == 0,
         iso2c %in% c("DE", "FR", "IT", "JP", "NL", "ES", "SE", "GB", "US",
                             "AU", "DK", "ID", "NO", "GR", "ZA", "BE", "CH", "FI", 
                             "PT", "AT", "CA"),
         year >= as.Date("1800-01-01")) %>%
  group_by(iso3c) %>% 
  mutate(gr = 100*(gdp_pc - dplyr::lag(gdp_pc, 1)) / dplyr::lag(gdp_pc, 1)) %>% 
  ungroup() %>% 
  drop_na(gr) %>% 
  mutate(decade = paste0((year(year) %/% 10) * 10, "-", 
                         (year(year) %/% 10) * 10 + 9))

Define a data frame with annotations for the World Wars:

df_annotate <- tibble(
  xmin = as.Date(c("1914-01-01", "1939-01-01")),
  xmax = as.Date(c("1918-01-01", "1945-01-01")),
  ymin = -Inf, ymax = Inf,
  label = c("WW1", "WW2")

Check out data for some countries:

df %>% 
  filter(iso2c %in% c("DE", "FR", "IT", "JP", "NL", "ES", "SE", "GB", "US")) %>%
  ggplot() + 
  geom_hline(yintercept = 0, size = 0.3, color = "grey80") +
  geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),
            data = df_annotate, fill = "grey50", alpha = 0.25) +
  geom_line(aes(year, gr), size = 0.4) +
  facet_wrap(~country, scales = "free_y") +
  ggthemes::theme_tufte(base_family = "Helvetica", ticks = FALSE) +
  labs(x = NULL, 
       y = "Percentage points\n", 
       title = "Growth rate of real GDP per capita",
       subtitle = "1800-2010")
Real GDP growth rates for nine countries since 1800 with Maddison data

Growth rates wobble around a mean slightly larger than zero, as expected. There are some extreme events around the World Wars. But it’s not immediately apparent from looking at these figures if shrinking has become less. If anything, it’s macroeconomic volatility that has decreased for most of these countries.

Next, for each decade we calculate the number of periods that countries were growing and shrinking and the average growth rates in those periods.

stats <- df %>% 
  group_by(iso3c, decade) %>% 
  mutate(category = ifelse(gr > 0, "grow", "shrink"),
         gr_av = mean(gr, na.rm = TRUE)) %>% 
  group_by(decade, iso3c, country_original, country, category, gr_av) %>% 
  summarise(prds = n(),
            delta = mean(gr, na.rm = TRUE)) %>% 

stats <- stats %>% 
  left_join(stats %>% 
              group_by(iso3c, decade) %>% 
              mutate(totper = sum(prds))) %>% 
  mutate(frq = prds / totper) %>% 
  select(-prds, -totper) %>% 
  mutate(comp = frq * delta)

We’re interested in how much growing and shrinking years contributed to overall growth in a decade. So this is just the absolute value of changes by either shrinking and growing years divided by the total variation.

stats <- stats %>% 
  full_join(stats %>% 
              group_by(decade, iso3c, country_original, country, gr_av) %>% 
              summarise(tvar = sum(abs(comp)),
                        gr_comp = sum(comp))) %>% 
  mutate(contr = abs(comp) / tvar)
  mutate(dfac = factor(decade)) %>% 
  filter(decade != "2010-2019")

ggplot(stats, aes(dfac, comp, color = category, 
                  shape = category, fill = category)) +
  geom_hline(yintercept = 0, size = 0.3, color = "grey80") +
  geom_jitter(alpha = 0.5) +
  coord_flip() +
  ggthemes::theme_tufte(base_family = "Helvetica", ticks = FALSE) +
  scale_shape_manual(values = c(17, 25)) +
  scale_colour_manual(values = c("#00BFC4", "#F8766D")) +
  scale_fill_manual(values = c("#00BFC4", "#F8766D")) +
  labs(title = "Contributions of growing and shrinking",
       subtitle = "1800-2009, every point is a country-decade observation",
       y = "Contribution to real GDP growth",
       x = NULL,
       caption = "Source: Maddison Project and own calculations.") +
  scale_x_discrete(limits = rev(levels(stats$dfac)))

When we plot those contributions, we get the following picture:

Contribution of growing and shrinking

It looks as if the contributions of shrinking have clustered closer to zero since WW2.

Let’s check out the trends in the contribution of shrinking periods:

r <- stats %>% 
  mutate(trend = row_number()) %>% 
  filter(category == "shrink") %>% 
  group_by(iso3c, country_original, country) %>% 
  do(fitDecades = lm(contr ~ trend, data = .)) %>% 
  tidy(fitDecades, = TRUE) %>% 
  mutate(sig = (p.value < 0.05))

And plot the regression coefficients of the trend line:

r %>% 
  filter(term == "trend") %>% 
  arrange(estimate) %>% 
  ggplot(aes(reorder(country, -estimate), estimate)) +
  geom_hline(yintercept = 0, size = 0.8, color = "grey10", linetype = "dotted") +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.3) +
  geom_point(aes(y = estimate, fill = sig), shape = 21, size = 1.0, color = "black") + 
  coord_flip() +
  theme_tufte(base_family = "Helvetica", ticks = FALSE) +
  labs(title = "Trend of the contribution of shrinking episodes",
       subtitle = "1800-2009 by decades",
       x = NULL,
       y = "OLS estimate and 95% confidence bands",
       caption = "Source: Maddison Project and own calculations.") +
  scale_fill_manual(name = "Significant at 5% level?",
                    labels = c("No", "Yes"),
                    values = c("white", "black")) +
Contribution of growing and shrinking, regression coefficients

The trend was negative for many countries and for some there was no significant trend. None of the trends was significantly positive. It certainly looks as if episodes of growing output have become more important than episodes of shrinking episodes.

The question remains whether this might not just be mechanically driven by the fact that a higher trend real GDP growth rate reduces the probability of the growth rate hitting zero. And a fall in macroeconomic volatility would also make shrinking episodes less likely.


Broadberry, S. and J. J. Wallis (2017). “Growing, Shrinking, and Long Run Economic Performance: Historical Perspectives on Economic Development”, NBER Working Paper No. 23343. doi: 10.3386/w23343

Collected links

  1. John Coltrane – Alabama
  2. Is AI Riding a One-Trick Pony?” (through Kaiser Fung)
  3. British Brexit Secretary, David Davis:

    The assessment of that effect … is not as straightforward as many people think. And I’m not a fan of economic models, because they have all proven wrong. (link)

  4. Consumer Inflation Uncertainty
  5. Why Are Some Good Old Ideas Buried in the History of Statistical Graphics?
  6. Tom Breloff:

    Instead of rushing to Software 2.0, lets view neural networks in proper context: they are models, not magic.

  7. Roman Cheplyaka: “Explained variance in PCA
  8. Project-oriented workflow or how not to “SET YOUR COMPUTER ON FIRE 🔥.”
  9. Cat Person”, by Kristen Roupenian in The New Yorker. About which was written:

    This past weekend, the biggest story on social media was not about a powerful man who had sexually assaulted someone, or something the president said on Twitter. Charmingly, as if we were all at a Paris salon in the 1920s, everyone had an opinion about a short story.

  10. Beautiful Testing (seen here)
  11. Sam Altman thinks he can speek more freely in Beijing than in San Francisco. And Tyler Cowen has a good explanation.

VoxEU gobbledygook

We’ve written a VoxEU column for our patent paper, which you can find here.

Overview statistics

When I started writing this article I became curious how a typical VoxEU column looks like. So I scraped the archives and looked at some statistics. Here they are (as of November 15 2017):

  • After some cleaning, there are 5633 columns from January 2008 to November 2017.
  • The mean number of page reads of columns is 20,600 (median: 16,600).
  • The mean number of authors is 2.1 (median: 2). There are about 1800 single-authored columns.
  • The teaser text at the top of the columns contained 68 words on average (median: 66).
  • The main part of the column is a little harder to count, because it also contains tables, figure captions and references. When I just count all words before the first appearance of “References” in the text, I get a mean of 1383 words (median: 1327). That seems well within the recommended range of 1000-1500 words.
  • The most prolific writers have written up to 50 columns and the mean number of columns per author is 2.2 (median: 1).

Every column is assigned to one topic and several tags. I aggregated the 49 topics to one of 19 categories (e.g., I counted “EU institutions” and “EU policies” as “Europe” and “Microeconomic regulation” and “Competition policy” as “Industrial organisation”). This produces the following figure:

Categories of economic fiels in VoxEU columns, quarterly 2008-2017

Some observations:

  • The graph reflects the focus of The top categories are “International economics” (950), “Europe” (930), “Development” (590), “Financial markets” (580).
  • Microeconomic theory and econometrics are only rarely covered.
  • The spike of the “Europe” category around 2012 might be related to the euro area sovereign debt crisis around that time.
  • The topic “Frontiers of economic research” is a bit more vague.
  • “Labor” and “Economic history” columns have become more important and columns with the topic “Global crisis” have become rarer.

Measuring complexity of text in columns

One fun exercise I’ve run is inspired by this blog post by Julia Silge. She explains how to use a “Simple Measure of Gobbledygook” (SMOG) by McLaughlin (1969) to find out which texts are hard to read. This works by counting the average length of syllables per words that people write. Words with fewer syllables are seen as easier to understand. The SMOG value is meant to show how many years of education somebody needs to understand a text.

I’m running this analysis separately on the columns teaser texts and their main body. Our own teaser text has 16 polysyllable words in four sentences and we calculate the SMOG value like this:

\[\text{SMOG} = 1.043 \cdot \sqrt{\left(16 \cdot \frac{30}{4}\right)} + 3.1291 \approx 14.6\]

The rest of the column has 251 polysyllables in 65 sentences, which yields a SMOG of 14.4.

The winner of the VoxEU teaser text with the lowest SMOG count is this column by Jeffrey Frankel. It has a SMOG of 6.4, so taking the measure literally we would expect a kid fresh out of primary school to be able to understand it.

The column with the lowest SMOG value in its main column text is this column by James Andreoni and Laura Gee. It has 147 polysyllables spread out over 79 sentences, which yields a SMOG of 10.9.

I won’t name any offenders, but the highest SMOG score is 26.8. Understanding that text would require the substantial amount of education such as: 12 (school) + 3 (undergrad) + 1 (master) + 5 (PhD) + 6 (assistant professor) to understand.

The overall average SMOG value is 14.8 on teasers and 16.0 on main columns texts. So it seems that economists write on a level that college graduates can understand. SMOG doesn’t vary much by field, but it takes the highest value (on full columns) in “Industrial organisation” (16.9), “Monetary economics” (16.4) and lowest in “Economic History” (15.8) and “Global Crisis” (15.7).

The SMOG on the two column parts has a correlation of 0.27.

SMOG values of teaser vs. main part of VoxEU column

The OLS line is flatter than the 45 degree line which is probably a sign of the more accessible language in the teasers.

Interestingly, when we compare articles’ SMOG values with the number of times the page was read, we get the following negative relationship:

Number of page reads against SMOG values of VoxEU columns

This also holds in a regression of log(page reads) on the SMOG values of both main text and teaser text, the number of authors, number of authors squared and dummies for the day of the week, quarter, year and – most importantly – the literature category (e.g. “Taxation”, “Financial markets” or “Innovation”). It’s not driven by outliers either and there is also a significantly negative relationship if I measure SMOG on the teasers only.

Writing columns that take an additional year of schooling to understand (SMOG + 1) is associated with 3 percent fewer page reads. Maybe that’s a reason to use fewer big words in our papers!

One explanation might be that more complex papers require the use of more big words. And that users on prefer clicking on articles that don’t sound too complicated. But better written papers might also just be inherently better in other dimensions. And because they’re more important, people read them more often.


McLaughlin, G. H. (1969). “SMOG Grading - a New Readability Formula”. Journal of Reading. 12(8): 639—646.

My 10 favorite books 2017

Here are the books that I most enjoyed reading in the last 12 months in reverse order:

  1. Just Mercy: A Story of Justice and Redemption”, by Bryan Stevenson. A book about the weakest in society and the weaknesses of society.
  2. Hitler’s Soldiers: The German Army in the Third Reich”, by Ben Shepherd. It find it hard to say I enjoyed this book, but it impressed me. Facts really stand out when Shepherd is able to put numbers on them. For example, did you know that, “During the winter of 1941-2, 360,000 Greeks died of famine”?
  3. Submission”, by Michel Houellebecq. What I found eerie is the psychological plausibility of the decisions in this story.
  4. Rules for Radicals”, by Saul Alinsky. “All issues are controversial.” More here and here.
  5. Folding Beijing”, by Hao Jingfang. A fantastic (in both senses of the word) short story about reality and inequality by a Chinese macroeconomist.
  6. Hans Fallada: Die Biographie”, by Peter Walther (in German). I really enjoyed reading Fallada’s book “Alone in Berlin” and Fallada’s life was equally interesting. This superb biography spares nothing by simplifying too much.
  7. Commonwealth”, by Ann Patchett. We accompany the lives of six siblings over several decades. I started not expecting to finish, but after the first chapter I couldn’t stop.
  8. Gorbachev: His Life and Times”, by William Taubman. So much was new to me which mostly just reveils my ignorance about the topic.
  9. At the Existentialist Café: Freedom, Being, and Apricot Cocktails”, by Sarah Bakewell. Beautifully written and insightful.
  10. Secret of Our Success: How Culture Is Driving Human Evolution, Domesticating Our Species, and Making Us Smarter”, by Joseph Henrich. I found this book mind-boggling and impressive with things to say about anything from culture, evolution, causal reasoning to religion.

Related posts: