A typical problem when analyzing large amounts of text is trying to measure the similarity of documents. An established measure for this is cosine similarity.
It’s the cosine of the angle between two vectors. Two vectors have a maximum cosine similarity of 1 if they are parallel and the lowest cosine similarity of 0 if they are perpendicular to each other.
Say you have two documents and . Write these documents as vectors , where is the length of the pooled dictionary of all words that show up in either document. An entry is the number of occurences of a particular word in a document. Cosine similarity is then (Manning et al. 2008):
Given that entries can only be positive, cosine similarity will always take positive values. The denominator normalizes document lengths and bounds values between 0 and 1.
Cosine similarity is equal to the usual (Pearson’s) correlation coefficient if we first demean the word vectors.
Consider a dictionary of three words. Let’s define (in Matlab) three documents that contain some of these words:
Calculate the correlation between these:
Which gets us:
Documents 1 and 2 have the lowest possible correlation while 2 and 3 and 1 and 3 are somewhat correlated.
Define a function for cosine similarity:
And calculate the values for our word vectors:
Which gets us:
Documents 1 and 2 again have the lowest possible similarity. The association between documents 2 and 3 is especially high, as both contain the third word in the dictionary which also happens to be of particular importance in document 3.
Demean the vectors and then run the same calculation:
They’re indeed the same as the correlations.
Manning, C. D., P. Raghavan and H. Schütze (2008). Introduction to Information Retrieval. Cambridge University Press. (link)