An excellent course on machine learning

Following Olli’s recommendation I took a crack at Andrew Ng’s machine learning course on Coursera. It’s pedagogically well-designed and taught and I highly recommend it. I’d like to share my codes for the course, but I think that would be against the spirit of such an online class.

Some observations:

• It’s an introductory course, so there are no proofs. The focus is on graphical intuition, implementing the algorithms in Octave/Matlab and the practicalities of large-scale machine learning projects.
• The goal is not to accurately estimate the $\beta$ and the uncertainty around it, but to be precise in predicting the $\hat{y}$ out of sample. (See also here.)
• The course is refreshingly different from what we normally study and I think best taken not as a substitute, but as a complement to econometrics classes.
• It uses a different vocabulary:

Machine learning Econometrics
example observation
(to) learn (to) estimate
hypothesis estimation equation
feature/input variable
output/outcome dependent variable
bias constant/intercept
bias (yeah, twice 🤔) bias
• Linear regression is introduced through the cost function and its numerical minimization. Ng shows the analytical solution on the side, but he adds that would only be useful in a “low-dimensional” problem up to 10,000 or so variables.
• I liked the introduction to neural networks in week 4 and the explanation of them as stacked logical operators.
• Insightful discussion of how to use training and cross-validation set errors plotted against the regularization parameter and against the number of observations to identify whether bias or variance is a problem.
• The video on error analysis made me realize that in my patent project I had spend a very large amount of time thinking about appropriate error metrics, but little time actually inspecting the mis-classified patents.
• In a presentation I once attended, Tom Sargent said:

A little linear algebra goes a long way.

Similarly here: Only doing a little clustering, for example, we can compress the file size of this image by a factor of six, but preserve a lot of the information (exercise 7):

Figure: Image compression with PCA

• I hadn’t previously thought of dimension reduction as a debugging method: If you get thousands of features from different sources and you’re not sure if some might be constructed similarly, then dimension reduction weeds out the redundant features.
• Time series data is suspiciously missing from the course.
• He mentions this folk wisdom:

[…] [I]t’s not who has the best algorithm that wins. It’s who has the most data.