## Monday, October 31, 2011

### Machine Learning 10-37

Life caught up to me and I had several other deadlines pop up over this week.  Let's see if I can jam through the material before the homework is due.  This means I probably won't have time to work out the quizzes, sorry.

Laplace smoothing:  Smoothing is a funny word for it, no?  The idea is that when we have no data for a particular word, the distribution looks "sharp" because we have this sudden drop to 0.  By hallucinating some data, we get some probability in that case, and the distribution looks smooth.

The question in 15 shows us why the denominator has $$+ k |x|$$ in it.  Recall that the $$P(M)$$ must sum up to 1.  $$|x|$$ is the number of things we're summing, and the $$k$$ is what we add to each of them, so this way, after smoothing the distribution still sums to 1.

Naive Bayes:  Sort of odd to tell us NOW that what we're doing is Naive Bayes.  What makes this approach naive?  The key is the shape of the network.  See how y is the parent of all the x's?  That is great because then we know $$x_i \bot x_j | y$$ and we can use Bayes rule easily.  This is the natural characterization for spam, which is why it doesn't look naive, but take a more complex example.  Let's say we want to determine the value for a hidden variable $$y_0$$.  Now this variable affects our observed x's as before, but there are other hidden variables $$y_{1...n}$$ that also affect our x's.  Now, when we try to compute $$P(x_{1...m}|y)$$ those x's are no longer conditionally independent!  (draw the picture, I promise)  Naive Bayes is simply the assumption that there is only 1 hidden variable so we can use the conditional independence.

Cross validation: interesting conversation about experimental technique.  I wish someone sat me down and told me that when I started graduate school.  You may wonder  what the deal is with 10-fold cross validation is.  The problem is that if you get unlucky, your cv set might be misleading, so by trying multiple cv sets, we might get better value for our parameters.

Loss:  Why squared?  There are a few reasons, but one practical one is that we want loss to always be positive, because it should measure a distance between a point and a function.  Otherwise 3-5 would be different from 5-3, even though they're the same distance apart.  Why not absolute value?  Because you can't differentiate it, but square is easy to differentiate.

Wow, we're going awfully fast here, essentially just mentioning other techniques like logistic regression, or regularization.  There's no way you could implement logistic from that discussion. When he talks about sparsity in L1 vs. L2, the basic idea is that L1 works better when there are fewer data samples.ICML 2004.

No time to go in to it now, but the point of the question at the end of perceptions is about a very powerful technique called Support Vector Machines.  They try (among other things) to find the best separating line as opposed to just some separating line in perceptrons.

wow, that's confusing: non-parametric techniques have a growing number of parameters?  What he means is we save examples from the data, and those act like parameters in our model.  I prefer not to call the saved data "parameters".

Not looking forward to the homework.  I only have an hour and a half left.