Tuesday, November 15, 2011

HMMs and Filters

Yes, I've been AWOL for a week or so, while life got the better of me.   Picking back up with HMMs.  I'll try to get to the sections I missed before the midterm.

Something he hasn't said yet, but HMMs  can be thought of as a special-case of Bayes nets.  OK, now he's said it.  This special-case-ness took us a while to notice.  For years they were studied separately.

In the underground robot example, Sebastian mentions "noise in [the] motors."  That might be an odd phrase to hear, because we think of noise as something applying to a signal, and motors don't have a signal.  He's using the word noise metaphorically, referring to the noise in sensors.  Errors in motors occur when wheels slip, or bumps in the floor nudge the robot off its path.

Markov Chain questions:  note how using the law of total probability at each state is actually really efficient compared to doing the computation by treating the sequence as the outcome.  That is, we could say that the probability of rain on day three is the sum of all possible ways we could get rain on day 3:
\[P(R_0R_1R_2R_3) +
P(R_0S_1R_2R_3) +
P(R_0R_1S_2R_3) +
The nice things about the Markov model is that we only need to look at the previous day and the transitions, to compute the current probability.  This reduces the number of mutliplications you need to make.

Nice discussion of particle filters.  I'm going to need to update what I do in Robotics next semester.  Overall, this lecture was pretty tight.

Tuesday, November 1, 2011


I usually find that people who like probabilities hate logic, and vice versa.  I was a logic person to start with, and it took years of training to finally learn to love probabilities.  This might be an unpleasant switch for some.

The first quiz is a little misleading because he appeals to what this means in English.  The sentence "If 5 is an odd number, then Paris is the capital of France," is false, since we know there is no connection.  He is actually asking about the propositional logic truth values.

It is kind of a cop out to say that implication \(P \Rightarrow Q\) is just defined that way.  There are really good reasons for it.  To understand, you first need to accept that the truth value of a sentence has nothing to do with whether it is a coherent idea in the real world, and it says nothing about the truth of it's component  pieces., it only checks to see if it is consistent with every else we know.  If I tell you that if it is raining, I bring my umbrella, it is not raining, and I did bring my umbrella anyway:
&R \Rightarrow U\\
&\neg R\\

What do we want to say about the truth of \(R \Rightarrow U\)?  is that sentence suddenly false just because I brought my umbrella on a sunny day?  Does that mean I won't the next time it rains?  Of course not, just because \(R\) is false doesn't make that rule suddenly untrue.  It is still consistent with the other stuff we know.  The only way we can say for sure that the rule is false is if we see a counter-example: \(R \wedge \neg U\).  Now, if that's true, then we know the rule as stated must be false.

Another way to think about this is to look at the truth tables.  What if we said that for the implication \(P \Rightarrow Q\) when P is false, the whole thing is false?  In that case, the truth table would be identical to the "and" truth table.  Do we really want to say that \(P \Rightarrow Q\) means the same thing as \(P \wedge Q\)?  "If P is true then Q is true" means the same thing as "both P and Q are true?"  What about if we said that the implication is true when both P and Q are false, but false when P is false and Q is true?  Then if we look at the truth table, that's the same as equivalence: \(P \Leftrightarrow Q\).  Do we really want to say that "If P is true then Q is true" means the same thing as "P and Q are the same"?  The table we have is our only choice.

First order logic: Logics are deceptive.  They're really easy to describe but they have lots of thorny implications.  It's easy to watch this and thing you get it perfectly, but really you know nothing yet.

He just said \(\forall_x Vowel(x) \Rightarrow NumberOf(x) = 1\) is a valid statement.  He means valid in the loose English sense.  It is not always true for all models.

That final quiz is a good example of how logic seems simple but gets tricky pretty fast.

Unsupervised learning 18-30

What he didn't say about the eigenvectors is that they are pulled from the covariance matrix that describes the shape of the gaussian.

He just dropped the phrase "principle component analysis."  That's just the name for finding that largest eigenvector.

I notice he isn't telling us how to map the data down to the smaller space once we find it.  Maybe that will come later?  Guess not.

I honestly had no idea about spectral clustering.  That was pretty cool.

Monday, October 31, 2011

Unsupervised learning 1-17

I'm a bit tired, but want to finish this off tonight.

Do you notice how defensive S. is getting about his quizzes? Some of the criticism hit home.

Finally, we have some technology to animate the algorithms.  That should really help.

"for the sake of this class, let's just care about it."  That one made me laugh.

I'm excited about Expectation Maximization  It's a cool algorithm, but generally pretty hard to understand.  I wonder if he'll just give the gaussian version... yep.

That gaussian looks like it crosses the x axis.  Be careful, the tails approach but do not cross the x axis.

Anyway,  if you don't already  know about gaussians, I would recommend just hitting the I believe button regarding the formula (but do pay attention to the derivative).

It should be noted that what he's showing and calling EM, is not the general EM algorithm, but a special version just for these "mixtures of gaussians."  There are many other versions, depending on the underlying distribution you use.

I'm stopping here because I'm not paying very good attention.  I'll finish in the morning.

Homework #3

Jumped to the homework because I'm running out of time.  Mostly pretty straightforward, but measuring distances for knn was a royal pain on a video screen.

One tip on regression: when I went back to the lecture, I thought the denominator for \(w_1\) was \(M \sum y_i^2 - (\sum x_i)^2\)  It is of course \(M \sum x_i^2 - (\sum x_i)^2\).

Machine Learning 10-37

Life caught up to me and I had several other deadlines pop up over this week.  Let's see if I can jam through the material before the homework is due.  This means I probably won't have time to work out the quizzes, sorry.

Laplace smoothing:  Smoothing is a funny word for it, no?  The idea is that when we have no data for a particular word, the distribution looks "sharp" because we have this sudden drop to 0.  By hallucinating some data, we get some probability in that case, and the distribution looks smooth.

The question in 15 shows us why the denominator has \(+ k |x|\) in it.  Recall that the \(P(M)\) must sum up to 1.  \(|x|\) is the number of things we're summing, and the \(k\) is what we add to each of them, so this way, after smoothing the distribution still sums to 1.

Naive Bayes:  Sort of odd to tell us NOW that what we're doing is Naive Bayes.  What makes this approach naive?  The key is the shape of the network.  See how y is the parent of all the x's?  That is great because then we know \(x_i \bot x_j | y\) and we can use Bayes rule easily.  This is the natural characterization for spam, which is why it doesn't look naive, but take a more complex example.  Let's say we want to determine the value for a hidden variable \(y_0\).  Now this variable affects our observed x's as before, but there are other hidden variables \(y_{1...n}\) that also affect our x's.  Now, when we try to compute \(P(x_{1...m}|y)\) those x's are no longer conditionally independent!  (draw the picture, I promise)  Naive Bayes is simply the assumption that there is only 1 hidden variable so we can use the conditional independence.

Cross validation: interesting conversation about experimental technique.  I wish someone sat me down and told me that when I started graduate school.  You may wonder  what the deal is with 10-fold cross validation is.  The problem is that if you get unlucky, your cv set might be misleading, so by trying multiple cv sets, we might get better value for our parameters.

Loss:  Why squared?  There are a few reasons, but one practical one is that we want loss to always be positive, because it should measure a distance between a point and a function.  Otherwise 3-5 would be different from 5-3, even though they're the same distance apart.  Why not absolute value?  Because you can't differentiate it, but square is easy to differentiate.

Wow, we're going awfully fast here, essentially just mentioning other techniques like logistic regression, or regularization.  There's no way you could implement logistic from that discussion. When he talks about sparsity in L1 vs. L2, the basic idea is that L1 works better when there are fewer data samples.ICML 2004.

No time to go in to it now, but the point of the question at the end of perceptions is about a very powerful technique called Support Vector Machines.  They try (among other things) to find the best separating line as opposed to just some separating line in perceptrons.

wow, that's confusing: non-parametric techniques have a growing number of parameters?  What he means is we save examples from the data, and those act like parameters in our model.  I prefer not to call the saved data "parameters".

Not looking forward to the homework.  I only have an hour and a half left.

Wednesday, October 26, 2011

Machine Learning 1- 9

ML is a reasonable place to go from probability, but I'm still a bit weirded out by having it so soon.

Someone else complained to me that Sebastian bobs his head too much when he talks.  I hadn't noticed it, but now that he mentioned it, I can't not.  Now I have passed on this little gift to you.

What is:  yes it is building models from data, but not just diagnostic models. We can also learn models that tell us consequences of actions that we use in search.

Let me guess, this question will be "all of the above."  Yep.

Stanley: This actually a good example of something that bothered me about robots in the mid-90s to early 2000s:  A lot of times what it seemed like we were doing was coming up with algorithms whose main objective was to overcome crappy sensors.  Look at the image from Stanley's camera- I can barely tell where the road is, so of course the robot is going to have a hard time.  The solution of overlaying with the laser data is great, until the cameras improve and we can find the road based just on that.  That is why I've always preferred working in simulation.

Taxonomy: That is a really good list, and a good point at that end, that this list encompasses a great deal and it would take years to fully understand all of what's written on that page.

Supervised learning:  Interesting, he gave us a warning this time that we haven't seen the answer to the question, but he wants to check our intuition.  I wonder if he received a lot of negative feedback about that.

SPAM: Something that should have been made clear from the outset.  you pick your dictionary first, and process all messages with that same dictionary.  This leaves you with very long, mostly 0 input vectors.  Notice the caveat at the bottom where he says "sport" and "sports".  That's very common.  It's called stemming: reducing all variations of a single word to a single word.  The idea being that since sport and sports and sporting and sporty all have similar meanings, it is useful to count them the same in the input vector.

Maximum Likelihood:  I know what maximum likelihood is, and I have no idea what he's talking about, entirely because he keeps changing his notation and his meanings of symbols.  Let's see if we can figure it out:
\(P(S) = \pi\) means for \(\forall_i P(y_i = S) = \pi\)
The next bit means:
\(\forall_i P(y_i=S) = \pi \wedge P(y_i=H) = 1- \pi\)
I have never seen his notation to say this before.

The next line is a funny way to restate the previous one.  How can we use the value or \(y_i\) to compute the \(P(y_i)\)? If we know the value of \(y_i\), then the distribution is 0s and 1s for a particular i.  What he's trying to do is use \(y_i\) as a switch, so he doesn't have to say =S or =H.  That is,:

if \(y_i=S, P(y_i=S) = \pi\) and if  \(y_i=H, P(y_i=H) = 1-\pi\).  This notation is drifting to perversion.

Now, the next line can be derived from the previous line, but it also can be derived from basic probability.  The probability of a set of independent trials is the product of the probability of the outcomes of the individual trials.

"we can also maximize the logarithm of this expression." True statement, not actually obvious how that works.  It is true because log is always increasing, so the in the existing function f(), if \(f(a)>f(b)\) then \(\log f(a) > \log f(b)\).  This is handy because dealing with logs is easier to take derivative.

I think I'll publish this post now to get it out.