## Monday, October 31, 2011

### Unsupervised learning 1-17

I'm a bit tired, but want to finish this off tonight.

Do you notice how defensive S. is getting about his quizzes? Some of the criticism hit home.

Finally, we have some technology to animate the algorithms.  That should really help.

"for the sake of this class, let's just care about it."  That one made me laugh.

I'm excited about Expectation Maximization  It's a cool algorithm, but generally pretty hard to understand.  I wonder if he'll just give the gaussian version... yep.

That gaussian looks like it crosses the x axis.  Be careful, the tails approach but do not cross the x axis.

Anyway,  if you don't already  know about gaussians, I would recommend just hitting the I believe button regarding the formula (but do pay attention to the derivative).

It should be noted that what he's showing and calling EM, is not the general EM algorithm, but a special version just for these "mixtures of gaussians."  There are many other versions, depending on the underlying distribution you use.

I'm stopping here because I'm not paying very good attention.  I'll finish in the morning.

### Homework #3

Jumped to the homework because I'm running out of time.  Mostly pretty straightforward, but measuring distances for knn was a royal pain on a video screen.

One tip on regression: when I went back to the lecture, I thought the denominator for $$w_1$$ was $$M \sum y_i^2 - (\sum x_i)^2$$  It is of course $$M \sum x_i^2 - (\sum x_i)^2$$.

### Machine Learning 10-37

Life caught up to me and I had several other deadlines pop up over this week.  Let's see if I can jam through the material before the homework is due.  This means I probably won't have time to work out the quizzes, sorry.

Laplace smoothing:  Smoothing is a funny word for it, no?  The idea is that when we have no data for a particular word, the distribution looks "sharp" because we have this sudden drop to 0.  By hallucinating some data, we get some probability in that case, and the distribution looks smooth.

The question in 15 shows us why the denominator has $$+ k |x|$$ in it.  Recall that the $$P(M)$$ must sum up to 1.  $$|x|$$ is the number of things we're summing, and the $$k$$ is what we add to each of them, so this way, after smoothing the distribution still sums to 1.

Naive Bayes:  Sort of odd to tell us NOW that what we're doing is Naive Bayes.  What makes this approach naive?  The key is the shape of the network.  See how y is the parent of all the x's?  That is great because then we know $$x_i \bot x_j | y$$ and we can use Bayes rule easily.  This is the natural characterization for spam, which is why it doesn't look naive, but take a more complex example.  Let's say we want to determine the value for a hidden variable $$y_0$$.  Now this variable affects our observed x's as before, but there are other hidden variables $$y_{1...n}$$ that also affect our x's.  Now, when we try to compute $$P(x_{1...m}|y)$$ those x's are no longer conditionally independent!  (draw the picture, I promise)  Naive Bayes is simply the assumption that there is only 1 hidden variable so we can use the conditional independence.

Cross validation: interesting conversation about experimental technique.  I wish someone sat me down and told me that when I started graduate school.  You may wonder  what the deal is with 10-fold cross validation is.  The problem is that if you get unlucky, your cv set might be misleading, so by trying multiple cv sets, we might get better value for our parameters.

Loss:  Why squared?  There are a few reasons, but one practical one is that we want loss to always be positive, because it should measure a distance between a point and a function.  Otherwise 3-5 would be different from 5-3, even though they're the same distance apart.  Why not absolute value?  Because you can't differentiate it, but square is easy to differentiate.

Wow, we're going awfully fast here, essentially just mentioning other techniques like logistic regression, or regularization.  There's no way you could implement logistic from that discussion. When he talks about sparsity in L1 vs. L2, the basic idea is that L1 works better when there are fewer data samples.ICML 2004.

No time to go in to it now, but the point of the question at the end of perceptions is about a very powerful technique called Support Vector Machines.  They try (among other things) to find the best separating line as opposed to just some separating line in perceptrons.

wow, that's confusing: non-parametric techniques have a growing number of parameters?  What he means is we save examples from the data, and those act like parameters in our model.  I prefer not to call the saved data "parameters".

Not looking forward to the homework.  I only have an hour and a half left.

## Wednesday, October 26, 2011

### Machine Learning 1- 9

ML is a reasonable place to go from probability, but I'm still a bit weirded out by having it so soon.

Someone else complained to me that Sebastian bobs his head too much when he talks.  I hadn't noticed it, but now that he mentioned it, I can't not.  Now I have passed on this little gift to you.

What is:  yes it is building models from data, but not just diagnostic models. We can also learn models that tell us consequences of actions that we use in search.

Let me guess, this question will be "all of the above."  Yep.

Stanley: This actually a good example of something that bothered me about robots in the mid-90s to early 2000s:  A lot of times what it seemed like we were doing was coming up with algorithms whose main objective was to overcome crappy sensors.  Look at the image from Stanley's camera- I can barely tell where the road is, so of course the robot is going to have a hard time.  The solution of overlaying with the laser data is great, until the cameras improve and we can find the road based just on that.  That is why I've always preferred working in simulation.

Taxonomy: That is a really good list, and a good point at that end, that this list encompasses a great deal and it would take years to fully understand all of what's written on that page.

Supervised learning:  Interesting, he gave us a warning this time that we haven't seen the answer to the question, but he wants to check our intuition.  I wonder if he received a lot of negative feedback about that.

SPAM: Something that should have been made clear from the outset.  you pick your dictionary first, and process all messages with that same dictionary.  This leaves you with very long, mostly 0 input vectors.  Notice the caveat at the bottom where he says "sport" and "sports".  That's very common.  It's called stemming: reducing all variations of a single word to a single word.  The idea being that since sport and sports and sporting and sporty all have similar meanings, it is useful to count them the same in the input vector.

Maximum Likelihood:  I know what maximum likelihood is, and I have no idea what he's talking about, entirely because he keeps changing his notation and his meanings of symbols.  Let's see if we can figure it out:
$$P(S) = \pi$$ means for $$\forall_i P(y_i = S) = \pi$$
The next bit means:
$$\forall_i P(y_i=S) = \pi \wedge P(y_i=H) = 1- \pi$$
I have never seen his notation to say this before.

The next line is a funny way to restate the previous one.  How can we use the value or $$y_i$$ to compute the $$P(y_i)$$? If we know the value of $$y_i$$, then the distribution is 0s and 1s for a particular i.  What he's trying to do is use $$y_i$$ as a switch, so he doesn't have to say =S or =H.  That is,:

if $$y_i=S, P(y_i=S) = \pi$$ and if  $$y_i=H, P(y_i=H) = 1-\pi$$.  This notation is drifting to perversion.

Now, the next line can be derived from the previous line, but it also can be derived from basic probability.  The probability of a set of independent trials is the product of the probability of the outcomes of the individual trials.

"we can also maximize the logarithm of this expression." True statement, not actually obvious how that works.  It is true because log is always increasing, so the in the existing function f(), if $$f(a)>f(b)$$ then $$\log f(a) > \log f(b)$$.  This is handy because dealing with logs is easier to take derivative.

I think I'll publish this post now to get it out.

## Wednesday, October 19, 2011

### Homework #2

1. Much easier than the in class quizzes were.
2. Much clearer questions than last homework, most likely because they are numeric.
3. Weird things happen when the priors are 0.5.
4. Big problem with this model of evaluation: no partial credit.  It's easy to have a small error in knowledge but get the problem wrong.  Usually this is made up for in automatically graded exams by asking lots of questions, so if you know 80% of the material you get 80% of the questions right.  But with few questions, you need to look at them more closely to evaluate the level of knowledge.

### Probabilistic Inference

Peter, that network is only familiar if you've read your or Judea Pearl's book.  The presentation doesn't really explain what's going on.  What do you mean by John or Mary "calls"?  For the record, the idea is that these are neighbors who call you at work when your home alarm goes off. This alarm could be caused either by a burglary or by an earthquake.

I don't understand this first question.  You just told us which were which, and you've even labeled them in the picture!  Obviously, you've just changed your definition of what these things mean and haven't told us.

Oh great, he's using different notation from Sebastian.  It would nice to be consistent.

Wait!  that's an outright error!

He says $$P(+b,+j,+m) = \sum_e \sum_a P(+b,+j,+m)$$  What he means is
$$P(+b,+j,+m) = \sum_e \sum_a P(+b,+j,+m,E,A)$$

After the last module being so much better, I'm really disappointed that this module is just as sloppy as the earlier ones.

ha ha!  moving around slips of paper?  Couldn't we use technology there?

What do you mean by "put the nodes together in a different order?"  There has been nothing so far that refers to the order of things in construction. Sebastian just assumed it was causal.  The idea of iterative construction of the network is rather non-intuitive.  We naturally want to build the network causally, so what we're doing here is showing that putting the network in a way that we didn't want to in the first place, results in a bad network structure.  And notice that through all that we didn't speed out our example in the slightest because it was already the best structure.

Quiz 9 is broken.  1 - 0.134 = 0.866 not 0.886.

So sampling will give us an estimate of a joint, but what does that have to do with a Bayes net?  It's presented as if sampling is a way to perform inference on a Bayes net fast, but what he showed was using sampling instead of a Bayes net to answer the same question.

"Sampling has an advantage over inference in that we know a procedure for coming up with at least an approximate value for the joint probability distribution, as opposed to exact inference where the computation may be very complex." What? Isn't a complex computation still a known procedure? He means that sampling get get us an approximate answer in a reasonable amount of time, whereas inference may not return in our lifetime.

In rejection sampling, why don't we just not generate the samples we don't want, rather than generate them and reject them?  That would actually work in the P(W|C) example he has there because C has no parents.   If C had parents, then its value would depend on them, so we can't just generate samples with -w.  See liklihood weighting coming up.

That Gibbs sampling discussion was really cursory.  He describes the process correctly, but doesn't tell us how to use it to answer these conditional questions we're talking about.  The key is if we want to know P(C|+r,+s), S and R are the "evidence" variables, so we don't vary them, and vary only C and W over time.  But wait, don't we have the same likelihood weighting problem that sampling C only reveals the prior?  No, that's what he left out.  We compute a new value for C based on its Markov blanket, in this case P(C|+r,+s).  But wait, that's what we're trying to compute!  yep, Gibbs sampling is no help here.  Not really much of an example is it?

I'm always a little bothered with the Monty Hall problem.  Not because I don't understand it, but because it just isn't related to anything else in the section.  We always present it because it's neat, not because it gives us any insight into AI.  I question our motives.  At least Peter tells us this up front.

As I said before, I'm a bit disappointed with this unit- it was too sloppy, and some important things were glossed over.

## Tuesday, October 18, 2011

### how AI parts are related (or: avoiding Bayes net whiplash)

I mentioned in today's post that it's hard to see how we get from search to Bayes nets, and the student can feel a little lost.  This post tries to explain AI as I see it.

Fundamentally, I see top-level AI as all Search.  We have a world state, actions that can be taken, a possible goal, and possible heuristics.  So where do diagnostic-style things, like Bayes nets fit in?  Bayes nets are all about figuring out what state we're in.   In the 'travel Romania' example, we know what out start state is, so we can begin searching immediately.  But what if we didn't?  We would need to figure out where we were before we could start to search.

Or to turn this upside-down, look at the car starting network.  Sure, the network tells us what is wrong (or at least what is most likely wrong) but it doesn't tell us what to do about it.  That's where search comes in.  Now that we know it's the alternator, we can start to search the actions we can take to formulate a plan of action.  Often in applications of Bayes nets, the only part of the system is the network itself, how is that related to search?  State space search is still being done, it's just by the humans involved, not the computer.  When we use a medical diagnosis system, it spits out a diagnosis, and then the doctor and patient think about various options for care (sort of an AI cyborg: intelligent decisions made part by humans, part by machines).  We could have let the treatment plan also be computer generated, and that would likely require good old state-space search.

So diagnosis is a tool so we know where we are in our state-space search.

(Warning: not all AI people agree with me on this, but they're wrong.)

### Probability in AI

This is interesting, probability in the second week.  Usually that comes much later, and we roll into games now, followed by planning.  Very curious to see how this will work.

Oh, not just probabilities, but opening with Bayes networks.  Bold move.  One issue I have with this is that there is no attempt to connect this material with last weeks material.  A little bit of topic whiplash is inflicted on the student.  Some of this is a reflection of how many AI people see the AI world: a collection of loosely related topics and techniques for solving problems that require "smart".  Personally I think that is a mistake.  I'll put up another posting on the topic.

He let the 2^16 bit slide by.  I think that deserves more attention as it is the primary motivation for these things.  In classical probability, you can ask questions about conclusions given all the data, like, what is the probability that the alternator is broken, given that there is gas, and no oil, and no lights, etc., but to do that, you have to look at all possible combinations of the data: gas=true, oil=true, light=true, and gas=true, oil=true, light=false, and gas=true, oil=false, light=true, etc.  That's where the 2^n comes from, where n is the number of variables.  The key thing about Bayes nets is that it encodes information about the problem in the form of irrelevance.  Whether or not there is oil in the  affects whether or not it will start, not not necessarily whether or not there is gas.  This informations allows us to consider smaller tables

Probability/Coin Flip.  Starting really basic here.  Sebastian really likes this "quiz before he tells you how to do something.  Am I the only person driven nuts by this technique?  It always strikes me the the teacher is trying to show off how much more he knows.

Wait, we went from kindergarten probability to dropping words like "joint" and "marginals" around?  Define your terms!  Joint probability: a table describing the probability of outcomes of one event AND another event: P(X,Y)={P(X=true and Y= true, P(X=true and Y= false, P(X=false and Y= true, P(X=false and Y= false}.

He's going really fast.  Was a probs and stats course required for this class?  A friend in the math department once told me that it took them weeks to just get across what a random variable is, and we've just  jumped to week 6.  Ah, I see, yes, all this is a prerequisite and we're just doing "review" right now.  Sorry, back to the show.

Actually, I think this is going rather well considering we're already supposed to know this.

Bayes Rule:  big point right there that I think should be emphasized- the relationship between causal and diagnostic inference.  if A "causes" B (within some probability) then P(B|A) is the causal conditional probability, and P(A|B) is the diagnostic (because we're trying to determine what caused B to take the value it did- what is making the car not start).  The diagnostic is often the question we're after, but the causal is the value we know, because we can run experiments on it.  Take people both with and without cancer and give them the test many times, now we know P(+|C).  One reason Bayes rule is helpful is that it turns a question we don't know the number for (diagnostic) into one that we do (causal).

This number of parameters question has two answers.  Does he want the number of parameters in the system or the minimal number we need?  For instance we don't need to store P(~B|A)  if we store P(B|A).  The answer is either 3 or 6, let's see what he means... ah 3.

ew, what just happened to his hand? Is that a camera effect?

Ack, I hate that normalizer.  It's never made anything easier for me and it always confuses students.  At least he's explaining it; the usual approach is the just throw it out there with an "obviously we can omit the denominator and just normalize" comment.

The 2-test questions require that you understand conditional independence, which I'm pretty sure wouldn't be in a basic probability class.  Now he's explaining it.  I'd be pretty ticked off if I spent 10 minutes trying the quiz, only to be told afterwards what I need to figure it out.

I just realized that since I'm getting the questions right, I'm missing some of the material in terms of the answer explanations.  Maybe I should start getting them wrong.

Ha! they got me on the P(R|S) question.  I spent 5 minutes trying to derive the formula before I remember they're independent! lol.

Interestingly, I did  P(R|H,S)  a different way.  I computed the joint probability, which can be read off the network: P(R|H,S) = (eta)P(R,H,S).  The joint is just P(R,H,S) = P(R)P(S)P(H|R,S)

(yeah, so eta is useful when I can't type fractions. sue me)

P(R|H) is a difficult question.  I used the same trick of computing the joint:
P(R|H) = (eta)P(R,H).  P(R,H) = P(R,H,S)+P(R,H,~S) = P(R)P(S)P(H|R,S) + P(R)P(~S)P(H|R,~S)

Apparently I have deeply impressed Sebastian but perhaps that's not really fair :)

General Bayes Net: now we see the trick I used above.  Another important point is that the compactness of the network not just saves space, but means we need to gather less data to figure out what the numbers are.  That's actually more important than space.

So why do I get all the hard questions right but am unable to add up numbers to 13?

D separation is actually a really useful idea in daily life.  There are many times in meetings where we all agree on some fact B, but get lost discussing how A affects C.  In certain crowds you can just say, "we've established B, A and C are D-separated, let's move on."

In summary, this unit was way harder than than the first two.  I wonder if this is supposed to be the weeder section to flick off some of the less dedicated students.  I wouldn't blame them if it was.  On the other hand, I felt that the presentation quality was much higher than before.  It will be interesting to see if this is a function of Sebastian caring more about this topic, or just them getting their acts together.  I still think a little more professionalism in the presentation, with animations etc., would really improve things
$$\sum_\alpha$$

## Wednesday, October 12, 2011

### Homework #1

The questions on problem formulation are underspecified, so there's no way to answer them.  These same horseshit questions are in the book as well, and they always cause more problems than they are worth.

Let's be precise about the ambiguity.  Loaded coin:  We know nothing about what the states or the actions are, or even how this pertains to search.  The definition of fully observable is "What the agent can sense at any point in time is sufficient to make the optimal decisions."  But there are no decisions to make: you keep flipping a coin until you get bored, and voila you have your estimate: (See here.) Thus the definition of fully observable does not even apply unless we have some sort of (unstated) cost function to optimize.

For the maze problem, he does not state what the states or actions are so we can't know if it is continuous or not.   Is this a robot in the maze? then it's continuous.  Can we assume actions take us to junctions in the maze? then it is discrete

Problem 6 is a trick question.  What do they hope to find out by asking that?

Problem 7 does not have the correct answer listed for first node to expand.  The first node to expand is a1.  Oh, btw, he didn't state what the actions are (again).

### Problem solving

Introduction:  Couldn't find an image of fog on a street?  Arad to Bucharest?  This example is straight out of the text.  It would be better to give a different example, so the student can see two examples instead of the same one twice.

What is a problem:  text getting a little cramped there.  How is this better than a power point slide?

Tree Search:  There are numbers on the edges, we know they are distances, so why are the lengths just # of hops?  Needs more discussion, especially with breadth first search-  You say it's shortest first, but in the algorithm you ignore the path cost- why?.  Wait, did we just re-write the tree search algorithm on the page to make it more cramped?  "The reason we keep track of the explored state..."  we just covered that and you argued that in tree search it is a natural way of looking at the problem to allow duplicates, but now you're saying we prune them out?  Which is it?

Graph Search: This whole transition from tree to graph was really awkward.

Say wouldn't it be great if I could see the code as you walk through the algorithm? Also would be helpful when I'm trying to answer the quiz questions.  Still haven't discussed the difference between path length and path cost.

Is unit 11 broken?  I can only play half of it.

Is he playing with candy wrappers?

Uniform cost: woah he just totally missed the discussion of whether to add Oradea from Sibiu. That's important.  Omitting it is a flat out error.

Search Comparison:  He said I was supposed to fill in the numbers of node expansion order, but I never got a chance to do so.  Wait, we're looking at the node expansion order of DFS w/o first looking at what the algorithm is?

More on Uniform Cost:  I like this.  The topological map model of how the search expands was a nice explanation for greedy best first search.

A*: no proof of optimality, just a hand wave?  Ooops, I made an arithmetic error on the quiz :-(

Peter needs a better 15 puzzle toy.  Those tiles don't move very well.

So, how would you relax the problem in the path finding problem?  how could the program get those numbers?  The argument that we can automatically generate heuristics only works in the sliding tile puzzle.

### Welcome to AI

The first class is the intro.  This usually sets the stage, and tries to get the student excited about AI.  Let's see how it goes...

Introduction: No new information, but they sure don't seem excited to be there.

Course Overview:  So this is going to be "board work" by writing on a piece of paper?  Rather low tech, but I'll roll with it.

Intelligent Agents: OK, I'm already annoyed.  There's a quiz that asked a question on material that has not been covered yet.  Yes, it's easy, but that is always super obnoxious.  There is no good purpose to that.

Applications of AI:  These diagrams are cramped and ugly.  Wait, I am the environment in a game agent example?  So the computer's move changes me in some way?  Wouldn't it be way more natural to say the game board is the environment?  or at least the game board + me?

The AI in medicine is in fact a great example of how this agent-based model is not sufficient for encompassing all of AI.  Do you really claim the the diagnosis output is an "actuator?"  You can claim that, it's a pretty big stretch from the natural definition of the word.
Web crawlers and web search page rankings are an example of AI?  hmmm...not sure the designers of them would agree at all.

Terminology:  Finally some real meat.  Why does state get an arrow?  does it somehow move information from the environment to the environment?  All the other arrows represented information transfer.  Game of "Pokes?"  really? This quiz is really poorly designed: what do you mean by poker?  There are many versions?  Is it really stochastic or is it that the deck is part of the state and not observable?  What do you mean by "robot car?"  that could be a million different things.  Student is supposed to guess what you're thinking?

AI and Uncertainty.  Where did this come from?

Machine Translation:  OK, Sebastian is out and Peter is in.  This is cooler.

Chinese Translation: Shout out to Adam Lopez! Hey, zoom in a little, I can't read the characters.

Summary: Overall I'm unimpressed so far.  There wasn't much here to get me excited about AI, and Sebastian seems to be phoning it.

### Mission

This semester, Stanford is trying an educational experiment by putting their Artificial Intelligence class entirely on-line (interest was so strong, they also added a Machine Learning and a Natural Language Processing class).  As an educator and AI person, I was naturally interested in the course and have joined the 58,000 or so other people to take the class.  Since I know the material pretty well, I'll be focusing on course structure and presentation style, to see what I can learn about teaching AI.  I will take all the tests and do all the homeworks because they are part of the entire package.

The class started yesterday and as I was watching the video, I realized I had things to say about the content, and thus, the creation of this blog.  Hopefully I will be able to keep this up through the whole semester.