NLP Nanodegree

2021-12-30

Following the world trend, I decided to complete a course in Machine learning, and chose Natural Language Process (NLP) from Udacity nano-degree program with the following reasons:

The course required an intermediate level of machine learning techniques including deep learnings. So, I didn’t have to start from learning the basics.
The subject was completely new to me, making it really “learning something new”.

I enjoyed the course throughout, and learned a lot. Specifically, the deep learning was like a magic. Transform data into tensors, apply a chain of non-linear functions to produce output tensors, minimise errors from true values through gradient descents and back-propagation, and abracadabra… It works!

I was most intrigued by the fact that we can now tackle non-linear problems through deep learning. Those educated in classical subjects such as maths and theoretical physics may argue the approach of deep learning is not elegant since it is difficult to see “why” the algorithm behaves in certain ways. In my opinion, however, it is a matter of perspective - in the end, what do I really understand? I don’t even know how the keyboard I am typing works. But, I know it works.

What matters the most is that I have enjoyed and completed this course. Here is the certificate.

Review notes below (never complete)

Spam Classifier

Based on this and this

probabilistic classifier
- generative classifier (e.g. Naive Bayes): how a class could generate some input data. given an observation, return the class most likely to have generated the observation.
- discriminative classifier (e.g. logistic regression): what features from the input are most useful to discriminate between the different possible classes.

Naive Bayes classifier

About multinomial naive Bayes classifier
A document is turned into bag of words: a set of (word, count) pairs.
Method: Maximum posterior probability given document $d$.
\[\hat{c} = \arg \max_{c\in C} P(c|d)\]
Bayes inference:
\[\hat{c} = \arg \max_{c\in C} P(c|d) = \arg \max_{c} \frac{P(d|c)P(c)}{P(d)} = \arg \max_{c} P(d|c)P(c)\]
generative model because: (i) a class is sampled from $P(c)$ and (ii) words are generated by sampling from $P(d\vert c)$.
- $P(d\vert c)$: likelihood
- $P(c)$: *prior$
naive -> assume independence.
\[\arg \max_{c} P(c) \prod_{i \in \mathrm{positions}} P(w_i | c)\]
as usually, take the log.
Training for $P(c)$ and $P(w_i\vert c)$: use maximum likelihood estimates with $\color{red}\textrm{Laplace smoothing}$
\[\hat{P}(c) = \frac{N_c}{N_{\mathrm{doc}}}, \quad \hat{P}(w_i|c) = \frac{\mathrm{count}(w_i,c) + \color{red} 1}{\sum_{w\in V} \mathrm{count}(w, c) + \color{red} 1}\]
comments:
- unknown words in test words in test data but not in the training? remove them from the test doc
- stop words: remove them.
optimisations:
- binary multinomial naive Bayes (binary NB): remove duplicates from each doc. occurrence is more important than frequency.
- treatment of negation: add ‘NOT_’ to all words following ‘didn’t’, ‘doesn’t, etc, until the punctuation mark.

Evaluation: Precision, Recall, F-measure

confusion matrix

	gold positive	gold negative
system positive	true positive	false positive	Precision = tp/(tp+fp)
system negative	false negative	true negative
	Recall = tp/(tp+fn)		accuracy = (tp+tn)/(tp+fp+tn+fn)

precision & recall focuses on true positive which we are looking for.
F-measure (harmonic mean): (there is more general form using $\beta$)
\[F = \frac{2PR}{P+R}\]
can be extended to any number of classes.

Statistical Significance Testing

Unlike the standard tests, one can’t assume any distribution. So, non-parametric approaches such as bootstrapping is required to test A/B test through p-values. The original bias should be incorporated. A nice section to read.

Logistic Regression

Spam Classifier

Naive Bayes classifier

Evaluation: Precision, Recall, F-measure

Statistical Significance Testing

Logistic Regression

Part of Speech Tagging with HMMs