The word “Big data” prevailed in 2017, and it’s going to keep prevailing in the following years. In our previous post, I’ve introduced some concepts about big data, machine learning, and data mining (see post: Understanding Big data, Data mining, and Machine Learning in 5 Minutes). Now let's dig deeper into Machine Learning with a brief walk-through of some most commonly used ML algorithms, no codes, no abstract theories, just pictures and some examples of how they are used.
The list of algorithms covered in this article include:
1. Decision Tree
Classify a set of data into different groups using certain attributes, execute a test at each node, through brach judgement, further split the data into two distinct groups, so on and so forth. Tests are done based on existing data, and when new data are being added it can be classified to the corresponding group
Classify data according to some features, whenever the process goes to the next step, there is a judging branch, and the judgement divides the data into two, and the process goes on. When tests are done with existing data, new data can be These questions are learned by the existing data, when there is new data coming in, computer can categorize data into the right leaves.
Select randomly from the original data, and form into different subsets.
Matrix S is the original data, and it contains 1-N data rows, while A, B, C are the features, and the last C stands for categories.
Create random subsets from S, let’s say we got M sets of subsets.
And we get M sets of decision trees from these subsets:
Throw new data into these trees, we can get M sets of results, and we count to see which results are the most in all M sets, we can consider that as the final result.
When the probability of the predicting target is larger than 0, and less than or equal to 1, it cannot be fulfilled by simple linear model. Because when domain of definition is not within certain level, the range would exceed the specified interval.
We better go with model with this kind.
So how can we get this model?
This model needs to fulfill two conditions, “Larger than or equal to 0”, “Less than or equal to 1”
And we transform the formula, we can get the logistic regressions model:
By calculating the original data, we can get corresponding coefficients.
And we get the logistic model plot.
4.Support Vector Machine
To separate the two classes from hyperplane, the best choice will be the hyperplane that leaves the maximum margin from both classes. Because Z2>Z1, so the green one is better.
Use a linear equation to express the hyperplane, class above the line is larger than or equal to 1, the other class is less than or equal to -1.
Calculate the distance between the point to the surface by using the equation in the graph:
So we get the expression of total margin as below, the aim is to maximize the margin, which we need to do is to minimize the denominator.
For example, we use 3 points to find the optimal hyperplane, define weight vector=(2, 3) - (1, 1)
And get weight vector (a, 2a), substitute these two points into the equation
When a is confirmed, the result using (a, 2a) is support vector,
Equation substituting in a and w0 is support vector machine.
Here’s an example of NLP:
Giving out a pieces of text, examine the text’s attitude is positive or negative.
To solve the problem, we can only look at some of the words:
And these words, will represent by only some of words and their counts.
And the original question is: Give you a sentence, which category does it belong?
By using Bayes Rules, it is going to be an easy question.
The question becomes, in this class, what’s the probability of occurrence of this sentence? And remember not to forget the other two probabilities in the equation.
Example: the probability of occurrence of the word “love” is 0.1 in the positive class, and 0.001 in the negative class.
When comes a new datum, which category has the most points nearest to it, it belongs to which category.
For example: To distinguish “dog” and “cat”, we judge from two features, “claws” and “sound”. Circles and triangles are the known categories, what about “star”:
When K=3, these three lines connect the nearest 3 points, and circles are more, so “star” belongs to “cat”.
Separate the data into 3 classes, the pink part is the biggest, while the yellow is the smallest.
Pick 3, 2, 1 as default, and calculate the distance between the rest data and the defaults, and classify it into the class that has the shortest distance.
After classification, calculate the means of each class, and set it as the new center.
After some rounds, we can stop when the class no longer changes.
Adaboost is one measure of boosting.
Boosting is to gather up the classifiers that didn’t have satisfied results, and generate a classifier that may have better effect.
As the below shows, tree 1 and tree 2 don’t have good effects individually, but if we input the same data, and sum up the results, the final result will be more convincing.
An example for adaboost, in handwriting recognition, the panel can extract many features, such as the beginning direction, distance between beginning point and ending point, and etc.
When training the machine, it will get the weight of each feature, like 2 and 3, the beginnings of writing them are very similar, so this feature does little to classification, so its weight is little.
But this alpha angle has a great recognizability, so the weight of this feature will be great. The final outcome will be a result of considering all of these features.
In NN, an input may end up into at least two classes.
Neural network is formed of neures, and connections of neures.
The first layer is the input layer, and the last layer is the output layer.
In hidden layers and output layer, they both have their own classifiers.
When an input comes in the network, and being activated, the calculated score will be passed down to the next layer. Scores shown in the output layer are the scores for each class. Example below gets the result of class 1;
same input being passed to different knots generates different scores, which is because that in each knot, it has different weights and bias, and this is propagation.
Markov Chain consists of states and transitions.
For example, get a Markov Chain based on “the quick brown fox jumps over the lazy dog”.
First, we need to set every word under a state, and we need to calculate the probability of state transitions.
These are the probabilities calculated by one single sentence. When you use massive data of texts to train the computer, you will get a bigger state transition matrix, such as words that can follow “the”, and their corresponding probabilities.