Name is Random Forest also called decision forest or also called ensemble
----. In general, Forest means collection of trees right. The same case here. Random means which is not taken in continuous manner, the name itself says randomly or randomness.
Before going to random forest, it is necessary to know about decision trees, I am giving a brief description about it below because if you want to know about forest you should know what is tree right so:
lets explain with an example--
lets say you have 5 data points (3 red and 2 green)that is 3 are positive class and 2 are negative class and you can assume height and weight. start with the root node(number 0) , as we go down ,you need to take a decision like height>2m goes to left node (number 1) and height <2m goes to right node (number 2) Now, here again you need to take decision at node 2 like weight <200 goes to left node and greater than goes to right node. keep on taking decision at every node until you reach leaf node. you will end up with red points or green points. Now, if you feed the tree with new point and take decision at every node, you end up with node having red point or green point and the new point belongs to that.
decision tree can be constructed either breadth wise or depth wise. here I am talking about breadth wise. so you know how classification tree is built. lets see how random tree is built:
----------Random Forest:
say you have d=3 features, h= 5 data points .pick 2 features at random from the data and point out that in graph(for easiness) observe the image below:
For every split point, evaluate information gain using the formula of entropy.you choose the split point that has the highest information gain for splitting. if there is a tie,break it random. why we are saying random is. here i mentioned d= 3. so here no problem but if d has thousand and more. for every point you need to calculate information gain. computation power is high. but by choosing subsets at random, it reduces the computation time.
we construct a tree randomly.tree in one machine and another tree in other machine and so on its completely a distributed algorithm. Here goes the Random Forest algorithm:
The algorithm has two sources of randomness one is randomness in the data and randomness in the split. lets see discuss algorithm step by step:
l. let b is a node, a single tree, we are going to build a forest that is B. Now, how we build each tree
a) we have a data set of N points.(data = (X1 Y1).......(Xn Yn) In order to get more randomness, each tree will get different data-sets. we take N data and with replacement, we draw uniformly at random this N data. this is something called bootstrapping or we can say related to bagging.(something complexity control) bagging means you construct tree randomly on different subsets of data
suppose you have 3 trees in the forest t=1, t=2 and t=3. each tree is constructed on a different data-set. now say you got a new point v(you don't know what this is like red point or green point) just follow the tree that makes decisions and you end up with say green point in first tree. second tree also gives green point and third tree gives red point-- you take majority of all.
p(c/v) = 1/T sigma t=1 to T Pt(c/v)
here c means it is of which color, v is new point, sigma of something it is individual trees and averaging all. averaging is done at the bottom of the tree. so in technical terms, In a forest with T trees we have t belongs to 1...T. All trees are trained independently(and possibly in parallel). During testing, each test point v is simultaneously pushed through all trees(starting a t root node) until it reaches the corresponding leaves.
b) this I already discussed with d=3 features here in algorithm it is p features. we choose subset of size m of p features we then choose the split point the best split point we use to split the two daughter nodes.
2. the output of the algorithm is just all the trees. combine the trees by averaging.
I mentioned above the averaging and splitting.
If you have iPhone. you will see an app called kinect. This app uses this algorithm :)