Thursday, August 9, 2018

Machine Learning 1. Decision Tree (also Classification Tree or Regression Tree).

Decision Tree (DT) - is a decision support tool which consists of "leaves" (also called nodes) and "branches". Two branches form a "split". Each split corresponds to two branches with one leaf at the end of the branch. Each leaf is one of the possible values for that split:
As you can understand, DT uses heuristic algorithm, which means that DT algorithm uses practical method and this method is not guaranteed accurate or optimal but is sufficient to solve that problem.
Decision Tree Classification - helps to find class of an object using available characteristics of that object, i.e. we know which classes are existing and know parameters used to classify objects. For example: 
  1. to predict if incoming e-mail is spam or not, we use DT classification model, because we have multiple characteristics (object parameters) and want to learn class of the object and we have two classes - spam and not-spam. 
  2. to predict which number is on a sign (for simplicity think that each sign can have only numbers from 0 to 9) we'll use DT classification model, because we have characteristics of each object (pixel matrix with each pixel having it's own placement and color). So we'll try to predict to which class (10 classes - because we have 10 numbers from 0 to 9) our sign is belonging to.


Decision Tree Regression - helps to find parameters of an object using known object characteristics. In contrast to the classification, the parameter value is not a finite set of classes, but a set of real numbers. For example:
  1. to predict house price, we have a set of parameters, such as house size, floors count, placement etc. Our prediction is not predefined set of classes but a number. So we'll use DT regression model. 
There are several methods (algorithms) used to "draw" decision tree:
  1. C&RT (also CART - Classification and Regression Tree) :
    1. this method is used to draw only binary trees (each split has only two branches)
    2. on each iteration, for selected parameters (set):
      1. root of the DT is a set of all members of the model (all homes)
      2. select rule which is forming leaf (eg - home having one room or more than one rooms)
      3. we find right-branch (true - having rule - "having one room") and left-branch (false - not having rule - "having more than one room")
    3. iterations are done until we have only one branch in split or until given depth
    4. This algorithm is good for initial data analysis
  2. Random Forest - built forest is consisting of CART trees, training uses Bootstrap Aggregation or bagging method (a combination of learning models is put into the "bag" which increases the overall result). In statistics bootstrap - is any test or metric that relies on random sampling with replacement (an element may appear multiple times in a sample - this helps to estimate each element weight). Random forest can be used both for classification and regression problems. Logic of random forest:
    1. build many CART trees using different set of parameters for each DT
    2. choose most often predicted value
    3. Random Forest also automatically measures importance of the parameters (assigns score to the parameter), the sum of all scores equals 1. 

No comments:

Post a Comment