Friday, 31 August 2012

statestics of classification algorithms

Decision Tree Algorithm Frequency Usage

CLS                            9%
IDE                             68 %
IDE3+                         4.5 %
C4.5                            54.55 %
C5.                              0 9%
CART                         40.9 %
Random Tree               4.5 %
Random Forest            9%
SLIQ                          27.27 %
PUBLIC                     13.6 %
OCI                            4.5 %
CLOUDS                   4.5 %
SPRINT                     31.84 %

Classification Algorithms

D)  SLIQ

SLIQ (Supervised Learning In Ques) was introduced by Mehta et al, (1996). It is a fast, scalable
decision tree algorithm that can be implemented in serial and parallel pattern. It is not based on
Hunt’s algorithm for decision tree classification. It partitions a training data set recursively using
breadth-first greedy strategy that is integrated with pre-sorting technique during the tree building
phase (Mehta et al, 1996). With the pre-sorting technique sorting at decision tree nodes is
eliminated and replaced with one-time sort, with the use of list data structure for each attribute to
determine the best split point (Mehta et al, 1996 and Shafer et al, 1996). In building a decision
tree model SLIQ handles both numeric and categorical attributes. One of the disadvantages of
SLIQ is that it uses a class list data structure that is memory resident thereby imposing memory
restrictions on the data (Shafer et al, 1996). It uses Minimum Description length Principle (MDL)
in pruning the tree after constructing it MDL is an inexpensive technique in tree pruning that usesthe least amount of coding in producing tree that are small in size using bottom-up technique
(Anyanwu et al, 2009 and Mehta et al, 1996).

E) SPRINT

SPRINT (Scalable Parallelizable Induction of decision Tree algorithm) was introduced by Shafer
et al, 1996. It is a fast, scalable decision tree classifier. It is not based on Hunt’s algorithm in
constructing the decision tree, rather it partitions the training data set recursively using breadthfirst
greedy technique until each partition belong to the same leaf node or class (Anyanwu et al,
2009 and Shafer et al, 1996). It is an enhancement of SLIQ as it can be implemented in both
serial and parallel pattern for good data placement andload balancing (Shafer et al, 1996). In this
paper we will focus on the serial implementation of SPRINT. Like SLIQ it uses one time sort of
the data items and it has no restriction on the input data size. Unlike SLIQ it uses two data
structures: attribute list and histogram which is not memory resident making SPRINT suitable for
large data set, thus it removes all the data memory restrictions on data (Shafer et al, 1996). It
handles both continuous and categorical attributes.

Classification Algorithms

B)  C4.5

C4.5 algorithm is an improvement of IDE3 algorithm, developed by Quinlan Ross (1993). It is
based on Hunt’s algorithm and also like IDE3, it is serially implemented. Pruning takes place in
C4.5 by replacing the internal node with a leaf node thereby reducing the error rate (Podgorelec
et al, 2002). Unlike IDE3, C4.5 accepts both continuous and categorical attributes in building the
decision tree. It has an enhanced method of tree pruning that reduces misclassification errors due
noise or too-much details in the training data set. Like IDE3 the data is sorted at every node of
the tree in order to determine the best splitting attribute. It uses gain ratio impurity method to
evaluate the splitting attribute (Quinlan, 1993).

c) C5.0

A.C5.0 Algorithm: C5.0 algorithm is an extension of C4.5 algorithm. C5.0 is the classification algorithm which applies in big data set. C5.0 is better than C4.5 on the efficiency and the memory. C5.0 model works by splitting the sample based on the field that provides the maximum information gain. The C5.0 model can split samples on basis of the biggest information gain field..The sample subset that is get from the former split will be split afterward. The process will continue until the sample subset cannot be split and is usually according to another field. Finally, examine the lowest level split, those sample subsets that don’t have remarkable contribution to the model will be rejected.

Thursday, 30 August 2012

Classification Algorithm


 DECISION TREE

                             A decision tree is a flow-chart-like tree structure, where each internal node is denoted by rectangles, and leaf nodes are denoted by ovals. All internal nodes have two or more child nodes. All internal nodes contain splits, which test the value of an expression of the attributes. Arcs from an internal node to its
children are labeled with distinct outcomes of the test. Each leaf node has a class label associated with it.
Decision tree are commonly used for gaining information for the purpose of decision -making. Decision tree starts with a root node on which it is for users to take actions. From this node, users split each node recursively according to decision tree learning algorithm. The final result is a decision tree in which each branch represents a possible scenario of decision and its outcome.

The three widely used decision tree learning algorithms are:

  •  ID3, 
  • C4.5 and
  •  CART.

A. ID3 (Iterative Dichotomiser 3)

This is a decision tree algorithm introduced in 1986 by Quinlan Ross . It is based on Hunts algorithm. The tree is constructed in two phases. The two phases are tree building and pruning.ID3 uses information gain measure to choose the splitting attribute. It only accepts categorical attributes in building a tree model. It does not give accurate result when there is noise. To 
remove the noise pre-processing technique has to be used.To  build decision tree, information gain is calculated for each and every attribute and select the attribute with the highest information gain to designate as a root node. Label the attribute as a root node and the possible values of the attribute are represented as arcs. Then all possible outcome instances are tested to check whether they are falling under the same class or not. If all the instances are falling under the same class, the node is represented with single class name, otherwise choose the splitting attribute to classify the instances.Continuous attributes can be handled using the ID3 algorithm by discretizing or directly, by considering the values to find the best split point by taking a threshold on the attribute values. ID3 does not support pruning.

Wednesday, 29 August 2012

Data mining category

  • Data mining category :=

    Data mining is a broad concept there are many category inside a data mining available.
    below diagram show clear classification inside data mining.

Monday, 27 August 2012

Data Mining Techniques

Data Mining Techniques :=

  • There are several major data mining techniques have been developed and used in data mining projects recently including association, classification, clustering, prediction and sequential patterns. We will briefly examine those data mining techniques with example to have a good overview of them.

      1) Association

Association is one of the best known data mining technique. In association, a pattern is discovered based on a relationship of a particular item on other items in the same transaction. For example, the association technique is used in market basket analysis to identify what products that customers frequently purchase together. Based on this data businesses can have corresponding marketing campaign to sell more products to make more profit.

      2) Classification

Classification is a classic data mining technique based on machine learning. Basically classification is used to classify each item in a set of data into one of predefined set of classes or groups. Classification method makes use of mathematical techniques such as decision trees, linear programming, neural network and statistics. In classification, we make the software that can learn how to classify the data items into groups. For example, we can apply classification in application that “given all past records of employees who left the company, predict which current employees are probably to leave in the future.” In this case, we divide the employee’s records into two groups that are “leave” and “stay”. And then we can ask our data mining software to classify the employees into each group.

     3) Clustering

Clustering is a data mining technique that makes meaningful or useful cluster of objects that have similar characteristic using automatic technique. Different from classification, clustering technique also defines the classes and put objects in them, while in classification objects are assigned into predefined classes. To make the concept clearer, we can take library as an example. In a library, books have a wide range of topics available. The challenge is how to keep those books in a way that readers can take several books in a specific topic without hassle. By using clustering technique, we can keep books that have some kind of similarities in one cluster or one shelf and label it with a meaningful name. If readers want to grab books in a topic, he or she would only go to that shelf instead of looking the whole in the whole library.

    4) Prediction

The prediction as it name implied is one of a data mining techniques that discovers relationship between independent variables and relationship between dependent and independent variables. For instance, prediction analysis technique can be used in sale to predict profit for the future if we consider sale is an independent variable, profit could be a dependent variable. Then based on the historical sale and profit data, we can draw a fitted regression curve that is used for profit prediction.

     5) Sequential Patterns

Sequential patterns analysis in one of data mining technique that seeks to discover similar patterns in data transaction over a business period. The uncover patterns are used for further business analysis to recognize relationships among data.

Data Mining and Application of Data Mining

Data Mining (DM) :=

  • Data mining, also known as "knowledge discovery," refers to computer-assisted tools and techniques for sifting through and analyzing these vast data stores in order to find trends, patterns, and correlations that can guide decision making and increase understanding. Data mining covers a wide variety of uses, from analyzing customer purchases to discovering galaxies. In essence, data mining is the equivalent of finding gold nuggets in a mountain of data. The monumental task of finding hidden gold depends heavily upon the power of computers. 

 Data Mining Applications :=

Data mining includes a variety of interesting applications. A few examples are listed below:
  • By recording the activity of shoppers in an online store, such as Amazon.com, over time, retailers can use knowledge of these patterns to improve the placement of items in the layout of a mail-order catalog page or Web page.
  • Telephone companies mine customer billing data to identify customers who spend considerably more than average on their monthly phone bill. The company can then target these customers to sell additional services.
  • Marketers can effectively target the wants and needs of specific consumer groups by analyzing data about customer preferences and buying patterns.
  • Hospitals use data mining to identify groups of people whose healthcare costs are likely to increase in the near future so that preventative steps can be taken.