Semester 3 - Data Mining (Week 11-14)
4 minute read
Classification and Prediction
- Definition and Purpose
- Classification: Technique for extracting models or “classifiers” from data.
- Goal: To predict categorical, unordered class labels.
- Methodology
- Various techniques exist for classification, including decision tree, Bayesian, and rule-based classifiers.
- Scalable techniques for large, disk-resident data have been developed recently.
- Applications
- Diverse fields such as fraud detection, target marketing, performance prediction, manufacturing, and medical diagnosis.
- Accuracy and Evaluation
- Various measures evaluate classifiers’ accuracy.
- Techniques ensure reliable accuracy estimates.
- Dealing with Challenges
- Methods exist for increasing accuracy, including strategies for handling imbalanced class data.
- Process
- Classification typically involves a two-step process:
- Building a classification model from previous data.
- Determining the model’s accuracy before deploying it for new data classification.
Decision Trees
- Definition and Purpose
- Decision Trees: Supervised Machine Learning technique involving data splitting based on parameters.
- Utility: Suitable for both Classification and Regression problems.
- Features and Limitations
- Transparent and easy to interpret.
- Prone to overfitting, particularly with deeper trees.
- Algorithms and Techniques
- Algorithms include ID3, C4.5, CART, among others.
- Use different metrics for attribute split, such as information gain, gain ratio, Gini index.
- Pruning: Technique to reduce overfitting by removing less impactful branches.
- Data Compatibility
- Handles both categorical and numerical data.
- Can manage missing values by inferring the most beneficial outcome.
- Further Utilization
- Serves as foundation for more powerful machine learning algorithms like Random Forest and Gradient Boosting algorithms.
Bayesian Classification:
- Bayesian classification is a type of statistical classification.
- It assumes different parts of data are independent if we know their class.
- This makes calculations easier and can make it the best classifier.
- But, in real-life, data parts can depend on each other.
Bayesian Belief Networks
- These networks use joint conditional probabilities.
- They let us define independence between certain data parts.
- They offer a visual model of cause-effect links for learning.
- Once trained, they can classify data.
Components of Bayesian Belief Networks
- A belief network is made of a graph and probability tables.
- Each point in the graph stands for a random variable.
- Variables can be discrete or continuous.
- They can represent actual data attributes or “hidden variables”.
- Each line shows a probability dependence. If line goes from Y to Z, Y is Z’s parent. Z is a descendant of Y.
Applications
- They can classify data and show dependencies among attribute groups.
- For instance, in medical data, a “hidden variable” could point to a syndrome. This syndrome indicates multiple symptoms that define a particular disease.
K-Nearest Neighbors (KNN) Algorithm
- What is KNN?
- KNN is a simple machine learning algorithm.
- It classifies objects based on the majority vote of its neighbors.
- The object is assigned to the class most common among its ‘K’ closest neighbors.
- ‘K’ is usually a small positive number. If K = 1, the object takes the class of its closest neighbor.
- Uses of KNN
- KNN is used for both classification and regression problems.
- It’s mostly used for classification in the industry.
- It uses existing data points classified into groups to predict the classification of new points.
- Relation to Real Life
- Like learning about a person by knowing their friends, KNN uses known data to classify unknown data.
- Things to consider before using KNN
- KNN uses lots of computer power.
- Variables should be normalized to avoid bias.
- Outlier and noise removal is important before using KNN.
Clustering
What is Clustering?
- Clustering groups similar data objects together.
- Objects in the same cluster are more alike than those in different clusters.
Challenges in Clustering
- Scalability:
- Clustering algorithms should work well on large databases with millions or billions of objects.
- Different Types of Attributes:
- Clustering should handle various types of data, such as numeric, binary, categorical, and ordinal.
- Clusters with Arbitrary Shape:
- Clusters can have any shape, so algorithms should detect clusters of different shapes.
- Domain Knowledge:
- Clustering algorithms may require user input, like the number of desired clusters. Finding optimal parameters can be challenging.
- Noisy Data:
- Real-world data often has outliers, missing values, or errors. Robust clustering methods are needed to handle noise.
- Incremental Clustering and Insensitivity to Input Order:
- Clustering algorithms should handle incremental updates and be insensitive to the order of data objects.
Types of clustering methods
- Partitioning methods:
- Create initial partitions and iteratively improve them.
- Examples: k-means, k-medoids, CLARANS.
- Hierarchical methods:
- Create hierarchical decomposition of data objects.
- Agglomerative (bottom-up) or divisive (top-down) approaches.
- Examples: Chameleon, BIRCH.
- Density-based methods:
- Cluster based on density of objects.
- Grow clusters using neighborhood density or density function.
- Examples: DBSCAN, DENCLUE, OPTICS.
- Grid-based methods:
- Quantize object space into a grid structure.
- Perform clustering on the grid.
- Examples: STING, CLIQUE.
K-means Algorithm
- K-means algorithm partitions data into K non-overlapping clusters based on nearest mean.
- Randomly choose K cluster centers.
- Assign each data point to the closest cluster center.
- Update cluster centers as the mean of assigned points.
- Repeat steps 3 and 4 until cluster assignments stabilize or reach maximum iterations.
- Result is clusters where each point is closer to its cluster center.
- Suitable for various data types and cluster shapes.
- Requires prior specification of K and sensitive to initial cluster center selection.
- Assumes spherical and evenly sized clusters, which may not always hold.
- Sensitive to outliers, may require data preprocessing.
Hierarchical Clustering Algorithm
- Hierarchical clustering is an algorithm that builds a hierarchy of clusters where each node is a cluster consisting of the clusters of its daughter nodes.
- Two types: agglomerative (bottom-up) and divisive (top-down).
- Results presented in a dendrogram.
- No pre-specified number of clusters required.
- Can be used for disjoint cluster partitioning.
- Sensitivity to distance metric and inter-cluster distance calculation methods.