Semester 3 - Data Mining (Week 11-14)

4 minute read

Definition and Purpose
- Classification: Technique for extracting models or “classifiers” from data.
- Goal: To predict categorical, unordered class labels.
Methodology
- Various techniques exist for classification, including decision tree, Bayesian, and rule-based classifiers.
- Scalable techniques for large, disk-resident data have been developed recently.
Applications
- Diverse fields such as fraud detection, target marketing, performance prediction, manufacturing, and medical diagnosis.
Accuracy and Evaluation
- Various measures evaluate classifiers’ accuracy.
- Techniques ensure reliable accuracy estimates.
Dealing with Challenges
- Methods exist for increasing accuracy, including strategies for handling imbalanced class data.
Process
- Classification typically involves a two-step process:
  - Building a classification model from previous data.
  - Determining the model’s accuracy before deploying it for new data classification.

Definition and Purpose
- Decision Trees: Supervised Machine Learning technique involving data splitting based on parameters.
- Utility: Suitable for both Classification and Regression problems.
Features and Limitations
- Transparent and easy to interpret.
- Prone to overfitting, particularly with deeper trees.
Algorithms and Techniques
- Algorithms include ID3, C4.5, CART, among others.
- Use different metrics for attribute split, such as information gain, gain ratio, Gini index.
- Pruning: Technique to reduce overfitting by removing less impactful branches.
Data Compatibility
- Handles both categorical and numerical data.
- Can manage missing values by inferring the most beneficial outcome.
Further Utilization
- Serves as foundation for more powerful machine learning algorithms like Random Forest and Gradient Boosting algorithms.

A belief network is made of a graph and probability tables.
Each point in the graph stands for a random variable.
Variables can be discrete or continuous.
They can represent actual data attributes or “hidden variables”.
Each line shows a probability dependence. If line goes from Y to Z, Y is Z’s parent. Z is a descendant of Y.

They can classify data and show dependencies among attribute groups.
For instance, in medical data, a “hidden variable” could point to a syndrome. This syndrome indicates multiple symptoms that define a particular disease.

What is KNN?
- KNN is a simple machine learning algorithm.
- It classifies objects based on the majority vote of its neighbors.
- The object is assigned to the class most common among its ‘K’ closest neighbors.
- ‘K’ is usually a small positive number. If K = 1, the object takes the class of its closest neighbor.
Uses of KNN
- KNN is used for both classification and regression problems.
- It’s mostly used for classification in the industry.
- It uses existing data points classified into groups to predict the classification of new points.
Relation to Real Life
- Like learning about a person by knowing their friends, KNN uses known data to classify unknown data.
Things to consider before using KNN
- KNN uses lots of computer power.
- Variables should be normalized to avoid bias.
- Outlier and noise removal is important before using KNN.

Scalability:
- Clustering algorithms should work well on large databases with millions or billions of objects.
Different Types of Attributes:
- Clustering should handle various types of data, such as numeric, binary, categorical, and ordinal.
Clusters with Arbitrary Shape:
- Clusters can have any shape, so algorithms should detect clusters of different shapes.
Domain Knowledge:
- Clustering algorithms may require user input, like the number of desired clusters. Finding optimal parameters can be challenging.
Noisy Data:
- Real-world data often has outliers, missing values, or errors. Robust clustering methods are needed to handle noise.
Incremental Clustering and Insensitivity to Input Order:
- Clustering algorithms should handle incremental updates and be insensitive to the order of data objects.

Partitioning methods:
- Create initial partitions and iteratively improve them.
- Examples: k-means, k-medoids, CLARANS.
Hierarchical methods:
- Create hierarchical decomposition of data objects.
- Agglomerative (bottom-up) or divisive (top-down) approaches.
- Examples: Chameleon, BIRCH.
Density-based methods:
- Cluster based on density of objects.
- Grow clusters using neighborhood density or density function.
- Examples: DBSCAN, DENCLUE, OPTICS.
Grid-based methods:
- Quantize object space into a grid structure.
- Perform clustering on the grid.
- Examples: STING, CLIQUE.

K-means algorithm partitions data into K non-overlapping clusters based on nearest mean.
Randomly choose K cluster centers.
Assign each data point to the closest cluster center.
Update cluster centers as the mean of assigned points.
Repeat steps 3 and 4 until cluster assignments stabilize or reach maximum iterations.
Result is clusters where each point is closer to its cluster center.
Suitable for various data types and cluster shapes.
Requires prior specification of K and sensitive to initial cluster center selection.
Assumes spherical and evenly sized clusters, which may not always hold.
Sensitive to outliers, may require data preprocessing.

Hierarchical clustering is an algorithm that builds a hierarchy of clusters where each node is a cluster consisting of the clusters of its daughter nodes.
Two types: agglomerative (bottom-up) and divisive (top-down).
Results presented in a dendrogram.
No pre-specified number of clusters required.
Can be used for disjoint cluster partitioning.
Sensitivity to distance metric and inter-cluster distance calculation methods.

Semester 4 - Distributed System