Semester 3 - Data Mining (Week 1-4)

2 minute read

Data Mining, also known as Knowledge Discovery in Database (KDD), is a process used to extract valuable information from large sets of data.
It involves various aspects such as:
- Data Types: Includes relational, transactional, data warehouse data, and complex data types like time-series, sequences, data streams, spatiotemporal data, multimedia data, text data, graphs, social networks, and Web data.
- Knowledge Mined: Involves discovering patterns, associations, correlations, and causal structures.
- Technologies Used: Incorporates machine learning, statistics, pattern recognition, neural networks, and visualization.
- Applications: Extensively used in various fields such as business, science, engineering, and healthcare.
Major challenges
- scalability,
- handling different types of attributes,
- dealing with noisy data, and
- developing incremental clustering algorithms.

image-center

Knowledge Discovery Process

Data Cleaning: Remove noise and inconsistent data.
Data Integration: Combine multiple data sources.
Data Selection: Data relevant to the analysis task are retrieved from the database.
Data Transformation: Consolidate data into mining-friendly formats.
Data Mining: Apply intelligent methods to uncover patterns.
Pattern Evaluation: Identify valuable patterns via interestingness measures.
Knowledge Presentation: Visualization and knowledge representation techniques are used to present the mined knowledge.

Refers to the data that records an exchange, agreement or transfer between entities.
Captures every system event detail.
Examples: order delivery, purchase orders, invoices.

image-center

Three tier data warehousing architecture

Data Warehouse

Important reasons for using Data warehouse

Integrates many sources of data and helps to decrease stress on a production system.
Optimized Data for reading access and consecutive disk scans.
Data Warehouse helps to protect Data from the source system upgrades.
Allows users to perform master Data Management.
Improve data quality in source systems.

Mining Methodology: Managing diverse data types, noise, uncertainty, scalability.
User Interaction: Maintaining simplicity, transparency, and user engagement.
Efficiency and Scalability: Ensuring fast, scalable data processing.
Diversity of Database Types: Handling various data types and sources.
Data Mining and Society: Navigating information misuse, privacy, security.

Semester 4 - Distributed System