《数据挖掘与管理决策》课程教学大纲
(2016年修订)
课程编号:20157
英文名:Data mining and Management Decision
课程类别:专业主干(双语)
前置课:统计学、线性代数、管理学
后置课:企业资源计划
学 分:3学分
课 时:51课时
主讲教师:陈欣、吴士亮、王月虎等
选定教材:Data Mining Introductory and Advanced Topics(影印版). Margaret H. Dunham. 清华大学出版社,2003年10月
课程概述:
数据挖掘是近年来伴随着数据库系统的大量建立和万维网的广泛使用而发展起来的一门数据处理和分析技术,它是数据库、机器学习与统计学这三个领域的交叉结合而形成的一门新兴技术。本课程系统地介绍各种数据挖掘的基本概念、方法和算法,并结合软件介绍和管理决策案例分析进行系统学习数据挖掘和应用。本课程由四部分构成:第一部分是导论,全面介绍数据挖掘的背景信息、相关概念以及数据挖掘所使用的主要技术;第二部分是数据挖掘的核心算法,系统深入地描述了用于分类、聚类和关联规则的常用算法;第三部分是数据挖掘的高级课题,主要叙述了Web挖掘、空间数据挖掘、时序数据和序列数据挖掘。通过数据挖掘技术找到蕴藏在数据中的有用信息,进而找到尚未发现的知识,为商业竞争、企业生产和管理、政府部门决策以及科学探索等提供信息与知识,对于帮助管理者作出科学决策具有重要价值。
教学目的:
数据挖掘技术经过十几年的发展,已经取得一些重要成果,特别是在基本概念、基本原理、基本算法等方面发展的越来越清晰。因此,现在开设此课程具备基本的技术条件。本课程以介绍基本概念和基本算法为主,作为高级数据处理和分析技术,其目的是通过本课程学习让学生了解信息处理技术的发展方向以及数据挖掘技术本身的概念、原理和方法。同时结合管理决策的案例进行教学,以前沿问题的讨论与探索为辅,为学生将来研究和学习提供知识储备,适应大数据时代的管理需要。
教学方法:
本课程课堂教学主要采用多媒体授课,并辅助以案例教学、课堂讨论和软件应用。
各章教学要求及教学要点
第一章 引言(Introduction)
课时分配:3课时
教学要求:
通过本章的教学,使学生了解数据挖掘基本概念、数据挖掘技术,包括分类、回归、时间序列分析、预测、聚类、关联规则、序列发现,以及 数据挖掘与数据库中的知识发现、数据挖掘对未来管理决策和社会发展的影响。
教学内容:
1.1 Basic Data Mining Tasks
1.2 Data Mining Versus Knowledge Discovery in Databases
1.3 Data Mining Issues
1.4 Data Mining Metrics
1.5 Social Implications of Data Mining
1.6 Data Mining from a Database Perspective
1.7 The Future
思考题:
1. Identify and describe the phases in the KDD process, and how does KDD differ from data mining?
2. Find at least three examples of data mining applications that have appeared in the business section of your local publication. And describe the data mining application involved.
第二章相关概念(Related Concepts)
课时分配:4课时
教学要求:
通过本章的教学,使学生了解数据处理相关概念,掌握数据库/OLTP系统、模糊集和模糊逻辑、信息检索、决策支持系统、维数据建模、多维模式、索引、数据仓储、、 Web搜索引擎、机器学习、模式匹配等方法及其应用的相关概念。
教学内容:
2.1 Database/OLTP Systems
2.2 Fuzzy Sets and Fuzzy Logic
2.3 Information Retrieval
2.4 Decision Support Systems
2.5 Dimensional Modeling
2.6 Indexing
2.7 Data Warehousing
2.8 OLAP
2.9 Web Search Engines
2.10 Statistics
2.11 Machine Learning
思考题:
1. Compare and contrast database, information retrieval, and data mining queries. What metrics are used to measure the performance of each type of query?
2. Data warehouse are often viewed to contain relatively static data. Investigate techniques that have been proposed to provide updates to this data from the operational data . How often should these updates occur?
第三章数据挖掘技术 Data Mining Techniques
课时分配:4课时
教学要求:
通过本章的教学,使学生了解数据挖掘技术的统计方法、贝叶斯定理、回归和相关、决策树、相似性、神经网络、激励函数和遗传算法等基本公式、计算步骤等内容。
教学内容:
3.1 Introduction
3.2 A Statistical Perspective on Data Mining
3.3 Similarity Measures
3.4 Decision Trees
3.5 Neural Networks
3.6 Genetic Algorithms
思考题:
1.Given the following set of values { 1,3 ,9 15, 20}, determine the jackknife estimate for both the mean and standard deviation of the mean.
2. Find the similarity between ,< 0 1 0.5 0.3 1 >and <1 0 0.5 0 0> using the Dice, Jaccard and Cosine similarity measures.
3. given the decision tree in Fig.3.5, classify each of the following students: < Mary, 20, F, 2m, Senior, Math>, <Dave , 19, M, 1.7m, Sophomore, Computer science> and < Martha, 18, F, 1.2m, Freshman, English>.
第四章分类 Classification
课时分配:8课时
教学要求:
了解分类中的问题和数据分析方法,包括基于统计的算法(如回归、贝叶斯分类)、基于距离的算法(K最近邻)、基于决策树的算法、神经网络、基于规则的算法以及其他组合技术。
教学内容:
4.1 Introduction
4.2 Statistical-Based Algorithms
4.3 Distance-Based Algorithms
4.4 Decision Tree-Based Algorithms
4.5 Neural Network-Based Algorithms
4.6 Rule-Based Algorithms
4.7 Combining Techniques
思考题:
1.Apply the method of least squares technique to determine the division between medium and tall persons using the training data in Table4.1 and classification shown in output1(see example 4.3). You may use either the division technique or the prediction technique.
2. Explain the difference between P(ti|Cj) and P (Cj|ti)
3. Compare at least three different guideline that have been proposed for determining the optimal number of hidden nodes in an NN.
4. Various classification algorithm can be found online. Apply these programs to the height example in Table4.1 using the training classification shown in the output2 column.
第五章聚类Clustering
课时分配:6课时
教学要求:
掌握相似性和距离度量、异常点、层次算法、划分算法(最小生成树、平方误差聚类算法、K均值聚类、最近邻算法等)、大型数据库聚类(BIRCH、DBSCAN、CURE算法)以及对类别属性进行聚类等方法
教学内容:
5.1 Introduction
5.2 Similarity and Distance Measures
5.3 Outliers
5.4 Hierarchical Algorithms
5.5 Partitional Algorithms
5.6 Clustering Large Databases
5.7 Clustering with Categorical attributes
5.8 Comparison
思考题:
1. Show the dendrogram created by the single, complete, and average link clustering algorithms using the following adjacency matrix.
Item
|
A
|
B
|
C
|
D
|
A
|
0
|
1
|
4
|
5
|
B
|
1
|
0
|
2
|
6
|
C
|
4
|
3
|
0
|
3
|
D
|
5
|
6
|
3
|
0
|
2. A major problem with the single link algorithm is that clusters consisting of long chains may be created. Describe and illustrate this concept.
3. Trace the use of the nearest neighbor algorithm on the data of Exercise 1 assuming a threshold of 3.
4. Perform a survey of recently proposed clustering algorithms. Identify where they fit in the classification tree in Figure5.2. Try to describe their approach and performance.
第六章关联规则(Association Rules)
课时分配:8课时
教学要求:
通过本章的教学,使学生了解大项目集法、基本算法(Apriori算法、抽样算法、划分)、并行和分布式算法、方法比较、增量规则、高级关联规则技术相关规则以及如何度量规则的质量,并结合实际案例进行应用分析。
教学内容:
6.1 Introduction
6.2 Large Item sets
6.3 Basic Algorithms
6.4 Parallel and Distributed Algorithms
6.5 Comparing Approaches
6.6 Incremental Rules
6.7 Advanced Association Rule Techniques
6.8 Measuring the Quality of Rules
思考题:
1. Trace the results of using the Apriori algorithm on the grocery store example with s=20% and a=40%. Be sure to show the candidate an large itemsets for each database scan. Also indicate the association rules that will be generated.
2. Trace the results of using the sampling algorithm on the clothing store example with s=20% and a=40%. Be sure to show the use of negative border function as well as the candidate and large itemsets for each database scan.
3. Calculate the lift and conviction for the rules shown in Table 6.3, Compare these to the shown support and confidence.
4. Perform a survey of recent research examining techniques to generate rules incrementally.
第七章Web 挖掘(Web Mining)
课时分配:6课时
教学要求:
通过本章的教学,使学生了解 Web内容挖掘(爬虫、Harvest系统、虚拟Web视图)、Web结构挖掘( PageRank、Clever)、Web使用挖掘(预处理、数据结构、模式发现、模式分析)等高级数据挖掘技术和方法。
教学内容:
7.1 Introduction
7.2 Web Content Mining
7.3 Web Structure Mining
7.4 Web Usage Mining
思考题:
1. Construct the trie for the string < A B A C >.
2. The use of a Web server through a proxy (such as an ISP) complicates the collection of frequent sequence statistics. Suppose that two users use one proxy and have the following sessions:
User 1:<1,3,1,3,4,3,6,8,2,3,6>
User2:<2,3,4,3,6,8,6,3,1>
When these are viewed together by the Web server(taking into account the time stamps), one large session is generated:
<1,2,3,3,4,1,3,6,3,8,4,3,6,3,6,1,8,2,3,6>
Identify the maximal frequent sequences assuming a minimum support of 2. What are the maximal frequent sequences if the two users could be separated?
3. Perform a literature survey concerning current research into solutions to the proxy problem identified in Exercise 6.
第八章空间数据挖掘(Spatial Mining)
课时分配:6课时
教学要求:
通过本章的教学,使学生了解空间数据相关基本概念(空间查询、空间数据结构、主题地图和图像数据库)、空间数据挖掘原语、一般化和特殊化(渐进求精、一般化、最近邻、STING)、空间规则(空间关联规则、空间分类算法、对ID3的扩展、空间决策树)、空间聚类算法(对CLARANS的扩展、SD(CLARANS)、DBCLASD、BANG、WaveCluster以及近似)。
教学内容:
8.1 Introduction
8.2 Spatial Data Overview
8.3 Spatial Data Mining Primitives
8.4 Generalization and Specialization
8.5 Spatial Rules
8.6 Spatial Classification Algorithm
8.7 Spatial Clustering Algorithms
思考题:
1. Compare the R-tree to the R*-tree.
2. Another commonly used spatial index is the grid file. Define a grid file. Compare it to a k-D tree and a quad tree. Show the grid file that would be used to index the data found in Figure8.5.
第九章时序数据挖掘(Temporal Mining)
课时分配:6课时
教学要求:
通过本章的教学,使学生了解时序事件建模、时间序列(时间序列分析、 趋势分析、变换、相似性、预测)、模式检测、时序序列(AprioriAll、SPADE、特征抽取)、时序关联规则(事务间关联规则、情节规则、趋势依赖、序列关联规则、日历关联规则)等方法,重点结合管理案例讲解数据分析方法。
教学内容:
9.1 Introduction
9.2 Modeling Temporal Events
9.3 Time Series
9.4 Pattern Detcdtion
9.5 Sequences
9.6 Temporal Association Rules
思考题:
1. Assume that you are given the following temperature values, Zt, taken at 5-minute time intervals:{ 50, 52, 55, 58, 60, 57, 66, 62, 60}. Plot both Zt+2 and Zt. Does there appear to be an autocorrelation? Calculate the correlation coefficient.
2. Plot the following time series values as well as the moving average found by replacing a given value with the average of it and ones preceding and following it :{ 5 15 7 20 13 5 8 10 12 11 9 15}. For the first and last values, you are to use only the two values available to calculate the average.
3. Investigate and describe two techniques which have been used to predict future stock prices.
附录:参考书目
1、《数据挖掘导论(完整版)》,Pangning Tan, Michael Steinbach, Vipin Kumar. 范明,范宏建等译,人民邮电出版,2016
2、《数据挖掘概念与技术》,Jiawei Han,Micheline Kamber.范明,孟小峰等译,机械工业出版社,2007
执笔人:陈欣
审定人:于荣
院(系、部)负责人:刘军