Copyright © 2010, SAS Institute Inc. All rights reserved. Введение в кластеризацию.

Презентация:



Advertisements
Похожие презентации
Приложение 1 к решению Совета депутатов города Новосибирска от Масштаб 1 : 5000.
Advertisements

Масштаб 1 : Приложение 1 к решению Совета депутатов города Новосибирска от _____________ ______.
Масштаб 1 : Приложение 1 к решению Совета депутатов города Новосибирска от
Приложение 1 к решению Совета депутатов города Новосибирска от _____________ ______ Масштаб 1 : 5000.
Chap 11-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 11 Hypothesis Testing II Statistics for Business and Economics.
Chap 9-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 9 Estimation: Additional Topics Statistics for Business and Economics.
© 2009 Avaya Inc. All rights reserved.1 Chapter Two, Voic Pro Components Module Two – Actions, Variables & Conditions.
Chap 15-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 15 Nonparametric Statistics Statistics for Business and Economics.
REFERENCE ELEMENTS 64. If your REFERENCE ELEMENTS toolbar is not in view and not hidden, you can retrieve it from the toolbars menu seen here. 65.
Tool: Pareto Charts. The Pareto Principle This is also known as the "80/20 Rule". The rule states that about 80% of the problems are created by 20% of.
WS9-1 PAT328, Workshop 9, May 2005 Copyright 2005 MSC.Software Corporation WORKSHOP 9 PARAMETERIZED GEOMETRY SHAPES.
© 2005 Cisco Systems, Inc. All rights reserved. BGP v Customer-to-Provider Connectivity with BGP Connecting a Multihomed Customer to Multiple Service.
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 1-1 Chapter 1 Why Study Statistics? Statistics for Business and Economics.
Chap 8-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 8 Estimation: Single Population Statistics for Business and Economics.
Business Statistics 1-1 Chapter Two Describing Data: Frequency Distributions and Graphic Presentation GOALS When you have completed this chapter, you will.
DRAFTING and DIMENSIONING 98. A properly dimensioned drawing of a part is very important to the manufacturing outcome. With CATIA, it can be a very simple.
Time-Series Analysis and Forecasting – Part IV To read at home.
MOUSE MANIPULATION 23. The 3 button mouse is your tool for manipulation of the parts and assemblies that you have created. With it you can ZOOM, ROTATE.
1 Another useful model is autoregressive model. Frequently, we find that the values of a series of financial data at particular points in time are highly.
Designing Network Management Services © 2004 Cisco Systems, Inc. All rights reserved. Designing the Network Management Architecture ARCH v
Транксрипт:

Copyright © 2010, SAS Institute Inc. All rights reserved. Введение в кластеризацию

2 Процедуры кластеризации в SAS/STAT Variable Selection VARCLUS Plot Data PRINCOMP, MDS Preprocessing ACECLUS, STDIZE, DISTANCE Hierarchical Clustering CLUSTER Partitive Clustering Parametric Clustering FASTCLUS Non-Parametric Clustering MODECLUS SAS/STAT содержит богатый набор процедур для подготовки данных, широкий выбор алгоритмов кластеризации и оценки результатов моделирования НО... ЗАЧЕМ ВООБЩЕ НУЖНА КЛАСТЕРИЗАЦИЯ?

3 Example: Clustering for Customer Types While you have thousands of customers, there are really only a handful of major types into which most of your customers can be grouped. Bargain hunter Man/woman on a mission Impulse shopper Weary parent DINK (dual income, no kids)

4 Example: Clustering for Store Location You want to open new grocery stores in the U.S. based on demographics. Where should you locate the following types of new stores? low-end budget grocery stores small boutique grocery stores large full-service supermarkets

5 Профилирование кластеров Профилирование – это попытка вывести «бытовой» смысл группировки конкретных объектов в кластер Цель – определить уникальные черты (или их комбинации), характеризующие объекты в кластере

6 Виды кластеризации

7 Иерархическая кластеризация Аггломеративная Дробящая

8 Кластеризация разделением reference vectors (seeds) X X X X Initial State observations Final State X X X X X X X X PROBLEMS! –make you guess the number of clusters present –make assumptions about the shape of the clusters –influenced by seed location, outliers, and order of reading observations –impossible to determine the optimal grouping, due to the combinatorial explosion of potential solutions.

9 Heuristic Search 1. Generate an initial partitioning (based on the seeds) of the observations into clusters. 2. Calculate the change in error produced by moving each observation from its own cluster to each of the other clusters. 3. Make the move that produces the greatest reduction. 4. Repeat steps 2 and 3 until no move reduces error.

10 Меры сходства объектов

11 Principles of a Good Similarity Metric

12 The DISTANCE Procedure General form of the DISTANCE procedure: A distance method must be specified (no default), and all input variables are identified by level. PROC DISTANCE DATA=SAS-data-set METHOD=similarity-metric ; VAR level (variables ); RUN; PROC DISTANCE DATA=SAS-data-set METHOD=similarity-metric ; VAR level (variables ); RUN;

13 Simple popular Distance Metrics Euclidean distance City Block Distance Correlation

14 Go beyond: Density-Based Similarity Density-based methods define similarity as the distance between derived density bubbles (hyper-spheres). similarity Density estimate 1 (cluster 1) Density estimate 1 (cluster 1) Density estimate 2 (cluster 2) Density estimate 2 (cluster 2)

15 Оценка качества кластеризации

16 От кластеров к классам Perfect Типичные кластеры Это – не кластеризация Идеальные кластеры Если часть объектов выборки относится к разным классам, то это можно использовать для оценки качества кластеризации

17 От кластеров к вероятностям классов The probability that a cluster represents a given class is given by the clusters proportion of the row total. Frequency Probability

18 Меры качества кластеризации The chi-square statistic is used to determine whether an association exists. Because the chi-square value grows with sample size, it does not measure the strength of the association. Normally, Cramers V ranges from 0 to 1 For 2x2 tables only, it ranges between -1 and 1 WEAK STRONG 01 CRAMER'S V STATISTI C

19 Подготовка и разведочный анализ данных

20 The Challenge of Opportunistic Data Getting anything useful out of tons of data

21 Подготовка и анализ данных 1. Выбор данных и создание подвыборок (Что я разбиваю на кластеры?) 2. Отбор переменных (Какие характеристики объектов важны?) 3. Визуальный анализ данных (Какой формы кластеры и сколько их?) 4. Стандартизация переменных (Сравнимы ли масштабы переменных?) 5. Трансформация переменных (Переменные коррелируют? Кластеры не сферичны?) 21

22 Data and Sample Selection Not necessary to cluster a large population if you use clustering techniques that lend themselves to scoring (for example: Wards, k-means) It is useful to take a random sample for clustering and score the remainder of the larger population 22 CLUSTER IT, BEBE! THEN SCORE THESE GUYS!

23 Подготовка и разведочный анализ данных Отбор переменных

24 Подготовка и анализ данных 1. Выбор данных и создание подвыборок (Что я разбиваю на кластеры?) 2. Отбор переменных (Какие характеристики объектов важны?) 3. Визуальный анализ данных (Какой формы кластеры и сколько их?) 4. Стандартизация переменных (Сравнимы ли масштабы переменных?) 5. Трансформация переменных (Переменные коррелируют? Кластеры не сферичны?) 24

25 Снижение размерности Необходимо ли анализировать все данные? Х2Х2 Х1Х1 Избыточность Input 2 Input 1 E(Target) Незначимость

26 Отбор значимых переменных Регрессионные модели автоматически определяют значимость переменных на основе их влияния на целевую переменную Но в кластерном анализе целевой переменной НЕТ Поэтому все незначимые переменные должны быть удалены перед проведением кластеризации путем: –Анализа важности переменных на специально подготовленной выборке с целевой –Подключения априорных соображений 26

27 Секрет качественной кластеризации OK Мошенник Сумма транзакции Время дня...

28... Время дня Сумма транзакции Секрет качественной кластеризации OK Мошенник Сумма транзакции Время дня

29... Время дня Сумма транзакции Секрет качественной кластеризации OK Мошенник Сумма транзакции Время дня Больше нескоррелированных переменных = кластеры лучше! Азарт

30 Удаление избыточных переменных 30 PROC VARCLUS DATA=SAS-data-set ; BY variables; VAR variables; RUN; PROC VARCLUS DATA=SAS-data-set ; BY variables; VAR variables; RUN; PROC VARCLUS группирует избыточные переменные Из каждой группы выбирается по одному представителю, а остальные переменные удаляются, тем самым снижая коллинеарность и число переменных

31 Divisive Clustering PROC VARCLUS uses divisive clustering to create variable subgroups that are as dissimilar as possible. 31 Ignored В основе метода – Principal Component Analysis

32 clus02d01. sas Ignored Keep them

33 Подготовка и разведочный анализ данных Визуальный анализ

34 Подготовка и анализ данных 1. Выбор данных и создание подвыборок (Что я разбиваю на кластеры?) 2. Отбор переменных (Какие характеристики объектов важны?) 3. Визуальный анализ данных (Какой формы кластеры и сколько их?) 4. Стандартизация переменных (Сравнимы ли масштабы переменных?) 5. Трансформация переменных (Переменные коррелируют? Кластеры не сферичны?) 34

35 Визуальный анализ данных Визуализация помогает установить такие ключевые параметры задачи, как –форму кластеров –дисперсию кластеров –примерное количество кластеров

36 Principal Component Plots x1x1 x2x2 Eigenvector 2 Eigenvector 1 Eigenvalue 1 Eigenvalue 2 PROC PRINCOMP DATA=SAS-data-set ; BY variables; VAR variables; RUN; PROC PRINCOMP DATA=SAS-data-set ; BY variables; VAR variables; RUN;

37 Multidimensional Scaling Plots PROC MDS DATA=distance_matrix ; VAR variables; RUN; PROC MDS DATA=distance_matrix ; VAR variables; RUN;

38 Подготовка и разведочный анализ данных Стандартизация переменных

39 Подготовка и анализ данных 1. Выбор данных и создание подвыборок (Что я разбиваю на кластеры?) 2. Отбор переменных (Какие характеристики объектов важны?) 3. Визуальный анализ данных (Какой формы кластеры и сколько их?) 4. Стандартизация переменных (Сравнимы ли масштабы переменных?) 5. Трансформация переменных (Переменные коррелируют? Кластеры не сферичны?) 39

40 PROC STDIZE Общий вид процедуры STDIZE: 40 PROC STDIZE DATA=SAS-data-set METHOD=method ; VAR variables; RUN; PROC STDIZE DATA=SAS-data-set METHOD=method ; VAR variables; RUN; ? Опять сравнивают апельсины и слонов? Хватит это терпеть!

41 Бесчисленные методы стандартизации 41

42 Бесчисленные методы стандартизации The best of the best of the best!

43 Подготовка и разведочный анализ данных Трансформация переменных

44 Подготовка и анализ данных 1. Выбор данных и создание подвыборок (Что я разбиваю на кластеры?) 2. Отбор переменных (Какие характеристики объектов важны?) 3. Визуальный анализ данных (Какой формы кластеры и сколько их?) 4. Стандартизация переменных (Сравнимы ли масштабы переменных?) 5. Трансформация переменных (Переменные коррелируют? Кластеры не сферичны?) 44

45 PROC ACECLUS Общий вид процедуры ACECLUS: 45 PROC ACECLUS DATA=SAS-data-set ; VAR variables; RUN; PROC ACECLUS DATA=SAS-data-set ; VAR variables; RUN; До ACECLUSПосле ACECLUS

46 PROC ACECLUS 46

47 PROC ACECLUS

48 Кластеризация разбиением

49 Кластеризация разбиением Алгоритм К-Means

50 Кластеризация разбиением (оптимизационная) Клестеризация разделением оптимизирует некоторую штрафную функцию, например: –межкластерное расстояние –внутрикластерную однородность (похожесть) 50

51 Оптимизирующие естественный критерий группировки(K-means) Параметрическое семейство алгоритмов (Expectation-Maximization) Непараметрическое семейство алгоритмов (Kernel-based) 51 Семейства алгоритмов

52 Критерий естественной группировки Заимствуя идеи метода наименьших квадратов, поиск наилучшего разбиения множества объектов на кластеры можно свести к оптимизации критерия естественной группировки: –Максимизировать межкластерную сумму квадратов расстояний, или –Минимизировать внутрикластерную сумму квадратов расстояний между объектами. Большое межкластерное расстояние говорит о хорошей разделенности кластеров Малое внутри кластерное расстояние – признак однородности объектов внутри группы 52

53 Cross-Cluster Variation Matrix 53

54 The Trace Function Trace summarizes matrix W into a single number by adding together its diagonal (variance) elements. Simply adding matrix elements together makes trace very efficient, but it also makes it scale dependent Ignores the off-diagonal elements, so variables are treated as if they were independent (uncorrelated). Diminishes the impact of information from correlated variables

55 Basic Trace(W) Problems Because the trace function only looks at the diagonal elements of W, it tends to form spherical clusters Use data transformation techniques 55 Trace(W) also tends to produce clusters with about the same number of observations Alternative clustering techniques exist to manage this problem. Spherical Structure Problem Similar Size Problem

56 Кластеризация разбиением Алгоритм К-Means PROC FASTCLUS

57 The K -Means Methodology The three-step k-means methodology: 1. Select (or specify) an initial set of cluster seeds 57

58 The K -Means Methodology The three-step k-means methodology: 1. Select (or specify) an initial set of cluster seeds 2. Read the observations and update the seeds (known after the update as reference vectors). Repeat until convergence is attained 58

59 The K -Means Methodology The three-step k-means methodology: 1. Select (or specify) an initial set of cluster seeds 2. Read the observations and update the seeds (known after the update as reference vectors). Repeat until convergence is attained 3. Make one final pass through the data, assigning each observation to its nearest reference vector 59

60 k -Means Clustering Algorithm 1. Select inputs. 2. Select k cluster centers. 3. Assign cases to closest center. 4. Update cluster centers. 5.Re-assign cases. 6. Repeat steps 4 and 5 until convergence.

61 k -Means Clustering Algorithm 1. Select inputs. 2. Select k cluster centers. 3. Assign cases to closest center. 4. Update cluster centers. 5.Re-assign cases. 6. Repeat steps 4 and 5 until convergence.

62 k -Means Clustering Algorithm 1. Select inputs. 2. Select k cluster centers. 3. Assign cases to closest center. 4. Update cluster centers. 5. Reassign cases. 6. Repeat steps 4 and 5 until convergence....

63 k -Means Clustering Algorithm 1. Select inputs. 2. Select k cluster centers. 3. Assign cases to closest center. 4. Update cluster centers. 5. Reassign cases. 6. Repeat steps 4 and 5 until convergence....

64 k -Means Clustering Algorithm 1. Select inputs. 2. Select k cluster centers. 3. Assign cases to closest center. 4. Update cluster centers. 5. Reassign cases. 6. Repeat steps 4 and 5 until convergence....

65 k -Means Clustering Algorithm 1. Select inputs. 2. Select k cluster centers. 3. Assign cases to closest center. 4. Update cluster centers. 5. Reassign cases. 6. Repeat steps 4 and 5 until convergence....

66 k -Means Clustering Algorithm 1. Select inputs. 2. Select k cluster centers. 3. Assign cases to closest center. 4. Update cluster centers. 5. Reassign cases. 6. Repeat steps 4 and 5 until convergence....

67 k -Means Clustering Algorithm 1. Select inputs. 2. Select k cluster centers. 3. Assign cases to closest center. 4. Update cluster centers. 5. Reassign cases. 6. Repeat steps 4 and 5 until convergence....

68 k -Means Clustering Algorithm 1. Select inputs. 2. Select k cluster centers. 3. Assign cases to closest center. 4. Update cluster centers. 5. Reassign cases. 6. Repeat steps 4 and 5 until convergence....

69 k -Means Clustering Algorithm 1. Select inputs. 2. Select k cluster centers. 3. Assign cases to closest center. 4. Update cluster centers. 5. Reassign cases. 6. Repeat steps 4 and 5 until convergence....

70 Segmentation Analysis When no clusters exist, use the k-means algorithm to partition cases into contiguous groups.

71 The FASTCLUS Procedure General form of the FASTCLUS procedure: Because PROC FASTCLUS produces relatively little output, it is often a good idea to create an output data set, and then use other procedures such as PROC MEANS, PROC SGPLOT, PROC DISCRIM, or PROC CANDISC to study the clusters. 71 PROC FASTCLUS DATA=SAS-data-set | ; VAR variables; RUN; PROC FASTCLUS DATA=SAS-data-set | ; VAR variables; RUN;

72 The MAXITER= Option The MAXITER= option sets the number of K-Means iterations (the default number of iterations is 1) 72 X X X X X X X X X X X X X X X X X X X X X X X X X X X … Time n Time 0 Time 1

73 The DRIFT Option The DRIFT option adjusts the nearest reference vector as each observation is assigned. 73 X X X X X X X X X X X X X X X X X X X X X X X X X X … Time 0 Time 1 Time 2

74 The LEAST= Option The LEAST = option provides the argument for the Minkowski distance metric, changes the number of iterations, and changes the convergence criterion. 74 OptionDistanceMax IterationsConverge= defaultEUCLIDEAN1.02 LEAST=1CITY BLOCK LEAST=2EUCLIDEAN

75 What Value of k to Use? The number of seeds, k, typically translates to the final number of clusters obtained. The choice of k can be made using a variety of methods. Subject-matter knowledge (There are most likely five groups.) Convenience (It is convenient to market to three to four groups.) Constraints (You have six products and need six segments.) Arbitrarily (Always pick 20.) Based on the data (combined with Wards method).

76 Problems with K-Means Не всегда оптимальное разбиение пространства Плотность выборки? Нет, не слышал!

77 Grocery Store Case Study: Census Data Analysis goal: Where should you open new grocery store locations? Group geographic regions based on income, household size, and population density. Explore the data. Select the number of segments to create. Create segments with a clustering procedure. Interpret the segments. Map the segments. Analysis plan:

78 K -Means Clustering for Segmentation This demonstration illustrates the concepts discussed previously. 78

79

80

81

82 Кластеризация разбиением Непараметрическая кластеризация

83 Parametric vs Non-Parametric Clustering 83 Параметрические алгоритмы плохи на density-based кластерах Expectation-Maximization (+)Expectation-Maximization (-)

84 Developing Kernel Intuition 84 Modes

85 Advantages of Nonparametric Clustering It still obtains good results on compact clusters. It is capable of detecting clusters of unequal size and dispersion, even if they have irregular shapes. It is less sensitive (but not insensitive) to changes in scale than most clustering methods. It does not require that you guess the number of clusters present in the data. 85 PROC MODECLUS DATA=SAS-data-set METHOD=method ; VAR variables; RUN; PROC MODECLUS DATA=SAS-data-set METHOD=method ; VAR variables; RUN;

86 Significance Tests If requested (the JOIN= option), PROC MODECLUS can hierarchically join non-significant clusters. Although a fixed-radius kernel (R=) must be specified, the choice of smoothing parameter is not critical. 86

87 Valley-Seeking Method 87 valley modal region 1 (cluster 1) modal region 2 (cluster 2)

88 Saddle Density Estimation 88 no density difference density difference

89 Hierarchically Joining Non-Significant Clusters This demonstration illustrates the concepts discussed previously. 89

90

91

92 Иерархическая кластеризация

93 Hierarchical Clustering 93

94 The CLUSTER Procedure General form of the CLUSTER procedure: The required METHOD= option specifies the hierarchical technique to be used to cluster the observations. 94 PROC CLUSTER DATA=SAS-data-set METHOD=method ; VAR variables; FREQ variable; RMSSTD variable; RUN; PROC CLUSTER DATA=SAS-data-set METHOD=method ; VAR variables; FREQ variable; RMSSTD variable; RUN;

95 Cluster and Data Types 95 Hierarchical Method Distance Data Required? Average LinkageYes Two-Stage LinkageSome Options Wards MethodYes Centroid LinkageYes Complete LinkageYes Density LinkageSome Options EMLNo Flexible-Beta MethodYes McQuittys SimilarityYes Median LinkageYes Single LinkageYes

96 The TREE Procedure General form of the TREE procedure: The TREE procedure either displays the dendrogram (LEVEL= option), or assigns the observations to a specified number of clusters (NCLUSTERS= option). 96 PROC TREE DATA= ; RUN; PROC TREE DATA= ; RUN;

97 Иерархическая кластеризация Параметры алгоритма

98 Average Linkage The distance between clusters is the average distance between pairs of observations. 98 CKCKCKCK d(x i, x j ) CLCLCLCL

99 Two-Stage Density Linkage A nonparametric density estimate is used to determine distances, and recover irregularly shaped clusters Apply single linkage modal cluster K modal cluster L modal cluster K modal cluster L D KL 1. Form modal clusters

100 The Two Stages of Two-stage The first stage, known as density linkage, constructs a distance measure, d*, based on kernel density estimates and creates modal clusters. The second stage ensures that a cluster has at leastn members before it can be fused. Clusters are fused using single linkage (joins based on the nearest points between two clusters). The measure d* can be based on three methods. This course uses the k-nearest neighbor method. 100

101 Wards Wards method uses ANOVA at each fusion point to determine if the proposed fusion is warranted. 101 ANOVA

102 Additional Clustering Methods Centroid Linkage Complete Linkage Density Linkage Single Linkage 102 X X CKCKCKCK CLCLCLCL CKCKCKCK CLCLCLCL CKCKCKCK CLCLCLCL CKCKCKCK CLCLCLCL

103 Centroid Linkage The distance between clusters is the squared Euclidean distance between cluster centroids and. 103 X X D KL CKCKCKCK CLCLCLCL

104 Complete Linkage The distance between clusters is the maximum distance between two observations, one in each cluster. 104 D KL CKCKCKCK CLCLCLCL

105 Density Linkage 1. Calculate a new distance metric, d*, using k-nearest neighbor, uniform kernel, or Wongs hybrid method. 2. Perform single linkage clustering with d*. 105 d*(xi,xj)d*(xi,xj) CKCKCKCK CLCLCLCL

106 Single Linkage The distance between clusters is the distance between the two nearest observations, one in each cluster. 106 D KL CKCKCKCK CLCLCLCL

107 Оценка результатов кластеризации Оптимальное количество кластеров

108 Interpreting Dendrograms For interpreting any hierarchical clustering method change in fusion level; prefer 3 clusters.

109 Cubic Clustering Criterion Sarles Cubic Clustering Criterion compares observed and expected R 2 values. It tests the null hypothesis (H 0 ) that the data was sampled from uniform distribution across a hyper-box. CCC values greater than 2 suggest there is sufficient evidence of cluster structure (reject the H 0 ). Join clusters in local MAXIMA of CCC

110 Other Useful Statistics Pseudo-F Statistics Pseudo-T2 Statistics Join clusters if T2 statistics is in local MINIMUM Join clusters if statistics is in local MAXIMUM

111 Interpreting PSF and PST2 candidates Read in this Direction Pseudo-F Statistics Pseudo-T2 Statistics

112 Оценка результатов кластеризации Профилирование кластеров

113 Cluster Profiling Generation of unique cluster descriptions from the input variables. It can be implemented using many approaches: Generate the typical member of each cluster. Use ANOVA to determine the inputs that uniquely define each of the typical members. Use graphs to compare and describe the clusters In addition, one can compare each cluster against the whole cluster population

114 One-Against-All Comparison 1. For the cluster k classify each observation as being a member of cluster k (with a value of 1) or not a member of cluster k (with a value of 0) 2. Use logistic regression to rank the input variables by their ability to distinguish cluster k from the others 3. Generate a comparative plot of cluster k and the rest of the data.

115 Оценка результатов кластеризации Применение модели кластеризации к новым наблюдениям

116 Scoring PROC FASTCLUS Results 1. Perform cluster analysis and save the centroids. 2. Load the saved centroids and score a new file. PROC FASTCLUS OUTSTAT=centroids; PROC FASTCLUS INSTAT=centroids OUT=SAS-data-set; PROC FASTCLUS INSTAT=centroids OUT=SAS-data-set;

117 Scoring PROC CLUSTER Results 1. Perform the hierarchical cluster analysis. 2. Generate the cluster assignments. PROC CLUSTER METHOD= OUTTREE=tree; VAR variables; RUN; PROC CLUSTER METHOD= OUTTREE=tree; VAR variables; RUN; PROC TREE DATA=tree N=nclusters OUT=treeout; RUN; PROC TREE DATA=tree N=nclusters OUT=treeout; RUN; continued...

118 Scoring PROC CLUSTER Results 3. Calculate the cluster centroids. 4. Read the centroids and score the new file. PROC MEANS DATA=treeout; CLASS cluster; OUTPUT MEAN= OUT=centroids; RUN; PROC MEANS DATA=treeout; CLASS cluster; OUTPUT MEAN= OUT=centroids; RUN; PROC FASTCLUS DATA=newdata SEED=centroids MAXCLUSTERS=n MAXITER=0 OUT=results; RUN; PROC FASTCLUS DATA=newdata SEED=centroids MAXCLUSTERS=n MAXITER=0 OUT=results; RUN;

119 Кейс Happy Household Study

120 The Happy Household Catalog A retail catalog company with a strong online presence monitors quarterly purchasing behavior for its customers, including sales figures summarized across departments and quarterly totals for 5.5 years of sales. HH wants to improve customer relations by tailoring promotions to customers based on their preferred type of shopping experience Customer preferences are difficult to ascertain based solely on opportunistic data.

121 Cluster Analysis as a Predictive Modeling Tool The marketing team gathers questionnaire data: Identify patterns in customer attitudes toward shopping Generate attitude profiles (clusters) and tie to specific marketing promotions Use attitude profiles as the target variable in a predictive model with shopping behavior as inputs Score large customer database (n=48K) using the predictive model, and assign promotions based on predicted cluster groupings

122 Preparation for Clustering 1. Data and Sample Selection 2. Variable Selection (What characteristics matter?) 3. Graphical Exploration (What shape/how many clusters?) 4. Variable Standardization (Are variable scales comparable?) 5. Variable Transformation (Are variables correlated? Are clusters elongated?)

123 Data and Sample Selection A study is conducted to identify patterns in customer attitudes toward shopping Online customers are asked to complete a questionnaire during a visit to the companys retail Web site. A sample of 200 complete data questionnaires is analyzed.

124 Preparation for Clustering 1. Data and Sample Selection (Who am I clustering?) 2. Variable Selection 3. Graphical Exploration (What shape/how many clusters?) 4. Variable Standardization (Are variable scales comparable?) 5. Variable Transformation (Are variables correlated? Are clusters elongated?)

125 Variable Selection This demonstration illustrates the concepts discussed previously. clus06d01.sas

126 What Have You Learned? Three variables will be used for cluster analysis: HH5 I prefer to shop online rather than offline HH10 I believe that good service is the most important thing a company can provide HH11 Good value for the money is hard to find

127 Preparation for Clustering 1. Data and Sample Selection (Who am I clustering?) 2. Variable Selection (What characteristics matter?) 3. Graphical Exploration 4. Variable Standardization (Are variable scales comparable?) 5. Variable Transformation (Are variables correlated? Are clusters elongated?)

128 Graphical Exploration of Selected Variables This demonstration illustrates the concepts discussed previously. clus06d02.sas

129 Preparation for Clustering 1. Data and Sample Selection (Who am I clustering?) 2. Variable Selection (What characteristics matter?) 3. Graphical Exploration (What shape/how many clusters?) 4. Variable Standardization 5. Variable Transformation

130 What Have You Learned? Standardization is unnecessary in this example because all variables are on the same scale of measurement Transformation might be unnecessary in this example because there is not evidence of elongated cluster structure from the plots, and the variables have low correlation.

131 Selecting a Clustering Method With 200 observations, it is a good idea to use a hierarchical clustering technique. Wards method is selected for ease of interpretation Select number of clusters with CCC, PSF and PST2 Use cluster plots to assist in providing cluster labels

132 Hierarchical Clustering and Determining the Number of Clusters This demonstration illustrates the concepts discussed previously. clus06d03.sas

133 Profiling the Clusters There are seven clusters There are three marketing promotions Determine whether the seven cluster profiles are good complements to the three marketing promotions Otherwise try another number of clusters

134 Profiling the Seven-Cluster Solution This demonstration illustrates the concepts discussed previously. clus06d04.sas

135 What Have You Learned?

136 What Have You Learned?

137 What Will You Offer? Offer 1: Coupon for free shipping if > 6mo since last purchase Offer 2: Fee-based membership in exclusive club to get valet service, personal (online) shopper. Offer 3: Coupon for product of a brand different from previously purchased. 1. Discriminating online tastes 2. Savings and service anywhere 3. Values in-store service 4. Seeks in-store savings 5. Reluctant shopper, online 6. Reluctant shopper, in-store 7. Seeks on-line savings

138 What Will You Offer? Offer 1: Coupon for free shipping if > 6mo since last purchase Offer 2: Fee-based membership in exclusive club to get valet service, personal (online) shopper. Offer 3: Coupon for product of a brand different from previously purchased. 1. Discriminating online tastes 2. Savings and service anywhere 3. Values in-store service 4. Seeks in-store savings 5. Reluctant shopper, online 6. Reluctant shopper, in-store 7. Seeks on-line savings Offer will be made based on cluster classification and a high customer lifetime value score.

139 Predictive Modeling The marketing team can choose from a variety of predictive modeling tools, including logistic regression, decision trees, neural networks, and discriminant analysis Logistic regression and NN should be neglected because of the small sample and large number of input variables Discriminant analysis is used in this example PROC DISCRIM DATA=data-set-1; CLASS cluster-variable; VAR input-variables; RUN; PROC DISCRIM DATA=data-set-1; CLASS cluster-variable; VAR input-variables; RUN;

140 Modeling Cluster Membership This demonstration illustrates the concepts discussed previously. clus0605.sas

141 Scoring the Database Once a model has been developed to predict cluster membership from purchasing data, the full customer database can be scored. Customers are offered specific promotions based on predicted cluster membership. PROC DISCRIM DATA=data-set-1 TESTDATA=data-set-2 TESTOUT=scored-data; PRIORS priors-specification; CLASS cluster-variable; VAR input-variables; RUN; PROC DISCRIM DATA=data-set-1 TESTDATA=data-set-2 TESTOUT=scored-data; PRIORS priors-specification; CLASS cluster-variable; VAR input-variables; RUN;

142 Lets Cluster the World!