diff --git a/.RData b/.RData new file mode 100644 index 00000000..6c7dd810 Binary files /dev/null and b/.RData differ diff --git a/CourseSessions/InClassProcess/MarketSegmentationProcessInClass.html b/CourseSessions/InClassProcess/MarketSegmentationProcessInClass.html new file mode 100644 index 00000000..9a985733 --- /dev/null +++ b/CourseSessions/InClassProcess/MarketSegmentationProcessInClass.html @@ -0,0 +1,6385 @@ + + + + + + + + + + + + + + +A Market Segmentation and Purchase Drivers Process + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + +
+
+
+
+
+ +
+ + + + + + + +
+

IMPORTANT: Please make sure you create a copy of this file with a customized name, so that your work (e.g. answers to the questions) is not over-written when you pull the latest content from the course github. This is a template process for market segmentation based on survey data, using the Boats cases A and B.

+
+

All material and code is available at the INSEAD Data Analytics for Business website and github. Before starting, make sure you have pulled the course files on your github repository. As always, you can use the help command in Rstudio to find out about any R function (e.g. type help(list.files) to learn what the R function list.files does).

+
+

+
+

The Business Questions

+

This process can be used as a (starting) template for projects like the one described in the Boats cases A and B. For example (but not only), in this case some of the business questions were:

+
    +
  • What are the main purchase drivers of the customers (and prospects) of this company?

  • +
  • Are there different market segments? Which ones? Do the purchase drivers differ across segments?

  • +
  • What (possibly market segment specific) product development or brand positioning strategy should the company follow in order to increase its sales?

  • +
+

See for example some of the analysis of this case in these slides: part 1 and part 2.

+
+

+
+
+

The Process

+

The “high level” process template is split in 3 parts, corresponding to the course sessions 3-4, 5-6, and an optional last part:

+
    +
  1. Part 1: We use some of the survey questions (e.g. in this case the first 29 “attitude” questions) to find key customer descriptors (“factors”) using dimensionality reduction techniques described in the Dimensionality Reduction reading of Sessions 3-4.

  2. +
  3. Part 2: We use the selected customer descriptors to segment the market using cluster analysis techniques described in the Cluster Analysis reading of Sessions 5-6.

  4. +
  5. Part 3: For the market segments we create, we will use classification analysis to classify people based on whether or not they have purchased a product and find what are the key purchase drivers per segment. For this part we will use classification analysis techniques.

  6. +
+

Finally, we will use the results of this analysis to make business decisions e.g. about brand positioning, product development, etc depending on our market segments and key purchase drivers we find at the end of this process.

+
+

+
+
+

The Data

+

First we load the data to use (see the raw .Rmd file to change the data file as needed):

+
# Please ENTER the name of the file with the data used. The file should be a
+# .csv with one row per observation (e.g. person) and one column per
+# attribute. Do not add .csv at the end, make sure the data are numeric.
+datafile_name = "../Sessions23/data/Boats.csv"
+
+# Please enter the minimum number below which you would like not to print -
+# this makes the readability of the tables easier. Default values are either
+# 10e6 (to print everything) or 0.5. Try both to see the difference.
+MIN_VALUE = 0.5
+
+# Please enter the maximum number of observations to show in the report and
+# slides.  DEFAULT is 10. If the number is large the report may be slow.
+max_data_report = 10
+
+

+
+
+

Part 1: Key Customer Characteristics

+

The code used here is along the lines of the code in the session 3-4 reading FactorAnalysisReading.Rmd. We follow the process described in the Dimensionality Reduction reading.

+

In this part we also become familiar with:

+
    +
  1. Some visualization tools;
  2. +
  3. Principal Component Analysis and Factor Analysis;
  4. +
  5. Introduction to machine learning methods;
  6. +
+

(All user inputs for this part should be selected in the code chunk in the raw .Rmd file)

+
# Please ENTER then original raw attributes to use.  Please use numbers, not
+# column names, e.g. c(1:5, 7, 8) uses columns 1,2,3,4,5,7,8
+factor_attributes_used = c(2:30)
+
+# Please ENTER the selection criterions for the factors to use.  Choices:
+# 'eigenvalue', 'variance', 'manual'
+factor_selectionciterion = "manual"
+
+# Please ENTER the desired minumum variance explained (Only used in case
+# 'variance' is the factor selection criterion used).
+minimum_variance_explained = 65  # between 1 and 100
+
+# Please ENTER the number of factors to use (Only used in case 'manual' is
+# the factor selection criterion used).
+manual_numb_factors_used = 15
+
+# Please ENTER the rotation eventually used (e.g. 'none', 'varimax',
+# 'quatimax', 'promax', 'oblimin', 'simplimax', and 'cluster' - see
+# help(principal)). Default is 'varimax'
+rotation_used = "varimax"
+
+

Steps 1-2: Check the Data

+

Start by some basic visual exploration of, say, a few data:

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Obs.01 Obs.02 Obs.03 Obs.04 Obs.05 Obs.06 Obs.07 Obs.08 Obs.09 Obs.10
Q1.1 5 3 3 5 4 4 4 4 4 2
Q1.2 1 2 1 2 2 2 3 3 3 5
Q1.3 5 4 4 3 4 2 4 4 4 3
Q1.4 5 4 4 4 5 4 4 4 4 2
Q1.5 5 4 5 5 5 5 5 4 4 1
Q1.6 5 4 4 5 4 5 5 5 4 3
Q1.7 5 5 4 3 5 4 5 4 4 5
Q1.8 3 3 2 3 4 3 5 4 3 4
Q1.9 5 4 4 3 5 2 3 5 4 3
Q1.10 4 4 3 4 4 4 1 3 3 4
Q1.11 2 3 2 4 5 2 5 4 5 1
Q1.12 1 2 2 2 1 2 3 1 1 3
Q1.13 5 4 5 5 5 3 4 4 4 1
Q1.14 5 4 5 5 4 4 4 4 4 4
Q1.15 5 5 5 5 5 4 5 3 5 3
Q1.16 4 3 4 4 5 4 4 3 4 2
Q1.17 4 3 4 3 5 4 5 4 4 3
Q1.18 5 5 4 5 5 4 4 4 4 5
Q1.19 5 4 4 4 5 5 5 4 4 5
Q1.20 4 3 3 3 4 3 4 3 4 4
Q1.21 5 4 3 5 4 5 5 4 4 5
Q1.22 5 4 5 4 4 5 5 4 4 4
Q1.23 5 3 4 5 5 4 5 3 4 5
Q1.24 5 4 4 3 4 5 5 5 4 4
Q1.25 5 4 4 5 4 4 5 4 4 5
Q1.26 5 4 5 4 5 4 5 5 5 4
Q1.27 3 4 3 3 4 4 5 3 5 4
Q1.28 4 4 3 3 4 4 5 3 5 4
Q1.29 5 4 4 5 4 4 5 4 3 5
+

The data we use here have the following descriptive statistics:

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
min 25 percent median mean 75 percent max std
Q1.1 1 4 4 4.03 5 5 0.82
Q1.2 1 2 3 2.89 4 5 1.01
Q1.3 1 2 3 3.12 4 5 1.02
Q1.4 1 3 4 3.89 4 5 0.82
Q1.5 1 3 4 3.55 4 5 0.93
Q1.6 1 4 4 3.95 4 5 0.82
Q1.7 1 3 4 3.67 4 5 0.90
Q1.8 1 3 4 3.74 4 5 0.82
Q1.9 1 2 3 2.89 4 5 1.08
Q1.10 1 3 3 3.37 4 5 0.93
Q1.11 1 3 4 3.46 4 5 1.15
Q1.12 1 2 3 2.86 4 5 1.01
Q1.13 1 2 3 3.02 4 5 0.98
Q1.14 1 3 3 3.25 4 5 0.97
Q1.15 1 3 4 3.63 4 5 0.89
Q1.16 1 2 3 3.10 4 5 1.05
Q1.17 1 2 3 3.08 4 5 0.98
Q1.18 1 4 4 4.12 5 5 0.74
Q1.19 1 4 4 4.20 5 5 0.72
Q1.20 1 2 3 3.16 4 5 0.97
Q1.21 1 4 4 4.25 5 5 0.73
Q1.22 1 4 4 4.01 4 5 0.74
Q1.23 1 3 4 3.56 4 5 1.02
Q1.24 1 4 4 4.11 5 5 0.76
Q1.25 1 3 4 3.79 4 5 0.91
Q1.26 1 2 3 2.95 4 5 1.05
Q1.27 1 2 3 3.16 4 5 1.05
Q1.28 1 3 3 3.31 4 5 0.98
Q1.29 1 4 4 4.03 4 5 0.73
+
+
+

Step 3: Check Correlations

+

This is the correlation matrix of the customer responses to the 29 attitude questions - which are the only questions that we will use for the segmentation (see the case):

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Q1.1 Q1.2 Q1.3 Q1.4 Q1.5 Q1.6 Q1.7 Q1.8 Q1.9 Q1.10 Q1.11 Q1.12 Q1.13 Q1.14 Q1.15 Q1.16 Q1.17 Q1.18 Q1.19 Q1.20 Q1.21 Q1.22 Q1.23 Q1.24 Q1.25 Q1.26 Q1.27 Q1.28 Q1.29
Q1.1 1.00 0.01 0.11 0.20 0.18 0.27 0.18 0.09 0.08 0.11 0.14 -0.05 0.12 0.18 0.26 0.16 0.15 0.25 0.27 0.19 0.24 0.23 0.19 0.21 0.23 0.10 0.13 0.18 0.20
Q1.2 0.01 1.00 -0.03 -0.21 -0.21 -0.04 0.02 0.20 0.09 0.16 0.04 0.37 0.01 -0.03 -0.08 -0.02 0.04 -0.04 -0.04 0.05 -0.10 -0.08 0.00 -0.08 0.01 0.07 0.05 0.02 -0.03
Q1.3 0.11 -0.03 1.00 0.26 0.40 0.34 0.44 -0.05 0.58 0.14 0.10 -0.09 0.48 0.46 0.38 0.39 0.38 0.24 0.14 0.39 0.18 0.28 0.34 0.23 0.36 0.47 0.40 0.43 0.17
Q1.4 0.20 -0.21 0.26 1.00 0.37 0.20 0.18 0.00 0.17 0.10 0.06 -0.16 0.27 0.29 0.30 0.18 0.17 0.18 0.19 0.18 0.18 0.23 0.16 0.23 0.22 0.19 0.17 0.21 0.19
Q1.5 0.18 -0.21 0.40 0.37 1.00 0.29 0.29 -0.03 0.33 0.14 0.07 -0.17 0.45 0.46 0.42 0.36 0.32 0.23 0.18 0.32 0.19 0.27 0.29 0.25 0.29 0.34 0.29 0.33 0.18
Q1.6 0.27 -0.04 0.34 0.20 0.29 1.00 0.55 0.04 0.35 0.12 0.15 -0.12 0.29 0.31 0.31 0.27 0.24 0.44 0.36 0.35 0.42 0.41 0.32 0.37 0.42 0.31 0.34 0.39 0.27
Q1.7 0.18 0.02 0.44 0.18 0.29 0.55 1.00 -0.01 0.49 0.12 0.12 -0.11 0.35 0.36 0.34 0.31 0.29 0.40 0.28 0.36 0.33 0.39 0.30 0.33 0.42 0.39 0.37 0.40 0.24
Q1.8 0.09 0.20 -0.05 0.00 -0.03 0.04 -0.01 1.00 -0.09 0.09 0.14 0.24 -0.05 -0.02 0.06 0.02 0.05 0.07 0.09 0.04 0.06 0.05 0.10 0.02 0.10 -0.04 0.03 0.05 0.10
Q1.9 0.08 0.09 0.58 0.17 0.33 0.35 0.49 -0.09 1.00 0.14 0.06 -0.04 0.48 0.43 0.33 0.39 0.37 0.22 0.07 0.37 0.14 0.23 0.29 0.23 0.32 0.50 0.40 0.40 0.11
Q1.10 0.11 0.16 0.14 0.10 0.14 0.12 0.12 0.09 0.14 1.00 -0.09 0.12 0.16 0.11 0.11 -0.03 -0.03 0.14 0.09 0.10 0.08 0.09 0.07 0.13 0.08 0.13 0.08 0.07 0.05
Q1.11 0.14 0.04 0.10 0.06 0.07 0.15 0.12 0.14 0.06 -0.09 1.00 0.09 0.08 0.13 0.20 0.32 0.31 0.11 0.12 0.25 0.13 0.17 0.19 0.08 0.25 0.09 0.16 0.18 0.17
Q1.12 -0.05 0.37 -0.09 -0.16 -0.17 -0.12 -0.11 0.24 -0.04 0.12 0.09 1.00 -0.11 -0.17 -0.17 -0.02 0.02 -0.12 -0.09 0.01 -0.17 -0.11 -0.03 -0.17 -0.05 -0.06 0.00 -0.01 -0.04
Q1.13 0.12 0.01 0.48 0.27 0.45 0.29 0.35 -0.05 0.48 0.16 0.08 -0.11 1.00 0.64 0.46 0.43 0.43 0.20 0.11 0.39 0.14 0.23 0.32 0.20 0.32 0.48 0.40 0.40 0.19
Q1.14 0.18 -0.03 0.46 0.29 0.46 0.31 0.36 -0.02 0.43 0.11 0.13 -0.17 0.64 1.00 0.50 0.43 0.40 0.25 0.18 0.41 0.21 0.29 0.36 0.21 0.35 0.46 0.39 0.40 0.21
Q1.15 0.26 -0.08 0.38 0.30 0.42 0.31 0.34 0.06 0.33 0.11 0.20 -0.17 0.46 0.50 1.00 0.41 0.39 0.32 0.26 0.41 0.21 0.33 0.35 0.27 0.43 0.37 0.35 0.38 0.24
Q1.16 0.16 -0.02 0.39 0.18 0.36 0.27 0.31 0.02 0.39 -0.03 0.32 -0.02 0.43 0.43 0.41 1.00 0.63 0.20 0.14 0.52 0.16 0.30 0.40 0.19 0.39 0.40 0.48 0.50 0.20
Q1.17 0.15 0.04 0.38 0.17 0.32 0.24 0.29 0.05 0.37 -0.03 0.31 0.02 0.43 0.40 0.39 0.63 1.00 0.17 0.12 0.45 0.13 0.26 0.36 0.15 0.36 0.40 0.44 0.46 0.21
Q1.18 0.25 -0.04 0.24 0.18 0.23 0.44 0.40 0.07 0.22 0.14 0.11 -0.12 0.20 0.25 0.32 0.20 0.17 1.00 0.49 0.28 0.47 0.44 0.29 0.42 0.37 0.24 0.25 0.31 0.30
Q1.19 0.27 -0.04 0.14 0.19 0.18 0.36 0.28 0.09 0.07 0.09 0.12 -0.09 0.11 0.18 0.26 0.14 0.12 0.49 1.00 0.21 0.44 0.38 0.24 0.37 0.32 0.14 0.18 0.23 0.28
Q1.20 0.19 0.05 0.39 0.18 0.32 0.35 0.36 0.04 0.37 0.10 0.25 0.01 0.39 0.41 0.41 0.52 0.45 0.28 0.21 1.00 0.23 0.33 0.40 0.24 0.41 0.40 0.50 0.52 0.25
Q1.21 0.24 -0.10 0.18 0.18 0.19 0.42 0.33 0.06 0.14 0.08 0.13 -0.17 0.14 0.21 0.21 0.16 0.13 0.47 0.44 0.23 1.00 0.42 0.24 0.42 0.30 0.15 0.24 0.26 0.29
Q1.22 0.23 -0.08 0.28 0.23 0.27 0.41 0.39 0.05 0.23 0.09 0.17 -0.11 0.23 0.29 0.33 0.30 0.26 0.44 0.38 0.33 0.42 1.00 0.34 0.38 0.37 0.23 0.35 0.38 0.34
Q1.23 0.19 0.00 0.34 0.16 0.29 0.32 0.30 0.10 0.29 0.07 0.19 -0.03 0.32 0.36 0.35 0.40 0.36 0.29 0.24 0.40 0.24 0.34 1.00 0.23 0.32 0.33 0.39 0.44 0.23
Q1.24 0.21 -0.08 0.23 0.23 0.25 0.37 0.33 0.02 0.23 0.13 0.08 -0.17 0.20 0.21 0.27 0.19 0.15 0.42 0.37 0.24 0.42 0.38 0.23 1.00 0.31 0.21 0.24 0.25 0.27
Q1.25 0.23 0.01 0.36 0.22 0.29 0.42 0.42 0.10 0.32 0.08 0.25 -0.05 0.32 0.35 0.43 0.39 0.36 0.37 0.32 0.41 0.30 0.37 0.32 0.31 1.00 0.34 0.35 0.40 0.23
Q1.26 0.10 0.07 0.47 0.19 0.34 0.31 0.39 -0.04 0.50 0.13 0.09 -0.06 0.48 0.46 0.37 0.40 0.40 0.24 0.14 0.40 0.15 0.23 0.33 0.21 0.34 1.00 0.45 0.47 0.15
Q1.27 0.13 0.05 0.40 0.17 0.29 0.34 0.37 0.03 0.40 0.08 0.16 0.00 0.40 0.39 0.35 0.48 0.44 0.25 0.18 0.50 0.24 0.35 0.39 0.24 0.35 0.45 1.00 0.62 0.23
Q1.28 0.18 0.02 0.43 0.21 0.33 0.39 0.40 0.05 0.40 0.07 0.18 -0.01 0.40 0.40 0.38 0.50 0.46 0.31 0.23 0.52 0.26 0.38 0.44 0.25 0.40 0.47 0.62 1.00 0.26
Q1.29 0.20 -0.03 0.17 0.19 0.18 0.27 0.24 0.10 0.11 0.05 0.17 -0.04 0.19 0.21 0.24 0.20 0.21 0.30 0.28 0.25 0.29 0.34 0.23 0.27 0.23 0.15 0.23 0.26 1.00
+

Questions

+
    +
  1. Do you see any high correlations between the responses? Do they make sense?
  2. +
  3. What do these correlations imply?
  4. +
+

Answers:

+
    +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
+
+
+

Step 4: Choose number of factors

+

Clearly the survey asked many redundant questions (can you think some reasons why?), so we may be able to actually “group” these 29 attitude questions into only a few “key factors”. This not only will simplify the data, but will also greatly facilitate our understanding of the customers.

+

To do so, we use methods called Principal Component Analysis and factor analysis as also discussed in the Dimensionality Reduction readings. We can use two different R commands for this (they make slightly different information easily available as output): the command principal (check help(principal) from R package psych), and the command PCA from R package FactoMineR - there are more packages and commands for this, as these methods are very widely used.

+

Let’s look at the variance explained as well as the eigenvalues (see session readings):

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Eigenvalue Pct of explained variance Cumulative pct of explained variance
Component 1 8.43 29.08 29.08
Component 2 2.33 8.05 37.12
Component 3 1.86 6.42 43.55
Component 4 1.46 5.03 48.57
Component 5 1.21 4.16 52.74
Component 6 0.90 3.10 55.84
Component 7 0.82 2.82 58.65
Component 8 0.79 2.71 61.36
Component 9 0.78 2.69 64.05
Component 10 0.74 2.56 66.61
Component 11 0.69 2.37 68.98
Component 12 0.65 2.25 71.23
Component 13 0.65 2.23 73.47
Component 14 0.62 2.13 75.60
Component 15 0.61 2.10 77.70
Component 16 0.58 1.99 79.69
Component 17 0.56 1.94 81.62
Component 18 0.54 1.85 83.47
Component 19 0.52 1.81 85.28
Component 20 0.51 1.76 87.04
Component 21 0.50 1.72 88.77
Component 22 0.49 1.69 90.45
Component 23 0.46 1.59 92.04
Component 24 0.46 1.57 93.61
Component 25 0.41 1.42 95.03
Component 26 0.38 1.32 96.36
Component 27 0.37 1.28 97.63
Component 28 0.35 1.22 98.85
Component 29 0.33 1.15 100.00
+
+ +

Questions:

+
    +
  1. Can you explain what this table and the plot are? What do they indicate? What can we learn from these?
  2. +
  3. Why does the plot have this specific shape? Could the plotted line be increasing?
  4. +
  5. What characteristics of these results would we prefer to see? Why?
  6. +
+

Answers

+
    +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
+
+
+

Step 5: Interpret the factors

+

Let’s now see how the “top factors” look like.

+

To better visualize them, we will use what is called a “rotation”. There are many rotations methods. In this case we selected the varimax rotation. For our data, the 15 selected factors look as follows after this rotation:

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
Q1.27 0.81 0.18 0.11 0.18 0.03 -0.02 -0.04 0.03 -0.02 -0.01 0.07 0.04 0.05 0.05 0.03
Q1.28 0.76 0.18 0.18 0.21 0.00 -0.01 0.00 -0.01 0.05 0.03 0.07 0.03 0.11 0.14 -0.01
Q1.20 0.65 0.22 0.13 0.14 -0.03 0.16 0.04 0.11 0.21 0.09 0.08 -0.03 -0.02 0.10 0.03
Q1.16 0.55 0.39 0.01 0.09 -0.14 0.27 0.15 -0.16 0.20 0.08 0.02 -0.03 -0.08 0.12 0.25
Q1.17 0.48 0.43 -0.02 0.08 -0.06 0.25 0.22 -0.21 0.20 0.08 0.06 -0.01 -0.05 0.09 0.24
Q1.26 0.44 0.44 0.07 0.31 0.25 -0.03 -0.10 0.04 0.08 -0.05 -0.05 -0.10 0.15 0.11 0.03
Q1.23 0.32 0.18 0.20 0.14 -0.03 0.09 -0.03 0.02 0.04 0.07 0.06 0.08 0.00 0.84 0.02
Q1.22 0.29 0.03 0.46 0.22 -0.17 0.08 -0.03 0.07 0.09 0.04 0.28 0.01 0.07 0.17 0.16
Q1.25 0.29 0.12 0.30 0.32 0.00 0.15 -0.01 0.00 0.65 0.04 0.02 0.09 0.09 -0.02 0.05
Q1.6 0.25 0.07 0.43 0.60 -0.14 0.08 -0.08 0.08 0.03 0.20 0.08 0.09 -0.02 -0.04 -0.06
Q1.13 0.25 0.78 0.05 0.20 0.03 0.00 -0.06 0.08 0.05 0.00 0.08 -0.03 0.06 0.02 0.02
Q1.14 0.24 0.77 0.16 0.16 0.02 0.07 -0.16 0.03 0.05 0.05 0.07 0.02 0.08 0.06 -0.08
Q1.9 0.23 0.40 -0.04 0.65 0.12 -0.02 0.06 0.02 0.04 -0.03 -0.03 -0.13 0.05 0.12 0.21
Q1.15 0.22 0.48 0.15 0.08 -0.07 0.09 -0.19 0.09 0.53 0.14 0.11 0.08 0.08 0.13 0.06
Q1.3 0.21 0.42 0.02 0.57 -0.04 0.02 0.06 0.02 0.07 -0.06 0.01 -0.08 0.20 0.23 0.12
Q1.7 0.20 0.15 0.27 0.74 -0.01 0.02 -0.06 0.03 0.16 0.06 0.11 0.01 -0.03 0.02 0.03
Q1.21 0.17 0.07 0.71 0.16 -0.09 0.16 -0.18 0.02 -0.27 0.07 0.07 0.11 0.02 -0.07 0.16
Q1.5 0.14 0.58 0.14 0.15 -0.49 -0.05 0.12 0.11 0.07 0.12 -0.01 0.02 0.15 0.07 0.12
Q1.29 0.13 0.11 0.21 0.07 0.00 0.06 0.01 -0.01 0.04 0.07 0.92 0.04 0.06 0.04 0.06
Q1.11 0.12 0.04 0.08 0.03 0.03 0.94 0.04 -0.05 0.09 0.04 0.06 0.06 0.03 0.06 -0.01
Q1.24 0.12 0.04 0.41 0.17 -0.01 -0.01 -0.17 0.10 0.04 0.06 0.10 0.03 0.11 0.02 0.77
Q1.18 0.10 0.11 0.71 0.19 0.00 -0.02 -0.02 0.06 0.17 0.05 0.09 0.01 -0.01 0.10 0.13
Q1.4 0.09 0.20 0.11 0.06 -0.17 0.03 -0.04 0.04 0.07 0.10 0.07 0.02 0.90 0.01 0.07
Q1.1 0.07 0.06 0.18 0.06 0.03 0.05 -0.03 0.04 0.06 0.94 0.07 0.03 0.09 0.05 0.04
Q1.12 0.06 -0.13 -0.07 -0.03 0.23 0.05 0.85 0.11 -0.06 -0.03 0.01 0.14 -0.03 -0.02 -0.11
Q1.19 0.04 0.08 0.82 -0.03 0.04 -0.02 0.07 -0.03 0.20 0.10 0.05 -0.03 0.11 0.11 0.01
Q1.2 0.03 0.03 -0.04 0.03 0.82 0.02 0.28 0.10 -0.01 0.06 -0.01 0.13 -0.14 -0.01 0.02
Q1.10 0.02 0.10 0.06 0.06 0.06 -0.05 0.10 0.95 0.03 0.04 0.00 0.02 0.03 0.02 0.06
Q1.8 0.01 -0.02 0.05 -0.04 0.10 0.06 0.13 0.02 0.06 0.03 0.04 0.96 0.01 0.06 0.02
+

To better visualize and interpret the factors we often “suppress” loadings with small values, e.g. with absolute values smaller than 0.5. In this case our factors look as follows after suppressing the small numbers:

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
Q1.27 0.81
Q1.28 0.76
Q1.20 0.65
Q1.16 0.55
Q1.17
Q1.26
Q1.23 0.84
Q1.22
Q1.25 0.65
Q1.6 0.60
Q1.13 0.78
Q1.14 0.77
Q1.9 0.65
Q1.15 0.53
Q1.3 0.57
Q1.7 0.74
Q1.21 0.71
Q1.5 0.58
Q1.29 0.92
Q1.11 0.94
Q1.24 0.77
Q1.18 0.71
Q1.4 0.9
Q1.1 0.94
Q1.12 0.85
Q1.19 0.82
Q1.2 0.82
Q1.10 0.95
Q1.8 0.96
+

Questions

+
    +
  1. What do the first couple of factors mean? Do they make business sense?
  2. +
  3. How many factors should we choose for this data/customer base? Please try a few and explain your final choice based on a) statistical arguments, b) on interpretation arguments, c) on business arguments (you need to consider all three types of arguments)
  4. +
  5. How would you interpret the factors you selected?
  6. +
  7. What lessons about data science do you learn when doing this analysis? Please comment.
  8. +
+

Answers

+
    +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
+
+
+

Step 6: Save factor scores

+

We can now either replace all initial variables used in this part with the factors scores or just select one of the initial variables for each of the selected factors in order to represent that factor. Here is how the factor scores are for the first few respondents:

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Obs.01 Obs.02 Obs.03 Obs.04 Obs.05 Obs.06 Obs.07 Obs.08 Obs.09 Obs.10
DV (Factor) 1 -0.34 0.14 -0.47 -0.67 0.26 1.23 1.43 -0.55 1.82 1.13
DV (Factor) 2 1.81 0.89 2.36 2.39 1.99 0.41 0.81 1.10 0.82 -1.43
DV (Factor) 3 0.70 -0.04 -0.69 0.64 -0.07 1.28 0.75 -0.67 -0.69 1.56
DV (Factor) 4 1.38 1.00 0.26 -0.43 0.37 -0.90 0.61 1.74 0.06 0.13
DV (Factor) 5 -1.05 -0.30 -1.45 -0.91 -0.41 -1.47 -0.03 0.78 0.72 3.24
DV (Factor) 6 -1.31 -0.65 -1.34 0.65 0.98 -1.19 0.47 0.43 0.98 -2.12
DV (Factor) 7 -1.11 -1.01 -0.22 -0.89 -0.79 -0.40 0.00 -1.43 -2.10 -0.86
DV (Factor) 8 0.16 0.74 -0.41 0.80 0.09 0.51 -2.61 -0.73 -0.19 0.20
DV (Factor) 9 0.75 1.00 1.45 0.31 0.85 -0.25 0.31 -0.62 0.37 1.10
DV (Factor) 10 0.63 -1.57 -1.35 1.02 -0.40 0.15 -0.41 -0.13 -0.15 -3.05
DV (Factor) 11 0.81 0.07 0.42 0.92 -0.73 -0.32 0.80 -0.37 -1.72 1.24
DV (Factor) 12 -0.93 -0.71 -2.01 -0.59 -0.12 -0.51 1.78 0.32 -0.80 0.00
DV (Factor) 13 0.59 -0.04 -0.35 -0.76 0.62 -0.61 0.10 0.20 0.26 -1.81
DV (Factor) 14 1.01 -0.63 0.72 0.58 1.22 -0.44 0.36 -0.73 0.21 1.46
DV (Factor) 15 0.52 -0.58 -0.12 -2.26 0.24 0.61 0.26 1.42 -0.26 -0.77
+

Questions

+
    +
  1. Can you describe some of the people using the new derived variables (factor scores)?
  2. +
  3. Which of the 29 initial variables would you select to represent each of the factors you selected?
  4. +
+

Answers

+
    +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
+
+

+
+
+
+

Part 2: Customer Segmentation

+

The code used here is along the lines of the code in the session 5-6 reading ClusterAnalysisReading.Rmd. We follow the process described in the Cluster Analysis reading.

+

In this part we also become familiar with:

+
    +
  1. Some clustering Methods;
  2. +
  3. How these tools can be used in practice.
  4. +
+

A key family of methods used for segmentation is what is called clustering methods. This is a very important problem in statistics and machine learning, used in all sorts of applications such as in Amazon’s pioneer work on recommender systems. There are many mathematical methods for clustering. We will use two very standard methods, hierarchical clustering and k-means. While the “math” behind all these methods can be complex, the R functions used are relatively simple to use, as we will see.

+

(All user inputs for this part should be selected in the code chunk in the raw .Rmd file)

+
# Please ENTER then original raw attributes to use for the segmentation (the
+# 'segmentation attributes') Please use numbers, not column names, e.g.
+# c(1:5, 7, 8) uses columns 1,2,3,4,5,7,8
+segmentation_attributes_used = c(28, 25, 27, 14, 20, 8, 3, 12, 13, 5, 9, 11, 
+    2, 30, 24)  #c(10,19,5,12,3) 
+
+# Please ENTER then original raw attributes to use for the profiling of the
+# segments (the 'profiling attributes') Please use numbers, not column
+# names, e.g. c(1:5, 7, 8) uses columns 1,2,3,4,5,7,8
+profile_attributes_used = c(2:82)
+
+# Please ENTER the number of clusters to eventually use for this report
+numb_clusters_used = 7  # for boats possibly use 5, for Mall_Visits use 3
+
+# Please enter the method to use for the segmentation:
+profile_with = "hclust"  #  'hclust' or 'kmeans'
+
+# Please ENTER the distance metric eventually used for the clustering in
+# case of hierarchical clustering (e.g. 'euclidean', 'maximum', 'manhattan',
+# 'canberra', 'binary' or 'minkowski' - see help(dist)).  DEFAULT is
+# 'euclidean'
+distance_used = "euclidean"
+
+# Please ENTER the hierarchical clustering method to use (options are:
+# 'ward', 'single', 'complete', 'average', 'mcquitty', 'median' or
+# 'centroid').  DEFAULT is 'ward'
+hclust_method = "ward.D"
+
+# Please ENTER the kmeans clustering method to use (options are:
+# 'Hartigan-Wong', 'Lloyd', 'Forgy', 'MacQueen').  DEFAULT is 'Lloyd'
+kmeans_method = "Lloyd"
+
+

Steps 1-2: Explore the data

+

(This was done above, so we skip it)

+
+
+

Step 3. Select Segmentation Variables

+

For simplicity will use one representative question for each of the factor we found in Part 1 (we can also use the “factor scores” for each respondent) to represent our survey respondents. These are the segmentation_attributes_used selected below. We can choose the question with the highest absolute factor loading for each factor. For example, when we use 5 factors with the varimax rotation we can select questions Q.1.9 (I see my boat as a status symbol), Q1.18 (Boating gives me a feeling of adventure), Q1.4 (I only consider buying a boat from a reputable brand), Q1.11 (I tend to perform minor boat repairs and maintenance on my own) and Q1.2 (When buying a boat getting the lowest price is more important than the boat brand) - try it. These are columns 10, 19, 5, 12, and 3, respectively of the data matrix Projectdata. ## Step 4: Define similarity measure

+

We need to define a distance metric that measures how different people (observations in general) are from each other. This can be an important choice. Here are the differences between the observations using the distance metric we selected:

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Obs.01 Obs.02 Obs.03 Obs.04 Obs.05 Obs.06 Obs.07 Obs.08 Obs.09 Obs.10
Obs.01 0
Obs.02 4 0
Obs.03 4 3 0
Obs.04 4 4 4 0
Obs.05 4 4 5 4 0
Obs.06 4 3 3 4 4 0
Obs.07 6 5 6 6 4 5 0
Obs.08 4 3 4 4 4 4 5 0
Obs.09 5 4 5 4 3 4 4 3 0
Obs.10 8 6 7 7 8 5 7 7 7 0
+
+
+

Step 5: Visualize Pair-wise Distances

+

We can see the histogram of, say, the first 2 variables (can you change the code chunk in the raw .Rmd file to see other variables?)

+
+
+ +
+ +
+

or the histogram of all pairwise distances for the euclidean distance:

+
+ +
+
+

Step 6: Method and Number of Segments

+

We need to select the clustering method to use, as well as the number of cluster. It may be useful to see the dendrogram from Hierarchical Clustering, to have a quick idea of how the data may be segmented and how many segments there may be. Here is the dendrogram for our data:

+
+ +

We can also plot the “distances” traveled before we need to merge any of the lower and smaller in size clusters into larger ones - the heights of the tree branches that link the clusters as we traverse the tree from its leaves to its root. If we have n observations, this plot has n-1 numbers, we see the first 20 here.

+

+

Here is the segment membership of the first 10 respondents if we use hierarchical clustering:

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Observation Number Cluster_Membership
1 1
2 2
3 1
4 3
5 4
6 1
7 4
8 2
9 4
10 3
+

while this is the segment membership if we use k-means:

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Observation Number Cluster_Membership
1 7
2 7
3 7
4 7
5 7
6 7
7 7
8 7
9 7
10 3
+
+
+

Step 7: Profile and interpret the segments

+

In market segmentation one may use variables to profile the segments which are not the same (necessarily) as those used to segment the market: the latter may be, for example, attitude/needs related (you define segments based on what the customers “need”), while the former may be any information that allows a company to identify the defined customer segments (e.g. demographics, location, etc). Of course deciding which variables to use for segmentation and which to use for profiling (and then activation of the segmentation for business purposes) is largely subjective. In this case we can use all survey questions for profiling for now - the profile_attributes_used variables selected below.

+

There are many ways to do the profiling of the segments. For example, here we show how the average answers of the respondents in each segment compare to the average answer of all respondents using the ratio of the two. The idea is that if in a segment the average response to a question is very different (e.g. away from ratio of 1) than the overall average, then that question may indicate something about the segment relative to the total population.

+

Here are for example the profiles of the segments using the clusters found above. First let’s see just the average answer people gave to each question for the different segments as well as the total population:

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Population Seg.1 Seg.2 Seg.3 Seg.4 Seg.5 Seg.6 Seg.7
Q1.1 4.03 4.01 4.20 3.84 4.41 4.41 3.73 3.83
Q1.2 2.89 2.29 2.74 3.77 2.63 4.33 2.90 3.04
Q1.3 3.12 3.56 3.03 3.52 3.92 4.23 2.71 2.37
Q1.4 3.89 4.26 3.98 3.65 4.23 4.36 3.56 3.57
Q1.5 3.55 3.96 3.56 3.61 4.16 4.28 3.24 2.93
Q1.6 3.95 4.27 4.00 3.89 4.69 4.47 3.57 3.51
Q1.7 3.67 4.17 3.66 3.74 4.53 4.45 3.29 2.94
Q1.8 3.74 3.25 3.85 3.73 3.81 4.41 3.59 3.94
Q1.9 2.89 3.36 2.73 3.46 3.67 4.24 2.58 1.98
Q1.10 3.37 3.16 3.28 3.65 3.47 4.31 3.47 3.08
Q1.11 3.46 3.04 3.98 3.52 4.24 4.26 2.08 4.06
Q1.12 2.86 2.16 2.74 3.68 2.11 4.42 2.85 3.42
Q1.13 3.02 3.56 2.87 3.56 3.86 4.38 2.70 2.06
Q1.14 3.25 3.67 3.19 3.64 4.13 4.31 2.79 2.51
Q1.15 3.63 3.94 3.67 3.75 4.43 4.37 3.16 3.11
Q1.16 3.10 3.34 3.20 3.48 3.95 4.32 2.43 2.52
Q1.17 3.08 3.20 3.14 3.42 3.87 4.30 2.52 2.63
Q1.18 4.12 4.38 4.17 3.91 4.72 4.43 3.82 3.84
Q1.19 4.20 4.44 4.28 3.79 4.71 4.45 3.87 4.09
Q1.20 3.16 3.40 3.23 3.47 3.99 4.26 2.57 2.59
Q1.21 4.25 4.50 4.32 3.88 4.79 4.44 3.96 4.06
Q1.22 4.01 4.27 4.08 3.84 4.60 4.37 3.64 3.74
Q1.23 3.56 3.75 3.69 3.70 4.67 4.19 3.00 2.94
Q1.24 4.11 4.47 4.16 3.77 4.64 4.45 3.83 3.78
Q1.25 3.79 4.08 3.87 3.85 4.54 4.49 3.24 3.42
Q1.26 2.95 3.45 2.81 3.63 4.00 4.36 2.49 1.94
Q1.27 3.16 3.58 3.14 3.87 4.25 4.40 2.52 2.25
Q1.28 3.31 3.64 3.32 3.71 4.29 4.37 2.79 2.55
Q1.29 4.03 4.20 4.07 3.80 4.56 4.53 3.70 3.90
Q2 0.90 0.93 0.91 0.97 0.92 1.18 0.77 0.92
Q2.Cluster 0.74 0.75 0.77 0.73 0.78 0.75 0.65 0.81
Q3 4.15 4.25 4.14 4.25 4.46 4.60 4.02 3.88
Q4 3.92 4.39 3.67 4.60 4.40 4.45 3.90 3.16
Q5 3.25 3.80 3.05 3.84 4.37 4.77 2.78 2.35
Q6 22.83 24.48 22.19 22.84 26.09 24.35 22.75 20.07
Q7.1 2.23 1.95 2.21 2.84 2.32 3.11 2.00 2.28
Q7.2 4.00 4.25 4.00 3.73 4.17 4.00 3.95 3.88
Q7.3 3.80 3.95 3.85 3.70 4.10 3.98 3.68 3.52
Q7.4 3.67 3.84 3.69 3.72 3.96 4.03 3.53 3.37
Q8 2.31 2.44 2.46 2.08 2.66 2.32 1.95 2.27
Q9.1 3.57 3.28 3.70 3.60 3.87 4.03 3.23 3.75
Q9.2 3.41 3.63 3.38 3.60 3.72 3.92 3.20 3.11
Q9.3 3.72 4.05 3.66 3.72 3.90 4.07 3.68 3.35
Q9.4 3.19 3.38 3.16 3.52 3.51 3.92 3.05 2.68
Q9.5 3.51 3.84 3.45 3.56 3.87 3.94 3.47 2.93
Q10 46.25 52.82 48.54 40.06 55.10 45.82 41.96 38.15
Q11 1.45 1.57 1.41 1.34 1.38 1.40 1.57 1.35
Q12 13.42 14.08 13.34 12.89 13.99 12.66 13.56 12.88
Q13 2.08 2.23 1.99 2.28 2.17 2.60 2.09 1.80
Q14 2.27 2.21 2.32 1.78 1.86 1.68 2.45 2.65
Q15 2.54 2.38 2.66 1.79 2.32 1.74 2.67 3.06
Q16 24.77 25.42 24.30 21.25 23.82 23.27 26.38 25.75
Q16.1 3.66 3.72 3.71 3.81 3.88 4.19 3.44 3.41
Q16.2 3.56 3.70 3.53 3.75 3.88 4.28 3.30 3.34
Q16.3 3.72 3.87 3.71 3.76 4.13 4.38 3.43 3.51
Q16.4 3.76 3.98 3.74 3.78 4.18 4.31 3.46 3.59
Q16.5 3.71 3.83 3.71 3.85 4.04 4.21 3.47 3.48
Q16.6 3.82 4.04 3.81 3.92 4.21 4.41 3.55 3.56
Q16.7 3.91 4.13 3.91 3.96 4.21 4.48 3.65 3.67
Q16.8 3.91 4.03 3.91 3.90 4.25 4.38 3.66 3.79
Q16.9 3.91 4.05 3.92 3.88 4.20 4.32 3.71 3.77
Q16.10 3.83 4.06 3.84 3.74 4.17 3.92 3.64 3.69
Q16.11 3.65 3.74 3.62 3.84 3.99 4.30 3.45 3.41
Q16.12 3.56 3.85 3.54 3.55 3.97 3.69 3.45 3.17
Q16.13 3.66 3.93 3.62 3.79 4.02 4.33 3.45 3.31
Q16.14 3.75 4.04 3.71 3.77 4.19 4.36 3.49 3.46
Q16.15 3.88 4.14 3.88 3.81 4.28 4.33 3.63 3.65
Q16.16 3.67 3.94 3.62 3.75 4.09 4.34 3.45 3.31
Q16.17 3.85 4.02 3.84 3.83 4.23 4.35 3.60 3.69
Q16.18 3.88 4.04 3.88 3.85 4.23 4.35 3.65 3.74
Q16.19 3.89 4.06 3.90 3.83 4.29 4.29 3.63 3.74
Q16.20 3.97 4.14 3.99 3.85 4.33 4.38 3.74 3.82
Q16.21 3.91 4.08 3.90 3.82 4.27 4.37 3.69 3.82
Q16.22 3.93 3.98 3.96 3.81 4.29 4.36 3.67 3.91
Q16.23 3.99 4.14 4.01 3.90 4.33 4.33 3.75 3.88
Q16.24 3.31 3.28 3.27 3.65 3.31 4.15 3.17 3.23
Q16.25 3.65 3.85 3.66 3.74 4.09 4.27 3.34 3.34
Q16.26 3.90 4.07 3.90 3.86 4.26 4.38 3.66 3.72
Q16.27 3.63 3.81 3.60 3.78 4.05 4.36 3.37 3.31
Q17 0.33 0.41 0.36 0.30 0.52 0.45 0.19 0.28
Q18 0.50 0.45 0.53 0.30 0.41 0.31 0.55 0.62
+

We can also “visualize” the segments using snake plots for each cluster. For example, we can plot the means of the profiling variables for each of our clusters to better visualize differences between segments. For better visualization we plot the standardized profiling variables.

+
+ +

We can also compare the averages of the profiling variables of each segment relative to the average of the variables across the whole population. This can also help us better understand whether there are indeed clusters in our data (e.g. if all segments are much like the overall population, there may be no segments). For example, we can measure the ratios of the average for each cluster to the average of the population, minus 1, (e.g. avg(cluster) / avg(population) -1) for each segment and variable:

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Seg.1 Seg.2 Seg.3 Seg.4 Seg.5 Seg.6 Seg.7
Q1.1 -0.01 0.04 -0.05 0.09 0.10 -0.07 -0.05
Q1.2 -0.21 -0.05 0.30 -0.09 0.50 0.01 0.05
Q1.3 0.14 -0.03 0.13 0.26 0.36 -0.13 -0.24
Q1.4 0.10 0.02 -0.06 0.09 0.12 -0.08 -0.08
Q1.5 0.11 0.00 0.02 0.17 0.20 -0.09 -0.17
Q1.6 0.08 0.01 -0.02 0.19 0.13 -0.10 -0.11
Q1.7 0.13 0.00 0.02 0.23 0.21 -0.10 -0.20
Q1.8 -0.13 0.03 0.00 0.02 0.18 -0.04 0.05
Q1.9 0.16 -0.05 0.20 0.27 0.47 -0.11 -0.31
Q1.10 -0.06 -0.03 0.08 0.03 0.28 0.03 -0.09
Q1.11 -0.12 0.15 0.02 0.23 0.23 -0.40 0.17
Q1.12 -0.24 -0.04 0.29 -0.26 0.55 0.00 0.20
Q1.13 0.18 -0.05 0.18 0.28 0.45 -0.11 -0.32
Q1.14 0.13 -0.02 0.12 0.27 0.33 -0.14 -0.23
Q1.15 0.09 0.01 0.03 0.22 0.20 -0.13 -0.14
Q1.16 0.08 0.03 0.12 0.27 0.39 -0.22 -0.19
Q1.17 0.04 0.02 0.11 0.25 0.39 -0.18 -0.15
Q1.18 0.06 0.01 -0.05 0.15 0.07 -0.07 -0.07
Q1.19 0.06 0.02 -0.10 0.12 0.06 -0.08 -0.03
Q1.20 0.08 0.02 0.10 0.26 0.35 -0.19 -0.18
Q1.21 0.06 0.02 -0.09 0.13 0.04 -0.07 -0.05
Q1.22 0.07 0.02 -0.04 0.15 0.09 -0.09 -0.07
Q1.23 0.05 0.03 0.04 0.31 0.18 -0.16 -0.17
Q1.24 0.09 0.01 -0.08 0.13 0.08 -0.07 -0.08
Q1.25 0.08 0.02 0.02 0.20 0.18 -0.15 -0.10
Q1.26 0.17 -0.05 0.23 0.36 0.48 -0.16 -0.34
Q1.27 0.14 -0.01 0.22 0.35 0.40 -0.20 -0.29
Q1.28 0.10 0.00 0.12 0.30 0.32 -0.16 -0.23
Q1.29 0.04 0.01 -0.06 0.13 0.12 -0.08 -0.03
Q2 0.03 0.01 0.08 0.03 0.31 -0.14 0.02
Q2.Cluster 0.01 0.04 -0.02 0.05 0.01 -0.13 0.09
Q3 0.02 0.00 0.02 0.07 0.11 -0.03 -0.07
Q4 0.12 -0.06 0.18 0.12 0.14 0.00 -0.19
Q5 0.17 -0.06 0.18 0.34 0.47 -0.15 -0.28
Q6 0.07 -0.03 0.00 0.14 0.07 0.00 -0.12
Q7.1 -0.13 -0.01 0.27 0.04 0.39 -0.10 0.02
Q7.2 0.06 0.00 -0.07 0.04 0.00 -0.01 -0.03
Q7.3 0.04 0.01 -0.03 0.08 0.05 -0.03 -0.07
Q7.4 0.04 0.00 0.01 0.08 0.10 -0.04 -0.08
Q8 0.06 0.06 -0.10 0.15 0.00 -0.15 -0.02
Q9.1 -0.08 0.03 0.01 0.08 0.13 -0.10 0.05
Q9.2 0.06 -0.01 0.06 0.09 0.15 -0.06 -0.09
Q9.3 0.09 -0.02 0.00 0.05 0.09 -0.01 -0.10
Q9.4 0.06 -0.01 0.10 0.10 0.23 -0.04 -0.16
Q9.5 0.10 -0.02 0.01 0.10 0.12 -0.01 -0.16
Q10 0.14 0.05 -0.13 0.19 -0.01 -0.09 -0.18
Q11 0.08 -0.03 -0.07 -0.05 -0.03 0.09 -0.07
Q12 0.05 -0.01 -0.04 0.04 -0.06 0.01 -0.04
Q13 0.07 -0.04 0.09 0.04 0.25 0.00 -0.14
Q14 -0.03 0.02 -0.21 -0.18 -0.26 0.08 0.17
Q15 -0.06 0.05 -0.30 -0.09 -0.32 0.05 0.20
Q16 0.03 -0.02 -0.14 -0.04 -0.06 0.07 0.04
Q16.1 0.02 0.01 0.04 0.06 0.15 -0.06 -0.07
Q16.2 0.04 -0.01 0.05 0.09 0.20 -0.07 -0.06
Q16.3 0.04 0.00 0.01 0.11 0.18 -0.08 -0.06
Q16.4 0.06 -0.01 0.01 0.11 0.15 -0.08 -0.04
Q16.5 0.03 0.00 0.04 0.09 0.14 -0.06 -0.06
Q16.6 0.06 0.00 0.03 0.10 0.15 -0.07 -0.07
Q16.7 0.06 0.00 0.01 0.08 0.15 -0.07 -0.06
Q16.8 0.03 0.00 0.00 0.09 0.12 -0.06 -0.03
Q16.9 0.03 0.00 -0.01 0.07 0.10 -0.05 -0.03
Q16.10 0.06 0.00 -0.02 0.09 0.02 -0.05 -0.04
Q16.11 0.02 -0.01 0.05 0.09 0.18 -0.05 -0.07
Q16.12 0.08 0.00 0.00 0.11 0.04 -0.03 -0.11
Q16.13 0.07 -0.01 0.03 0.10 0.18 -0.06 -0.10
Q16.14 0.08 -0.01 0.01 0.12 0.16 -0.07 -0.08
Q16.15 0.07 0.00 -0.02 0.10 0.12 -0.06 -0.06
Q16.16 0.07 -0.01 0.02 0.12 0.18 -0.06 -0.10
Q16.17 0.05 0.00 -0.01 0.10 0.13 -0.06 -0.04
Q16.18 0.04 0.00 -0.01 0.09 0.12 -0.06 -0.04
Q16.19 0.04 0.00 -0.02 0.10 0.10 -0.07 -0.04
Q16.20 0.04 0.00 -0.03 0.09 0.10 -0.06 -0.04
Q16.21 0.04 0.00 -0.02 0.09 0.12 -0.06 -0.02
Q16.22 0.01 0.01 -0.03 0.09 0.11 -0.07 -0.01
Q16.23 0.04 0.01 -0.02 0.09 0.08 -0.06 -0.03
Q16.24 -0.01 -0.01 0.10 0.00 0.25 -0.04 -0.02
Q16.25 0.06 0.00 0.03 0.12 0.17 -0.08 -0.08
Q16.26 0.05 0.00 -0.01 0.09 0.12 -0.06 -0.04
Q16.27 0.05 -0.01 0.04 0.12 0.20 -0.07 -0.09
Q17 0.24 0.08 -0.09 0.56 0.36 -0.44 -0.17
Q18 -0.10 0.06 -0.39 -0.18 -0.38 0.11 0.24
+

Questions

+
    +
  1. What do the numbers in the last table indicate? What numbers are the more informative?
  2. +
  3. Based on the tables and snake plot above, what are some key features of each of the segments of this solution?
  4. +
+

Answers

+
    +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
+
+
+

Step 8: Robustness Analysis

+

We should also consider the robustness of our analysis as we change the clustering method and parameters. Once we are comfortable with the solution we can finally answer our first business questions:

+

Questions

+
    +
  1. How many segments are there in our market? How many do you select and why? Try a few and explain your final choice based on a) statistical arguments, b) on interpretation arguments, c) on business arguments (you need to consider all three types of arguments)
  2. +
  3. Can you describe the segments you found based on the profiles?
  4. +
  5. What if you change the number of factors and in general you iterate the whole analysis? Iterations are key in data science.
  6. +
  7. Can you now answer the Boats case questions? What business decisions do you recommend to this company based on your analysis?
  8. +
+

Answers

+
    +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
+
+

+
+
+
+

Part 3: Purchase Drivers

+

We will now use the classification analysis methods to understand the key purchase drivers for boats (a similar analysis can be done for recommendation drivers). For simplicity we do not follow the “generic” steps of classification discussed in that reading, and only consider the classification and purchase drivers analysis for the segments we found above.

+

We are interested in understanding the purchase drivers, hence our dependent variable is column 82 of the Boats data (Q18) - why is that? We will use only the subquestions of Question 16 of the case for now, and also select some of the parameters for this part of the analysis:

+
# Please ENTER the class (dependent) variable: Please use numbers, not
+# column names! e.g. 82 uses the 82nd column are dependent variable.  YOU
+# NEED TO MAKE SURE THAT THE DEPENDENT VARIABLES TAKES ONLY 2 VALUES: 0 and
+# 1!!!
+dependent_variable = 82
+
+# Please ENTER the attributes to use as independent variables Please use
+# numbers, not column names! e.g. c(1:5, 7, 8) uses columns 1,2,3,4,5,7,8
+independent_variables = c(54:80)  # use 54-80 for boats
+
+# Please ENTER the profit/cost values for the correctly and wrong classified
+# data:
+actual_1_predict_1 = 100
+actual_1_predict_0 = -75
+actual_0_predict_1 = -50
+actual_0_predict_0 = 0
+
+# Please ENTER the probability threshold above which an observations is
+# predicted as class 1:
+Probability_Threshold = 50  # between 1 and 99%
+
+# Please ENTER the percentage of data used for estimation
+estimation_data_percent = 80
+validation_data_percent = 10
+
+# Please enter 0 if you want to 'randomly' split the data in estimation and
+# validation/test
+random_sampling = 0
+
+# Tree parameter PLEASE ENTER THE Tree (CART) complexity control cp (e.g.
+# 0.001 to 0.02, depending on the data)
+CART_cp = 0.01
+
+# Please enter the minimum size of a segment for the analysis to be done
+# only for that segment
+min_segment = 100
+

Questions

+
    +
  1. How do you select the profit/cost values for the analysis? Does the variable 100, -50, -75, 0 above relate to the final business decisions? How?
  2. +
  3. What does the variable 0.5 affect? Does it relate to the final business decisions? How?
  4. +
+

Answers

+
    +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
+

We will use two classification trees and logistic regression. You can select “complexity” control for one of the classification trees in the code chunk of the raw .Rmd file here

+
CART_control = 0.001
+

Question

+
    +
  1. How can this parameter affect the final results? What business implications can this parameter choice have?
  2. +
+

Answer

+
    +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
+

This is a “small tree” classification for example:

+

+

After also running the large tree and the logistic regression classifiers, we can then check how much “weight” these three methods put on the different purchase drivers (Q16 of the survey):

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
CART 1 CART 2 Logistic Regr.
Q16.1 0.00000000 0.011019687 0.09090909
Q16.2 -1.00000000 -0.899454252 -0.77272727
Q16.3 0.17810761 0.399446101 0.04545455
Q16.4 -0.27298417 -0.765357150 -0.20454545
Q16.5 0.19480519 0.657793803 0.25000000
Q16.6 -0.13206483 -0.103951708 -0.29545455
Q16.7 0.00000000 -0.054540259 -0.02272727
Q16.8 0.07231261 0.658198414 0.40909091
Q16.9 0.00000000 0.005692312 0.47727273
Q16.10 0.00000000 0.491786085 0.75000000
Q16.11 -0.32096475 -0.230192726 -0.45454545
Q16.12 0.14048073 0.736723792 0.72727273
Q16.13 -0.15083876 -0.335896900 -0.54545455
Q16.14 -0.21363429 -0.202656708 -0.31818182
Q16.15 0.13400696 0.096108462 0.29545455
Q16.16 -0.57551783 -0.640472508 -1.00000000
Q16.17 -0.50419287 -0.846807210 -0.38636364
Q16.18 -0.09940166 -0.404064789 -0.22727273
Q16.19 0.28629241 0.297394573 0.40909091
Q16.20 0.26520309 0.299989042 0.06818182
Q16.21 0.95306972 1.000000000 0.50000000
Q16.22 0.62914386 0.739418001 0.20454545
Q16.23 -0.43657168 -0.412043025 -0.11363636
Q16.24 0.51894572 0.372182710 0.34090909
Q16.25 -0.17068646 -0.938098365 -0.25000000
Q16.26 0.34208441 0.245339536 0.36363636
Q16.27 0.00000000 0.719935467 0.40909091
+

Finally, if we were to use the estimated classification models on the test data, we would get the following profit curves (see the raw .Rmd file to select the business profit parameters).

+

The profit curve using the small classification tree:

+
+ +

The profit curve using the large classification tree:

+
+ +

The profit curve using the logistic regression classifier:

+
+ +

These are the maximum total profit achieved in the test data using the three classifiers (without any segment specific analysis so far).

+
+ + + + + + + + + + + + + + + + + + + + + + + + +
Percentile Profit
Small Tree 100.00 4650
Large Tree 95.04 4675
Logistic Regression 98.58 4850
+
+

+
+
+

Part 4: Business Decisions

+

We will now get the results of the overall process (parts 1-3) and based on them make business decisions (e.g. answer the questions of the Boats case study). Specifically, we will study the purchase drivers for each segment we found and consider the profit curves of the developed models on our test data.

+

Final Solution: Segment Specific Analysis

+

Let’s see first how many observations we have in each segment, for the segments we selected above:

+
+ + + + + + + + + + + + + + + + + + + + + + + + +
Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Segment 7
Number of Obs. 365 921 201 252 119 605 350
+

This is our final segment specific analysis and solution. We can study now the purchase drivers (average answers to Q16 of the survey) for each segment. They are as follows:

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Segment 7
Q16.2 -0.32 -0.79 -0.27 -0.29 -0.15 -0.17 -0.80
Q16.3 0.04 -0.10 0.06 0.21 0.44 -0.04 -0.20
Q16.4 -0.54 -0.34 0.21 -0.08 0.76 -0.04 0.37
Q16.5 0.68 0.41 -0.45 0.58 -0.65 -0.25 0.57
Q16.6 -0.36 -0.17 -0.24 -0.17 0.24 0.25 -0.17
Q16.7 -0.04 0.52 0.03 -0.17 -0.09 -0.25 -0.40
Q16.8 0.32 0.14 0.03 -0.29 0.24 0.29 0.87
Q16.9 0.43 0.10 0.06 0.25 0.71 0.58 0.13
Q16.10 0.14 0.90 -0.30 0.17 0.44 -0.04 0.70
Q16.11 -0.04 -0.45 0.67 -0.67 -0.15 -0.58 0.17
Q16.12 0.36 0.59 1.00 0.71 1.00 0.17 -0.27
Q16.13 -0.43 -0.38 -0.33 -0.12 -0.79 0.00 -0.03
Q16.14 -0.25 -0.62 0.06 -0.33 0.68 -0.08 -0.40
Q16.15 0.79 -0.14 0.45 -0.04 -0.26 0.25 0.40
Q16.16 0.39 -0.62 -0.33 -1.00 -0.09 -0.46 -0.67
Q16.17 0.14 -0.21 -0.76 -0.21 -0.15 -0.42 0.03
Q16.18 -0.68 0.21 -0.45 -0.17 -0.47 -0.29 -0.30
Q16.19 -0.54 0.41 -0.24 0.67 0.29 1.00 0.07
Q16.20 0.21 -0.17 0.33 -0.50 -0.18 0.04 -0.10
Q16.21 0.39 1.00 0.30 -0.62 -0.35 0.42 0.60
Q16.22 -0.18 -0.07 0.06 0.25 0.68 0.96 -0.33
Q16.23 -0.32 0.17 -0.18 0.46 0.59 -0.79 0.10
Q16.24 0.25 0.83 -0.52 0.08 -0.15 0.12 0.63
Q16.25 -0.36 -0.83 0.06 0.75 -0.41 -0.83 1.00
Q16.26 0.54 0.34 -0.06 0.58 -0.18 0.67 0.00
Q16.27 1.00 0.34 0.03 0.96 -0.38 0.33 0.67
+

The profit curves for the test data in this case are as follows. The profit curve using the small classification tree is:

+
+ +

The profit curve using the large classification tree is:

+
+ +

The profit curve using the logistic regression classifier:

+
+ +

These are the maximum total profit achieved in the test data using the three classifiers with the selected market segmentation solution.

+
+ + + + + + + + + + + + + + + + + + + + + + + + +
Percentile Profit
Small Tree 100.00 4650
Large Tree 100.00 4650
Logistic Regression 87.94 5225
+

Questions:

+
    +
  1. What are the main purchase drivers for the segments and solution you found?
  2. +
  3. How different are the purchase drivers you find when you use segmentation versus when you study all customers as “one segment”? Why?
  4. +
  5. Based on the overall analysis, what segmentation would you choose?
  6. +
  7. What is the business profit the company can achieve (as measured with the test data) based on your solution?
  8. +
  9. What business decisions can the company make based on this analysis?
  10. +
+

Answers:

+
    +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
  • +
+

You have now completed your first market segmentation project. Do you have data from another survey you can use with this report now?

+

Extra question: explore and report a new segmentation analysis…

+
+ + + +
+
+ +
+ + + + + + + + diff --git a/CourseSessions/InClassProcess/nzf_MarketSegmentationProcessInClass.Rmd b/CourseSessions/InClassProcess/nzf_MarketSegmentationProcessInClass.Rmd new file mode 100644 index 00000000..d41a65b7 --- /dev/null +++ b/CourseSessions/InClassProcess/nzf_MarketSegmentationProcessInClass.Rmd @@ -0,0 +1,1064 @@ +--- +title: "A Market Segmentation and Purchase Drivers Process" +author: "T. Evgeniou" +output: + html_document: + css: ../../AnalyticsStyles/default.css + theme: paper + toc: yes + toc_float: + collapsed: no + smooth_scroll: yes + pdf_document: + includes: + in_header: ../../AnalyticsStyles/default.sty +always_allow_html: yes +--- + +> **IMPORTANT**: Please make sure you create a copy of this file with a customized name, so that your work (e.g. answers to the questions) is not over-written when you pull the latest content from the course github. +This is a **template process for market segmentation based on survey data**, using the [Boats cases A](http://inseaddataanalytics.github.io/INSEADAnalytics/Boats-A-prerelease.pdf) and [B](http://inseaddataanalytics.github.io/INSEADAnalytics/Boats-B-prerelease.pdf). + +All material and code is available at the [INSEAD Data Analytics for Business](http://inseaddataanalytics.github.io/INSEADAnalytics/) website and github. Before starting, make sure you have pulled the [course files](https://github.com/InseadDataAnalytics/INSEADAnalytics) on your github repository. As always, you can use the `help` command in Rstudio to find out about any R function (e.g. type `help(list.files)` to learn what the R function `list.files` does). + + +
\clearpage + +# The Business Questions + +This process can be used as a (starting) template for projects like the one described in the [Boats cases A](http://inseaddataanalytics.github.io/INSEADAnalytics/Boats-A-prerelease.pdf) and [B](http://inseaddataanalytics.github.io/INSEADAnalytics/Boats-B-prerelease.pdf). For example (but not only), in this case some of the business questions were: + +- What are the main purchase drivers of the customers (and prospects) of this company? + +- Are there different market segments? Which ones? Do the purchase drivers differ across segments? + +- What (possibly market segment specific) product development or brand positioning strategy should the company follow in order to increase its sales? + +See for example some of the analysis of this case in these slides: part 1 and part 2. + +
\clearpage + +# The Process + +The "high level" process template is split in 3 parts, corresponding to the course sessions 3-4, 5-6, and an optional last part: + +1. *Part 1*: We use some of the survey questions (e.g. in this case the first 29 "attitude" questions) to find **key customer descriptors** ("factors") using *dimensionality reduction* techniques described in the [Dimensionality Reduction](http://inseaddataanalytics.github.io/INSEADAnalytics/CourseSessions/Sessions23/FactorAnalysisReading.html) reading of Sessions 3-4. + +2. *Part 2*: We use the selected customer descriptors to **segment the market** using *cluster analysis* techniques described in the [Cluster Analysis ](http://inseaddataanalytics.github.io/INSEADAnalytics/CourseSessions/Sessions45/ClusterAnalysisReading.html) reading of Sessions 5-6. + +3. *Part 3*: For the market segments we create, we will use *classification analysis* to classify people based on whether or not they have purchased a product and find what are the **key purchase drivers per segment**. For this part we will use [classification analysis ](http://inseaddataanalytics.github.io/INSEADAnalytics/CourseSessions/Sessions67/ClassificationAnalysisReading.html) techniques. + +Finally, we will use the results of this analysis to make business decisions e.g. about brand positioning, product development, etc depending on our market segments and key purchase drivers we find at the end of this process. + + +```{r setuplibraries, echo=FALSE, message=FALSE} +suppressWarnings(source("../../AnalyticsLibraries/library.R")) +# Package options +suppressWarnings(ggthemr('fresh')) # ggplot theme +opts_knit$set(progress=FALSE, verbose=FALSE) +opts_chunk$set(echo=FALSE, fig.align="center", fig.width=10, fig.height=6.35, results="asis") +options(knitr.kable.NA = '') +``` + +
\clearpage + +# The Data + +First we load the data to use (see the raw .Rmd file to change the data file as needed): + +```{r setupdata1E, echo=TRUE, tidy=TRUE} +# Please ENTER the name of the file with the data used. The file should be a .csv with one row per observation (e.g. person) and one column per attribute. Do not add .csv at the end, make sure the data are numeric. +datafile_name = "../Sessions23/data/Boats.csv" + +# Please enter the minimum number below which you would like not to print - this makes the readability of the tables easier. Default values are either 10e6 (to print everything) or 0.5. Try both to see the difference. +MIN_VALUE = 0.5 + +# Please enter the maximum number of observations to show in the report and slides. +# DEFAULT is 10. If the number is large the report may be slow. +max_data_report = 10 +``` + +```{r} +ProjectData <- read.csv(datafile_name) +ProjectData <- data.matrix(ProjectData) +ProjectData_INITIAL <- ProjectData + +``` + +
\clearpage + +# Part 1: Key Customer Characteristics + +The code used here is along the lines of the code in the session 3-4 reading [FactorAnalysisReading.Rmd](https://github.com/InseadDataAnalytics/INSEADAnalytics/blob/master/CourseSessions/Sessions23/FactorAnalysisReading.Rmd). We follow the process described in the [Dimensionality Reduction ](http://inseaddataanalytics.github.io/INSEADAnalytics/CourseSessions/Sessions23/FactorAnalysisReading.html) reading. + +In this part we also become familiar with: + +1. Some visualization tools; +2. Principal Component Analysis and Factor Analysis; +3. Introduction to machine learning methods; + +(All user inputs for this part should be selected in the code chunk in the raw .Rmd file) + +```{r setupfactor, echo=TRUE, tidy=TRUE} +# Please ENTER then original raw attributes to use. +# Please use numbers, not column names, e.g. c(1:5, 7, 8) uses columns 1,2,3,4,5,7,8 +factor_attributes_used = c(1:30) + +# Please ENTER the selection criterions for the factors to use. +# Choices: "eigenvalue", "variance", "manual" +factor_selectionciterion = "manual" + +# Please ENTER the desired minumum variance explained +# (Only used in case "variance" is the factor selection criterion used). +minimum_variance_explained = 65 # between 1 and 100 + +# Please ENTER the number of factors to use +# (Only used in case "manual" is the factor selection criterion used). +manual_numb_factors_used = 15 + +# Please ENTER the rotation eventually used (e.g. "none", "varimax", "quatimax", "promax", "oblimin", "simplimax", and "cluster" - see help(principal)). Default is "varimax" +rotation_used = "varimax" + +``` + +```{r} +factor_attributes_used <- intersect(factor_attributes_used, 1:ncol(ProjectData)) +ProjectDataFactor <- ProjectData[,factor_attributes_used] +ProjectDataFactor <- ProjectData <- data.matrix(ProjectDataFactor) +``` + +## Steps 1-2: Check the Data + +Start by some basic visual exploration of, say, a few data: + +```{r} +rownames(ProjectDataFactor) <- paste0("Obs.", sprintf("%02i", 1:nrow(ProjectDataFactor))) +iprint.df(t(head(round(ProjectDataFactor, 2), max_data_report))) +``` + +The data we use here have the following descriptive statistics: + +```{r} +iprint.df(round(my_summary(ProjectDataFactor), 2)) +``` + +## Step 3: Check Correlations + +This is the correlation matrix of the customer responses to the `r ncol(ProjectDataFactor)` attitude questions - which are the only questions that we will use for the segmentation (see the case): + +```{r} +thecor = round(cor(ProjectDataFactor),2) +iprint.df(round(thecor,2), scale=TRUE) +``` + +**Questions** + +1. Do you see any high correlations between the responses? Do they make sense? +2. What do these correlations imply? + +**Answers:** + +* +1. assuming the correlations of 0.5 is considered high. +3&9 +6&7 +9&26 +* +* +2. Correlations imply that the you can group +* +* +* +* +* +* +* + +## Step 4: Choose number of factors + +Clearly the survey asked many redundant questions (can you think some reasons why?), so we may be able to actually "group" these 29 attitude questions into only a few "key factors". This not only will simplify the data, but will also greatly facilitate our understanding of the customers. + +To do so, we use methods called [Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) and [factor analysis](https://en.wikipedia.org/wiki/Factor_analysis) as also discussed in the [Dimensionality Reduction readings](http://inseaddataanalytics.github.io/INSEADAnalytics/CourseSessions/Sessions23/FactorAnalysisReading.html). We can use two different R commands for this (they make slightly different information easily available as output): the command `principal` (check `help(principal)` from R package [psych](http://personality-project.org/r/psych/)), and the command `PCA` from R package [FactoMineR](http://factominer.free.fr) - there are more packages and commands for this, as these methods are very widely used. + +```{r} +# Here is how the `principal` function is used +UnRotated_Results<-principal(ProjectDataFactor, nfactors=ncol(ProjectDataFactor), rotate="none",score=TRUE) +UnRotated_Factors<-round(UnRotated_Results$loadings,2) +UnRotated_Factors<-as.data.frame(unclass(UnRotated_Factors)) +colnames(UnRotated_Factors)<-paste("Comp",1:ncol(UnRotated_Factors),sep="") +``` + +```{r} +# Here is how we use the `PCA` function +Variance_Explained_Table_results<-PCA(ProjectDataFactor, graph=FALSE) +Variance_Explained_Table<-Variance_Explained_Table_results$eig +Variance_Explained_Table_copy<-Variance_Explained_Table + +rownames(Variance_Explained_Table) <- paste("Component", 1:nrow(Variance_Explained_Table), sep=" ") +colnames(Variance_Explained_Table) <- c("Eigenvalue", "Pct of explained variance", "Cumulative pct of explained variance") +``` + +Let's look at the **variance explained** as well as the **eigenvalues** (see session readings): + +```{r} +iprint.df(round(Variance_Explained_Table, 2)) +``` + +```{r} +eigenvalues <- Variance_Explained_Table[, "Eigenvalue"] +df <- cbind(as.data.frame(eigenvalues), c(1:length(eigenvalues)), rep(1, length(eigenvalues))) +colnames(df) <- c("eigenvalues", "components", "abline") +iplot.df(melt(df, id="components")) +``` + +**Questions:** + +1. Can you explain what this table and the plot are? What do they indicate? What can we learn from these? +2. Why does the plot have this specific shape? Could the plotted line be increasing? +3. What characteristics of these results would we prefer to see? Why? + +**Answers** + +* +* +* +* +* +* +* +* +* +* + +## Step 5: Interpret the factors + +Let's now see how the "top factors" look like. + +```{r} +if (factor_selectionciterion == "eigenvalue") + factors_selected = sum(Variance_Explained_Table_copy[,1] >= 1) +if (factor_selectionciterion == "variance") + factors_selected = 1:head(which(Variance_Explained_Table_copy[,"cumulative percentage of variance"]>= minimum_variance_explained),1) +if (factor_selectionciterion == "manual") + factors_selected = manual_numb_factors_used +``` + +To better visualize them, we will use what is called a "rotation". There are many rotations methods. In this case we selected the `r rotation_used` rotation. For our data, the `r factors_selected` selected factors look as follows after this rotation: + +```{r} +Rotated_Results<-principal(ProjectDataFactor, nfactors=max(factors_selected), rotate=rotation_used,score=TRUE) +Rotated_Factors<-round(Rotated_Results$loadings,2) +Rotated_Factors<-as.data.frame(unclass(Rotated_Factors)) +colnames(Rotated_Factors)<-paste("Comp.",1:ncol(Rotated_Factors),sep="") + +sorted_rows <- sort(Rotated_Factors[,1], decreasing = TRUE, index.return = TRUE)$ix +Rotated_Factors <- Rotated_Factors[sorted_rows,] + +iprint.df(Rotated_Factors, scale=TRUE) +``` + +To better visualize and interpret the factors we often "suppress" loadings with small values, e.g. with absolute values smaller than 0.5. In this case our factors look as follows after suppressing the small numbers: + +```{r} +Rotated_Factors_thres <- Rotated_Factors +Rotated_Factors_thres[abs(Rotated_Factors_thres) < MIN_VALUE]<-NA +colnames(Rotated_Factors_thres)<- colnames(Rotated_Factors) +rownames(Rotated_Factors_thres)<- rownames(Rotated_Factors) + +iprint.df(Rotated_Factors_thres, scale=TRUE) +``` + +**Questions** + +1. What do the first couple of factors mean? Do they make business sense? +2. How many factors should we choose for this data/customer base? Please try a few and explain your final choice based on a) statistical arguments, b) on interpretation arguments, c) on business arguments (**you need to consider all three types of arguments**) +3. How would you interpret the factors you selected? +4. What lessons about data science do you learn when doing this analysis? Please comment. + +**Answers** + +* +* +* +* +* +* +* +* +* +* + +## Step 6: Save factor scores + +We can now either replace all initial variables used in this part with the factors scores or just select one of the initial variables for each of the selected factors in order to represent that factor. Here is how the factor scores are for the first few respondents: + +```{r} +NEW_ProjectData <- round(Rotated_Results$scores[,1:factors_selected,drop=F],2) +colnames(NEW_ProjectData)<-paste("DV (Factor)",1:ncol(NEW_ProjectData),sep=" ") + +iprint.df(t(head(NEW_ProjectData, 10)), scale=TRUE) +``` + +**Questions** + +1. Can you describe some of the people using the new derived variables (factor scores)? +2. Which of the 29 initial variables would you select to represent each of the factors you selected? + +**Answers** + +* +* +* +* +* +* +* +* +* +* + +
\clearpage + +# Part 2: Customer Segmentation + +The code used here is along the lines of the code in the session 5-6 reading [ClusterAnalysisReading.Rmd](https://github.com/InseadDataAnalytics/INSEADAnalytics/blob/master/CourseSessions/Sessions45/ClusterAnalysisReading.Rmd). We follow the process described in the [Cluster Analysis ](http://inseaddataanalytics.github.io/INSEADAnalytics/CourseSessions/Sessions45/ClusterAnalysisReading.html) reading. + +In this part we also become familiar with: + +1. Some clustering Methods; +2. How these tools can be used in practice. + +A key family of methods used for segmentation is what is called **clustering methods**. This is a very important problem in statistics and **machine learning**, used in all sorts of applications such as in [Amazon's pioneer work on recommender systems](http://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf). There are many *mathematical methods* for clustering. We will use two very standard methods, **hierarchical clustering** and **k-means**. While the "math" behind all these methods can be complex, the R functions used are relatively simple to use, as we will see. + +(All user inputs for this part should be selected in the code chunk in the raw .Rmd file) + +```{r setupcluster, echo=TRUE, tidy=TRUE} +# Please ENTER then original raw attributes to use for the segmentation (the "segmentation attributes") +# Please use numbers, not column names, e.g. c(1:5, 7, 8) uses columns 1,2,3,4,5,7,8 +segmentation_attributes_used = c(28,25,27,14,20,8,3,12,13,5,9,11,2,30,24) #c(10,19,5,12,3) + +# Please ENTER then original raw attributes to use for the profiling of the segments (the "profiling attributes") +# Please use numbers, not column names, e.g. c(1:5, 7, 8) uses columns 1,2,3,4,5,7,8 +profile_attributes_used = c(2:82) + +# Please ENTER the number of clusters to eventually use for this report +numb_clusters_used = 7 # for boats possibly use 5, for Mall_Visits use 3 + +# Please enter the method to use for the segmentation: +profile_with = "hclust" # "hclust" or "kmeans" + +# Please ENTER the distance metric eventually used for the clustering in case of hierarchical clustering +# (e.g. "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski" - see help(dist)). +# DEFAULT is "euclidean" +distance_used = "euclidean" + +# Please ENTER the hierarchical clustering method to use (options are: +# "ward", "single", "complete", "average", "mcquitty", "median" or "centroid"). +# DEFAULT is "ward" +hclust_method = "ward.D" + +# Please ENTER the kmeans clustering method to use (options are: +# "Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"). +# DEFAULT is "Lloyd" +kmeans_method = "Lloyd" + +``` + +```{r} +# Same as the initial data +ProjectData <- ProjectData_INITIAL + +segmentation_attributes_used <- intersect(segmentation_attributes_used, 1:ncol(ProjectData)) +profile_attributes_used <- intersect(profile_attributes_used, 1:ncol(ProjectData)) + +ProjectData_segment <- ProjectData[,segmentation_attributes_used] +ProjectData_profile <- ProjectData[,profile_attributes_used] + +ProjectData_scaled <- apply(ProjectData, 2, function(r) if (sd(r)!=0) (r-mean(r))/sd(r) else 0*r) +``` + +## Steps 1-2: Explore the data + +(This was done above, so we skip it) + +## Step 3. Select Segmentation Variables + +For simplicity will use one representative question for each of the factor we found in Part 1 (we can also use the "factor scores" for each respondent) to represent our survey respondents. These are the `segmentation_attributes_used` selected below. We can choose the question with the highest absolute factor loading for each factor. For example, when we use 5 factors with the varimax rotation we can select questions Q.1.9 (I see my boat as a status symbol), Q1.18 (Boating gives me a feeling of adventure), Q1.4 (I only consider buying a boat from a reputable brand), Q1.11 (I tend to perform minor boat repairs and maintenance on my own) and Q1.2 (When buying a boat getting the lowest price is more important than the boat brand) - try it. These are columns 10, 19, 5, 12, and 3, respectively of the data matrix `Projectdata`. +## Step 4: Define similarity measure + +We need to define a distance metric that measures how different people (observations in general) are from each other. This can be an important choice. Here are the differences between the observations using the distance metric we selected: + +```{r} +euclidean_pairwise <- as.matrix(dist(head(ProjectData_segment, max_data_report), method="euclidean")) +euclidean_pairwise <- euclidean_pairwise*lower.tri(euclidean_pairwise) + euclidean_pairwise*diag(euclidean_pairwise) + 10e10*upper.tri(euclidean_pairwise) +euclidean_pairwise[euclidean_pairwise==10e10] <- NA +rownames(euclidean_pairwise) <- colnames(euclidean_pairwise) <- sprintf("Obs.%02d", 1:max_data_report) + +iprint.df(round(euclidean_pairwise)) +``` + +## Step 5: Visualize Pair-wise Distances + +We can see the histogram of, say, the first 2 variables (can you change the code chunk in the raw .Rmd file to see other variables?) + +```{r} +variables_to_plot = 1:2 +do.call(iplot.grid, lapply(variables_to_plot, function(n){ + iplot.hist(ProjectData_segment[, n], breaks=5, xlab = paste("Variable", n)) +})) +``` + +or the histogram of all pairwise distances for the `r distance_used` distance: + +```{r} +Pairwise_Distances <- dist(ProjectData_segment, method = distance_used) +iplot.hist(Pairwise_Distances, breaks=10) +``` + +## Step 6: Method and Number of Segments + +We need to select the clustering method to use, as well as the number of cluster. It may be useful to see the dendrogram from Hierarchical Clustering, to have a quick idea of how the data may be segmented and how many segments there may be. Here is the dendrogram for our data: + +```{r} +Hierarchical_Cluster_distances <- dist(ProjectData_segment, method=distance_used) +Hierarchical_Cluster <- hclust(Hierarchical_Cluster_distances, method=hclust_method) +# Display dendogram +iplot.dendrogram(Hierarchical_Cluster) +# TODO: Draw dendogram with red borders around the 3 clusters +# rect.hclust(Hierarchical_Cluster, k=numb_clusters_used, border="red") +``` + +We can also plot the "distances" traveled before we need to merge any of the lower and smaller in size clusters into larger ones - the heights of the tree branches that link the clusters as we traverse the tree from its leaves to its root. If we have n observations, this plot has n-1 numbers, we see the first 20 here. +```{r} +num <- nrow(ProjectData) - 1 +df1 <- cbind(as.data.frame(Hierarchical_Cluster$height[length(Hierarchical_Cluster$height):1]), c(1:num)) +colnames(df1) <- c("distances","index") +iplot.df(melt(head(df1, 20), id="index"), xlab="Number of Components") +``` + +Here is the segment membership of the first `r max_data_report` respondents if we use hierarchical clustering: + +```{r} +cluster_memberships_hclust <- as.vector(cutree(Hierarchical_Cluster, k=numb_clusters_used)) # cut tree into 3 clusters +cluster_ids_hclust=unique(cluster_memberships_hclust) + +ProjectData_with_hclust_membership <- cbind(1:length(cluster_memberships_hclust),cluster_memberships_hclust) +colnames(ProjectData_with_hclust_membership)<-c("Observation Number","Cluster_Membership") + +iprint.df(round(head(ProjectData_with_hclust_membership, max_data_report), 2)) +``` + +while this is the segment membership if we use k-means: + +```{r} +kmeans_clusters <- kmeans(ProjectData_segment,centers= numb_clusters_used, iter.max=2000, algorithm=kmeans_method) + +ProjectData_with_kmeans_membership <- cbind(1:length(kmeans_clusters$cluster),kmeans_clusters$cluster) +colnames(ProjectData_with_kmeans_membership)<-c("Observation Number","Cluster_Membership") + +iprint.df(round(head(ProjectData_with_kmeans_membership, max_data_report), 2)) +``` + +## Step 7: Profile and interpret the segments + +In market segmentation one may use variables to **profile** the segments which are not the same (necessarily) as those used to **segment** the market: the latter may be, for example, attitude/needs related (you define segments based on what the customers "need"), while the former may be any information that allows a company to identify the defined customer segments (e.g. demographics, location, etc). Of course deciding which variables to use for segmentation and which to use for profiling (and then **activation** of the segmentation for business purposes) is largely subjective. In this case we can use all survey questions for profiling for now - the `profile_attributes_used` variables selected below. + +There are many ways to do the profiling of the segments. For example, here we show how the *average* answers of the respondents *in each segment* compare to the *average answer of all respondents* using the ratio of the two. The idea is that if in a segment the average response to a question is very different (e.g. away from ratio of 1) than the overall average, then that question may indicate something about the segment relative to the total population. + +Here are for example the profiles of the segments using the clusters found above. First let's see just the average answer people gave to each question for the different segments as well as the total population: + +```{r} +cluster_memberships_kmeans <- kmeans_clusters$cluster +cluster_ids_kmeans <- unique(cluster_memberships_kmeans) + +if (profile_with == "hclust"){ + cluster_memberships <- cluster_memberships_hclust + cluster_ids <- cluster_ids_hclust +} +if (profile_with == "kmeans"){ + cluster_memberships <- cluster_memberships_kmeans + cluster_ids <- cluster_ids_kmeans +} + +# WE WILL USE THESE IN THE CLASSIFICATION PART LATER +NewData = matrix(cluster_memberships,ncol=1) + +population_average = matrix(apply(ProjectData_profile, 2, mean), ncol=1) +colnames(population_average) <- "Population" +Cluster_Profile_mean <- sapply(sort(cluster_ids), function(i) apply(ProjectData_profile[(cluster_memberships==i), ], 2, mean)) +if (ncol(ProjectData_profile) <2) + Cluster_Profile_mean=t(Cluster_Profile_mean) +colnames(Cluster_Profile_mean) <- paste("Seg.", 1:length(cluster_ids), sep="") +cluster.profile <- cbind (population_average,Cluster_Profile_mean) + +iprint.df(round(cluster.profile, 2)) +``` + +We can also "visualize" the segments using **snake plots** for each cluster. For example, we can plot the means of the profiling variables for each of our clusters to better visualize differences between segments. For better visualization we plot the standardized profiling variables. + +```{r} +ProjectData_scaled_profile = ProjectData_scaled[, profile_attributes_used,drop=F] + +Cluster_Profile_standar_mean <- sapply(sort(cluster_ids), function(i) apply(ProjectData_scaled_profile[(cluster_memberships==i), ,drop = F], 2, mean)) +if (ncol(ProjectData_scaled_profile) < 2) + Cluster_Profile_standar_mean = t(Cluster_Profile_standar_mean) +colnames(Cluster_Profile_standar_mean) <- paste("Seg ", 1:length(cluster_ids), sep="") + +iplot.df(melt(cbind.data.frame(idx=as.numeric(1:nrow(Cluster_Profile_standar_mean)), Cluster_Profile_standar_mean), id="idx"), xlab="Profiling variables (standardized)", ylab="Mean of cluster") +``` + +We can also compare the averages of the profiling variables of each segment relative to the average of the variables across the whole population. This can also help us better understand whether there are indeed clusters in our data (e.g. if all segments are much like the overall population, there may be no segments). For example, we can measure the ratios of the average for each cluster to the average of the population, minus 1, (e.g. `avg(cluster)` `/` `avg(population)` `-1`) for each segment and variable: + +```{r} +population_average_matrix <- population_average[,"Population",drop=F] %*% matrix(rep(1,ncol(Cluster_Profile_mean)),nrow=1) +cluster_profile_ratios <- (ifelse(population_average_matrix==0, 0,Cluster_Profile_mean/population_average_matrix)) +colnames(cluster_profile_ratios) <- paste("Seg.", 1:ncol(cluster_profile_ratios), sep="") +rownames(cluster_profile_ratios) <- colnames(ProjectData)[profile_attributes_used] +## printing the result in a clean-slate table +iprint.df(round(cluster_profile_ratios-1, 2)) +``` + +**Questions** + +1. What do the numbers in the last table indicate? What numbers are the more informative? +2. Based on the tables and snake plot above, what are some key features of each of the segments of this solution? + +**Answers** + +* +* +* +* +* +* +* +* +* +* + +## Step 8: Robustness Analysis + +We should also consider the robustness of our analysis as we change the clustering method and parameters. Once we are comfortable with the solution we can finally answer our first business questions: + +**Questions** + +1. How many segments are there in our market? How many do you select and why? Try a few and explain your final choice based on a) statistical arguments, b) on interpretation arguments, c) on business arguments (**you need to consider all three types of arguments**) +2. Can you describe the segments you found based on the profiles? +3. What if you change the number of factors and in general you *iterate the whole analysis*? **Iterations** are key in data science. +4. Can you now answer the [Boats case questions](http://inseaddataanalytics.github.io/INSEADAnalytics/Boats-A-prerelease.pdf)? What business decisions do you recommend to this company based on your analysis? + +**Answers** + +* +* +* +* +* +* +* +* +* +* + +
\clearpage + +# Part 3: Purchase Drivers + +We will now use the [classification analysis ](http://inseaddataanalytics.github.io/INSEADAnalytics/CourseSessions/Sessions67/ClassificationAnalysisReading.html) methods to understand the key purchase drivers for boats (a similar analysis can be done for recommendation drivers). For simplicity we do not follow the "generic" steps of classification discussed in that reading, and only consider the classification and purchase drivers analysis for the segments we found above. + +We are interested in understanding the purchase drivers, hence our **dependent** variable is column 82 of the Boats data (`r colnames(ProjectData)[82]`) - why is that? We will use only the subquestions of **Question 16** of the case for now, and also select some of the parameters for this part of the analysis: + +```{r setupclassification, echo=TRUE, tidy=TRUE} +# Please ENTER the class (dependent) variable: +# Please use numbers, not column names! e.g. 82 uses the 82nd column are dependent variable. +# YOU NEED TO MAKE SURE THAT THE DEPENDENT VARIABLES TAKES ONLY 2 VALUES: 0 and 1!!! +dependent_variable= 82 + +# Please ENTER the attributes to use as independent variables +# Please use numbers, not column names! e.g. c(1:5, 7, 8) uses columns 1,2,3,4,5,7,8 +independent_variables= c(54:80) # use 54-80 for boats + +# Please ENTER the profit/cost values for the correctly and wrong classified data: +actual_1_predict_1 = 100 +actual_1_predict_0 = -75 +actual_0_predict_1 = -50 +actual_0_predict_0 = 0 + +# Please ENTER the probability threshold above which an observations +# is predicted as class 1: +Probability_Threshold=50 # between 1 and 99% + +# Please ENTER the percentage of data used for estimation +estimation_data_percent = 80 +validation_data_percent = 10 + +# Please enter 0 if you want to "randomly" split the data in estimation and validation/test +random_sampling = 0 + +# Tree parameter +# PLEASE ENTER THE Tree (CART) complexity control cp (e.g. 0.001 to 0.02, depending on the data) +CART_cp = 0.01 + +# Please enter the minimum size of a segment for the analysis to be done only for that segment +min_segment = 100 +``` + +```{r} +ProjectData = ProjectData_INITIAL # Just to initialize the data + +Probability_Threshold = Probability_Threshold/100 # make it between 0 and 1 + +dependent_variable = unique(sapply(dependent_variable,function(i) min(ncol(ProjectData), max(i,1)))) +independent_variables = unique(sapply(independent_variables,function(i) min(ncol(ProjectData), max(i,1)))) + +if (length(unique(ProjectData[,dependent_variable])) !=2){ + cat("\n*****\n BE CAREFUL, THE DEPENDENT VARIABLE TAKES MORE THAN 2 VALUES...") + cat("\nSplitting it around its median...\n*****\n ") + new_dependent = ProjectData[,dependent_variable] >= median(ProjectData[,dependent_variable]) + ProjectData[,dependent_variable] <- 1*new_dependent +} + +Profit_Matrix = matrix(c(actual_1_predict_1, actual_0_predict_1, actual_1_predict_0, actual_0_predict_0), ncol=2) +colnames(Profit_Matrix)<- c("Predict 1", "Predict 0") +rownames(Profit_Matrix) <- c("Actual 1", "Actual 0") +test_data_percent = 100-estimation_data_percent-validation_data_percent +CART_control = rpart.control(cp = CART_cp) +``` + +**Questions** + +1. How do you select the profit/cost values for the analysis? Does the variable `r Profit_Matrix` above relate to the final business decisions? How? +2. What does the variable `r Probability_Threshold` affect? Does it relate to the final business decisions? How? + +**Answers** + +* +* +* +* +* +* +* +* +* +* + +We will use two classification trees and logistic regression. You can select "complexity" control for one of the classification trees in the code chunk of the raw .Rmd file here + +```{r CART_control, echo=TRUE, tidy=TRUE} +CART_control = 0.001 +``` + +**Question** + +1. How can this parameter affect the final results? What business implications can this parameter choice have? + +**Answer** + +* +* +* +* +* +* +* +* +* +* + +This is a "small tree" classification for example: + +```{r} +# FIrst we split the data in estimation, validation, and test + +if (random_sampling){ + estimation_data_ids=sample.int(nrow(ProjectData),floor(estimation_data_percent*nrow(ProjectData)/100)) + non_estimation_data = setdiff(1:nrow(ProjectData),estimation_data_ids) + validation_data_ids=non_estimation_data[sample.int(length(non_estimation_data), floor(validation_data_percent/(validation_data_percent+test_data_percent)*length(non_estimation_data)))] + } else { + estimation_data_ids=1:floor(estimation_data_percent*nrow(ProjectData)/100) + non_estimation_data = setdiff(1:nrow(ProjectData),estimation_data_ids) + validation_data_ids = (tail(estimation_data_ids,1)+1):(tail(estimation_data_ids,1) + floor(validation_data_percent/(validation_data_percent+test_data_percent)*length(non_estimation_data))) + } + +test_data_ids = setdiff(1:nrow(ProjectData), union(estimation_data_ids,validation_data_ids)) + +estimation_data=ProjectData[estimation_data_ids,] +validation_data=ProjectData[validation_data_ids,] +test_data=ProjectData[test_data_ids,] +``` + +```{r} +# just name the variables numerically so that they look ok on the tree plots +independent_variables_nolabel = paste("IV", 1:length(independent_variables), sep="") + +estimation_data_nolabel = cbind(estimation_data[,dependent_variable], estimation_data[,independent_variables]) +colnames(estimation_data_nolabel)<- c(colnames(estimation_data)[dependent_variable],independent_variables_nolabel) + +validation_data_nolabel = cbind(validation_data[,dependent_variable], validation_data[,independent_variables]) +colnames(validation_data_nolabel)<- c(dependent_variable,independent_variables_nolabel) + +test_data_nolabel = cbind(test_data[,dependent_variable], test_data[,independent_variables]) +colnames(test_data_nolabel)<- c(dependent_variable,independent_variables_nolabel) + +estimation_data_nolabel = data.frame(estimation_data_nolabel) +validation_data_nolabel = data.frame(validation_data_nolabel) +test_data_nolabel = data.frame(test_data_nolabel) + +estimation_data = data.frame(estimation_data) +validation_data = data.frame(validation_data) +test_data = data.frame(test_data) +``` + +```{r} +formula=paste(colnames(estimation_data)[dependent_variable],paste(Reduce(paste,sapply(head(independent_variables_nolabel,-1), function(i) paste(i,"+",sep=""))),tail(independent_variables_nolabel,1),sep=""),sep="~") +CART_tree<-rpart(formula, data= estimation_data_nolabel,method="class", control=CART_control) + +rpart.plot(CART_tree, box.palette="OrBu", type=3, extra=1, fallen.leaves=F, branch.lty=3) +``` + +```{r} +CART_tree_large<-rpart(formula, data= estimation_data_nolabel,method="class", control=rpart.control(cp = 0.005)) +``` + +```{r} +# Let's first calculate all probabilites for the estimation, validation, and test data +estimation_Probability_class1_tree<-predict(CART_tree, estimation_data_nolabel)[,2] +estimation_Probability_class1_tree_large<-predict(CART_tree_large, estimation_data_nolabel)[,2] + +validation_Probability_class1_tree<-predict(CART_tree, validation_data_nolabel)[,2] +validation_Probability_class1_tree_large<-predict(CART_tree_large, validation_data_nolabel)[,2] + +test_Probability_class1_tree<-predict(CART_tree, test_data_nolabel)[,2] +test_Probability_class1_tree_large<-predict(CART_tree_large, test_data_nolabel)[,2] + + +estimation_prediction_class_tree=1*as.vector(estimation_Probability_class1_tree > Probability_Threshold) +estimation_prediction_class_tree_large=1*as.vector(estimation_Probability_class1_tree_large > Probability_Threshold) + +validation_prediction_class_tree=1*as.vector(validation_Probability_class1_tree > Probability_Threshold) +validation_prediction_class_tree_large=1*as.vector(validation_Probability_class1_tree_large > Probability_Threshold) + +test_prediction_class_tree=1*as.vector(test_Probability_class1_tree > Probability_Threshold) +test_prediction_class_tree_large=1*as.vector(test_Probability_class1_tree_large > Probability_Threshold) +``` + +```{r} +formula_log=paste(colnames(estimation_data[,dependent_variable,drop=F]),paste(Reduce(paste,sapply(head(independent_variables,-1), function(i) paste(colnames(estimation_data)[i],"+",sep=""))),colnames(estimation_data)[tail(independent_variables,1)],sep=""),sep="~") + +logreg_solution <- glm(formula_log, family=binomial(link="logit"), data=estimation_data) + +log_coefficients = round(summary(logreg_solution)$coefficients,1) +``` + +```{r} +estimation_Probability_class1_log<-predict(logreg_solution, type="response", newdata=estimation_data[,independent_variables]) +validation_Probability_class1_log<-predict(logreg_solution, type="response", newdata=validation_data[,independent_variables]) +test_Probability_class1_log<-predict(logreg_solution, type="response", newdata=test_data[,independent_variables]) + +estimation_prediction_class_log=1*as.vector(estimation_Probability_class1_log > Probability_Threshold) +validation_prediction_class_log=1*as.vector(validation_Probability_class1_log > Probability_Threshold) +test_prediction_class_log=1*as.vector(test_Probability_class1_log > Probability_Threshold) +``` + +After also running the large tree and the logistic regression classifiers, we can then check how much "weight" these three methods put on the different purchase drivers (Q16 of the survey): + +```{r} +log_importance = tail(log_coefficients[,"z value", drop=F],-1) # remove the intercept +log_importance = log_importance/max(abs(log_importance)) + +tree_importance = CART_tree$variable.importance +tree_ordered_drivers = as.numeric(gsub("\\IV"," ",names(CART_tree$variable.importance))) +tree_importance_final = rep(0,length(independent_variables)) +tree_importance_final[tree_ordered_drivers] <- tree_importance +tree_importance_final <- tree_importance_final/max(abs(tree_importance_final)) +tree_importance_final <- tree_importance_final*sign(log_importance) + +large_tree_importance = CART_tree_large$variable.importance +large_tree_ordered_drivers = as.numeric(gsub("\\IV"," ",names(CART_tree_large$variable.importance))) +large_tree_importance_final = rep(0,length(independent_variables)) +large_tree_importance_final[large_tree_ordered_drivers] <- large_tree_importance +large_tree_importance_final <- large_tree_importance_final/max(abs(large_tree_importance_final)) +large_tree_importance_final <- large_tree_importance_final*sign(log_importance) + +Importance_table <- cbind(tree_importance_final,large_tree_importance_final, log_importance) +colnames(Importance_table) <- c("CART 1", "CART 2", "Logistic Regr.") +rownames(Importance_table) <- rownames(log_importance) + +iprint.df(Importance_table) +``` + +Finally, if we were to use the estimated classification models on the test data, we would get the following profit curves (see the raw .Rmd file to select the business profit parameters). + +The profit curve using the small classification tree: + +```{r} +actual_class<- test_data[,dependent_variable] + +probs = test_Probability_class1_tree +xaxis = sort(unique(c(0,1,probs)), decreasing = TRUE) +res = Reduce(cbind,lapply(xaxis, function(prob){ + useonly = which(probs >= prob) + predict_class = 1*(probs >= prob) + theprofit = Profit_Matrix[1,1]*sum(predict_class==1 & actual_class ==1)+ + Profit_Matrix[1,2]*sum(predict_class==0 & actual_class ==1)+ + Profit_Matrix[2,1]*sum(predict_class==1 & actual_class ==0)+ + Profit_Matrix[2,2]*sum(predict_class==0 & actual_class ==0) + + c(100*length(useonly)/length(actual_class), theprofit) + })) +xaxis = res[1,]; yaxis = res[2,] +df<-data.frame(Percentile = xaxis, Profit = yaxis) +iplot.df(df, x="Percentile", y="Profit", v=NULL) + +best_profits_small_tree = df[which.max(df$Profit),] +``` + +The profit curve using the large classification tree: + +```{r} +probs = test_Probability_class1_tree_large +xaxis = sort(unique(c(0,1,probs)), decreasing = TRUE) +res = Reduce(cbind,lapply(xaxis, function(prob){ + useonly = which(probs >= prob) + predict_class = 1*(probs >= prob) + theprofit = Profit_Matrix[1,1]*sum(predict_class==1 & actual_class ==1)+ + Profit_Matrix[1,2]*sum(predict_class==0 & actual_class ==1)+ + Profit_Matrix[2,1]*sum(predict_class==1 & actual_class ==0)+ + Profit_Matrix[2,2]*sum(predict_class==0 & actual_class ==0) + + c(100*length(useonly)/length(actual_class), theprofit) + })) +xaxis = res[1,]; yaxis = res[2,] +names(xaxis)<- NULL; names(yaxis) <- NULL +df<-data.frame(Percentile = xaxis, Profit = yaxis) +iplot.df(df, x="Percentile", y="Profit", v=NULL) + +best_profits_large_tree = df[which.max(df$Profit),] +``` + +The profit curve using the logistic regression classifier: + +```{r} +probs = test_Probability_class1_log +xaxis = sort(unique(c(0,1,probs)), decreasing = TRUE) +res = Reduce(cbind,lapply(xaxis, function(prob){ + useonly = which(probs >= prob) + predict_class = 1*(probs >= prob) + theprofit = Profit_Matrix[1,1]*sum(predict_class==1 & actual_class ==1)+ + Profit_Matrix[1,2]*sum(predict_class==0 & actual_class ==1)+ + Profit_Matrix[2,1]*sum(predict_class==1 & actual_class ==0)+ + Profit_Matrix[2,2]*sum(predict_class==0 & actual_class ==0) + + c(100*length(useonly)/length(actual_class), theprofit) + })) +xaxis = res[1,]; yaxis = res[2,] +names(xaxis)<- NULL; names(yaxis) <- NULL +df<-data.frame(Percentile = xaxis, Profit = yaxis) +iplot.df(df, x="Percentile", y="Profit", v=NULL) + +best_profits_logistic = df[which.max(df$Profit),] +``` + +These are the maximum total profit achieved in the test data using the three classifiers (without any segment specific analysis so far). + +```{r} +best_profits = rbind(best_profits_small_tree, best_profits_large_tree, best_profits_logistic) +rownames(best_profits) <- c("Small Tree", "Large Tree", "Logistic Regression") +iprint.df(round(best_profits, 2)) +``` + +
\clearpage + +# Part 4: Business Decisions + +We will now get the results of the overall process (parts 1-3) and based on them make business decisions (e.g. answer the questions of the Boats case study). Specifically, we will study the purchase drivers for each segment we found and consider the profit curves of the developed models on our test data. + +**Final Solution: Segment Specific Analysis** + +Let's see first how many observations we have in each segment, for the segments we selected above: + +```{r} +# Let's rename cluster_memberships as cluster_ids +cluster_ids <- cluster_memberships + +cluster_size = NULL +for (i in sort(unique(cluster_ids))){ + cluster_size = c(cluster_size,sum(cluster_ids == i)) + } +cluster_size = matrix(cluster_size, nrow=1) +colnames(cluster_size) <- paste("Segment", 1:length(cluster_size), sep=" ") +rownames(cluster_size) <- "Number of Obs." +iprint.df(cluster_size, scale=TRUE) +``` + +This is our final segment specific analysis and solution. We can study now the purchase drivers (average answers to Q16 of the survey) for each segment. They are as follows: + +```{r} +actual_class<- test_data[,dependent_variable] +probs_tree = 0*test_Probability_class1_tree +probs_tree_large = 0*test_Probability_class1_tree_large +probs_log = 0*test_Probability_class1_log +Log_Drivers = NULL + +for (i in sort(unique(cluster_ids))){ + useonly = which(cluster_ids==i) + if (length(useonly) >= min_segment){ + + test_ids_used = intersect(test_data_ids,useonly) + probs_to_fill = which(sapply(test_data_ids, function(i) sum(test_ids_used==i)) !=0) + estimation_data_clus=ProjectData[intersect(estimation_data_ids,useonly) ,] + test_data_clus=ProjectData[intersect(test_data_ids,useonly),] + + ### + estimation_data_clus_nolabel = cbind(estimation_data_clus[,dependent_variable], estimation_data_clus[,independent_variables]) + colnames(estimation_data_clus_nolabel)<- c(colnames(estimation_data_clus)[dependent_variable],independent_variables_nolabel) + + test_data_clus_nolabel = cbind(test_data_clus[,dependent_variable], test_data_clus[,independent_variables]) + colnames(test_data_clus_nolabel)<- c(dependent_variable,independent_variables_nolabel) + + estimation_data_clus = data.frame(estimation_data_clus) + test_data_clus = data.frame(test_data_clus) + estimation_data_clus_nolabel = data.frame(estimation_data_clus_nolabel) + test_data_clus_nolabel = data.frame(test_data_clus_nolabel) + ### + + CART_tree<-rpart(formula, data= estimation_data_clus_nolabel,method="class", control=CART_control) + CART_tree_large<-rpart(formula, data= estimation_data_clus_nolabel,method="class", control=rpart.control(cp = 0.005)) + logreg_solution <- glm(formula_log, family=binomial(link="logit"), data=estimation_data_clus) + + ##### + + test_Probability_class1_tree<-predict(CART_tree, test_data_clus_nolabel)[,2] + test_Probability_class1_tree_large<-predict(CART_tree_large, test_data_clus_nolabel)[,2] + test_Probability_class1_log<-predict(logreg_solution, type="response", newdata=test_data_clus[,independent_variables]) + + ####### + probs_tree[probs_to_fill] <- test_Probability_class1_tree + probs_tree_large[probs_to_fill] <- test_Probability_class1_tree + probs_log[probs_to_fill] <- test_Probability_class1_log + + + log_coefficients = round(summary(logreg_solution)$coefficients,1) + Log_Drivers_segment = tail(log_coefficients[,"z value", drop=F],-1) # remove the intercept + Log_Drivers_segment = Log_Drivers_segment/max(abs(Log_Drivers_segment)) + + tree_importance = CART_tree$variable.importance + tree_ordered_drivers = as.numeric(gsub("\\IV"," ",names(CART_tree$variable.importance))) + tree_importance_final = rep(0,length(independent_variables)) + tree_importance_final[tree_ordered_drivers] <- tree_importance + tree_importance_final <- tree_importance_final/max(abs(tree_importance_final)) + tree_importance_final <- tree_importance_final*sign(Log_Drivers_segment) + + #Log_Drivers = cbind(Log_Drivers,tree_importance_final) + Log_Drivers = cbind(Log_Drivers,Log_Drivers_segment) + + + } + } +colnames(Log_Drivers) <- paste("Segment", 1:length(unique(cluster_ids)), sep = " ") +iprint.df(round(tail(Log_Drivers,-1), 2)) +``` + +The profit curves for the test data in this case are as follows. The profit curve using the small classification tree is: + +```{r} +actual_class<- test_data[,dependent_variable] + +probs = probs_tree +xaxis = sort(unique(c(0,1,probs)), decreasing = TRUE) +res = Reduce(cbind,lapply(xaxis, function(prob){ + useonly = which(probs >= prob) + predict_class = 1*(probs >= prob) + theprofit = Profit_Matrix[1,1]*sum(predict_class==1 & actual_class ==1)+ + Profit_Matrix[1,2]*sum(predict_class==0 & actual_class ==1)+ + Profit_Matrix[2,1]*sum(predict_class==1 & actual_class ==0)+ + Profit_Matrix[2,2]*sum(predict_class==0 & actual_class ==0) + + c(100*length(useonly)/length(actual_class), theprofit) + })) +xaxis = res[1,]; yaxis = res[2,] +names(xaxis)<- NULL; names(yaxis) <- NULL +df<-data.frame(Percentile = xaxis, Profit = yaxis) +iplot.df(df, x="Percentile", y="Profit", v=NULL) + +best_profits_small_tree = df[which.max(df$Profit),] +``` + +The profit curve using the large classification tree is: + +```{r} +probs = probs_tree_large +xaxis = sort(unique(c(0,1,probs)), decreasing = TRUE) +res = Reduce(cbind,lapply(xaxis, function(prob){ + useonly = which(probs >= prob) + predict_class = 1*(probs >= prob) + theprofit = Profit_Matrix[1,1]*sum(predict_class==1 & actual_class ==1)+ + Profit_Matrix[1,2]*sum(predict_class==0 & actual_class ==1)+ + Profit_Matrix[2,1]*sum(predict_class==1 & actual_class ==0)+ + Profit_Matrix[2,2]*sum(predict_class==0 & actual_class ==0) + + c(100*length(useonly)/length(actual_class), theprofit) + })) +xaxis = res[1,]; yaxis = res[2,] +names(xaxis)<- NULL; names(yaxis) <- NULL +df<-data.frame(Percentile = xaxis, Profit = yaxis) +iplot.df(df, x="Percentile", y="Profit", v=NULL) + +best_profits_large_tree = df[which.max(df$Profit),] +``` + +The profit curve using the logistic regression classifier: + +```{r} +probs = probs_log +xaxis = sort(unique(c(0,1,probs)), decreasing = TRUE) +res = Reduce(cbind,lapply(xaxis, function(prob){ + useonly = which(probs >= prob) + predict_class = 1*(probs >= prob) + theprofit = Profit_Matrix[1,1]*sum(predict_class==1 & actual_class ==1)+ + Profit_Matrix[1,2]*sum(predict_class==0 & actual_class ==1)+ + Profit_Matrix[2,1]*sum(predict_class==1 & actual_class ==0)+ + Profit_Matrix[2,2]*sum(predict_class==0 & actual_class ==0) + + c(100*length(useonly)/length(actual_class), theprofit) + })) +xaxis = res[1,]; yaxis = res[2,] +names(xaxis)<- NULL; names(yaxis) <- NULL +df<-data.frame(Percentile = xaxis, Profit = yaxis) +iplot.df(df, x="Percentile", y="Profit", v=NULL) + +best_profits_logistic = df[which.max(df$Profit),] +``` + +These are the maximum total profit achieved in the test data using the three classifiers with the selected market segmentation solution. + +```{r} +best_profits = rbind(best_profits_small_tree, best_profits_large_tree, best_profits_logistic) +rownames(best_profits) <- c("Small Tree", "Large Tree", "Logistic Regression") +iprint.df(round(best_profits, 2)) +``` + +**Questions:** + +1. What are the main purchase drivers for the segments and solution you found? +2. How different are the purchase drivers you find when you use segmentation versus when you study all customers as "one segment"? Why? +3. Based on the overall analysis, what segmentation would you choose? +4. What is the business profit the company can achieve (as measured with the test data) based on your solution? +5. What business decisions can the company make based on this analysis? + +**Answers:** + +* +* +* +* +* +* +* +* +* +* + +**You have now completed your first market segmentation project.** Do you have data from another survey you can use with this report now? + +**Extra question**: explore and report a new segmentation analysis... \ No newline at end of file diff --git a/CourseSessions/Session1/Session1inclass.Rmd b/CourseSessions/Session1/Session1inclass.Rmd index a19f9ca7..f93dd549 100644 --- a/CourseSessions/Session1/Session1inclass.Rmd +++ b/CourseSessions/Session1/Session1inclass.Rmd @@ -1,6 +1,6 @@ --- title: "Session 1" -author: "T. Evgeniou" +author: "T. Evgeniou really??" output: html_document --- @@ -142,9 +142,9 @@ The correlation matrix does not look pretty for now, but we will see example way #### Questions -1. Can you find which column asks about the name of the brand rated? -2. Can you find the average rating responded gave to question "`r gsub("_", " ", gsub("\\.", " ", "Q1_3_The.brand.of.boat.I.buy.says.a.lot.about.who.I.am"))`" (note the use of the R command `gsub` in the code to manipulate character strings). -3. *(Extra points)* What is the percentage of male in this population? How many of them responded that they plan to purchase a boat in the future? +1. Can you find which column asks about the name of the brand rated? QN 1_3 & 1_4 +2. Can you find the average rating responded gave to question "`r gsub("_", " ", gsub("\\.", " ", "Q1_3_The.brand.of.boat.I.buy.says.a.lot.about.who.I.am"))`" (note the use of the R command `gsub` in the code to manipulate character strings). 3.1135 +3. *(Extra points)* What is the percentage of male in this population? How many of them responded that they plan to purchase a boat in the future? 55%, 784 of them 4. *(Extra points)* Explore the R command `table` and use it to present in this report some of the data. 5. Explore the R packages or github repositories you found interesting and prepare to discuss one of them in class, and if possible show some demo. diff --git a/CourseSessions/Sessions23/Session2inclass.Rmd b/CourseSessions/Sessions23/Session2inclass.Rmd new file mode 100644 index 00000000..9bb734e8 --- /dev/null +++ b/CourseSessions/Sessions23/Session2inclass.Rmd @@ -0,0 +1,426 @@ +--- +title: "Sessions 3-4" +author: "T. Evgeniou" +output: html_document +--- + +
+ +The purpose of this session is to become familiar with: + +1. Some visualization tools; +2. Principal Component Analysis and Factor Analysis; +3. Clustering Methods; +4. Introduction to machine learning methods; +5. A market segmentation case study. + +As always, before starting, make sure you have pulled the [session 3-4 files](https://github.com/InseadDataAnalytics/INSEADAnalytics/tree/master/CourseSessions/Sessions23) (yes, I know, it says session 2, but it is 3-4 - need to update all filenames some time, but till then we use common sense and ignore a bit the filenames) on your github repository (if you pull the course github repository you also get the session files automatically). Moreover, make sure you are in the directory of this exercise. Directory paths may be complicated, and sometimes a frustrating source of problems, so it is recommended that you use these R commands to find out your current working directory and, if needed, set it where you have the main files for the specific exercise/project (there are other ways, but for now just be aware of this path issue). For example, assuming we are now in the "MYDIRECTORY/INSEADAnalytics" directory, we can do these: + +```{r echo=TRUE, eval=FALSE, tidy=TRUE} +getwd() +setwd("CourseSessions/Sessions23") +list.files() +rm(list=ls()) # Clean up the memory, if we want to rerun from scratch +``` +As always, you can use the `help` command in Rstudio to find out about any R function (e.g. type `help(list.files)` to learn what the R function `list.files` does). + +Let's start. + +
+
+ +### Survey Data for Market Segmentation + +We will be using the [boats case study](http://inseaddataanalytics.github.io/INSEADAnalytics/Boats-A-prerelease.pdf) as an example. At the end of this class we will be able to develop (from scratch) the readings of sessions 3-4 as well as understand the tools used and the interpretation of the results in practice - in order to make business decisions. The code used here is along the lines of the code in the session directory, e.g. in the [RunStudy.R](https://github.com/InseadDataAnalytics/INSEADAnalytics/blob/master/CourseSessions/Sessions23/RunStudy.R) file and the report [doc/Report_s23.Rmd.](https://github.com/InseadDataAnalytics/INSEADAnalytics/blob/master/CourseSessions/Sessions23/doc/Report_s23.Rmd) There may be a few differences, as there are many ways to write code to do the same thing. + +Let's load the data: + +```{r echo=FALSE, message=FALSE, prompt=FALSE, results='asis'} +source("R/library.R") +``` + +```{r echo=TRUE, eval=TRUE, tidy=TRUE} +ProjectData <- read.csv("data/Boats.csv", sep=";", dec=",") # this contains only the matrix ProjectData +ProjectData=data.matrix(ProjectData) +colnames(ProjectData)<-gsub("\\."," ",colnames(ProjectData)) +ProjectDataFactor=ProjectData[,c(2:30)] +``` +
+and do some basic visual exploration of the first 50 respondents first (always necessary to see the data first): +
+ +```{r echo=FALSE, message=FALSE, prompt=FALSE, results='asis'} +show_data = data.frame(round(ProjectData,2))[1:50,] +show_data$Variables = rownames(show_data) +m1<-gvisTable(show_data,options=list(showRowNumber=TRUE,width=1220, height=min(400,27*(nrow(show_data)+1)),allowHTML=TRUE,page='disable')) +print(m1,'chart') +``` +
+ +This is the correlation matrix of the customer responses to the `r ncol(ProjectDataFactor)` attitude questions - which are the only questions that we will use for the segmentation (see the case): +
+ +```{r echo=FALSE, comment=NA, warning=FALSE, message=FALSE, results='asis'} +show_data = data.frame(cbind(colnames(ProjectDataFactor), round(cor(ProjectDataFactor),2))) +m1<-gvisTable(show_data,options=list(width=1920, height=min(400,27*(nrow(show_data)+1)),allowHTML=TRUE)) +print(m1,'chart') +``` +
+ +#### Questions + +1. Do you see any high correlations between the responses? Do they make sense? +2. What do these correlations imply? + +##### Answers: +
+
+
+
+ +
+ +### Key Customer Attitudes + +Clearly the survey asked many reduntant questions (can you think some reasons why?), so we may be able to actually "group" these 29 attitude questions into only a few "key factors". This not only will simplify the data, but will also greatly facilitate our understanding of the customers. + +To do so, we use methods called [Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) and [factor analysis](https://en.wikipedia.org/wiki/Factor_analysis) as discussed in the [session readings](http://inseaddataanalytics.github.io/INSEADAnalytics/Report_s23.html). We can use two different R commands for this (they make slightly different information easily available as output): the command `principal` (check `help(principal)` from R package [psych](http://personality-project.org/r/psych/)), and the command `PCA` from R package [FactoMineR](http://factominer.free.fr) - there are more packages and commands for this, as these methods are very widely used. + +Here is how the `principal` function is used: +
+```{r echo=TRUE, eval=TRUE, tidy=TRUE} +UnRotated_Results<-principal(ProjectDataFactor, nfactors=ncol(ProjectDataFactor), rotate="none",score=TRUE) +UnRotated_Factors<-round(UnRotated_Results$loadings,2) +UnRotated_Factors<-as.data.frame(unclass(UnRotated_Factors)) +colnames(UnRotated_Factors)<-paste("Component",1:ncol(UnRotated_Factors),sep=" ") +``` + +
+
+ +Here is how we use `PCA` one is used: +
+ +```{r echo=TRUE, eval=TRUE, tidy=TRUE} +Variance_Explained_Table_results<-PCA(ProjectDataFactor, graph=FALSE) +Variance_Explained_Table<-Variance_Explained_Table_results$eig +Variance_Explained_Table_copy<-Variance_Explained_Table +row=1:nrow(Variance_Explained_Table) +name<-paste("Component No:",row,sep="") +Variance_Explained_Table<-cbind(name,Variance_Explained_Table) +Variance_Explained_Table<-as.data.frame(Variance_Explained_Table) +colnames(Variance_Explained_Table)<-c("Components", "Eigenvalue", "Percentage_of_explained_variance", "Cumulative_percentage_of_explained_variance") + +eigenvalues <- Variance_Explained_Table[,2] +``` + +
+Let's look at the **variance explained** as well as the **eigenvalues** (see session readings): +
+
+ +```{r echo=FALSE, comment=NA, warning=FALSE, error=FALSE,message=FALSE,results='asis'} +show_data = Variance_Explained_Table +m<-gvisTable(Variance_Explained_Table,options=list(width=1200, height=min(400,27*(nrow(show_data)+1)),allowHTML=TRUE,page='disable'),formats=list(Eigenvalue="#.##",Percentage_of_explained_variance="#.##",Cumulative_percentage_of_explained_variance="#.##")) +print(m,'chart') +``` +
+ +```{r Fig1, echo=FALSE, comment=NA, results='asis', message=FALSE, fig.align='center', fig=TRUE} +df <- cbind(as.data.frame(eigenvalues), c(1:length(eigenvalues)), rep(1, length(eigenvalues))) +colnames(df) <- c("eigenvalues", "components", "abline") +Line <- gvisLineChart(as.data.frame(df), xvar="components", yvar=c("eigenvalues","abline"), options=list(title='Scree plot', legend="right", width=900, height=600, hAxis="{title:'Number of Components', titleTextStyle:{color:'black'}}", vAxes="[{title:'Eigenvalues'}]", series="[{color:'green',pointSize:3, targetAxisIndex: 0}]")) +print(Line, 'chart') +``` +
+ +#### Questions: + +1. Can you explain what this table and the plot are? What do they indicate? What can we learn from these? +2. Why does the plot have this specific shape? Could the plotted line be increasing? +3. What characteristics of these results would we prefer to see? Why? + +**Your Answers here:** +
+
+
+
+ +#### Visualization and Interpretation + +Let's now see how the "top factors" look like. +
+ +```{r echo=TRUE, eval=TRUE, tidy=TRUE} +# Choose one of these options: +factors_selected = sum(Variance_Explained_Table_copy[,1] >= 1) +# minimum_variance_explained = 0.5; factors_selected = 1:head(which(Variance_Explained_Table_copy[,"cumulative percentage of variance"]>= minimum_variance_explained),1) +#factors_selected = 10 +``` +
+ +To better visualise them, we will use what is called a "rotation". There are many rotations methods, we use what is called the [varimax](http://stats.stackexchange.com/questions/612/is-pca-followed-by-a-rotation-such-as-varimax-still-pca) rotation: +
+ +```{r echo=TRUE, eval=TRUE, tidy=TRUE} +# Please ENTER the rotation eventually used (e.g. "none", "varimax", "quatimax", "promax", "oblimin", "simplimax", and "cluster" - see help(principal)). Defauls is "varimax" +rotation_used="varimax" +``` + +```{r echo=TRUE, eval=TRUE, tidy=TRUE} +Rotated_Results<-principal(ProjectDataFactor, nfactors=max(factors_selected), rotate=rotation_used,score=TRUE) +Rotated_Factors<-round(Rotated_Results$loadings,2) +Rotated_Factors<-as.data.frame(unclass(Rotated_Factors)) +colnames(Rotated_Factors)<-paste("Component",1:ncol(Rotated_Factors),sep=" ") +sorted_rows <- sort(Rotated_Factors[,1], decreasing = TRUE, index.return = TRUE)$ix +Rotated_Factors <- Rotated_Factors[sorted_rows,] +``` + +```{r echo=FALSE, comment=NA, warning=FALSE, message=FALSE,results='asis'} +show_data <- Rotated_Factors +show_data$Variables <- rownames(show_data) +m1<-gvisTable(show_data,options=list(showRowNumber=TRUE,width=1220, height=min(400,27*(nrow(show_data)+1)),allowHTML=TRUE,page='disable')) +print(m1,'chart') +``` +

+ +To better visualize and interpret the factors we often "supress" loadings with small values, e.g. with absolute values smaller than 0.5. In this case our factors look as follows after suppressing the small numbers: +
+ +```{r echo=TRUE, eval=TRUE, tidy=TRUE} +MIN_VALUE = 0.5 +Rotated_Factors_thres <- Rotated_Factors +Rotated_Factors_thres[abs(Rotated_Factors_thres) < MIN_VALUE]<-NA +colnames(Rotated_Factors_thres)<- colnames(Rotated_Factors) +rownames(Rotated_Factors_thres)<- rownames(Rotated_Factors) +``` + +```{r echo=FALSE, comment=NA, warning=FALSE, message=FALSE,results='asis'} +show_data <- Rotated_Factors_thres +#show_data = show_data[1:min(max_data_report,nrow(show_data)),] +show_data$Variables <- rownames(show_data) +m1<-gvisTable(show_data,options=list(showRowNumber=TRUE,width=1220, height=min(400,27*(nrow(show_data)+1)),allowHTML=TRUE,page='disable')) +print(m1,'chart') +``` +

+ + +#### Questions + +1. What do the first couple of factors mean? Do they make business sense? +2. How many factors should we choose for this data/customer base? Please try a few and explain your final choice based on a) statistical arguments, b) on interpretation arguments, c) on business arguments (**you need to consider all three types of arguments**) +3. How would you interpret the factors you selected? +4. What lessons about data science do you learn when doing this analysis? Please comment. +5. (Extra/Optional) Can you make this report "dynamic" using shiny and then post it on [shinyapps.io](http://www.shinyapps.io)? (see for example exercise set 1 and interactive exercise set 2) + +**Your Answers here:** +
+
+
+
+ +
+
+ +### Market Segmentation + +Let's now use one representative question for each factor (we can also use the "factor scores" for each respondent - see [session readings](http://inseaddataanalytics.github.io/INSEADAnalytics/Report_s23.html)) to represent our survey respondents. We can choose the question with the highest absolute factor loading for each factor. For example, when we use 5 factors with the varimax rotation we can select questions Q.1.9 (I see my boat as a status symbol), Q1.18 (Boating gives me a feeling of adventure), Q1.4 (I only consider buying a boat from a reputable brand), Q1.11 (I tend to perform minor boat repairs and maintenance on my own) and Q1.2 (When buying a boat getting the lowest price is more important than the boat brand) - try it. These are columns 10, 19, 5, 12, and 3, respectively of the data matrix `Projectdata`. + +In market segmentation one may use variables to **profile** the segments which are not the same (necessarily) as those used to **segment** the market: the latter may be, for example, attitude/needs related (you define segments based on what the customers "need"), while the former may be any information that allows a company to identify the defined customer segments (e.g. demographics, location, etc). Of course deciding which variables to use for segmentation and which to use for profiling (and then **activation** of the segmentation for business purposes) is largely subjective. So in this case we will use all survey questions for profiling for now: + +
+ +```{r echo=TRUE, eval=TRUE, tidy=TRUE} +segmentation_attributes_used = c(10,19,5,12,3) +profile_attributes_used = 2:ncol(ProjectData) +ProjectData_segment=ProjectData[,segmentation_attributes_used] +ProjectData_profile=ProjectData[,profile_attributes_used] +``` + +A key family of methods used for segmenation is what is called **clustering methods**. This is a very important problem in statistics and **machine learning**, used in all sorts of applications such as in [Amazon's pioneer work on recommender systems](http://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf). There are many *mathematical methods* for clustering. We will use two very standard methods, **hierarchical clustering** and **k-means**. While the "math" behind all these methods can be complex, the R functions used are relatively simple to use, as we will see. + +For example, to use hierarchical clustering we simply first define some parameters used (see session readings) and then simply call the command `hclust`: + +```{r echo=TRUE, eval=TRUE, tidy=TRUE} +# Please ENTER the distance metric eventually used for the clustering in case of hierarchical clustering +# (e.g. "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski" - see help(dist)). +# DEFAULT is "euclidean" +distance_used="euclidean" +# Please ENTER the hierarchical clustering method to use (options are: +# "ward", "single", "complete", "average", "mcquitty", "median" or "centroid") +# DEFAULT is "ward.D" +hclust_method = "ward.D" +# Define the number of clusters: +numb_clusters_used = 3 +``` + +```{r echo=TRUE, eval=TRUE, tidy=TRUE} +Hierarchical_Cluster_distances <- dist(ProjectData_segment, method=distance_used) +Hierarchical_Cluster <- hclust(Hierarchical_Cluster_distances, method=hclust_method) + +# Assign observations (e.g. people) in their clusters +cluster_memberships_hclust <- as.vector(cutree(Hierarchical_Cluster, k=numb_clusters_used)) +cluster_ids_hclust=unique(cluster_memberships_hclust) +ProjectData_with_hclust_membership <- cbind(1:length(cluster_memberships_hclust),cluster_memberships_hclust) +colnames(ProjectData_with_hclust_membership)<-c("Observation Number","Cluster_Membership") +``` + +Finally, we can see the **dendrogram** (see class readings and online resources for more information) to have a first rough idea of what segments (clusters) we may have - and how many. +
+ +```{r echo=FALSE, comment=NA, warning=FALSE, message=FALSE, fig.align='center', results='asis'} +# Display dendogram +plot(Hierarchical_Cluster, main = NULL, sub=NULL, labels = 1:nrow(ProjectData_segment), xlab="Our Observations", cex.lab=1, cex.axis=1) +# Draw dendogram with red borders around the 3 clusters +rect.hclust(Hierarchical_Cluster, k=numb_clusters_used, border="red") +``` +
+ We can also plot the "distances" traveled before we need to merge any of the lower and smaller in size clusters into larger ones - the heights of the tree branches that link the clusters as we traverse the tree from its leaves to its root. If we have n observations, this plot has n-1 numbers. +
+ + +```{r echo=FALSE, comment=NA, warning=FALSE, message=FALSE, fig.align='center', results='asis'} +df1 <- cbind(as.data.frame(Hierarchical_Cluster$height[length(Hierarchical_Cluster$height):1]), c(1:(nrow(ProjectData)-1))) +colnames(df1) <- c("distances","index") +Line <- gvisLineChart(as.data.frame(df1), xvar="index", yvar="distances", options=list(title='Distances plot', legend="right", width=900, height=600, hAxis="{title:'Number of Components', titleTextStyle:{color:'black'}}", vAxes="[{title:'Distances'}]", series="[{color:'green',pointSize:3, targetAxisIndex: 0}]")) +print(Line,'chart') +``` +
+ +To use k-means on the other hand one needs to define a priori the number of segments (which of course one can change and re-cluster). K-means also requires the choice of a few more parameters, but this is beyond our scope for now. Here is how to run K-means: +
+ +```{r echo=TRUE, eval=TRUE, tidy=TRUE} +# Please ENTER the kmeans clustering method to use (options are: +# "Hartigan-Wong", "Lloyd", "Forgy", "MacQueen" +# DEFAULT is "Lloyd" +kmeans_method = "Lloyd" +# Define the number of clusters: +numb_clusters_used = 3 +kmeans_clusters <- kmeans(ProjectData_segment,centers= numb_clusters_used, iter.max=2000, algorithm=kmeans_method) +ProjectData_with_kmeans_membership <- cbind(1:length(kmeans_clusters$cluster),kmeans_clusters$cluster) +colnames(ProjectData_with_kmeans_membership)<-c("Observation Number","Cluster_Membership") + +# Assign observations (e.g. people) in their clusters +cluster_memberships_kmeans <- kmeans_clusters$cluster +cluster_ids_kmeans <- unique(cluster_memberships_kmeans) +``` + +K-means does not provide much information about segmentation. However, when we profile the segments we can start getting a better (business) understanding of what is happening. **Profiling** is a central part of segmentation: this is where we really get to mix technical and business creativity. + + +### Profiling + +There are many ways to do the profiling of the segments. For example, here we show how the *average* answers of the respondents *in each segment* compare to the *average answer of all respondents* using the ratio of the two. The idea is that if in a segment the average response to a question is very different (e.g. away from ratio of 1) than the overall average, then that question may indicate something about the segment relative to the total population. + +Here are for example the profiles of the segments using the clusters found above: + +
+ First let's see just the average answer people gave to each question for the different segments as well as the total population: +
+ +```{r echo=TRUE, eval=TRUE, tidy=TRUE} +# Select whether to use the Hhierarchical clustering or the k-means clusters: + +cluster_memberships <- cluster_memberships_hclust +cluster_ids <- cluster_ids_hclust +# here is the k-means: uncomment these 2 lines +#cluster_memberships <- cluster_memberships_kmeans +#cluster_ids <- cluster_ids_kmeans + +population_average = matrix(apply(ProjectData_profile, 2, mean), ncol=1) +colnames(population_average) <- "Population" +Cluster_Profile_mean <- sapply(sort(cluster_ids), function(i) apply(ProjectData_profile[(cluster_memberships==i), ], 2, mean)) +if (ncol(ProjectData_profile) <2) + Cluster_Profile_mean=t(Cluster_Profile_mean) +colnames(Cluster_Profile_mean) <- paste("Segment", 1:length(cluster_ids), sep=" ") +cluster.profile <- cbind(population_average,Cluster_Profile_mean) +``` + + +```{r echo=FALSE, comment=NA, warning=FALSE, message=FALSE, results='asis'} +show_data = data.frame(round(cluster.profile,2)) +#show_data = show_data[1:min(max_data_report,nrow(show_data)),] +row<-rownames(show_data) +dfnew<-cbind(row,show_data) +change<-colnames(dfnew) +change[1]<-"Variables" +colnames (dfnew)<-change +m1<-gvisTable(dfnew,options=list(showRowNumber=TRUE,width=1220, height=min(400,27*(nrow(show_data)+1)),allowHTML=TRUE,page='disable')) +print(m1,'chart') + +``` +
+ +Let's now see the relative ratios, which we can also save in a .csv and explore if (absolutely) necessary - e.g. for collaboration with people using other tools. + +```{r echo=TRUE, eval=TRUE, tidy=TRUE} +ratio_limit = 0.1 +``` +Let's see only ratios that are larger or smaller than 1 by, say, at least `r ratio_limit`. +
+ +```{r echo=TRUE, eval=TRUE, tidy=TRUE} +population_average_matrix <- population_average[,"Population",drop=F] %*% matrix(rep(1,ncol(Cluster_Profile_mean)),nrow=1) +cluster_profile_ratios <- (ifelse(population_average_matrix==0, 0,Cluster_Profile_mean/population_average_matrix)) +colnames(cluster_profile_ratios) <- paste("Segment", 1:ncol(cluster_profile_ratios), sep=" ") +rownames(cluster_profile_ratios) <- colnames(ProjectData)[profile_attributes_used] +## printing the result in a clean-slate table +``` + +```{r echo=TRUE, eval=TRUE, tidy=TRUE} +# Save the segment profiles in a file: enter the name of the file! +profile_file = "my_segmentation_profiles.csv" +write.csv(cluster_profile_ratios,file=profile_file) +# We can also save the cluster membership of our respondents: +data_with_segment_membership = cbind(cluster_memberships,ProjectData) +colnames(data_with_segment_membership)[1] = "Segment" +cluster_file = "my_segments.csv" +write.csv(data_with_segment_membership,file=cluster_file) +``` + +```{r echo=FALSE, comment=NA, warning=FALSE, message=FALSE, results='asis'} +#library(shiny) # need this library for heatmaps to work! +# Please enter the minimum distance from "1" the profiling values should have in order to be colored +# (e.g. using heatmin = 0 will color everything - try it) +#heatmin = 0.1 +#source("R/heatmapOutput.R") +#cat(renderHeatmapX(cluster_profile_ratios, border=1, center = 1, minvalue = heatmin)) +``` + +```{r echo=FALSE, comment=NA, warning=FALSE, message=FALSE,results='asis'} +cluster_profile_ratios[abs(cluster_profile_ratios-1) < ratio_limit] <- NA +show_data = data.frame(round(cluster_profile_ratios,2)) +show_data$Variables <- rownames(show_data) +m1<-gvisTable(show_data,options=list(showRowNumber=TRUE,width=1220, height=min(400,27*(nrow(show_data)+1)),allowHTML=TRUE,page='disable')) +print(m1,'chart') +``` + +
+
+**The further a ratio is from 1, the more important that attribute is for a segment relative to the total population.** + +
+ +#### Questions + +1. How many segments are there in our market? Why you chose that number of segments? Again, try a few and explain your final choice based on a) statistical arguments, b) on interpretation arguments, c) on business arguments (**you need to consider all three types of arguments**) +2. Can you describe the segments you found based on the profiles? +3. What if you change the number of factors and in general you *iterate the whole analysis*? **Iterations** are key in data science. +4. Can you now answer the [Boats case questions](http://inseaddataanalytics.github.io/INSEADAnalytics/Boats-A-prerelease.pdf)? What business decisions do you recommend to this company based on your analysis? + +
+ +**Your Answers here:** +
+
+
+
+ +
+ +**You have now completed your first market segmentation project.** Do you have data from another survey you can use with this report now? + +**Extra question**: explore and report a new segmentation analysis... + +... and as always **Have Fun** diff --git a/CourseSessions/Sessions23/Session2inclass.html b/CourseSessions/Sessions23/Session2inclass.html new file mode 100644 index 00000000..990e768d --- /dev/null +++ b/CourseSessions/Sessions23/Session2inclass.html @@ -0,0 +1,19532 @@ + + + + + + + + + + + + + + +Sessions 3-4 + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + +


+

The purpose of this session is to become familiar with:

+
    +
  1. Some visualization tools;
  2. +
  3. Principal Component Analysis and Factor Analysis;
  4. +
  5. Clustering Methods;
  6. +
  7. Introduction to machine learning methods;
  8. +
  9. A market segmentation case study.
  10. +
+

As always, before starting, make sure you have pulled the session 3-4 files (yes, I know, it says session 2, but it is 3-4 - need to update all filenames some time, but till then we use common sense and ignore a bit the filenames) on your github repository (if you pull the course github repository you also get the session files automatically). Moreover, make sure you are in the directory of this exercise. Directory paths may be complicated, and sometimes a frustrating source of problems, so it is recommended that you use these R commands to find out your current working directory and, if needed, set it where you have the main files for the specific exercise/project (there are other ways, but for now just be aware of this path issue). For example, assuming we are now in the “MYDIRECTORY/INSEADAnalytics” directory, we can do these:

+
getwd()
+setwd("CourseSessions/Sessions23")
+list.files()
+rm(list = ls())  # Clean up the memory, if we want to rerun from scratch
+

As always, you can use the help command in Rstudio to find out about any R function (e.g. type help(list.files) to learn what the R function list.files does).

+

Let’s start.

+
+
+
+

Survey Data for Market Segmentation

+

We will be using the boats case study as an example. At the end of this class we will be able to develop (from scratch) the readings of sessions 3-4 as well as understand the tools used and the interpretation of the results in practice - in order to make business decisions. The code used here is along the lines of the code in the session directory, e.g. in the RunStudy.R file and the report doc/Report_s23.Rmd. There may be a few differences, as there are many ways to write code to do the same thing.

+

Let’s load the data:

+
## Warning: package 'knitr' was built under R version 3.3.2
+
## Warning: package 'Hmisc' was built under R version 3.3.2
+
## Warning: package 'ggplot2' was built under R version 3.3.2
+
## Warning: package 'FactoMineR' was built under R version 3.3.2
+
ProjectData <- read.csv("data/Boats.csv", sep = ";", dec = ",")  # this contains only the matrix ProjectData
+ProjectData = data.matrix(ProjectData)
+colnames(ProjectData) <- gsub("\\.", " ", colnames(ProjectData))
+ProjectDataFactor = ProjectData[, c(2:30)]
+


and do some basic visual exploration of the first 50 respondents first (always necessary to see the data first):

+ + + + + + + +
+ +
+


+

This is the correlation matrix of the customer responses to the 29 attitude questions - which are the only questions that we will use for the segmentation (see the case):

+ + + + + + + +
+ +
+


+
+

Questions

+
    +
  1. Do you see any high correlations between the responses? Do they make sense?
  2. +
  3. What do these correlations imply?
  4. +
+
+
Answers:
+





+
+
+
+
+
+

Key Customer Attitudes

+

Clearly the survey asked many reduntant questions (can you think some reasons why?), so we may be able to actually “group” these 29 attitude questions into only a few “key factors”. This not only will simplify the data, but will also greatly facilitate our understanding of the customers.

+

To do so, we use methods called Principal Component Analysis and factor analysis as discussed in the session readings. We can use two different R commands for this (they make slightly different information easily available as output): the command principal (check help(principal) from R package psych), and the command PCA from R package FactoMineR - there are more packages and commands for this, as these methods are very widely used.

+

Here is how the principal function is used:

+
UnRotated_Results <- principal(ProjectDataFactor, nfactors = ncol(ProjectDataFactor), 
+    rotate = "none", score = TRUE)
+UnRotated_Factors <- round(UnRotated_Results$loadings, 2)
+UnRotated_Factors <- as.data.frame(unclass(UnRotated_Factors))
+colnames(UnRotated_Factors) <- paste("Component", 1:ncol(UnRotated_Factors), 
+    sep = " ")
+



+

Here is how we use PCA one is used:

+
Variance_Explained_Table_results <- PCA(ProjectDataFactor, graph = FALSE)
+Variance_Explained_Table <- Variance_Explained_Table_results$eig
+Variance_Explained_Table_copy <- Variance_Explained_Table
+row = 1:nrow(Variance_Explained_Table)
+name <- paste("Component No:", row, sep = "")
+Variance_Explained_Table <- cbind(name, Variance_Explained_Table)
+Variance_Explained_Table <- as.data.frame(Variance_Explained_Table)
+colnames(Variance_Explained_Table) <- c("Components", "Eigenvalue", "Percentage_of_explained_variance", 
+    "Cumulative_percentage_of_explained_variance")
+
+eigenvalues <- Variance_Explained_Table[, 2]
+


Let’s look at the variance explained as well as the eigenvalues (see session readings):

+ + + + + + + +
+ +
+


+ + + + + + + +
+ +
+


+
+

Questions:

+
    +
  1. Can you explain what this table and the plot are? What do they indicate? What can we learn from these?
  2. +
  3. Why does the plot have this specific shape? Could the plotted line be increasing?
  4. +
  5. What characteristics of these results would we prefer to see? Why?
  6. +
+

Your Answers here:



+
+
+

Visualization and Interpretation

+

Let’s now see how the “top factors” look like.

+
# Choose one of these options:
+factors_selected = sum(Variance_Explained_Table_copy[, 1] >= 1)
+# minimum_variance_explained = 0.5; factors_selected =
+# 1:head(which(Variance_Explained_Table_copy[,'cumulative percentage of
+# variance']>= minimum_variance_explained),1) factors_selected = 10
+


+

To better visualise them, we will use what is called a “rotation”. There are many rotations methods, we use what is called the varimax rotation:

+
# Please ENTER the rotation eventually used (e.g. 'none', 'varimax',
+# 'quatimax', 'promax', 'oblimin', 'simplimax', and 'cluster' - see
+# help(principal)). Defauls is 'varimax'
+rotation_used = "varimax"
+
Rotated_Results <- principal(ProjectDataFactor, nfactors = max(factors_selected), 
+    rotate = rotation_used, score = TRUE)
+Rotated_Factors <- round(Rotated_Results$loadings, 2)
+Rotated_Factors <- as.data.frame(unclass(Rotated_Factors))
+colnames(Rotated_Factors) <- paste("Component", 1:ncol(Rotated_Factors), sep = " ")
+sorted_rows <- sort(Rotated_Factors[, 1], decreasing = TRUE, index.return = TRUE)$ix
+Rotated_Factors <- Rotated_Factors[sorted_rows, ]
+ + + + + + + +
+ +
+



+

To better visualize and interpret the factors we often “supress” loadings with small values, e.g. with absolute values smaller than 0.5. In this case our factors look as follows after suppressing the small numbers:

+
MIN_VALUE = 0.5
+Rotated_Factors_thres <- Rotated_Factors
+Rotated_Factors_thres[abs(Rotated_Factors_thres) < MIN_VALUE] <- NA
+colnames(Rotated_Factors_thres) <- colnames(Rotated_Factors)
+rownames(Rotated_Factors_thres) <- rownames(Rotated_Factors)
+ + + + + + + +
+ +
+



+
+
+

Questions

+
    +
  1. What do the first couple of factors mean? Do they make business sense?
  2. +
  3. How many factors should we choose for this data/customer base? Please try a few and explain your final choice based on a) statistical arguments, b) on interpretation arguments, c) on business arguments (you need to consider all three types of arguments)
  4. +
  5. How would you interpret the factors you selected?
  6. +
  7. What lessons about data science do you learn when doing this analysis? Please comment.
  8. +
  9. (Extra/Optional) Can you make this report “dynamic” using shiny and then post it on shinyapps.io? (see for example exercise set 1 and interactive exercise set 2)
  10. +
+

Your Answers here:



+
+
+
+
+
+

Market Segmentation

+

Let’s now use one representative question for each factor (we can also use the “factor scores” for each respondent - see session readings) to represent our survey respondents. We can choose the question with the highest absolute factor loading for each factor. For example, when we use 5 factors with the varimax rotation we can select questions Q.1.9 (I see my boat as a status symbol), Q1.18 (Boating gives me a feeling of adventure), Q1.4 (I only consider buying a boat from a reputable brand), Q1.11 (I tend to perform minor boat repairs and maintenance on my own) and Q1.2 (When buying a boat getting the lowest price is more important than the boat brand) - try it. These are columns 10, 19, 5, 12, and 3, respectively of the data matrix Projectdata.

+

In market segmentation one may use variables to profile the segments which are not the same (necessarily) as those used to segment the market: the latter may be, for example, attitude/needs related (you define segments based on what the customers “need”), while the former may be any information that allows a company to identify the defined customer segments (e.g. demographics, location, etc). Of course deciding which variables to use for segmentation and which to use for profiling (and then activation of the segmentation for business purposes) is largely subjective. So in this case we will use all survey questions for profiling for now:

+


+
segmentation_attributes_used = c(10, 19, 5, 12, 3)
+profile_attributes_used = 2:ncol(ProjectData)
+ProjectData_segment = ProjectData[, segmentation_attributes_used]
+ProjectData_profile = ProjectData[, profile_attributes_used]
+

A key family of methods used for segmenation is what is called clustering methods. This is a very important problem in statistics and machine learning, used in all sorts of applications such as in Amazon’s pioneer work on recommender systems. There are many mathematical methods for clustering. We will use two very standard methods, hierarchical clustering and k-means. While the “math” behind all these methods can be complex, the R functions used are relatively simple to use, as we will see.

+

For example, to use hierarchical clustering we simply first define some parameters used (see session readings) and then simply call the command hclust:

+
# Please ENTER the distance metric eventually used for the clustering in
+# case of hierarchical clustering (e.g. 'euclidean', 'maximum', 'manhattan',
+# 'canberra', 'binary' or 'minkowski' - see help(dist)).  DEFAULT is
+# 'euclidean'
+distance_used = "euclidean"
+# Please ENTER the hierarchical clustering method to use (options are:
+# 'ward', 'single', 'complete', 'average', 'mcquitty', 'median' or
+# 'centroid') DEFAULT is 'ward.D'
+hclust_method = "ward.D"
+# Define the number of clusters:
+numb_clusters_used = 3
+
Hierarchical_Cluster_distances <- dist(ProjectData_segment, method = distance_used)
+Hierarchical_Cluster <- hclust(Hierarchical_Cluster_distances, method = hclust_method)
+
+# Assign observations (e.g. people) in their clusters
+cluster_memberships_hclust <- as.vector(cutree(Hierarchical_Cluster, k = numb_clusters_used))
+cluster_ids_hclust = unique(cluster_memberships_hclust)
+ProjectData_with_hclust_membership <- cbind(1:length(cluster_memberships_hclust), 
+    cluster_memberships_hclust)
+colnames(ProjectData_with_hclust_membership) <- c("Observation Number", "Cluster_Membership")
+

Finally, we can see the dendrogram (see class readings and online resources for more information) to have a first rough idea of what segments (clusters) we may have - and how many.

+


We can also plot the “distances” traveled before we need to merge any of the lower and smaller in size clusters into larger ones - the heights of the tree branches that link the clusters as we traverse the tree from its leaves to its root. If we have n observations, this plot has n-1 numbers.

+ + + + + + + +
+ +
+


+

To use k-means on the other hand one needs to define a priori the number of segments (which of course one can change and re-cluster). K-means also requires the choice of a few more parameters, but this is beyond our scope for now. Here is how to run K-means:

+
# Please ENTER the kmeans clustering method to use (options are:
+# 'Hartigan-Wong', 'Lloyd', 'Forgy', 'MacQueen' DEFAULT is 'Lloyd'
+kmeans_method = "Lloyd"
+# Define the number of clusters:
+numb_clusters_used = 3
+kmeans_clusters <- kmeans(ProjectData_segment, centers = numb_clusters_used, 
+    iter.max = 2000, algorithm = kmeans_method)
+ProjectData_with_kmeans_membership <- cbind(1:length(kmeans_clusters$cluster), 
+    kmeans_clusters$cluster)
+colnames(ProjectData_with_kmeans_membership) <- c("Observation Number", "Cluster_Membership")
+
+# Assign observations (e.g. people) in their clusters
+cluster_memberships_kmeans <- kmeans_clusters$cluster
+cluster_ids_kmeans <- unique(cluster_memberships_kmeans)
+

K-means does not provide much information about segmentation. However, when we profile the segments we can start getting a better (business) understanding of what is happening. Profiling is a central part of segmentation: this is where we really get to mix technical and business creativity.

+
+
+

Profiling

+

There are many ways to do the profiling of the segments. For example, here we show how the average answers of the respondents in each segment compare to the average answer of all respondents using the ratio of the two. The idea is that if in a segment the average response to a question is very different (e.g. away from ratio of 1) than the overall average, then that question may indicate something about the segment relative to the total population.

+

Here are for example the profiles of the segments using the clusters found above:

+


First let’s see just the average answer people gave to each question for the different segments as well as the total population:

+
# Select whether to use the Hhierarchical clustering or the k-means
+# clusters:
+
+cluster_memberships <- cluster_memberships_hclust
+cluster_ids <- cluster_ids_hclust
+# here is the k-means: uncomment these 2 lines cluster_memberships <-
+# cluster_memberships_kmeans cluster_ids <- cluster_ids_kmeans
+
+population_average = matrix(apply(ProjectData_profile, 2, mean), ncol = 1)
+colnames(population_average) <- "Population"
+Cluster_Profile_mean <- sapply(sort(cluster_ids), function(i) apply(ProjectData_profile[(cluster_memberships == 
+    i), ], 2, mean))
+if (ncol(ProjectData_profile) < 2) Cluster_Profile_mean = t(Cluster_Profile_mean)
+colnames(Cluster_Profile_mean) <- paste("Segment", 1:length(cluster_ids), sep = " ")
+cluster.profile <- cbind(population_average, Cluster_Profile_mean)
+ + + + + + + +
+ +
+


+

Let’s now see the relative ratios, which we can also save in a .csv and explore if (absolutely) necessary - e.g. for collaboration with people using other tools.

+
ratio_limit = 0.1
+

Let’s see only ratios that are larger or smaller than 1 by, say, at least 0.1.

+
population_average_matrix <- population_average[, "Population", drop = F] %*% 
+    matrix(rep(1, ncol(Cluster_Profile_mean)), nrow = 1)
+cluster_profile_ratios <- (ifelse(population_average_matrix == 0, 0, Cluster_Profile_mean/population_average_matrix))
+colnames(cluster_profile_ratios) <- paste("Segment", 1:ncol(cluster_profile_ratios), 
+    sep = " ")
+rownames(cluster_profile_ratios) <- colnames(ProjectData)[profile_attributes_used]
+## printing the result in a clean-slate table
+
# Save the segment profiles in a file: enter the name of the file!
+profile_file = "my_segmentation_profiles.csv"
+write.csv(cluster_profile_ratios, file = profile_file)
+# We can also save the cluster membership of our respondents:
+data_with_segment_membership = cbind(cluster_memberships, ProjectData)
+colnames(data_with_segment_membership)[1] = "Segment"
+cluster_file = "my_segments.csv"
+write.csv(data_with_segment_membership, file = cluster_file)
+ + + + + + + +
+ +
+



The further a ratio is from 1, the more important that attribute is for a segment relative to the total population.

+


+
+

Questions

+
    +
  1. How many segments are there in our market? Why you chose that number of segments? Again, try a few and explain your final choice based on a) statistical arguments, b) on interpretation arguments, c) on business arguments (you need to consider all three types of arguments)
  2. +
  3. Can you describe the segments you found based on the profiles?
  4. +
  5. What if you change the number of factors and in general you iterate the whole analysis? Iterations are key in data science.
  6. +
  7. Can you now answer the Boats case questions? What business decisions do you recommend to this company based on your analysis?
  8. +
+


+

Your Answers here:



+
+

You have now completed your first market segmentation project. Do you have data from another survey you can use with this report now?

+

Extra question: explore and report a new segmentation analysis…

+

… and as always Have Fun

+
+
+ + + + +
+ + + + + + + + diff --git a/Exercises/Exerciseset1/ExerciseSet1.Rmd b/Exercises/Exerciseset1/ExerciseSet1.Rmd index a8349376..6817a092 100644 --- a/Exercises/Exerciseset1/ExerciseSet1.Rmd +++ b/Exercises/Exerciseset1/ExerciseSet1.Rmd @@ -2,6 +2,7 @@ --- title: "Exercise Set 1" author: "T. Evgeniou" +runtime: shiny output: html_document --- @@ -60,6 +61,7 @@ For more information on downloading finance data from the internet as well as on 1. Can you find some interesting finance related R package or github repository? **Your Answers here:**
+[1. R for Quantitative Finance](https://github.com/cheerzzh/R_for_Quantitative_Finance)

@@ -83,17 +85,36 @@ pnl_plot(SPY) #### Questions 1. Notice that the code also downloads the returns of Apple during the same period. Can you explain where this is done in the code (including the .R files used)? +
+**Your Answers here:** +
+This was defined in the dataSet1.R using the code mytickers = c("SPY", "AAPL"). With this code, Apple stock data is downloaded. +
2. What are the cumulative, average daily returns, and the standard deviation of the daily returns of Apple in the same period? -3. *(Extra points)* What if we want to also see the returns of another company, say Yahoo!, in the same period? Can you get that data and report the statistics for Yahoo!'s stock, too? - +
**Your Answers here:**
+The cumulative returns of the AAPL index during this period is `r round(100*sum(StockReturns[,2]),1)`%.
+The average daily returns of the AAPL index during this period is `r round(100*mean(StockReturns[,2]),3)`%;
+The standard deviation of the daily returns of the AAPL index during this period is `r round(100*sd(StockReturns[,2]),3)`%;
-
-
+3. *(Extra points)* What if we want to also see the returns of another company, say Yahoo!, in the same period? Can you get that data and report the statistics for Yahoo!'s stock, too? +
+**Your Answers here:** +
+Just add the ticket symbol "YHOO" to the dataSet1.R list of tickers to pull. +
+For Yahoo +
+The cumulative returns of the YHOO index during this period is `r round(100*sum(StockReturns[,3]),1)`%. +
+The average daily returns of the YHOO index during this period is `r round(100*mean(StockReturns[,3]),3)`%; +
+The standard deviation of the daily returns of the YHOO index during this period is `r round(100*sd(StockReturns[,3]),3)`%; +
### Part II: Simple Matrix Manipulations @@ -119,17 +140,27 @@ names(portfolio) <- rownames(StockReturns) pnl_plot(portfolio) ``` - We can also transpose the matrix of returns to create a new "horizontal" matrix. Let's call this matrix (variable name) transposedData. We can do so using this command: `transposedData = t(StockReturns)`. #### Questions 1. What R commands can you use to get the number of rows and number of columns of the new matrix called transposedData? + 2. Based on the help for the R function *apply* (`help(apply)`), can you create again the portfolio of S&P and Apple and plot the returns in a new figure below? **Your Answers here:** +`r transposedData = t(StockReturns)`
+The number of rows `r nrow(transposedData)` and number of columns is `r ncol(transposedData)`
+Create the new transposed matrix as such: +
+```{r echo=TRUE, comment=NA, warning=FALSE,message=FALSE,results='asis',fig.align='center', fig=TRUE} +transposedData = t(StockReturns) +t_portfolio= apply(transposedData,2,mean) +names(t_portfolio) <- rownames(StockReturns) +pnl_plot(t_portfolio) +```

@@ -147,8 +178,10 @@ This is an important step and will get you to think about the overall process on **Your Answers here:**
+1. Just make sure the code startDate = "2001-01-01" in dataSet1.R is set.

+2. You can keep adding ticker symbols to the list of tickers you want

@@ -181,8 +214,12 @@ myData + StockReturns[1:40,] **Your Answers here:**
+ +1. I get a value of 20. +It makes not sense but myData is just a lump of dates and data together.

+2. I should get all sorts of errors since the data doesnt even match up in the dimensions

@@ -194,11 +231,24 @@ Can you now load another dataset from some CSV file and report some basic statis
+I downloaded the csv of Hilary Clinton and realDonald Trump tweets from Kaggle. It has a lot of text instead of numbers. Link(https://www.kaggle.com/benhamner/clinton-trump-tweets) + +You can still do basic analysis of the data like dim but its also has challenges because its mostly text and more advanced functions will be needed to analyze. + +```{r eval = TRUE, echo=TRUE, comment=NA, warning=FALSE, message=FALSE,results='asis'} +tweetsData <- read.csv(file = "tweets.csv", header = TRUE, sep=";") +dim(tweetsData) +class(tweetsData) +dim(tweetsData) +nrow(tweetsData) +ncol(tweetsData) +tweetsData[1:2,] +``` ### Creating Interactive Documents Finally, just for fun, one can add some interactivity in the report using [Shiny](http://rmarkdown.rstudio.com/authoring_shiny.html).All one needs to do is set the eval flag of the code chunk below (see the .Rmd file) to "TRUE", add the line "runtime: shiny" at the very begining of the .Rmd file, make the markdown output to be "html_document", and then press "Run Document". -```{r, eval=FALSE, echo = TRUE} +```{r, eval=TRUE, echo = TRUE} sliderInput("startdate", "Starting Date:", min = 1, max = length(portfolio), value = 1) sliderInput("enddate", "End Date:", min = 1, max = length(portfolio), diff --git a/Exercises/Exerciseset1/ExerciseSet1_NZF.Rmd b/Exercises/Exerciseset1/ExerciseSet1_NZF.Rmd new file mode 100644 index 00000000..4abd7ab6 --- /dev/null +++ b/Exercises/Exerciseset1/ExerciseSet1_NZF.Rmd @@ -0,0 +1,274 @@ + +--- +title: "Exercise Set 1" +author: "Ng Zhenfu" +runtime: shiny +output: html_document +--- + + +
+ +The purpose of this exercise is to become familiar with: + +1. Basic statistics functions in R; +2. Simple matrix operations; +3. Simple data manipulations; +4. The idea of functions as well as some useful customized functions provided. + +While doing this exercise we will also see how to generate replicable and customizable reports. For this purpose the exercise uses the R Markdown capabilities (see [Markdown Cheat Sheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf) or a [basic introduction to R Markdown](http://rmarkdown.rstudio.com/authoring_basics.html)). These capabilities allow us to create dynamic reports. For example today's date is `r Sys.Date()` (you need to see the .Rmd to understand that this is *not* a static typed-in date but it changes every time you compile the .Rmd - if the date changed of course). + +Before starting, make sure you have pulled the [exercise files](https://github.com/InseadDataAnalytics/INSEADAnalytics/tree/master/Exercises/Exerciseset1) on your github repository (if you pull the course github repository you also get the exercise set files automatically). Moreover, make sure you are in the directory of this exercise. Directory paths may be complicated, and sometimes a frustrating source of problems, so it is recommended that you use these R commands to find out your current working directory and, if needed, set it where you have the main files for the specific exercise/project (there are other ways, but for now just be aware of this path issue). For example, assuming we are now in the "MYDIRECTORY/INSEADAnalytics" directory, we can do these: + +```{r echo=TRUE, eval=FALSE, tidy=TRUE} +getwd() +setwd("Exercises/Exerciseset1/") +list.files() +``` + +**Note:** you can always use the `help` command in Rstudio to find out about any R function (e.g. type `help(list.files)` to learn what the R function `list.files` does). + +Let's now see the exercise. + +**IMPORTANT:** You should answer all questions by simply adding your code/answers in this document through editing the file ExerciseSet1.Rmd and then clicking on the "Knit HTML" button in RStudio. Once done, please post your .Rmd and html files in your github repository. + +
+
+ +### Exercise Data + +We download daily prices (open, high, low, close, and adjusted close) and volume data of publicly traded companies and markets from the web (e.g. Yahoo! or Google, etc). This is done by sourcing the file data.R as well as some helper functions in herpersSet1.R which also installs a number of R libraries (hence the first time you run this code you will see a lot of red color text indicating the *download* and *installation* process): + +```{r eval = TRUE, echo=TRUE, error = FALSE, warning=FALSE,message=FALSE,results='asis'} +source("helpersSet1.R") +source("dataSet1.R") +``` + +For more information on downloading finance data from the internet as well as on finance related R tools see these starting points (there is a lot more of course available): + +* [Some finance data loading tools](http://www.r-bloggers.com/r-code-yahoo-finance-data-loading/) +* [Connecting directly to Bloomberg](http://www.r-bloggers.com/rblpapi-connecting-r-to-bloomberg/) +* [Some time series plot tools](http://www.r-bloggers.com/plotting-time-series-in-r-using-yahoo-finance-data/) +* [Various finance code links](https://cran.r-project.org/web/views/Finance.html) +* [More links](http://blog.revolutionanalytics.com/2013/12/quantitative-finance-applications-in-r.html) +* [Even more links](http://www.r-bloggers.com/financial-data-accessible-from-r-part-iv/) +* Of course endless available code (e.g. like this one that seems to [get companies' earnings calendars](https://github.com/gsee/qmao/blob/master/R/getCalendar.R)) + +#### Optional Question + +1. Can you find some interesting finance related R package or github repository? +**Your Answers here:** +
+[1. R for Quantitative Finance](https://github.com/cheerzzh/R_for_Quantitative_Finance) +
+ +
+
+ +### Part I: Statistics of S&P Daily Returns + +We have `r nrow(StockReturns)` days of data, starting from `r rownames(StockReturns)[1]` until `r tail(rownames(StockReturns),1)`. Here are some basic statistics about the S&P returns: + +1. The cumulative returns of the S&P index during this period is `r round(100*sum(StockReturns[,1]),1)`%. +2. The average daily returns of the S&P index during this period is `r round(100*mean(StockReturns[,1]),3)`%; +2. The standard deviation of the daily returns of the S&P index during this period is `r round(100*sd(StockReturns[,1]),3)`%; + +Here are returns of the S&P in this period (note the use of the helper function pnl_plot - defined in file helpersSet1.R): + +```{r echo=FALSE, comment=NA, warning=FALSE, message=FALSE,results='asis',fig.align='center', fig.height=4,fig.width= 6, fig=TRUE} +SPY = StockReturns[,"SPY"] +pnl_plot(SPY) +``` + +#### Questions + +1. Notice that the code also downloads the returns of Apple during the same period. Can you explain where this is done in the code (including the .R files used)? +
+**Your Answers here:** +
+This was defined in the dataSet1.R using the code mytickers = c("SPY", "AAPL"). With this code, Apple stock data is downloaded. +
+2. What are the cumulative, average daily returns, and the standard deviation of the daily returns of Apple in the same period? +
+**Your Answers here:** +
+The cumulative returns of the AAPL index during this period is `r round(100*sum(StockReturns[,2]),1)`%. +
+The average daily returns of the AAPL index during this period is `r round(100*mean(StockReturns[,2]),3)`%; +
+The standard deviation of the daily returns of the AAPL index during this period is `r round(100*sd(StockReturns[,2]),3)`%; +
+ +3. *(Extra points)* What if we want to also see the returns of another company, say Yahoo!, in the same period? Can you get that data and report the statistics for Yahoo!'s stock, too? +
+**Your Answers here:** +
+Just add the ticket symbol "YHOO" to the dataSet1.R list of tickers to pull. +
+For Yahoo +
+The cumulative returns of the YHOO index during this period is `r round(100*sum(StockReturns[,3]),1)`%. +
+The average daily returns of the YHOO index during this period is `r round(100*mean(StockReturns[,3]),3)`%; +
+The standard deviation of the daily returns of the YHOO index during this period is `r round(100*sd(StockReturns[,3]),3)`%; +
+ +### Part II: Simple Matrix Manipulations + +For this part of the exercise we will do some basic manipulations of the data. First note that the data are in a so-called matrix format. If you run these commands in RStudio (use help to find out what they do) you will see how matrices work: + +```{r eval = FALSE, echo=TRUE} +class(StockReturns) +dim(StockReturns) +nrow(StockReturns) +ncol(StockReturns) +StockReturns[1:4,] +head(StockReturns,5) +tail(StockReturns,5) +``` + +We will now use an R function for matrices that is extremely useful for analyzing data. It is called *apply*. Check it out using help in R. + +For example, we can now quickly estimate the average returns of S&P and Apple (of course this can be done manually, too, but what if we had 500 stocks - e.g. a matrix with 500 columns?) and plot the returns of that 50-50 on S&P and Apple portfolio: + +```{r echo=FALSE, comment=NA, warning=FALSE, message=FALSE,results='asis',fig.align='center', fig=TRUE} +portfolio = apply(StockReturns,1,mean) +names(portfolio) <- rownames(StockReturns) +pnl_plot(portfolio) +``` + +We can also transpose the matrix of returns to create a new "horizontal" matrix. Let's call this matrix (variable name) transposedData. We can do so using this command: `transposedData = t(StockReturns)`. + +#### Questions + +1. What R commands can you use to get the number of rows and number of columns of the new matrix called transposedData? + +2. Based on the help for the R function *apply* (`help(apply)`), can you create again the portfolio of S&P and Apple and plot the returns in a new figure below? + +**Your Answers here:** +`r transposedData = t(StockReturns)` +
+The number of rows `r nrow(transposedData)` and number of columns is `r ncol(transposedData)` +
+Create the new transposed matrix as such: +
+```{r echo=TRUE, comment=NA, warning=FALSE,message=FALSE,results='asis',fig.align='center', fig=TRUE} +transposedData = t(StockReturns) +t_portfolio= apply(transposedData,2,mean) +names(t_portfolio) <- rownames(StockReturns) +pnl_plot(t_portfolio) +``` +
+
+ +
+
+ +### Part III: Reproducibility and Customization + +This is an important step and will get you to think about the overall process once again. + +#### Questions + +1. We want to re-do all this analysis with data since 2001-01-01: what change do we need to make in the code (hint: all you need to change is one line - exactly 1 number! - in data.R file), and how can you get the new exercise set with the data since 2001-01-01? +2. *(Extra Exercise)* Can you get the returns of a few companies and plot the returns of an equal weighted portfolio with those companies during some period you select? + +**Your Answers here:** +
+1. Just make sure the code startDate = "2001-01-01" in dataSet1.R is set. +
+
+2. You can keep adding ticker symbols to the list of tickers you want +
+ +
+
+ +### Part IV: Read/Write .CSV files + +Finally, one can read and write data in .CSV files. For example, we can save the first 20 days of data for S&P and Apple in a file using the command: + +```{r eval = TRUE, echo=TRUE, comment=NA, warning=FALSE, message=FALSE,results='asis'} +write.csv(StockReturns[1:20,c("SPY","AAPL")], file = "twentydays.csv", row.names = TRUE, col.names = TRUE) +``` + +Do not get surpsised if you see the csv file in your directories suddenly! You can then read the data from the csv file using the read.csv command. For example, this will load the data from the csv file and save it in a new variable that now is called "myData": + +```{r eval = TRUE, echo=TRUE, comment=NA, warning=FALSE, message=FALSE,results='asis'} +myData <- read.csv(file = "twentydays.csv", header = TRUE, sep=";") +``` + +Try it! + +#### Questions + +1. Once you write and read the data as described above, what happens when you run this command in the console of the RStudio: `sum(myData != StockReturns[1:20,])` +2. *(Extra exercise)* What do you think will happen if you now run this command, and why: + +```{r eval = FALSE, echo=TRUE} +myData + StockReturns[1:40,] +``` + +**Your Answers here:** +
+ +1. I get a value of 20. +It makes not sense but myData is just a lump of dates and data together. +
+
+2. I should get all sorts of errors since the data doesnt even match up in the dimensions +
+ +
+
+ +### Extra Question + +Can you now load another dataset from some CSV file and report some basic statistics about that data? + +
+ +I downloaded the csv of Hilary Clinton and realDonald Trump tweets from Kaggle. It has a lot of text instead of numbers. Link(https://www.kaggle.com/benhamner/clinton-trump-tweets) + +You can still do basic analysis of the data like dim but its also has challenges because its mostly text and more advanced functions will be needed to analyze. + +```{r eval = TRUE, echo=TRUE, comment=NA, warning=FALSE, message=FALSE,results='asis'} +tweetsData <- read.csv(file = "tweets.csv", header = TRUE, sep=";") +dim(tweetsData) +class(tweetsData) +dim(tweetsData) +nrow(tweetsData) +ncol(tweetsData) +tweetsData[1:2,] +``` +### Creating Interactive Documents + +Finally, just for fun, one can add some interactivity in the report using [Shiny](http://rmarkdown.rstudio.com/authoring_shiny.html).All one needs to do is set the eval flag of the code chunk below (see the .Rmd file) to "TRUE", add the line "runtime: shiny" at the very begining of the .Rmd file, make the markdown output to be "html_document", and then press "Run Document". + +```{r, eval=TRUE, echo = TRUE} +sliderInput("startdate", "Starting Date:", min = 1, max = length(portfolio), + value = 1) +sliderInput("enddate", "End Date:", min = 1, max = length(portfolio), + value = length(portfolio)) + +renderPlot({ + pnl_plot(portfolio[input$startdate:input$enddate]) +}) +``` + +
+ +
+
+ +### Endless explorations (optional homework) + +This is a [recent research article](http://poseidon01.ssrn.com/delivery.php?ID=851091091009083082092113118102076099034023058067019062072066007100008111081022102123034016097101060099003106125099002090116089026058012038004030005113111105079028059062024121067073126072090091089069014121102110107075029090001011087028011082124103085&EXT=pdf) that won an award in 2016. Can you implement a simple strategy as in Figure 1 of this paper? You may find these R commands useful: `names`, `which`, `str_sub`,`diff`,`as.vector`, `length`, `pmin`, `pmax`, `sapply`, `lapply`,`Reduce`,`unique`, `as.numeric`, `%in%` +![A Simple Trading Startegy](simpletrade.png) + +What if you also include information about bonds? (e.g. download the returns of the the ETF with ticker "TLT") Is there any relation between stocks and bonds? + + +**Have fun** + diff --git a/Exercises/Exerciseset2/ExerciseSet2_NZF.Rmd b/Exercises/Exerciseset2/ExerciseSet2_NZF.Rmd new file mode 100644 index 00000000..abafe021 --- /dev/null +++ b/Exercises/Exerciseset2/ExerciseSet2_NZF.Rmd @@ -0,0 +1,416 @@ +--- +title: "Exercise Set 2: A $300 Billion Strategy" +author: "Ng Zhen fu" +output: html_document +--- + +
+ +The purpose of this exercise is to become familiar with: + +1. Some time series analysis tools; +2. Correlation matrices and principal component analysis (PCA) (see [readings of sessions 3-4](http://inseaddataanalytics.github.io/INSEADAnalytics/Report_s23.html)); +3. More data manipulation and reporting tools (including Google Charts). + +As always, while doing this exercise we will also see how to generate replicable and customizable reports. For this purpose the exercise uses the R Markdown capabilities (see [Markdown Cheat Sheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf) or a [basic introduction to R Markdown](http://rmarkdown.rstudio.com/authoring_basics.html)). These capabilities allow us to create dynamic reports. For example today's date is `r Sys.Date()` (you need to see the .Rmd to understand that this is *not* a static typed-in date but it changes every time you compile the .Rmd - if the date changed of course). + +Before starting, make sure you have pulled the [exercise set 2 souce code files](https://github.com/InseadDataAnalytics/INSEADAnalytics/tree/master/Exercises/Exerciseset2) on your github repository (if you pull the course github repository you also get the exercise set files automatically). Moreover, make sure you are in the directory of this exercise. Directory paths may be complicated, and sometimes a frustrating source of problems, so it is recommended that you use these R commands to find out your current working directory and, if needed, set it where you have the main files for the specific exercise/project (there are other ways, but for now just be aware of this path issue). For example, assuming we are now in the "Data Analytics R version/INSEADAnalytics" directory, we can do these: + +```{r echo=TRUE, eval=FALSE, tidy=TRUE} +getwd() +setwd("Exercises/Exerciseset2/") +list.files() +``` + +**Note:** as always, you can use the `help` command in Rstudio to find out about any R function (e.g. type `help(list.files)` to learn what the R function `list.files` does). + +Let's now see the exercise. + +**IMPORTANT:** You should answer all questions by simply adding your code/answers in this document through editing the file ExerciseSet2.Rmd and then clicking on the "Knit HTML" button in RStudio. Once done, please post your .Rmd and html files in your github repository. + +
+ +### The Exercise: Introduction + +For this exercise we will use the Futures' daily returns to develop what is considered to be a *"classic" hedge fund trading strategy*, a **futures trend following strategy**. There is a lot written about this, so it is worth doing some online search about "futures trend following", or "Managed Futures", or "Commodity Trading Advisors (CTA)". There is about **[$300 billion](http://www.barclayhedge.com/research/indices/cta/Money_Under_Management.html)** invested on this strategy today, and is considered to be one of the **oldest hedge fund strategies**. Some example links are: + +* [A fascinating report on 2 centuries of trend following from the CFM hedge - a $6 billion fund](https://www.trendfollowing.com/whitepaper/Two_Centuries_Trend_Following.pdf) +* [Another fascinating report on 1 century of trend following investing from AQR - a $130 billion fund](https://www.aqr.com/library/aqr-publications/a-century-of-evidence-on-trend-following-investing) +* [Wikipedia on CTAs](https://en.wikipedia.org/wiki/Commodity_trading_advisor) +* [Morningstar on CTAs](http://www.morningstar.co.uk/uk/news/69379/commodity-trading-advisors-(cta)-explained.aspx) +* [A report](http://perspectives.pictet.com/wp-content/uploads/2011/01/Trading-Strategies-Final.pdf) +* [Man AHL (a leading hedge fund on CTAs - among others) - an $80 billion fund](https://www.ahl.com) + +Of course there are also many starting points for developing such a strategy (for example [this R bloggers one](http://www.r-bloggers.com/system-from-trend-following-factors/) (also on [github](https://gist.github.com/timelyportfolio/2855303)), or the [turtle traders website](http://turtletrader.com) which has many resources. + +In this exercise we will develop our own strategy from scratch. + +*Note (given today's market conditions):* **Prices of commodities, like oil or gold, can be excellent indicators of the health of the economy and of various industries, as we will also see below**. + +### Getting the Futures Data + +There are many ways to get futures data. For example, one can use the [Quandl package,](https://www.quandl.com/browse) or the [turtle traders resources,](http://turtletrader.com/hpd/) or (for INSEAD only) get data from the [INSEAD library finance data resources](http://sites.insead.edu/library/E_resources/ER_subject.cfm#Stockmarket) website. One has to pay attention on how to create continuous time series from underlying contracts with varying deliveries (e.g. see [here](https://www.quantstart.com/articles/Continuous-Futures-Contracts-for-Backtesting-Purposes) ). Using a combination of the resources above, we will use data for a number of commodities. + + +### Data description + +Let's load the data and see what we have. + +```{r echo=TRUE, eval=TRUE, comment=NA, warning=FALSE,error=FALSE, message=FALSE, prompt=FALSE, tidy=TRUE} +source("helpersSet2.R") +library(googleVis) +load("data/FuturesTrendFollowingData.Rdata") +``` + +
+We have data from `r head(rownames(futures_data),1)` to `r tail(rownames(futures_data),1)` of daily returns for the following `r ncol(futures_data)` futures: + +
+ +```{r echo=TRUE, eval=TRUE, comment=NA, warning=FALSE,error=FALSE, message=FALSE, prompt=FALSE, tidy=TRUE, results='asis'} +show_data = data.frame(colnames(futures_data)) +m1<-gvisTable(show_data,options=list(showRowNumber=TRUE,width=1920, height=min(400,27*(nrow(show_data)+1)),allowHTML=TRUE,page='disable')) +print(m1,'chart') +``` +
+ + + +### Basic data analysis + +Let's see how these are correlated. Let's also make it look nicer (than, say, what we did in Exercise Set 1), using [Google Charts](https://code.google.com/p/google-motion-charts-with-r/wiki/GadgetExamples) (see examples online, e.g. [examples](https://cran.r-project.org/web/packages/googleVis/vignettes/googleVis_examples.html) and the [R package used used](https://cran.r-project.org/web/packages/googleVis/googleVis.pdf) ).The correlation matrix is as follows (note that the table is "dynamic": for example you can sort it based on each column by clicking on the column's header) + +
+ + +```{r echo=FALSE, comment=NA, warning=FALSE, message=FALSE, results='asis'} +show_data = data.frame(cbind(colnames(futures_data), round(cor(futures_data),2))) +m1<-gvisTable(show_data,options=list(width=1920, height=min(400,27*(nrow(show_data)+1)),allowHTML=TRUE)) +print(m1,'chart') +``` + +
+ +We see quite high correlations among some of the futures. Does it make sense? Why? Do you see some negative correlations? Do those make sense? + +Given such high correlations, we can try to see whether there are some "principal components" (see [reading on dimensionality reduction](http://inseaddataanalytics.github.io/INSEADAnalytics/Report_s23.html)). This analysis can also indicate whether all futures (the global economy!) are driven by some common "factors" (let's call them **"risk factors"**). + +
+ +```{r echo=TRUE, eval=TRUE, comment=NA, warning=FALSE,error=FALSE, message=FALSE, prompt=FALSE, tidy=TRUE} +Variance_Explained_Table_results<-PCA(futures_data, graph=FALSE) +Variance_Explained_Table<-cbind(paste("component",1:ncol(futures_data),sep=" "),Variance_Explained_Table_results$eig) +Variance_Explained_Table<-as.data.frame(Variance_Explained_Table) +colnames(Variance_Explained_Table)<-c("Component","Eigenvalue", "Percentage_of_explained_variance", "Cumulative_percentage_of_explained_variance") +``` + +```{r echo=FALSE, comment=NA, warning=FALSE, message=FALSE, results='asis'} +show_data = data.frame(Variance_Explained_Table) +m1<-gvisTable(show_data,options=list(width=1920, height=min(400,27*(nrow(show_data)+1)),allowHTML=TRUE,page='disable'),formats=list(Eigenvalue="#.##",Percentage_of_explained_variance="#.##",Cumulative_percentage_of_explained_variance="#.##")) +print(m1,'chart') +``` +
+ +Here is the scree plot (see Sessions 3-4 readings): +
+ +```{r echo=TRUE, eval=TRUE, comment=NA, warning=FALSE,error=FALSE, message=FALSE, prompt=FALSE, tidy=TRUE} +eigenvalues <- Variance_Explained_Table[,2] +``` + +```{r Fig1, echo=FALSE, comment=NA, results='asis', message=FALSE, fig.align='center', fig=TRUE} +df <- cbind(as.data.frame(eigenvalues), c(1:length(eigenvalues)), rep(1, length(eigenvalues))) +colnames(df) <- c("eigenvalues", "components", "abline") +Line <- gvisLineChart(as.data.frame(df), xvar="components", yvar=c("eigenvalues","abline"), options=list(title='Scree plot', legend="right", width=900, height=600, hAxis="{title:'Number of Components', titleTextStyle:{color:'black'}}", vAxes="[{title:'Eigenvalues'}]", series="[{color:'green',pointSize:3, targetAxisIndex: 0}]")) +print(Line, 'chart') +``` + +
+ +Let's now see how the 20 first (**rotated**) principal components look like. Let's also use the *rotated* factors (note that these are not really the "principal component", as explained in the [reading on dimensionality reduction](http://inseaddataanalytics.github.io/INSEADAnalytics/Report_s23.html)) and not show any numbers less than 0.3 in absolute value, to avoid cluttering. Note again that you can sort the table according to any column by clicking on the header of that column. +
+ +```{r echo=TRUE, comment=NA, warning=FALSE, error=FALSE,message=FALSE,results='asis',tidy=TRUE} +corused = cor(futures_data[,apply(futures_data!=0,2,sum) > 10, drop=F]) +Rotated_Results<-principal(corused, nfactors=20, rotate="varimax",score=TRUE) +Rotated_Factors<-round(Rotated_Results$loadings,2) +Rotated_Factors<-as.data.frame(unclass(Rotated_Factors)) +colnames(Rotated_Factors)<-paste("Component",1:ncol(Rotated_Factors),sep=" ") + +sorted_rows <- sort(Rotated_Factors[,1], decreasing = TRUE, index.return = TRUE)$ix +Rotated_Factors <- Rotated_Factors[sorted_rows,] +Rotated_Factors[abs(Rotated_Factors) < 0.3]<-NA +``` + +```{r echo=FALSE, comment=NA, warning=FALSE, error=FALSE,message=FALSE,results='asis'} +show_data <- Rotated_Factors +show_data<-cbind(rownames(show_data),show_data) +colnames(show_data)<-c("Variables",colnames(Rotated_Factors)) +m1<-gvisTable(show_data,options=list(showRowNumber=TRUE,width=1220, height=min(400,27*(nrow(show_data)+1)),allowHTML=TRUE,page='disable')) +print(m1,'chart') +``` +
+ +#### Questions: + +1. How many principal components ("factors") do we need to explain at least 50% of the variance in this data? +2. What are the highest weights (in absolute value) of the first principal component portfolio above on the `r ncol(futures_data)` futures? +3. Can we interpret the first 10 components? How would you call these factors? +4. Can you now generate the principal components and scree plot using only: a) the pre-crisis bull market years (e.g. only using the data between November 1, 2002, and October 1, 2007)? b) the financial crisis years (e.g. only using the data between October 1, 2007 and March 1, 2009), (Hint: you can select subsets of the data using for example the command `crisis_data = futures_data[as.Date(rownames(futures_data)) > "2007-10-01" & as.Date(rownames(futures_data)) < "2009-03-01", ]) +5. Based on your analysis in question 3, please discuss any differences you observe about the futures returns during bull and bear markets. What implications may these results have? What do the results imply about how assets are correlated during bear years compared to bull years? +6. (Extra - optional) Can you create an interactive (shiny based) tool so that we can study how the "**risk factors**" change ove time? (Hint: see [Exercise set 1](https://github.com/InseadDataAnalytics/INSEADAnalytics/blob/master/Exercises/Exerciseset1/ExerciseSet1.Rmd) and online resources on [Shiny](http://rmarkdown.rstudio.com/authoring_shiny.html) such as these [Shiny lessons](http://shiny.rstudio.com/tutorial/lesson1/). Note however that you may need to pay attention to various details e.g. about how to include Google Charts in Shiny tools - so keep this extra exercise for later!). + +
+ +**Your Answers here:** +
+1. 6 components is required to view the data. +
+2. +
+10 yr T bonds: 0.93 +
+5 yr T bonds: 0.93 +
+Treasury bonds: 0.86 +
+2 yr T note US: 0.85 +
+Canadian Bond: 0.79 +
+3. +r> +Component 1: Bonds +
+Component 2: Currencies +
+Component 3: Stockmarkets representing a fraction of futures contract, group a +
+Component 4: Stockmarkets representing a fraction of futures contract, group b +
+Component 5: Daily reference rate +
+Component 6: Oil +
+Component 7: Metals +
+Component 8: Agricultural commodities +
+Component 9: Precious metals +
+Component 10: Asian markets +
+
+4. +```{r eval = TRUE, echo=TRUE, comment=NA, warning=FALSE, message=FALSE,results='asis'} +bull_data = futures_data[as.Date(rownames(futures_data)) > "2002-11-01" & as.Date(rownames(futures_data)) < "2007-10-01", ] +Variance_Explained_Table_results_bull<-PCA(bull_data, graph=FALSE) +Variance_Explained_Table_bull<-cbind(paste("component",1:ncol(futures_data),sep=" "),Variance_Explained_Table_results_bull$eig) +Variance_Explained_Table_bull<-as.data.frame(Variance_Explained_Table_bull) +colnames(Variance_Explained_Table_bull)<-c("Component","Eigenvalue", "Percentage_of_explained_variance", "Cumulative_percentage_of_explained_variance") +``` + +```{r echo=FALSE, comment=NA, warning=FALSE, message=FALSE, results='asis'} +show_data = data.frame(Variance_Explained_Table_bull) +m_bull<-gvisTable(show_data,options=list(width=1920, height=min(400,27*(nrow(show_data)+1)),allowHTML=TRUE,page='disable'),formats=list(Eigenvalue="#.##",Percentage_of_explained_variance="#.##",Cumulative_percentage_of_explained_variance="#.##")) +print(m_bull,'chart') +``` +
+Here is the Bull scree plot +```{r echo=TRUE, eval=TRUE, comment=NA, warning=FALSE,error=FALSE, message=FALSE, prompt=FALSE, tidy=TRUE} +eigenvalues_bull <- Variance_Explained_Table_bull[,2] +``` + +```{r FigBull, echo=FALSE, comment=NA, results='asis', message=FALSE, fig.align='center', fig=TRUE} +df_bull <- cbind(as.data.frame(eigenvalues_bull), c(1:length(eigenvalues_bull)), rep(1, length(eigenvalues_bull))) +colnames(df_bull) <- c("eigenvalues", "components", "abline") +Line_bull <- gvisLineChart(as.data.frame(df_bull), xvar="components", yvar=c("eigenvalues","abline"), options=list(title='Bull Scree plot', legend="right", width=900, height=600, hAxis="{title:'Number of Components', titleTextStyle:{color:'black'}}", vAxes="[{title:'Eigenvalues'}]", series="[{color:'green',pointSize:3, targetAxisIndex: 0}]")) +print(Line_bull, 'chart') +``` + + + + +```{r eval = TRUE, echo=TRUE, comment=NA, warning=FALSE, message=FALSE,results='asis'} +crisis_data = futures_data[as.Date(rownames(futures_data)) > "2007-10-01" & as.Date(rownames(futures_data)) < "2009-03-01", ] +Variance_Explained_Table_results_bear<-PCA(crisis_data, graph=FALSE) +Variance_Explained_Table_bear<-cbind(paste("component",1:ncol(futures_data),sep=" "),Variance_Explained_Table_results_bear$eig) +Variance_Explained_Table_bear<-as.data.frame(Variance_Explained_Table_bear) +colnames(Variance_Explained_Table_bear)<-c("Component","Eigenvalue", "Percentage_of_explained_variance", "Cumulative_percentage_of_explained_variance") +``` + +```{r echo=FALSE, comment=NA, warning=FALSE, message=FALSE, results='asis'} +show_data = data.frame(Variance_Explained_Table_bear) +m_bear<-gvisTable(show_data,options=list(width=1920, height=min(400,27*(nrow(show_data)+1)),allowHTML=TRUE,page='disable'),formats=list(Eigenvalue="#.##",Percentage_of_explained_variance="#.##",Cumulative_percentage_of_explained_variance="#.##")) +print(m_bear,'chart') +``` +
+Here is the Bear scree plot +```{r echo=TRUE, eval=TRUE, comment=NA, warning=FALSE,error=FALSE, message=FALSE, prompt=FALSE, tidy=TRUE} +eigenvalues_bear <- Variance_Explained_Table_bear[,2] +``` + +```{r FigCrisis, echo=FALSE, comment=NA, results='asis', message=FALSE, fig.align='center', fig=TRUE} +df_bear <- cbind(as.data.frame(eigenvalues_bear), c(1:length(eigenvalues_bear)), rep(1, length(eigenvalues_bear))) +colnames(df_bear) <- c("eigenvalues", "components", "abline") +Line_bear <- gvisLineChart(as.data.frame(df_bear), xvar="components", yvar=c("eigenvalues","abline"), options=list(title='Bear Scree plot', legend="right", width=900, height=600, hAxis="{title:'Number of Components', titleTextStyle:{color:'black'}}", vAxes="[{title:'Eigenvalues'}]", series="[{color:'green',pointSize:3, targetAxisIndex: 0}]")) +print(Line_bear, 'chart') +``` +
+ + +5. The components have changed both in composition as well as assigned weights. Therefore, the results imply that assets are correlated differently during bear years as compared to bull years. +### A Simple Futures Trend Following Strategy + +We can now develop a simple futures trend following trading strategy, as outlined in the papers in the Exercise Introduction above. There are about $300 billion invested in such strategies! Of course we cannot develop here a sophisticated product, but with some more work... + +We will do the following: + +1. Calculate a number of moving averages of different "window lengths" for each of the `r ncol(futures_data)` futures - there are [many](http://www.r-bloggers.com/stock-analysis-using-r/) so called [technical indicators](http://www.investopedia.com/active-trading/technical-indicators/) one can use. We will use the "moving average" function `ma` for this (try for example to see what this returns `ma(1:10,2)` ). +2. Add the signs (can also use the actual moving average values of course - try it!) of these moving averages (as if they "vote"), and then scale this sum across all futures so that the sum of their (of the sum across all futures!) absolute value across all futures is 1 (hence we invest $1 every day - you see why?). +3. Then invest every day in each of the `r ncol(futures_data)` an amount that is defined by the weights calculated in step 2, using however the weights calculated using data until 2 days ago (why 2 days and not 1 day?) - see the use of the helper function `shift` for this. +4. Finally see the performance of this strategy. + +Here is the code. +
+ +```{r echo=TRUE, eval=TRUE, comment=NA, warning=FALSE,error=FALSE, message=FALSE, prompt=FALSE, tidy=TRUE} +signal_used = 0*futures_data # just initialize the trading signal to be 0 +# Take many moving Average (MA) Signals and let them "vote" with their sign (+-1, e.g. long or short vote, for each signal) +MAfreq<-seq(10,250,by=20) +for (iter in 1:length(MAfreq)) + signal_used = signal_used + sign(apply(futures_data,2, function(r) ma(r,MAfreq[iter]))) +# Now make sure we invest $1 every day (so the sum of the absolute values of the weights is 1 every day) +signal_used = t(apply(signal_used,1,function(r) { + res = r + if ( sum(abs(r)) !=0 ) + res = r/sum(abs(r)) + res +})) +colnames(signal_used) <- colnames(futures_data) +# Now create the returns of the strategy for each futures time series +strategy_by_future <- scrub(shift(signal_used,2)*futures_data) # use the signal from 2 days ago +# finally, this is our futures trend following strategy +trading_strategy = apply(strategy_by_future,1,sum) +names(trading_strategy) <- rownames(futures_data) +``` + + +### Reporting the performance results + +Let's see how this strategy does: +
+
+ +```{r echo=FALSE, comment=NA, warning=FALSE, message=FALSE,results='asis',fig.align='center', fig.height=5,fig.width= 8, fig=TRUE} +pnl_plot(trading_strategy) +``` + +
+
+ +Here is how this strategy has performed during this period. +
+
+ +```{r echo=FALSE, comment=NA, warning=FALSE, message=FALSE, results='asis'} +show_data = data.frame(cbind(rownames(pnl_matrix(trading_strategy)), round(pnl_matrix(trading_strategy),2))) +m1<-gvisTable(show_data,options=list(width=1220, height=min(400,27*(nrow(show_data)+1)),allowHTML=TRUE)) +print(m1,'chart') +``` + +
+
+ +How does this compare with **existing CTA products** such as [this one from Societe Generale?](https://cib.societegenerale.com/fileadmin/indices_feeds/SG_CTA_Monthly_Report.pdf) (Note: one can easily achieve a correlation of more than 0.8 with this specific product - as well as with many other ones) + +![Compare our strategy with this product](societegenerale.png) + +
+ +#### Questions + +1. Can you describe in more detail what the code above does? +2. What happens if you use different moving average technical indicators in the code above? Please explore and report below the returns of a trading strategy you build. (Hint: check that the command line `MAfreq<-seq(10,250,by=20)` above does for example - but not only of course, the possibilities are endless) + +
+ +**Your Answers here:** +
+1. The code uses the moving average for 20 days to determine on the sign of the returns. + + +
+2. By changing the code to segment more or less, it changes the returns and the volume that is traded. Depending on the strategy chosen, it will change the returns of the investing strategy. By using 2 day segment data comapred to 20 days, it gave slightly higher volume traded although the returns were similar. +
+```{r echo=TRUE, eval=TRUE, comment=NA, warning=FALSE,error=FALSE, message=FALSE, prompt=FALSE, tidy=TRUE} +signal_used = 0*futures_data # just initialize the trading signal to be 0 +# Take many moving Average (MA) Signals and let them "vote" with their sign (+-1, e.g. long or short vote, for each signal) +MAfreq<-seq(10,250,by=2) +for (iter in 1:length(MAfreq)) + signal_used = signal_used + sign(apply(futures_data,2, function(r) ma(r,MAfreq[iter]))) +# Now make sure we invest $1 every day (so the sum of the absolute values of the weights is 1 every day) +signal_used = t(apply(signal_used,1,function(r) { + res = r + if ( sum(abs(r)) !=0 ) + res = r/sum(abs(r)) + res +})) +colnames(signal_used) <- colnames(futures_data) +# Now create the returns of the strategy for each futures time series +strategy_by_future <- scrub(shift(signal_used,2)*futures_data) # use the signal from 2 days ago +# finally, this is our futures trend following strategy +trading_strategy = apply(strategy_by_future,1,sum) +names(trading_strategy) <- rownames(futures_data) +``` +
+```{r echo=FALSE, comment=NA, warning=FALSE, message=FALSE,results='asis',fig.align='center', fig.height=5,fig.width= 8, fig=TRUE} +pnl_plot(trading_strategy) +``` +
+```{r echo=FALSE, comment=NA, warning=FALSE, message=FALSE, results='asis'} +show_data = data.frame(cbind(rownames(pnl_matrix(trading_strategy)), round(pnl_matrix(trading_strategy),2))) +m1<-gvisTable(show_data,options=list(width=1220, height=min(400,27*(nrow(show_data)+1)),allowHTML=TRUE)) +print(m1,'chart') +``` +
+
+ +### A class competition + +Now you have seen how to develop some trading strategies that hedge funds have been using for centuries. Clearly this is only the very first step - as many of the online resources on technical indicators also suggest. Can you now explore more such strategies? How good a **futures trend following hedge fund strategy** can you develop? Let's call this.... a **class competition**! Explore as much as you can and report your best strategy as we move along the course... + +Here is for example something that can be achieved relatively easily... +
+ +```{r echo=FALSE, comment=NA, warning=FALSE, message=FALSE,results='asis',fig.align='center', fig.height=5,fig.width= 8, fig=TRUE} +load("data/sample_strategy.Rdata") +pnl_plot(sample_strategy) +``` + +
+ +Here is how this strategy has performed during this period. +
+
+ +```{r echo=FALSE, comment=NA, warning=FALSE, message=FALSE, results='asis'} +show_data = data.frame(cbind(rownames(pnl_matrix(sample_strategy)), round(pnl_matrix(sample_strategy),2))) +m1<-gvisTable(show_data,options=list(width=1220, height=min(400,27*(nrow(show_data)+1)),allowHTML=TRUE)) +print(m1,'chart') +``` + +
+
+ +**Finally**: One can develop (shiny based) interactive versions of this report and deploy them using `shinyapps::deployApp('ExerciseSet2.Rmd')` (you need a [shinyapps.io](https://www.shinyapps.io) account for this). This is for example an [interactive version of this exercise.](https://inseaddataanalytics.shinyapps.io/ExerciseSet2/) + +
+
+ +As always, **have fun** + + + + + diff --git a/Exercises/Exerciseset2/ExerciseSet2_NZF.html b/Exercises/Exerciseset2/ExerciseSet2_NZF.html new file mode 100644 index 00000000..73fbf110 --- /dev/null +++ b/Exercises/Exerciseset2/ExerciseSet2_NZF.html @@ -0,0 +1,10272 @@ + + + + + + + + + + + + + + +Exercise Set 2: A $300 Billion Strategy + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + +


+

The purpose of this exercise is to become familiar with:

+
    +
  1. Some time series analysis tools;
  2. +
  3. Correlation matrices and principal component analysis (PCA) (see readings of sessions 3-4);
  4. +
  5. More data manipulation and reporting tools (including Google Charts).
  6. +
+

As always, while doing this exercise we will also see how to generate replicable and customizable reports. For this purpose the exercise uses the R Markdown capabilities (see Markdown Cheat Sheet or a basic introduction to R Markdown). These capabilities allow us to create dynamic reports. For example today’s date is 2017-01-12 (you need to see the .Rmd to understand that this is not a static typed-in date but it changes every time you compile the .Rmd - if the date changed of course).

+

Before starting, make sure you have pulled the exercise set 2 souce code files on your github repository (if you pull the course github repository you also get the exercise set files automatically). Moreover, make sure you are in the directory of this exercise. Directory paths may be complicated, and sometimes a frustrating source of problems, so it is recommended that you use these R commands to find out your current working directory and, if needed, set it where you have the main files for the specific exercise/project (there are other ways, but for now just be aware of this path issue). For example, assuming we are now in the “Data Analytics R version/INSEADAnalytics” directory, we can do these:

+
getwd()
+setwd("Exercises/Exerciseset2/")
+list.files()
+

Note: as always, you can use the help command in Rstudio to find out about any R function (e.g. type help(list.files) to learn what the R function list.files does).

+

Let’s now see the exercise.

+

IMPORTANT: You should answer all questions by simply adding your code/answers in this document through editing the file ExerciseSet2.Rmd and then clicking on the “Knit HTML” button in RStudio. Once done, please post your .Rmd and html files in your github repository.

+
+
+

The Exercise: Introduction

+

For this exercise we will use the Futures’ daily returns to develop what is considered to be a “classic” hedge fund trading strategy, a futures trend following strategy. There is a lot written about this, so it is worth doing some online search about “futures trend following”, or “Managed Futures”, or “Commodity Trading Advisors (CTA)”. There is about $300 billion invested on this strategy today, and is considered to be one of the oldest hedge fund strategies. Some example links are:

+ +

Of course there are also many starting points for developing such a strategy (for example this R bloggers one (also on github), or the turtle traders website which has many resources.

+

In this exercise we will develop our own strategy from scratch.

+

Note (given today’s market conditions): Prices of commodities, like oil or gold, can be excellent indicators of the health of the economy and of various industries, as we will also see below.

+
+
+

Getting the Futures Data

+

There are many ways to get futures data. For example, one can use the Quandl package, or the turtle traders resources, or (for INSEAD only) get data from the INSEAD library finance data resources website. One has to pay attention on how to create continuous time series from underlying contracts with varying deliveries (e.g. see here ). Using a combination of the resources above, we will use data for a number of commodities.

+
+
+

Data description

+

Let’s load the data and see what we have.

+
source("helpersSet2.R")
+library(googleVis)
+load("data/FuturesTrendFollowingData.Rdata")
+


We have data from 2001-01-02 to 2015-09-24 of daily returns for the following 64 futures:

+


+
show_data = data.frame(colnames(futures_data))
+m1 <- gvisTable(show_data, options = list(showRowNumber = TRUE, width = 1920, 
+    height = min(400, 27 * (nrow(show_data) + 1)), allowHTML = TRUE, page = "disable"))
+print(m1, "chart")
+ + + + + + + +
+ +
+


+
+
+

Basic data analysis

+

Let’s see how these are correlated. Let’s also make it look nicer (than, say, what we did in Exercise Set 1), using Google Charts (see examples online, e.g. examples and the R package used used ).The correlation matrix is as follows (note that the table is “dynamic”: for example you can sort it based on each column by clicking on the column’s header)

+


+ + + + + + + +
+ +
+


+

We see quite high correlations among some of the futures. Does it make sense? Why? Do you see some negative correlations? Do those make sense?

+

Given such high correlations, we can try to see whether there are some “principal components” (see reading on dimensionality reduction). This analysis can also indicate whether all futures (the global economy!) are driven by some common “factors” (let’s call them “risk factors”).

+


+
Variance_Explained_Table_results <- PCA(futures_data, graph = FALSE)
+Variance_Explained_Table <- cbind(paste("component", 1:ncol(futures_data), sep = " "), 
+    Variance_Explained_Table_results$eig)
+Variance_Explained_Table <- as.data.frame(Variance_Explained_Table)
+colnames(Variance_Explained_Table) <- c("Component", "Eigenvalue", "Percentage_of_explained_variance", 
+    "Cumulative_percentage_of_explained_variance")
+ + + + + + + +
+ +
+


+

Here is the scree plot (see Sessions 3-4 readings):

+
eigenvalues <- Variance_Explained_Table[, 2]
+ + + + + + + +
+ +
+


+

Let’s now see how the 20 first (rotated) principal components look like. Let’s also use the rotated factors (note that these are not really the “principal component”, as explained in the reading on dimensionality reduction) and not show any numbers less than 0.3 in absolute value, to avoid cluttering. Note again that you can sort the table according to any column by clicking on the header of that column.

+
corused = cor(futures_data[, apply(futures_data != 0, 2, sum) > 10, drop = F])
+Rotated_Results <- principal(corused, nfactors = 20, rotate = "varimax", score = TRUE)
+Rotated_Factors <- round(Rotated_Results$loadings, 2)
+Rotated_Factors <- as.data.frame(unclass(Rotated_Factors))
+colnames(Rotated_Factors) <- paste("Component", 1:ncol(Rotated_Factors), sep = " ")
+
+sorted_rows <- sort(Rotated_Factors[, 1], decreasing = TRUE, index.return = TRUE)$ix
+Rotated_Factors <- Rotated_Factors[sorted_rows, ]
+Rotated_Factors[abs(Rotated_Factors) < 0.3] <- NA
+ + + + + + + +
+ +
+


+
+

Questions:

+
    +
  1. How many principal components (“factors”) do we need to explain at least 50% of the variance in this data?
  2. +
  3. What are the highest weights (in absolute value) of the first principal component portfolio above on the 64 futures?
  4. +
  5. Can we interpret the first 10 components? How would you call these factors?
  6. +
  7. Can you now generate the principal components and scree plot using only: a) the pre-crisis bull market years (e.g. only using the data between November 1, 2002, and October 1, 2007)? b) the financial crisis years (e.g. only using the data between October 1, 2007 and March 1, 2009), (Hint: you can select subsets of the data using for example the command `crisis_data = futures_data[as.Date(rownames(futures_data)) > “2007-10-01” & as.Date(rownames(futures_data)) < “2009-03-01”, ])
  8. +
  9. Based on your analysis in question 3, please discuss any differences you observe about the futures returns during bull and bear markets. What implications may these results have? What do the results imply about how assets are correlated during bear years compared to bull years?
  10. +
  11. (Extra - optional) Can you create an interactive (shiny based) tool so that we can study how the “risk factors” change ove time? (Hint: see Exercise set 1 and online resources on Shiny such as these Shiny lessons. Note however that you may need to pay attention to various details e.g. about how to include Google Charts in Shiny tools - so keep this extra exercise for later!).
  12. +
+


+

Your Answers here:
1. 6 components is required to view the data.
2.
10 yr T bonds: 0.93
5 yr T bonds: 0.93
Treasury bonds: 0.86
2 yr T note US: 0.85
Canadian Bond: 0.79
3.
4.

+
bull_data = futures_data[as.Date(rownames(futures_data)) > "2002-11-01" & as.Date(rownames(futures_data)) < "2007-10-01", ]
+Variance_Explained_Table_results_bull<-PCA(bull_data, graph=FALSE)
+Variance_Explained_Table_bull<-cbind(paste("component",1:ncol(futures_data),sep=" "),Variance_Explained_Table_results_bull$eig)
+Variance_Explained_Table_bull<-as.data.frame(Variance_Explained_Table_bull)
+colnames(Variance_Explained_Table_bull)<-c("Component","Eigenvalue", "Percentage_of_explained_variance", "Cumulative_percentage_of_explained_variance")
+ + + + + + + +
+ +
+


Here is the Bull scree plot

+
eigenvalues_bull <- Variance_Explained_Table_bull[, 2]
+ + + + + + + +
+ +
+
crisis_data = futures_data[as.Date(rownames(futures_data)) > "2007-10-01" & as.Date(rownames(futures_data)) < "2009-03-01", ]
+Variance_Explained_Table_results_bear<-PCA(crisis_data, graph=FALSE)
+Variance_Explained_Table_bear<-cbind(paste("component",1:ncol(futures_data),sep=" "),Variance_Explained_Table_results_bear$eig)
+Variance_Explained_Table_bear<-as.data.frame(Variance_Explained_Table_bear)
+colnames(Variance_Explained_Table_bear)<-c("Component","Eigenvalue", "Percentage_of_explained_variance", "Cumulative_percentage_of_explained_variance")
+ + + + + + + +
+ +
+


Here is the Bear scree plot

+
eigenvalues_bear <- Variance_Explained_Table_bear[, 2]
+ + + + + + + +
+ +
+
+
+
+
+

A Simple Futures Trend Following Strategy

+

We can now develop a simple futures trend following trading strategy, as outlined in the papers in the Exercise Introduction above. There are about $300 billion invested in such strategies! Of course we cannot develop here a sophisticated product, but with some more work…

+

We will do the following:

+
    +
  1. Calculate a number of moving averages of different “window lengths” for each of the 64 futures - there are many so called technical indicators one can use. We will use the “moving average” function ma for this (try for example to see what this returns ma(1:10,2) ).
  2. +
  3. Add the signs (can also use the actual moving average values of course - try it!) of these moving averages (as if they “vote”), and then scale this sum across all futures so that the sum of their (of the sum across all futures!) absolute value across all futures is 1 (hence we invest $1 every day - you see why?).
  4. +
  5. Then invest every day in each of the 64 an amount that is defined by the weights calculated in step 2, using however the weights calculated using data until 2 days ago (why 2 days and not 1 day?) - see the use of the helper function shift for this.
  6. +
  7. Finally see the performance of this strategy.
  8. +
+

Here is the code.

+
signal_used = 0 * futures_data  # just initialize the trading signal to be 0
+# Take many moving Average (MA) Signals and let them 'vote' with their sign
+# (+-1, e.g. long or short vote, for each signal)
+MAfreq <- seq(10, 250, by = 20)
+for (iter in 1:length(MAfreq)) signal_used = signal_used + sign(apply(futures_data, 
+    2, function(r) ma(r, MAfreq[iter])))
+# Now make sure we invest $1 every day (so the sum of the absolute values of
+# the weights is 1 every day)
+signal_used = t(apply(signal_used, 1, function(r) {
+    res = r
+    if (sum(abs(r)) != 0) 
+        res = r/sum(abs(r))
+    res
+}))
+colnames(signal_used) <- colnames(futures_data)
+# Now create the returns of the strategy for each futures time series
+strategy_by_future <- scrub(shift(signal_used, 2) * futures_data)  # use the signal from 2 days ago
+# finally, this is our futures trend following strategy
+trading_strategy = apply(strategy_by_future, 1, sum)
+names(trading_strategy) <- rownames(futures_data)
+
+
+

Reporting the performance results

+

Let’s see how this strategy does:

+

+



+

Here is how this strategy has performed during this period.

+ + + + + + + +
+ +
+



+

How does this compare with existing CTA products such as this one from Societe Generale? (Note: one can easily achieve a correlation of more than 0.8 with this specific product - as well as with many other ones)

+
+Compare our strategy with this product +

Compare our strategy with this product

+
+


+
+

Questions

+
    +
  1. Can you describe in more detail what the code above does?
  2. +
  3. What happens if you use different moving average technical indicators in the code above? Please explore and report below the returns of a trading strategy you build. (Hint: check that the command line MAfreq<-seq(10,250,by=20) above does for example - but not only of course, the possibilities are endless)
  4. +
+


+

Your Answers here:
The code uses the moving average for 20 days to determine on the sign of the returns.

+
signal_used = 0 * futures_data  # just initialize the trading signal to be 0
+# Take many moving Average (MA) Signals and let them 'vote' with their sign
+# (+-1, e.g. long or short vote, for each signal)
+MAfreq <- seq(10, 250, by = 2)
+for (iter in 1:length(MAfreq)) signal_used = signal_used + sign(apply(futures_data, 
+    2, function(r) ma(r, MAfreq[iter])))
+# Now make sure we invest $1 every day (so the sum of the absolute values of
+# the weights is 1 every day)
+signal_used = t(apply(signal_used, 1, function(r) {
+    res = r
+    if (sum(abs(r)) != 0) 
+        res = r/sum(abs(r))
+    res
+}))
+colnames(signal_used) <- colnames(futures_data)
+# Now create the returns of the strategy for each futures time series
+strategy_by_future <- scrub(shift(signal_used, 2) * futures_data)  # use the signal from 2 days ago
+# finally, this is our futures trend following strategy
+trading_strategy = apply(strategy_by_future, 1, sum)
+names(trading_strategy) <- rownames(futures_data)
+



+ + + + + +
+ +
+
+
+
+
+
+

A class competition

+

Now you have seen how to develop some trading strategies that hedge funds have been using for centuries. Clearly this is only the very first step - as many of the online resources on technical indicators also suggest. Can you now explore more such strategies? How good a futures trend following hedge fund strategy can you develop? Let’s call this…. a class competition! Explore as much as you can and report your best strategy as we move along the course…

+

Here is for example something that can be achieved relatively easily…

+

+


+

Here is how this strategy has performed during this period.

+ + + + + + + +
+ +
+



+

Finally: One can develop (shiny based) interactive versions of this report and deploy them using shinyapps::deployApp('ExerciseSet2.Rmd') (you need a shinyapps.io account for this). This is for example an interactive version of this exercise.

+



+

As always, have fun

+
+ + + + +
+ + + + + + + + diff --git a/Exercises/Exerciseset2/rsconnect/documents/ExerciseSet2_NZF.Rmd/rpubs.com/rpubs/Publish Document.dcf b/Exercises/Exerciseset2/rsconnect/documents/ExerciseSet2_NZF.Rmd/rpubs.com/rpubs/Publish Document.dcf new file mode 100644 index 00000000..7dea03ed --- /dev/null +++ b/Exercises/Exerciseset2/rsconnect/documents/ExerciseSet2_NZF.Rmd/rpubs.com/rpubs/Publish Document.dcf @@ -0,0 +1,7 @@ +name: Publish Document +account: rpubs +server: rpubs.com +appId: https://api.rpubs.com/api/v1/document/240864/b366b759482d4d579be7fb8553e312c9 +bundleId: https://api.rpubs.com/api/v1/document/240864/b366b759482d4d579be7fb8553e312c9 +url: http://rpubs.com/publish/claim/240864/c35620250224492c950ba82805d8e4ca +when: 1484073105.23475 diff --git a/updatefork.R b/updatefork.R new file mode 100644 index 00000000..f9be0b8b --- /dev/null +++ b/updatefork.R @@ -0,0 +1,25 @@ +# Sync a forked copy of INSEADAnalytics with the master project. For details, +# see https://github.com/InseadDataAnalytics/INSEADAnalytics/issues/7 + +if (!require(git2r)) { + install.packages("git2r", repos="http://cran.r-project.org/", quiet=TRUE) + library(git2r) +} + +repo <- repository(".") +if (!("upstream" %in% remotes(repo))) { + remote_add(repo, "upstream", "https://github.com/InseadDataAnalytics/INSEADAnalytics") +} + +fetch(repo, "upstream") +checkout(repo, "master") +merge(repo, "upstream/master") + +message(" + Your local copy of INSEADAnalytics is now in sync with the master project. + You can update your remote copy by clicking 'Push' in the 'Git' panel or + by executing the following: + + push(repo, credentials=cred_user_pass(readline('Github username: '), readline('Github password: '))) +") +# push(repo, credentials=cred_user_pass(readline("Github username: "), readline("Github password: ")))