diff --git a/CourseSessions/Sessions23/CopyOfSession2inclass_Ana.Rmd b/CourseSessions/Sessions23/CopyOfSession2inclass_Ana.Rmd
new file mode 100644
index 00000000..e2446404
--- /dev/null
+++ b/CourseSessions/Sessions23/CopyOfSession2inclass_Ana.Rmd
@@ -0,0 +1,439 @@
+---
+title: "Sessions 3-4"
+author: "T. Evgeniou"
+output: html_document
+---
+
+
+
+The purpose of this session is to become familiar with:
+
+1. Some visualization tools;
+2. Principal Component Analysis and Factor Analysis;
+3. Clustering Methods;
+4. Introduction to machine learning methods;
+5. A market segmentation case study.
+
+As always, before starting, make sure you have pulled the [session 3-4 files](https://github.com/InseadDataAnalytics/INSEADAnalytics/tree/master/CourseSessions/Sessions23) (yes, I know, it says session 2, but it is 3-4 - need to update all filenames some time, but till then we use common sense and ignore a bit the filenames) on your github repository (if you pull the course github repository you also get the session files automatically). Moreover, make sure you are in the directory of this exercise. Directory paths may be complicated, and sometimes a frustrating source of problems, so it is recommended that you use these R commands to find out your current working directory and, if needed, set it where you have the main files for the specific exercise/project (there are other ways, but for now just be aware of this path issue). For example, assuming we are now in the "MYDIRECTORY/INSEADAnalytics" directory, we can do these:
+
+```{r echo=TRUE, eval=FALSE, tidy=TRUE}
+getwd()
+setwd("CourseSessions/Sessions23")
+list.files()
+rm(list=ls()) # Clean up the memory, if we want to rerun from scratch
+```
+As always, you can use the `help` command in Rstudio to find out about any R function (e.g. type `help(list.files)` to learn what the R function `list.files` does).
+
+Let's start.
+
+
The purpose of this session is to become familiar with:
+As always, before starting, make sure you have pulled the session 3-4 files (yes, I know, it says session 2, but it is 3-4 - need to update all filenames some time, but till then we use common sense and ignore a bit the filenames) on your github repository (if you pull the course github repository you also get the session files automatically). Moreover, make sure you are in the directory of this exercise. Directory paths may be complicated, and sometimes a frustrating source of problems, so it is recommended that you use these R commands to find out your current working directory and, if needed, set it where you have the main files for the specific exercise/project (there are other ways, but for now just be aware of this path issue). For example, assuming we are now in the “MYDIRECTORY/INSEADAnalytics” directory, we can do these:
+getwd()
+setwd("CourseSessions/Sessions23")
+list.files()
+rm(list = ls()) # Clean up the memory, if we want to rerun from scratch
+As always, you can use the help command in Rstudio to find out about any R function (e.g. type help(list.files) to learn what the R function list.files does).
Let’s start.
+We will be using the boats case study as an example. At the end of this class we will be able to develop (from scratch) the readings of sessions 3-4 as well as understand the tools used and the interpretation of the results in practice - in order to make business decisions. The code used here is along the lines of the code in the session directory, e.g. in the RunStudy.R file and the report doc/Report_s23.Rmd. There may be a few differences, as there are many ways to write code to do the same thing.
+Let’s load the data:
+ProjectData <- read.csv("data/Boats.csv", sep = ";", dec = ",") # this contains only the matrix ProjectData
+ProjectData = data.matrix(ProjectData)
+colnames(ProjectData) <- gsub("\\.", " ", colnames(ProjectData))
+ProjectDataFactor = ProjectData[, c(2:30)]
+
and do some basic visual exploration of the first 50 respondents first (always necessary to see the data first):
This is the correlation matrix of the customer responses to the 29 attitude questions - which are the only questions that we will use for the segmentation (see the case):
1. I see high correlations between questions that ask for status / power / self expression 2. The correlations mean that the people interviewed will probably answer similarly to questions related to these themes
Clearly the survey asked many reduntant questions (can you think some reasons why?), so we may be able to actually “group” these 29 attitude questions into only a few “key factors”. This not only will simplify the data, but will also greatly facilitate our understanding of the customers.
+To do so, we use methods called Principal Component Analysis and factor analysis as discussed in the session readings. We can use two different R commands for this (they make slightly different information easily available as output): the command principal (check help(principal) from R package psych), and the command PCA from R package FactoMineR - there are more packages and commands for this, as these methods are very widely used.
Here is how the principal function is used:
UnRotated_Results <- principal(ProjectDataFactor, nfactors = ncol(ProjectDataFactor),
+ rotate = "none", score = TRUE)
+UnRotated_Factors <- round(UnRotated_Results$loadings, 2)
+UnRotated_Factors <- as.data.frame(unclass(UnRotated_Factors))
+colnames(UnRotated_Factors) <- paste("Component", 1:ncol(UnRotated_Factors),
+ sep = " ")
+
Here is how we use PCA one is used:
Variance_Explained_Table_results <- PCA(ProjectDataFactor, graph = FALSE)
+Variance_Explained_Table <- Variance_Explained_Table_results$eig
+Variance_Explained_Table_copy <- Variance_Explained_Table
+row = 1:nrow(Variance_Explained_Table)
+name <- paste("Component No:", row, sep = "")
+Variance_Explained_Table <- cbind(name, Variance_Explained_Table)
+Variance_Explained_Table <- as.data.frame(Variance_Explained_Table)
+colnames(Variance_Explained_Table) <- c("Components", "Eigenvalue", "Percentage_of_explained_variance",
+ "Cumulative_percentage_of_explained_variance")
+
+eigenvalues <- Variance_Explained_Table[, 2]
+
Let’s look at the variance explained as well as the eigenvalues (see session readings):
Your Answers here:
1. the table and the plot show the factors that explain most of the behaviour of the clients. Each factor is a combination of the questions of the survey. From the table and the plot we can learn how many factors are needed to explain a certain percentage of the behaviour (for example, to explain at least the 50% we need 5 factors). 2. the plot has this shape because the first factor is always the one that exaplain a greater part of the behaviour. 3. the smaller the number of factors, the better for the analysis of the customers.
Let’s now see how the “top factors” look like.
# Choose one of these options:
+factors_selected = sum(Variance_Explained_Table_copy[, 1] >= 1)
+# minimum_variance_explained = 0.5; factors_selected =
+# 1:head(which(Variance_Explained_Table_copy[,'cumulative percentage of
+# variance']>= minimum_variance_explained),1) factors_selected = 10
+To better visualise them, we will use what is called a “rotation”. There are many rotations methods, we use what is called the varimax rotation:
# Please ENTER the rotation eventually used (e.g. 'none', 'varimax',
+# 'quatimax', 'promax', 'oblimin', 'simplimax', and 'cluster' - see
+# help(principal)). Defauls is 'varimax'
+rotation_used = "varimax"
+Rotated_Results <- principal(ProjectDataFactor, nfactors = max(factors_selected),
+ rotate = rotation_used, score = TRUE)
+Rotated_Factors <- round(Rotated_Results$loadings, 2)
+Rotated_Factors <- as.data.frame(unclass(Rotated_Factors))
+colnames(Rotated_Factors) <- paste("Component", 1:ncol(Rotated_Factors), sep = " ")
+sorted_rows <- sort(Rotated_Factors[, 1], decreasing = TRUE, index.return = TRUE)$ix
+Rotated_Factors <- Rotated_Factors[sorted_rows, ]
+
+
+
+
+
+
+
+
To better visualize and interpret the factors we often “supress” loadings with small values, e.g. with absolute values smaller than 0.5. In this case our factors look as follows after suppressing the small numbers:
MIN_VALUE = 0.5
+Rotated_Factors_thres <- Rotated_Factors
+Rotated_Factors_thres[abs(Rotated_Factors_thres) < MIN_VALUE] <- NA
+colnames(Rotated_Factors_thres) <- colnames(Rotated_Factors)
+rownames(Rotated_Factors_thres) <- rownames(Rotated_Factors)
+
+
+
+
+
+
+
+
Your Answers here:
1. The first factors are the ones that explain most of the customers’ behaviour (for example, the first factor seems more related to the status, lifestyle, power, and the second one to feeling adventurous, socializing). They make sense because we can use them to segment and target the market.
as explained before,based on eigenvalue and cumulative explained vairance, we should choose 5 factors, because they explain at least 50% of the behaviour. Besides, they don’t overlap with each other, which means that with each factor we will be able to tackle one aspect of the customers preference
the first factor seems more related to the status, lifestyle, power, and the second one to feeling adventurous, socializing…
Let’s now use one representative question for each factor (we can also use the “factor scores” for each respondent - see session readings) to represent our survey respondents. We can choose the question with the highest absolute factor loading for each factor. For example, when we use 5 factors with the varimax rotation we can select questions Q.1.9 (I see my boat as a status symbol), Q1.18 (Boating gives me a feeling of adventure), Q1.4 (I only consider buying a boat from a reputable brand), Q1.11 (I tend to perform minor boat repairs and maintenance on my own) and Q1.2 (When buying a boat getting the lowest price is more important than the boat brand) - try it. These are columns 10, 19, 5, 12, and 3, respectively of the data matrix Projectdata.
In market segmentation one may use variables to profile the segments which are not the same (necessarily) as those used to segment the market: the latter may be, for example, attitude/needs related (you define segments based on what the customers “need”), while the former may be any information that allows a company to identify the defined customer segments (e.g. demographics, location, etc). Of course deciding which variables to use for segmentation and which to use for profiling (and then activation of the segmentation for business purposes) is largely subjective. So in this case we will use all survey questions for profiling for now:
+segmentation_attributes_used = c(10, 19, 5, 12, 3)
+profile_attributes_used = 2:ncol(ProjectData)
+ProjectData_segment = ProjectData[, segmentation_attributes_used]
+ProjectData_profile = ProjectData[, profile_attributes_used]
+A key family of methods used for segmenation is what is called clustering methods. This is a very important problem in statistics and machine learning, used in all sorts of applications such as in Amazon’s pioneer work on recommender systems. There are many mathematical methods for clustering. We will use two very standard methods, hierarchical clustering and k-means. While the “math” behind all these methods can be complex, the R functions used are relatively simple to use, as we will see.
+For example, to use hierarchical clustering we simply first define some parameters used (see session readings) and then simply call the command hclust:
# Please ENTER the distance metric eventually used for the clustering in
+# case of hierarchical clustering (e.g. 'euclidean', 'maximum', 'manhattan',
+# 'canberra', 'binary' or 'minkowski' - see help(dist)). DEFAULT is
+# 'euclidean'
+distance_used = "euclidean"
+# Please ENTER the hierarchical clustering method to use (options are:
+# 'ward', 'single', 'complete', 'average', 'mcquitty', 'median' or
+# 'centroid') DEFAULT is 'ward.D'
+hclust_method = "ward.D"
+# Define the number of clusters:
+numb_clusters_used = 3
+Hierarchical_Cluster_distances <- dist(ProjectData_segment, method = distance_used)
+Hierarchical_Cluster <- hclust(Hierarchical_Cluster_distances, method = hclust_method)
+
+# Assign observations (e.g. people) in their clusters
+cluster_memberships_hclust <- as.vector(cutree(Hierarchical_Cluster, k = numb_clusters_used))
+cluster_ids_hclust = unique(cluster_memberships_hclust)
+ProjectData_with_hclust_membership <- cbind(1:length(cluster_memberships_hclust),
+ cluster_memberships_hclust)
+colnames(ProjectData_with_hclust_membership) <- c("Observation Number", "Cluster_Membership")
+Finally, we can see the dendrogram (see class readings and online resources for more information) to have a first rough idea of what segments (clusters) we may have - and how many.
We can also plot the “distances” traveled before we need to merge any of the lower and smaller in size clusters into larger ones - the heights of the tree branches that link the clusters as we traverse the tree from its leaves to its root. If we have n observations, this plot has n-1 numbers.
To use k-means on the other hand one needs to define a priori the number of segments (which of course one can change and re-cluster). K-means also requires the choice of a few more parameters, but this is beyond our scope for now. Here is how to run K-means:
# Please ENTER the kmeans clustering method to use (options are:
+# 'Hartigan-Wong', 'Lloyd', 'Forgy', 'MacQueen' DEFAULT is 'Lloyd'
+kmeans_method = "Lloyd"
+# Define the number of clusters:
+numb_clusters_used = 3
+kmeans_clusters <- kmeans(ProjectData_segment, centers = numb_clusters_used,
+ iter.max = 2000, algorithm = kmeans_method)
+ProjectData_with_kmeans_membership <- cbind(1:length(kmeans_clusters$cluster),
+ kmeans_clusters$cluster)
+colnames(ProjectData_with_kmeans_membership) <- c("Observation Number", "Cluster_Membership")
+
+# Assign observations (e.g. people) in their clusters
+cluster_memberships_kmeans <- kmeans_clusters$cluster
+cluster_ids_kmeans <- unique(cluster_memberships_kmeans)
+K-means does not provide much information about segmentation. However, when we profile the segments we can start getting a better (business) understanding of what is happening. Profiling is a central part of segmentation: this is where we really get to mix technical and business creativity.
+There are many ways to do the profiling of the segments. For example, here we show how the average answers of the respondents in each segment compare to the average answer of all respondents using the ratio of the two. The idea is that if in a segment the average response to a question is very different (e.g. away from ratio of 1) than the overall average, then that question may indicate something about the segment relative to the total population.
+Here are for example the profiles of the segments using the clusters found above:
+
First let’s see just the average answer people gave to each question for the different segments as well as the total population:
# Select whether to use the Hhierarchical clustering or the k-means
+# clusters:
+
+cluster_memberships <- cluster_memberships_hclust
+cluster_ids <- cluster_ids_hclust
+# here is the k-means: uncomment these 2 lines cluster_memberships <-
+# cluster_memberships_kmeans cluster_ids <- cluster_ids_kmeans
+
+population_average = matrix(apply(ProjectData_profile, 2, mean), ncol = 1)
+colnames(population_average) <- "Population"
+Cluster_Profile_mean <- sapply(sort(cluster_ids), function(i) apply(ProjectData_profile[(cluster_memberships ==
+ i), ], 2, mean))
+if (ncol(ProjectData_profile) < 2) Cluster_Profile_mean = t(Cluster_Profile_mean)
+colnames(Cluster_Profile_mean) <- paste("Segment", 1:length(cluster_ids), sep = " ")
+cluster.profile <- cbind(population_average, Cluster_Profile_mean)
+
+
+
+
+
+
+
+Let’s now see the relative ratios, which we can also save in a .csv and explore if (absolutely) necessary - e.g. for collaboration with people using other tools.
+ratio_limit = 0.1
+Let’s see only ratios that are larger or smaller than 1 by, say, at least 0.1.
population_average_matrix <- population_average[, "Population", drop = F] %*%
+ matrix(rep(1, ncol(Cluster_Profile_mean)), nrow = 1)
+cluster_profile_ratios <- (ifelse(population_average_matrix == 0, 0, Cluster_Profile_mean/population_average_matrix))
+colnames(cluster_profile_ratios) <- paste("Segment", 1:ncol(cluster_profile_ratios),
+ sep = " ")
+rownames(cluster_profile_ratios) <- colnames(ProjectData)[profile_attributes_used]
+## printing the result in a clean-slate table
+# Save the segment profiles in a file: enter the name of the file!
+profile_file = "my_segmentation_profiles.csv"
+write.csv(cluster_profile_ratios, file = profile_file)
+# We can also save the cluster membership of our respondents:
+data_with_segment_membership = cbind(cluster_memberships, ProjectData)
+colnames(data_with_segment_membership)[1] = "Segment"
+cluster_file = "my_segments.csv"
+write.csv(data_with_segment_membership, file = cluster_file)
+
+
+
+
+
+
+
+
The further a ratio is from 1, the more important that attribute is for a segment relative to the total population.
Your Answers here:
You have now completed your first market segmentation project. Do you have data from another survey you can use with this report now?
+Extra question: explore and report a new segmentation analysis…
+… and as always Have Fun
+The purpose of this exercise is to become familiar with:
+While doing this exercise we will also see how to generate replicable and customizable reports. For this purpose the exercise uses the R Markdown capabilities (see Markdown Cheat Sheet or a basic introduction to R Markdown). These capabilities allow us to create dynamic reports. For example today’s date is 2016-01-07 (you need to see the .Rmd to understand that this is not a static typed-in date but it changes every time you compile the .Rmd - if the date changed of course).
+Before starting, make sure you have pulled the exercise files on your github repository (if you pull the course github repository you also get the exercise set files automatically). Moreover, make sure you are in the directory of this exercise. Directory paths may be complicated, and sometimes a frustrating source of problems, so it is recommended that you use these R commands to find out your current working directory and, if needed, set it where you have the main files for the specific exercise/project (there are other ways, but for now just be aware of this path issue):
+getwd()
+setwd("Exercises/Exerciseset1/")
+list.files()
+Note: you can always use the help command in Rstudio to find out about any R function (e.g. type help(list.files) to learn what the R function list.files does).
Let’s now see the exercise.
+IMPORTANT: You should answer all questions by simply adding your code/answers in this document through editing the file ExerciseSet1.Rmd and then clicking on the “Knit HTML” button in RStudio. Once done, please post your .Rmd and html files in your github repository.
+We download daily prices (open, high, low, close, and adjusted close) and volume data of publicly traded companies and markets from the web (e.g. Yahoo! or Google, etc). This is done by sourcing the file data.R as well as some helper functions in herpersSet1.R which also installs a number of R libraries (hence the first time you run this code you will see a lot of red color text indicating the download and installation process):
+source("helpersSet1.R")
+source("dataSet1.R")
+We have 2771 days of data, starting from 2005-01-04 until 2016-01-06.
+Here are some basic statistics about the S&P returns:
+Here are returns of the S&P in this period (note the use of the helper function pnl_plot - defined in file helpersSet1.R):
+Your Answers here:
+in dataSet1.R, the variable mytickers = c(“SPY”, “AAPL”) - line 9 - includes AAPL, which is the trading name of Apple. THe lines from 13 to 46 download the data for the values on mytickers, which include the whole market and Apple specifically.
For this question I created a variable called AAPL = StockReturns[,“AAPL”]. The cumulative return is 384.1, the average daily return is 0.139 and the std deviation is 2.188 (using the same functions as with SPY: round(100sum(AAPL),1), round(100mean(AAPL),3) and round(100sd(AAPL),3)) We can get the same results if we get the data from the second column of Stockreturns, which stores the values of Apple (example for cumulative round(100sum(StockReturns[,2]),1))
We could get all the different companies traded from Yahoo. We would need to add the trading names to the variable mytickers. We would then store the data in additional columns in both StockPrices and StockReturns.
For this part of the exercise we will do some basic manipulations of the data. First note that the data are in a so-called matrix format. If you run these commands in RStudio (use help to find out what they do) you will see how matrices work:
+class(StockReturns)
+dim(StockReturns)
+nrow(StockReturns)
+ncol(StockReturns)
+StockReturns[1:4,]
+head(StockReturns,5)
+tail(StockReturns,5)
+We will now use an R function for matrices that is extremely useful for analyzing data. It is called apply. Check it out using help in R.
+For example, we can now quickly estimate the average returns of S&P and Apple (of course this can be done manually, too, but what if we had 500 stocks - e.g. a matrix with 500 columns?) and plot the returns of that 50-50 on S&P and Apple portfolio:
+We can also transpose the matrix of returns to create a new “horizontal” matrix. Let’s call this matrix (variable name) transposedData. We can do so using this command: transposedData = t(StockReturns).
help(apply)), can you create again the portfolio of S&P and Apple and plot the returns in a new figure below?Your Answers here: 1. We use the command dim(transposedData) to get all the dimensions (which returs 2 2771). If we want to get only the numer of rows we use nrow(transposedData) and for the number of columns we use ncol(transposedData), which gives us again 2 and 2771.
+This is an important step and will get you to think about the overall process once again.
+Your Answers here: 1. We should change line 11 in DataSet1.R, which includes the variable startDate = “2005-01-01”. We would assign then startDate = “2001-01-01”
Finally, one can read and write data in .CSV files. For example, we can save the first 20 days of data for S&P and Apple in a file using the command:
+write.csv(StockReturns[1:20,c("SPY","AAPL")], file = "twentydays.csv", row.names = TRUE, col.names = TRUE)
+Do not get surpsised if you see the csv file in your directories suddenly! You can then read the data from the csv file using the read.csv command. For example, this will load the data from the csv file and save it in a new variable that now is called “myData”:
+myData <- read.csv(file = "twentydays.csv", header = TRUE, sep=";")
+Try it!
+sum(myData != StockReturns[1:20,])myData + StockReturns[1:40,])
+Your Answers here: 1. It reutrns a value of 20.
Can you now load another dataset from some CSV file and report some basic statistics about that data?
+Finally, just for fun, one can add some interactivity in the report using Shiny.All one needs to do is set the eval flag of the code chunk below (see the .Rmd file) to “TRUE”, add the line “runtime: shiny” at the very begining of the .Rmd file, make the markdown output to be “html_document”, and then press “Run Document”.
+sliderInput("startdate", "Starting Date:", min = 1, max = length(portfolio),
+ value = 1)
+sliderInput("enddate", "End Date:", min = 1, max = length(portfolio),
+ value = length(portfolio))
+
+renderPlot({
+ pnl_plot(portfolio[input$startdate:input$enddate])
+})
+Have fun.
+The purpose of this exercise is to become familiar with:
+As always, while doing this exercise we will also see how to generate replicable and customizable reports. For this purpose the exercise uses the R Markdown capabilities (see Markdown Cheat Sheet or a basic introduction to R Markdown). These capabilities allow us to create dynamic reports. For example today’s date is 2016-01-20 (you need to see the .Rmd to understand that this is not a static typed-in date but it changes every time you compile the .Rmd - if the date changed of course).
+Before starting, make sure you have pulled the exercise set 2 souce code files on your github repository (if you pull the course github repository you also get the exercise set files automatically). Moreover, make sure you are in the directory of this exercise. Directory paths may be complicated, and sometimes a frustrating source of problems, so it is recommended that you use these R commands to find out your current working directory and, if needed, set it where you have the main files for the specific exercise/project (there are other ways, but for now just be aware of this path issue). For example, assuming we are now in the “Data Analytics R version/INSEADAnalytics” directory, we can do these:
+getwd()
+setwd("Exercises/Exerciseset2/")
+list.files()
+Note: as always, you can use the help command in Rstudio to find out about any R function (e.g. type help(list.files) to learn what the R function list.files does).
Let’s now see the exercise.
+IMPORTANT: You should answer all questions by simply adding your code/answers in this document through editing the file ExerciseSet2.Rmd and then clicking on the “Knit HTML” button in RStudio. Once done, please post your .Rmd and html files in your github repository.
+For this exercise we will use the Futures’ daily returns to develop what is considered to be a “classic” hedge fund trading strategy, a futures trend following strategy. There is a lot written about this, so it is worth doing some online search about “futures trend following”, or “Managed Futures”, or “Commodity Trading Advisors (CTA)”. There is about $300 billion invested on this strategy today, and is considered to be one of the oldest hedge fund strategies. Some example links are:
+Of course there are also many starting points for developing such a strategy (for example this R bloggers one (also on github), or the turtle traders website which has many resources.
+In this exercise we will develop our own strategy from scratch.
+Note (given today’s market conditions): Prices of commodities, like oil or gold, can be excellent indicators of the health of the economy and of various industries, as we will also see below.
+There are many ways to get futures data. For example, one can use the Quandl package, or the turtle traders resources, or (for INSEAD only) get data from the INSEAD library finance data resources website. One has to pay attention on how to create continuous time series from underlying contracts with varying deliveries (e.g. see here ). Using a combination of the resources above, we will use data for a number of commodities.
+Let’s load the data and see what we have.
+source("helpersSet2.R")
+library(googleVis)
+load("data/FuturesTrendFollowingData.Rdata")
+
We have data from 2001-01-02 to 2015-09-24 of daily returns for the following 64 futures:
show_data = data.frame(colnames(futures_data))
+m1 <- gvisTable(show_data, options = list(showRowNumber = TRUE, width = 1920,
+ height = min(400, 27 * (nrow(show_data) + 1)), allowHTML = TRUE, page = "disable"))
+print(m1, "chart")
+
+
+
+
+
+
+
+Let’s see how these are correlated. Let’s also make it look nicer (than, say, what we did in Exercise Set 1), using Google Charts (see examples online, e.g. examples and the R package used used ).The correlation matrix is as follows (note that the table is “dynamic”: for example you can sort it based on each column by clicking on the column’s header)
+We see quite high correlations among some of the futures. Does it make sense? Why? Do you see some negative correlations? Do those make sense?
+Given such high correlations, we can try to see whether there are some “principal components” (see reading on dimensionality reduction). This analysis can also indicate whether all futures (the global economy!) are driven by some common “factors” (let’s call them “risk factors”).
+Variance_Explained_Table_results <- PCA(futures_data, graph = FALSE)
+Variance_Explained_Table <- cbind(paste("component", 1:ncol(futures_data), sep = " "),
+ Variance_Explained_Table_results$eig)
+Variance_Explained_Table <- as.data.frame(Variance_Explained_Table)
+colnames(Variance_Explained_Table) <- c("Component", "Eigenvalue", "Percentage_of_explained_variance",
+ "Cumulative_percentage_of_explained_variance")
+
+
+
+
+
+
+
+Here is the scree plot (see Sessions 3-4 readings):
eigenvalues <- Variance_Explained_Table[, 2]
+
+
+
+
+
+
+
+Let’s now see how the 20 first (rotated) principal components look like. Let’s also use the rotated factors (note that these are not really the “principal component”, as explained in the reading on dimensionality reduction) and not show any numbers less than 0.3 in absolute value, to avoid cluttering. Note again that you can sort the table according to any column by clicking on the header of that column.
corused = cor(futures_data[, apply(futures_data != 0, 2, sum) > 10, drop = F])
+Rotated_Results <- principal(corused, nfactors = 20, rotate = "varimax", score = TRUE)
+Rotated_Factors <- round(Rotated_Results$loadings, 2)
+Rotated_Factors <- as.data.frame(unclass(Rotated_Factors))
+colnames(Rotated_Factors) <- paste("Component", 1:ncol(Rotated_Factors), sep = " ")
+
+sorted_rows <- sort(Rotated_Factors[, 1], decreasing = TRUE, index.return = TRUE)$ix
+Rotated_Factors <- Rotated_Factors[sorted_rows, ]
+Rotated_Factors[abs(Rotated_Factors) < 0.3] <- NA
+
+
+
+
+
+
+
+Your Answers here:
+We need 6 factors (cumulative percentage = 52,29%)
In absolute values, the highest weights are 0,93 (5yr T-Notes US and 10yr T notes), 0,86 (treasury bonds), 0,85 (2 yr T-note US), 0,79, 0,74, …
If we see the first 10 factors, we can see that, since the overlapping of values between the factors are low, we can explain the variables with these 10 factors.
For the pre crisis data:
For the crisis data:
+ + + + + + + +We can now develop a simple futures trend following trading strategy, as outlined in the papers in the Exercise Introduction above. There are about $300 billion invested in such strategies! Of course we cannot develop here a sophisticated product, but with some more work…
+We will do the following:
+ma for this (try for example to see what this returns ma(1:10,2) ).shift for this.Here is the code.
signal_used = 0 * futures_data # just initialize the trading signal to be 0
+# Take many moving Average (MA) Signals and let them 'vote' with their sign
+# (+-1, e.g. long or short vote, for each signal)
+MAfreq <- seq(20, 250, by = 30)
+for (iter in 1:length(MAfreq)) signal_used = signal_used + sign(apply(futures_data,
+ 2, function(r) ma(r, MAfreq[iter])))
+# Now make sure we invest $1 every day (so the sum of the absolute values of
+# the weights is 1 every day)
+signal_used = t(apply(signal_used, 1, function(r) {
+ res = r
+ if (sum(abs(r)) != 0)
+ res = r/sum(abs(r))
+ res
+}))
+colnames(signal_used) <- colnames(futures_data)
+# Now create the returns of the strategy for each futures time series
+strategy_by_future <- scrub(shift(signal_used, 2) * futures_data) # use the signal from 2 days ago
+# finally, this is our futures trend following strategy
+trading_strategy = apply(strategy_by_future, 1, sum)
+names(trading_strategy) <- rownames(futures_data)
+Let’s see how this strategy does:
Here is how this strategy has performed during this period.
How does this compare with existing CTA products such as this one from Societe Generale? (Note: one can easily achieve a correlation of more than 0.8 with this specific product - as well as with many other ones)
+MAfreq<-seq(10,250,by=20) above does for example - but not only of course, the possibilities are endless)Your Answers here:
+The code generates a new variable (signal_used) which has the same format of futures_data. Then, it fills this variable with the “votes” of the moving averages of another new variable (MAfreq). These votes indicate the investment for each of the futures.
If we change the values of the moving average (for example, giving a broader range), the returns will be slightly similar (although the trend is similar).
signal_used = 0 * futures_data
+MAfreq <- seq(10, 400, by = 10)
+for (iter in 1:length(MAfreq)) signal_used = signal_used + sign(apply(futures_data,
+ 2, function(r) ma(r, MAfreq[iter])))
+signal_used = t(apply(signal_used, 1, function(r) {
+ res = r
+ if (sum(abs(r)) != 0)
+ res = r/sum(abs(r))
+ res
+}))
+colnames(signal_used) <- colnames(futures_data)
+strategy_by_future <- scrub(shift(signal_used, 2) * futures_data)
+trading_strategy = apply(strategy_by_future, 1, sum)
+names(trading_strategy) <- rownames(futures_data)
+pnl_plot(trading_strategy)
+Now you have seen how to develop some trading strategies that hedge funds have been using for centuries. Clearly this is only the very first step - as many of the online resources on technical indicators also suggest. Can you now explore more such strategies? How good a futures trend following hedge fund strategy can you develop? Let’s call this…. a class competition! Explore as much as you can and report your best strategy as we move along the course…
+Here is for example something that can be achieved relatively easily…
Here is how this strategy has performed during this period.
Finally: One can develop (shiny based) interactive versions of this report and deploy them using shinyapps::deployApp('ExerciseSet2.Rmd') (you need a shinyapps.io account for this). This is for example an interactive version of this exercise.
As always, have fun
+