Causal-Inference-R-Code-Peng-Ding/CHAP-1_REPLICATION.Rmd at main · ttasdique-matadora/Causal-Inference-R-Code-Peng-Ding · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
title: "CHAP-1_REPLICATION"
author: "Tawhid Tasdique"
date: "2026-04-08"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

setwd("F:/MASTERS/CAUSAL/Peng Ding Replication")

pkg <- c("tidyverse", "ggthemes", "readr", "datasets")

lapply(X = pkg, FUN = library, character.only = T)

cps1re74 <- read_table("dataverse_files/cps1re74.csv")

resume <- read_csv("dataverse_files/resume.csv")

```

# Pg-4

```{r}

dat <- cps1re74 %>%
  reframe(
    treat = `"treat"`,
    age = `"age"`,
    educ = `"educ"`,
    black = `"black"`,
    hispan = `"hispan"`,
    married = `"married"`,
    nodegree = `"nodegree"`,
    re74 = `"re74"`,
    re75 = `"re75"`,
    re78 = `"re78"`
  )

dat$u74 <- as.numeric(dat$re74 == 0) # those unpaid in 1974
dat$u75 <- as.numeric(dat$re75 == 0) # those unpaid in 1975

# linear regression on outcome
lmoutcome <- lm(data = dat,
                formula = re78 ~ .) # unrestricted model

# textbook's way of extracting stats for "treat" variable
summary(lmoutcome)$coeff[2, 1:2]

# tidy way to extract the same
lmoutcome %>%
  summary() %>%
  broom::tidy() %>%
  dplyr::filter(term == "treat") %>%
  dplyr::select(estimate, std.error)

# restricted model (simple linear regression)
lmoutcome <- lm(data = dat,
                formula = re78 ~ treat) # unrestricted model

# textbook's way of extracting stats for "treat" variable
summary(lmoutcome)$coeff[2, 1:2]

# tidy way to extract the same
lmoutcome %>%
  summary() %>%
  broom::tidy() %>%
  dplyr::filter(term == "treat") %>%
  dplyr::select(estimate, std.error)

# The effects between the two models differ because of the inclusion and exclusion of confounders.

```

# Pg-6

```{r}

Alltable <- table(resume$race, resume$call)

# Fisher's exact test
fisher.test(Alltable)
# Why not do the Chi-square test of independence? It's not like we lack observation in any of the slots. The Chi-square test would be better for our purposes.

```

## Conclusion

There is a significant association between race and call-back status. But remember, this is not a cohort study (where participants are followed over a length of time to record the desired outcome).

This is actually a randomized experiment. The racially-suggestive names were actually randomly assigned. Hence, we can infer that you're less likely to be called back for a position if you are African-American.

**Note:** There are equal sample sizes of both races in the study. This allocation of sample is ideal for exposure-based sampling/cohort studies as cohort studies are most powerful when the exposure-based sample sizes are equal. I proved this is in my epidemiology notes. Also, it eradicates the chance that the observed effect/association is due to varying sample sizes.

# Pg-11

This problem was part of assignment-1 as well.

```{r}

# aperm() is a very interesting function, glad I came across
dta <- aperm(a = UCBAdmissions, perm = c(2, 1, 3))

# dta %>% View()

###############################

# Now the data is being aggregated over departments i.e. departments will no longer be distinguishable, it will no longer be used as a factor. That is, the department variable will be "marginalized" in terms of probability theory.
dta.sum <- apply(
  X = UCBAdmissions,
  MARGIN = c(1, 2), # all observations having common admission and gender will be operated on together i.e. they will be treated as a vector/list.
  FUN = sum # relevant observations will undergo summation
)

###############################

# The following function will estimate risk difference and compute the p-value corresponding to the chi-square test of independence for any given contingency (which acts as the input).
rd.and.pvalue <- function(tb2){

  # This table is not oriented in the classical manner but whatever.
  # As long as I'm getting the same results, it's fine
  `p[Disease | Exposure]` = tb2[1, 2]/(tb2[1, 2] + tb2[2, 2])

  `p[Disease | Non-exposure]` = tb2[1, 1]/(tb2[1, 1] + tb2[2, 1])

  # exposure-based risk difference
  risk.difference = (
    `p[Disease | Exposure]` - `p[Disease | Non-exposure]`
  )

  # p-value for chi-square statistic
  p_value = chisq.test(x = tb2) %>%
    broom::tidy() %>%
    pull(p.value)

  # final output
  return(list(
    `Risk Difference` = risk.difference,
    `P-value` = p_value
  ))

}

###############################

# applying above function on the aggregated data (a.k.a. department-agnostic data)
rd.and.pvalue(tb2 = dta.sum)
# Interpreting p-value : There is significant association between admission status and gender, overall, at 5% level of significance.
# Interpreting risk difference: You are 14.16% less likely to be admitted to any of the listed departments if you are female than otherwise.

###############################

# stratum-wise application of the function (each department is a stratum in this context)
for(dept in 1:6){

  rd.and.pvalue(tb2 = dta[ , , dept]) %>%
    print()

}

# simpler alternative
sapply(
  X = 1:6,
  FUN = function(i) {
    rd.and.pvalue(tb2 = dta[, , i])
  }
)
# Conclusion: Only for department-A the chi-square statistic is significant at 5% level. There is a 9% higher chance of admission if the candidate is female than otherwise, in department-A.

# Note: If only department-A had agreed with the other department regarding the null hypothesis, then we could have witnessed the Yule-Simpson paradox where the overall/department-agnostic data disagrees with each and every stratum with regards to the null hypothesis.

```

# Textbook Exercises

## Ex-1.4

```{r}

# Sex-specific contingency tables

female.table <- resume %>%
  filter(sex == "female") %>%
  dplyr::select(call, race) %>%
  table() %>%  # reorder the rows
  {.[1:nrow(.), 1:ncol(.)]} # correct orientation of contingency table

male.table <- resume %>%
  filter(sex == "male") %>%
  dplyr::select(call, race) %>%
  table() %>% # reorder the rows
  {.[1:nrow(.), 1:ncol(.)]} # correct orientation of contingency table

########################

# Chi-Square Test of Independence

## analysis for females
chisq.test(x = female.table)
# There is association between race and call-back status among females, at 5% level of significance.

## analysis for males
chisq.test(x = male.table)
# There is no association between race and call-back status among males, at 5% level of significance.

```