set.seed(123)
Importing the dataset as in 01
Introduction.rmd, but using the R script available in import_data/munich_rent_index.R:
source("import_data/munich_rent_index.R")
Filtering by apartments build after 1966:
rent_1966 <- subset(munich_rent_index, yearc >= 1966)
summary(rent_1966)
rent rentsqm area yearc location
Min. : 133.3 Min. : 1.926 Min. : 20.00 Min. :1966 1:768
1st Qu.: 368.2 1st Qu.: 6.618 1st Qu.: 48.00 1st Qu.:1971 2:346
Median : 482.6 Median : 8.216 Median : 63.00 Median :1973 3: 37
Mean : 508.7 Mean : 8.242 Mean : 63.92 Mean :1978
3rd Qu.: 610.1 3rd Qu.: 9.783 3rd Qu.: 80.00 3rd Qu.:1984
Max. :1517.9 Max. :17.722 Max. :154.00 Max. :1997
bath kitchen cheating district
Mode :logical Mode :logical Mode :logical 2024 : 28
FALSE:1050 FALSE:1070 FALSE:7 711 : 24
TRUE :101 TRUE :81 TRUE :1144 1012 : 21
1013 : 21
2221 : 21
1942 : 16
(Other):1020
And use only average locations:
av_rent_1966 <- subset(rent_1966, location == 1)
summary(av_rent_1966)
rent rentsqm area yearc location
Min. : 133.3 Min. : 2.077 Min. : 20.00 Min. :1966 1:768
1st Qu.: 359.4 1st Qu.: 6.344 1st Qu.: 48.00 1st Qu.:1971 2: 0
Median : 466.3 Median : 7.892 Median : 63.00 Median :1972 3: 0
Mean : 484.8 Mean : 7.902 Mean : 63.62 Mean :1977
3rd Qu.: 578.6 3rd Qu.: 9.291 3rd Qu.: 79.00 3rd Qu.:1984
Max. :1459.2 Max. :14.872 Max. :154.00 Max. :1997
bath kitchen cheating district
Mode :logical Mode :logical Mode :logical 2024 : 28
FALSE:714 FALSE:710 FALSE:1 1012 : 21
TRUE :54 TRUE :58 TRUE :767 1013 : 21
2221 : 20
1942 : 16
1641 : 15
(Other):647
Then the simple linear regression model is given by
\[ \mathbb{E}({\text{rent}}) = \beta_{0} + \beta_{1} \cdot \text{area} \]
simple_model <- lm(rent ~ area, data = av_rent_1966)
summary(simple_model)
Call:
lm(formula = rent ~ area, data = av_rent_1966)
Residuals:
Min 1Q Median 3Q Max
-434.36 -81.23 -2.23 80.08 549.11
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 130.2334 14.2884 9.115 <2e-16 ***
area 5.5729 0.2133 26.128 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 123.9 on 766 degrees of freedom
Multiple R-squared: 0.4712, Adjusted R-squared: 0.4705
F-statistic: 682.7 on 1 and 766 DF, p-value: < 2.2e-16
par(mfrow = c(1, 2))
plot(
x = av_rent_1966$area,
y = av_rent_1966$rent,
xlab = 'area in sqm',
ylab = 'net rent in Euro'
)
plot(
x = av_rent_1966$area,
y = av_rent_1966$rent,
xlab = 'area in sqm',
ylab = 'net rent in Euro'
)
abline(simple_model, col = "red", lwd=2)
Choosing the rent per square meter as response variable, we have the model:
\[ \mathbb{E}(\text{rent per sqm}) = \beta_{0} + \beta_{1} \cdot \text{area} \]
per_sqm_model <- lm(rentsqm ~ area, data = av_rent_1966)
summary(per_sqm_model)
Call:
lm(formula = rentsqm ~ area, data = av_rent_1966)
Residuals:
Min 1Q Median 3Q Max
-5.4874 -1.4274 -0.0734 1.2813 6.3396
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.497332 0.219984 47.72 <2e-16 ***
area -0.040797 0.003284 -12.42 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.908 on 766 degrees of freedom
Multiple R-squared: 0.1677, Adjusted R-squared: 0.1666
F-statistic: 154.3 on 1 and 766 DF, p-value: < 2.2e-16
plot(
x = av_rent_1966$area,
y = av_rent_1966$rentsqm,
xlab = 'area in sqm',
ylab = 'net rent per sqm'
)
abline(per_sqm_model, col = "red", lwd=2)
We change the variable
\[ x = \frac{1}{\text{area}} \]
Giving us the model:
\[ \mathbb{E}(\text{rent per sqm}) = \beta_{0} + \beta_{1} \cdot x = \beta_{0} + \beta_{1} \cdot \frac{1}{\text{area}} \]
av_rent_1966$inv_area <- 1 / av_rent_1966$area
per_sqm_inv_model <- lm(rentsqm ~ inv_area, data = av_rent_1966)
summary(per_sqm_inv_model)
Call:
lm(formula = rentsqm ~ inv_area, data = av_rent_1966)
Residuals:
Min 1Q Median 3Q Max
-5.5730 -1.4152 -0.0508 1.3200 6.2068
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.4449 0.1901 28.64 <2e-16 ***
inv_area 138.3266 10.0067 13.82 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.871 on 766 degrees of freedom
Multiple R-squared: 0.1997, Adjusted R-squared: 0.1986
F-statistic: 191.1 on 1 and 766 DF, p-value: < 2.2e-16
plot(
x = av_rent_1966$area,
y = av_rent_1966$rentsqm,
xlab = 'area in sqm',
ylab = 'net rent per sqm'
)
x_seq <- seq(min(1 / av_rent_1966$area), max(1 / av_rent_1966$area), length.out = 100)
predicted <- predict(per_sqm_inv_model, newdata = data.frame(inv_area = x_seq))
lines(1 / x_seq, predicted, col = "red", lwd = 2)
We add the variable \(\text{good location}\) to the model:
\[ \mathbb{E}(\text{rent}) = \beta_{0} + \beta_{1} \cdot \text{area} + \beta_{2} \cdot \text{good localtion} \]
ag_rent_1966 <- subset(rent_1966, location %in% c(1, 2))
summary(ag_rent_1966)
rent rentsqm area yearc location
Min. : 133.3 Min. : 1.926 Min. : 20.00 Min. :1966 1:768
1st Qu.: 367.8 1st Qu.: 6.578 1st Qu.: 48.00 1st Qu.:1971 2:346
Median : 479.3 Median : 8.166 Median : 63.00 Median :1973 3: 0
Mean : 502.0 Mean : 8.201 Mean : 63.53 Mean :1978
3rd Qu.: 601.1 3rd Qu.: 9.727 3rd Qu.: 79.00 3rd Qu.:1984
Max. :1517.9 Max. :17.722 Max. :154.00 Max. :1997
bath kitchen cheating district
Mode :logical Mode :logical Mode :logical 2024 : 28
FALSE:1020 FALSE:1035 FALSE:6 711 : 24
TRUE :94 TRUE :79 TRUE :1108 1012 : 21
1013 : 21
2221 : 21
1942 : 16
(Other):983
ag_rent_1966$glocation <- ag_rent_1966$location == 2
summary(ag_rent_1966)
rent rentsqm area yearc location
Min. : 133.3 Min. : 1.926 Min. : 20.00 Min. :1966 1:768
1st Qu.: 367.8 1st Qu.: 6.578 1st Qu.: 48.00 1st Qu.:1971 2:346
Median : 479.3 Median : 8.166 Median : 63.00 Median :1973 3: 0
Mean : 502.0 Mean : 8.201 Mean : 63.53 Mean :1978
3rd Qu.: 601.1 3rd Qu.: 9.727 3rd Qu.: 79.00 3rd Qu.:1984
Max. :1517.9 Max. :17.722 Max. :154.00 Max. :1997
bath kitchen cheating district glocation
Mode :logical Mode :logical Mode :logical 2024 : 28 Mode :logical
FALSE:1020 FALSE:1035 FALSE:6 711 : 24 FALSE:768
TRUE :94 TRUE :79 TRUE :1108 1012 : 21 TRUE :346
1013 : 21
2221 : 21
1942 : 16
(Other):983
good_model <- lm(rent ~ area + glocation, data = ag_rent_1966)
summary(good_model)
Call:
lm(formula = rent ~ area + glocation, data = ag_rent_1966)
Residuals:
Min 1Q Median 3Q Max
-497.96 -82.28 0.42 81.37 611.07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 112.686 12.973 8.686 < 2e-16 ***
area 5.849 0.189 30.942 < 2e-16 ***
glocationTRUE 57.261 8.727 6.561 8.15e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 134.8 on 1111 degrees of freedom
Multiple R-squared: 0.4732, Adjusted R-squared: 0.4722
F-statistic: 498.9 on 2 and 1111 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
colors <- c("black", "red")
plot(
x = ag_rent_1966$area,
y = ag_rent_1966$rent,
xlab = 'area in sqm',
ylab = 'net rent in Euro',
col = colors[as.numeric(ag_rent_1966$location)]
)
legend(
"topleft",
legend = c('average', 'good'),
col = colors,
pch = 19
)
g_rent_1966 <- subset(rent_1966, location == 2)
simple_good_model <- lm(rent ~ area, data = g_rent_1966)
plot(NA, xlim = range(rent_1966$area), ylim = range(rent_1966$rent), xlab = 'area in sqm', ylab = 'estimated net rent')
abline(simple_model)
abline(simple_good_model, col = "red", lwd = 2)
legend(
"topleft",
legend = c('average', 'good'),
col = c('black', 'red'),
lwd = 2
)
plot(NA, xlim = range(rent_1966$area), ylim = range(rent_1966$rent), xlab = 'area in sqm', ylab = 'esimated net rent')
beta <- good_model$coefficients
abline(beta[1], beta[2])
abline(beta[1] + beta[3], beta[2], col = "red", lwd = 2)
legend(
"topleft",
legend = c('average', 'good'),
col = c('black', 'red'),
lwd = 2
)
We add the good location dummy variable to the model of example 2.2:
\[ \mathbb{E}(\text{rent per sqm}) = \beta_{0} + \beta_{1} \cdot \frac{1}{\text{area}} + \beta_{2} \cdot \text{good location} \]
g_multiple_model <- lm(rentsqm ~ I(1/area) + glocation, data = ag_rent_1966)
summary(g_multiple_model)
Call:
lm(formula = rentsqm ~ I(1/area) + glocation, data = ag_rent_1966)
Residuals:
Min 1Q Median 3Q Max
-6.1795 -1.4137 -0.0329 1.3489 8.5672
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.5090 0.1668 33.028 < 2e-16 ***
I(1/area) 134.7204 8.4686 15.908 < 2e-16 ***
glocationTRUE 0.8961 0.1295 6.922 7.53e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.998 on 1111 degrees of freedom
Multiple R-squared: 0.2173, Adjusted R-squared: 0.2159
F-statistic: 154.2 on 2 and 1111 DF, p-value: < 2.2e-16
par(mfrow = c(1, 2))
colors <- c("black", "red")
plot(
x = ag_rent_1966$area,
y = ag_rent_1966$rentsqm,
xlab = 'area in sqm',
ylab = 'net rent in Euro',
col = colors[as.numeric(ag_rent_1966$location)]
)
legend(
"topleft",
legend = c('average', 'good'),
col = colors,
pch = 19
)
plot(NA, xlim = range(rent_1966$area), ylim = range(rent_1966$rentsqm), xlab = 'area in sqm', ylab = 'esimated net rent per sqm')
x_seq <- seq(min(ag_rent_1966$area), max(ag_rent_1966$area), length.out = 100)
beta <- g_multiple_model$coefficients
lines(x_seq, beta[1] + beta[2] * (1/x_seq), col = 'black', lwd = 2)
lines(x_seq, beta[1] + beta[2] * (1/x_seq) + beta[3], col = 'red', lwd = 2)
legend(
"topleft",
legend = c('average', 'good'),
col = c('black', 'red'),
lwd = 2
)
Adding the interaction term between area and good location:
\[ \mathbb{E}(\text{rent}) = \beta_{0} + \beta_{1} \cdot \text{area} + \beta_{2} \cdot \text{good location} + \beta_{3} \cdot \text{interaction} \]
ag_rent_1966$inter <- ag_rent_1966$area * as.numeric(ag_rent_1966$glocation)
summary(ag_rent_1966$inter)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 0.00 19.67 41.00 132.00
g_inter_model <- lm(rent ~ area + glocation + inter, data = ag_rent_1966)
summary(g_inter_model)
Call:
lm(formula = rent ~ area + glocation + inter, data = ag_rent_1966)
Residuals:
Min 1Q Median 3Q Max
-520.67 -81.88 -1.22 80.68 576.91
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 130.2334 15.5173 8.393 <2e-16 ***
area 5.5729 0.2316 24.058 <2e-16 ***
glocationTRUE 5.2053 26.7999 0.194 0.8460
inter 0.8208 0.3996 2.054 0.0402 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 134.6 on 1110 degrees of freedom
Multiple R-squared: 0.4752, Adjusted R-squared: 0.4737
F-statistic: 335 on 3 and 1110 DF, p-value: < 2.2e-16
plot(NA, xlim = range(ag_rent_1966$area), ylim = range(ag_rent_1966$rent), xlab = 'area in sqm', ylab = 'esimated net rent')
x_seq <- seq(min(ag_rent_1966$area), max(ag_rent_1966$area), length.out = 100)
beta <- g_inter_model$coefficients
lines(x_seq, beta[1] + beta[2] * x_seq, col = 'black', lwd = 2)
lines(x_seq, beta[1] + (beta[2] + beta[4]) * x_seq + beta[3], col = 'red', lwd = 2)
legend(
"topleft",
legend = c('average', 'good'),
col = c('black', 'red'),
lwd = 2
)
Now we add the top location (\(\text{location} = 3\)):
munich_rent_index$glocation <- munich_rent_index$location == 2
munich_rent_index$tlocation <- munich_rent_index$location == 3
summary(munich_rent_index)
rent rentsqm area yearc location
Min. : 40.51 Min. : 0.4158 Min. : 20.00 Min. :1918 1:1794
1st Qu.: 322.03 1st Qu.: 5.2610 1st Qu.: 51.00 1st Qu.:1939 2:1210
Median : 426.97 Median : 6.9802 Median : 65.00 Median :1959 3: 78
Mean : 459.44 Mean : 7.1113 Mean : 67.37 Mean :1956
3rd Qu.: 559.36 3rd Qu.: 8.8408 3rd Qu.: 81.00 3rd Qu.:1972
Max. :1843.38 Max. :17.7216 Max. :160.00 Max. :1997
bath kitchen cheating district glocation
Mode :logical Mode :logical Mode :logical 411 : 53 Mode :logical
FALSE:2891 FALSE:2951 FALSE:321 623 : 53 FALSE:1872
TRUE :191 TRUE :131 TRUE :2761 350 : 49 TRUE :1210
563 : 49
711 : 42
360 : 38
(Other):2798
tlocation
Mode :logical
FALSE:3004
TRUE :78
A multiple model with all available variables:
\[ \begin{matrix} \mathbb{E}(\text{rentsqm}) = & \beta_{0} + \beta_{1} \cdot \frac{1}{\text{area}} + \beta_{2} \cdot \text{yearc} + \beta_{3} \cdot \text{yearc}^2 + \beta_{4} \cdot \text{good location} + \beta_{5} \cdot \text{top location} \\ & + \beta_{6} \cdot \text{bath} + \beta_{7} \cdot \text{kitchen} + \beta_{8} \cdot \text{cheating} \\ \end{matrix} \]
multiple_model <- lm(rentsqm ~ I(1/area) + yearc + I(yearc^2) + glocation + tlocation + bath + kitchen + cheating, data = munich_rent_index)
summary(multiple_model)
Call:
lm(formula = rentsqm ~ I(1/area) + yearc + I(yearc^2) + glocation +
tlocation + bath + kitchen + cheating, data = munich_rent_index)
Residuals:
Min 1Q Median 3Q Max
-7.0080 -1.2651 -0.0558 1.2261 8.0095
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.685e+03 2.681e+02 13.745 < 2e-16 ***
I(1/area) 1.375e+02 5.303e+00 25.929 < 2e-16 ***
yearc -3.801e+00 2.747e-01 -13.838 < 2e-16 ***
I(yearc^2) 9.805e-04 7.033e-05 13.941 < 2e-16 ***
glocationTRUE 6.795e-01 7.276e-02 9.340 < 2e-16 ***
tlocationTRUE 1.519e+00 2.227e-01 6.819 1.10e-11 ***
bathTRUE 5.027e-01 1.472e-01 3.416 0.000644 ***
kitchenTRUE 8.665e-01 1.752e-01 4.945 8.02e-07 ***
cheatingTRUE 1.870e+00 1.229e-01 15.225 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.921 on 3073 degrees of freedom
Multiple R-squared: 0.38, Adjusted R-squared: 0.3783
F-statistic: 235.4 on 8 and 3073 DF, p-value: < 2.2e-16
par(mfrow = c(1, 2))
plot(
x = munich_rent_index$area,
y = munich_rent_index$rentsqm,
xlab = 'area in sqm',
ylab = 'estimated net rent in Euro',
)
area_seq <- seq(min(munich_rent_index$area), max(munich_rent_index$area), length.out = 100)
beta <- multiple_model$coefficients
constant <- beta[5] * mean(as.numeric(munich_rent_index$glocation)) +
beta[6] * mean(as.numeric(munich_rent_index$tlocation)) +
beta[7] * mean(as.numeric(munich_rent_index$bath)) +
beta[8] * mean(as.numeric(munich_rent_index$kitchen)) +
beta[9] * mean(as.numeric(munich_rent_index$cheating))
year_constant <- beta[3] * mean(munich_rent_index$yearc) + beta[4] * mean(munich_rent_index$yearc^2)
lines(area_seq, beta[1] + beta[2] * (1 / area_seq) + constant + year_constant, col = 'red', lwd = 2)
plot(
x = munich_rent_index$yearc,
y = munich_rent_index$rentsqm,
xlab = 'year of construction',
ylab = 'estimated net rent in Euro',
)
area_constant <- beta[2] * mean(1 / munich_rent_index$area)
year_seq <- seq(min(munich_rent_index$yearc), max(munich_rent_index$yearc), length.out = 100)
lines(year_seq, beta[1] + beta[3] * year_seq + beta[4] * year_seq^2 + constant + area_constant, col = 'red', lwd = 2)
Import the patent opposition from script:
source("import_data/patent_opposition.R")
We left the \(\text{opp}\) dataset as integer, so we can model the response variable as binary:
patent_opposition$opp <- as.integer(patent_opposition$opp == 1)
summary(patent_opposition$opp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.4149 1.0000 1.0000
We explore the histogram of claims (\(\text{nclaims}\)) and citations (\(\text{ncit}\)):
par(mfrow = c(1, 2))
hist(patent_opposition$nclaims, ylab = 'estimated density', xlab = 'number of claims', freq = FALSE)
lines(density(patent_opposition$nclaims), col = 'red', lwd = 2)
hist(patent_opposition$ncit, ylab = 'estimated density', xlab = 'number of citations', freq = FALSE)
lines(density(patent_opposition$ncit), col = 'red', lwd = 2)
We then subset the dataset to exclude \(\text{nclaims} \leq 60\) and \(\text{ncit} \leq 15\). And also extract the biopharmacological ones (\(\text{biopharm} == 0\)):
deoutlier_patent_opp <- subset(patent_opposition, nclaims <= 60 & ncit <= 15)
not_biopharm_patent_opp <- subset(deoutlier_patent_opp, biopharm == 0)
summary(not_biopharm_patent_opp)
opp biopharm ustwin patus
Min. :0.000 Mode :logical Mode :logical Mode :logical
1st Qu.:0.000 FALSE:2702 FALSE:719 FALSE:1886
Median :0.000 TRUE :1983 TRUE :816
Mean :0.282
3rd Qu.:1.000
Max. :1.000
patgsgr year ncit ncountry
Mode :logical Min. :1980 Min. : 0.000 Min. : 1.000
FALSE:2118 1st Qu.:1988 1st Qu.: 0.000 1st Qu.: 3.000
TRUE :584 Median :1992 Median : 1.000 Median : 4.000
Mean :1991 Mean : 1.387 Mean : 5.329
3rd Qu.:1994 3rd Qu.: 2.000 3rd Qu.: 6.000
Max. :1997 Max. :14.000 Max. :17.000
nclaims
Min. : 1.00
1st Qu.: 6.00
Median : 9.00
Mean :10.84
3rd Qu.:13.00
Max. :60.00
We then fit the logistic regression model:
\[ P(\text{opp}_{i} = 1) = \frac{\exp(\eta_{i})}{1 + \exp(\eta_{i})} \]
\[ \eta_{i} = \beta_{0} + \beta_{1} \cdot \text{year}_{i} + \beta_{2} \cdot \text{ncit}_{i} + \beta_{3} \cdot \text{nclaims}_{i} + \beta_{4} \cdot \text{ustwin}_{i} + \beta_{5} \cdot \text{patus}_{i} + \beta_{6} \cdot \text{patgsgr}_{i} + \beta_{7} \cdot \text{ncountry}_{i} \]
pat_opp_model <- glm(opp ~ year + ncit + nclaims + ustwin + patus + patgsgr + ncountry, data = not_biopharm_patent_opp, family = binomial)
summary(pat_opp_model)
Call:
glm(formula = opp ~ year + ncit + nclaims + ustwin + patus +
patgsgr + ncountry, family = binomial, data = not_biopharm_patent_opp)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 201.739764 22.321279 9.038 < 2e-16 ***
year -0.102127 0.011217 -9.105 < 2e-16 ***
ncit 0.113558 0.022304 5.091 3.55e-07 ***
nclaims 0.026575 0.005922 4.488 7.20e-06 ***
ustwinTRUE -0.402600 0.099963 -4.028 5.64e-05 ***
patusTRUE -0.526343 0.112599 -4.675 2.95e-06 ***
patgsgrTRUE 0.196654 0.117388 1.675 0.0939 .
ncountry 0.097517 0.014897 6.546 5.91e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3214.5 on 2701 degrees of freedom
Residual deviance: 2977.1 on 2694 degrees of freedom
AIC: 2993.1
Number of Fisher Scoring iterations: 4
We then generate the table of odd ratios:
pat_opp_odd_ratios <- cbind(pat_opp_model$coefficients, 0)
colnames(pat_opp_odd_ratios) <- c('Estimated Coefficients', 'Estimated Odd Ratios')
pat_opp_odd_ratios[, 2] <- exp(pat_opp_odd_ratios[, 1])
pat_opp_odd_ratios[1, 2] <- NA
pat_opp_odd_ratios
Estimated Coefficients Estimated Odd Ratios
(Intercept) 201.73976353 NA
year -0.10212685 0.9029150
ncit 0.11355777 1.1202566
nclaims 0.02657537 1.0269316
ustwinTRUE -0.40260010 0.6685794
patusTRUE -0.52634315 0.5907614
patgsgrTRUE 0.19665401 1.2173228
ncountry 0.09751730 1.1024305
Then we repeeat the process with the \(\text{ncountry}\) variable as a polynomial:
\[ \begin{matrix} \eta_{i} = & \beta_{0} + \beta_{1} \cdot \text{year}_{i} + \beta_{2} \cdot \text{ncit}_{i} + \beta_{3} \cdot \text{nclaims}_{i} + \beta_{4} \cdot \text{ustwin}_{i} + \beta_{5} \cdot \text{patus}_{i} + \beta_{6} \cdot \text{patgsgr}_{i} \\ & + \beta_{7} \cdot \text{ncountry}_{i} + \beta_{8} \cdot \text{ncountry}_{i}^{2} + \beta_{9} \cdot \text{ncountry}_{i}^{3} \end{matrix} \]
pat_opp_poly_model <- glm(opp ~ year + ncit + nclaims + ustwin + patus + patgsgr + ncountry + I(ncountry ^ 2) + I(ncountry ^ 3), data = not_biopharm_patent_opp, family = binomial)
summary(pat_opp_poly_model)
Call:
glm(formula = opp ~ year + ncit + nclaims + ustwin + patus +
patgsgr + ncountry + I(ncountry^2) + I(ncountry^3), family = binomial,
data = not_biopharm_patent_opp)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.981e+02 2.274e+01 8.713 < 2e-16 ***
year -1.006e-01 1.141e-02 -8.821 < 2e-16 ***
ncit 1.134e-01 2.231e-02 5.085 3.68e-07 ***
nclaims 2.639e-02 5.936e-03 4.445 8.78e-06 ***
ustwinTRUE -4.090e-01 1.001e-01 -4.085 4.40e-05 ***
patusTRUE -5.393e-01 1.131e-01 -4.769 1.85e-06 ***
patgsgrTRUE 1.805e-01 1.191e-01 1.516 0.130
ncountry 3.938e-01 1.836e-01 2.145 0.032 *
I(ncountry^2) -3.782e-02 2.387e-02 -1.584 0.113
I(ncountry^3) 1.404e-03 9.373e-04 1.498 0.134
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3214.5 on 2701 degrees of freedom
Residual deviance: 2974.5 on 2692 degrees of freedom
AIC: 2994.5
Number of Fisher Scoring iterations: 4
par(mfrow = c(1, 2))
beta <- pat_opp_model$coefficients
constant <- beta[1] +
beta[2] * mean(not_biopharm_patent_opp$year) +
beta[3] * mean(not_biopharm_patent_opp$ncit) +
beta[4] * mean(not_biopharm_patent_opp$nclaims) +
beta[5] * mean(as.numeric(not_biopharm_patent_opp$ustwin)) +
beta[6] * mean(as.numeric(not_biopharm_patent_opp$patus)) +
beta[7] * mean(as.numeric(not_biopharm_patent_opp$patgsgr))
beta_poly <- pat_opp_poly_model$coefficients
poly_constant <- beta_poly[1] +
beta_poly[2] * mean(not_biopharm_patent_opp$year) +
beta_poly[3] * mean(not_biopharm_patent_opp$ncit) +
beta_poly[4] * mean(not_biopharm_patent_opp$nclaims) +
beta_poly[5] * mean(as.numeric(not_biopharm_patent_opp$ustwin)) +
beta_poly[6] * mean(as.numeric(not_biopharm_patent_opp$patus)) +
beta_poly[7] * mean(as.numeric(not_biopharm_patent_opp$patgsgr))
ncountry_seq <- seq(min(not_biopharm_patent_opp$ncountry), max(not_biopharm_patent_opp$ncountry), length.out = 100)
effect <- constant + beta[8] * ncountry_seq
poly_effect <- poly_constant + beta_poly[8] * ncountry_seq + beta_poly[9] * ncountry_seq ^ 2 + beta_poly[10] * ncountry_seq ^ 3
plot(
NA,
xlim = range(not_biopharm_patent_opp$ncountry),
ylim = range(poly_effect),
xlab = 'number of designated states for the patent',
ylab = 'estimated effect'
)
lines(ncountry_seq, poly_effect)
lines(ncountry_seq, effect, col = 'red', lwd = 2)
legend(
"topleft",
legend = c("polynomial", "linear"),
col = c("black", "red"),
lwd = 2
)
poly_prob <- exp(poly_effect) / (1 + exp(poly_effect))
plot(
NA,
xlim = range(not_biopharm_patent_opp$ncountry),
ylim = range(poly_prob),
xlab = 'number of designated states for the patent',
ylab = 'estimated probability'
)
lines(ncountry_seq, poly_prob)
prob <- exp(effect) / (1 + exp(effect))
lines(ncountry_seq, prob, col = 'red', lwd = 2)
legend(
"topleft",
legend = c("polynomial", "linear"),
col = c("black", "red"),
lwd = 2
)
We import the Hormone Therapy with Rats dataset from the oficial page of the dataset, as the other example:
rats_url <- "https://www.uni-goettingen.de/de/document/download/3ac5a038b19297fb8609dfc28319cbb7.raw/rats.raw"
rats_dataset <- read.table(
url(rats_url),
header = 1,
colClasses = c(
"factor", "factor", "numeric",
"integer", "numeric", "numeric",
"factor", "factor", "factor",
"numeric", "numeric", "numeric"
)
)
rats_dataset$low <- rats_dataset$low == 1
rats_dataset$high <- rats_dataset$high == 1
rats_dataset$control <- rats_dataset$control == 1
summary(rats_dataset)
subject group response time transf_time
1 : 7 1:126 Min. :66.60 Min. : 50 Min. :0.4055
10 : 7 2:119 1st Qu.:74.62 1st Qu.: 60 1st Qu.:0.9163
11 : 7 3:105 Median :77.88 Median : 80 Median :1.5041
12 : 7 Mean :77.53 Mean : 80 Mean :1.3814
13 : 7 3rd Qu.:81.02 3rd Qu.:100 3rd Qu.:1.8718
14 : 7 Max. :86.83 Max. :110 Max. :2.0149
(Other):308 NA's :98
eff low high control
Min. :-0.90260 Mode :logical Mode :logical Mode :logical
1st Qu.:-0.90260 FALSE:224 FALSE:231 FALSE:245
Median : 0.17801 TRUE :126 TRUE :119 TRUE :105
Mean :-0.00475
3rd Qu.: 0.72833
Max. : 0.72833
NA's :98
lowtime hightime controltime
Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.4973 Mean :0.4697 Mean :0.4144
3rd Qu.:1.2528 3rd Qu.:0.9163 3rd Qu.:0.9163
Max. :2.0149 Max. :2.0149 Max. :2.0149
We plot the timeseries of the rats divided in the three groups: control, low and high dose:
par(mfrow = c(3, 1))
control_rats <- subset(rats_dataset, control)
low_rats <- subset(rats_dataset, low)
high_rats <- subset(rats_dataset, high)
library(viridis)
Loading required package: viridisLite
plot_rats_subsets <- function(rats_subset, title) {
plot(
NA,
xlim = range(rats_dataset$time),
ylim = range(na.omit(rats_dataset$response)),
xlab = "age in days",
ylab = "response",
main = title
)
subjects <- unique(rats_subset$subject)
for (i in seq_along(subjects)) {
colors <- length(subjects)
lines(
subset(rats_subset, subject == subjects[i])$time,
subset(rats_subset, subject == subjects[i])$response,
col = viridis(colors)[i]
)
}
}
plot_rats_subsets(control_rats, "control group")
plot_rats_subsets(low_rats, "low dose")
plot_rats_subsets(high_rats, "high dose")
We add the transformed age:
\[ t = \log(1 + (\text{age} - 45) / 10) \]
rats_dataset$t <- log(1 + (rats_dataset$time - 45) / 10)
rats_dataset$C <- as.integer(rats_dataset$control)
rats_dataset$L <- as.integer(rats_dataset$low)
rats_dataset$H <- as.integer(rats_dataset$high)
summary(rats_dataset)
subject group response time transf_time
1 : 7 1:126 Min. :66.60 Min. : 50 Min. :0.4055
10 : 7 2:119 1st Qu.:74.62 1st Qu.: 60 1st Qu.:0.9163
11 : 7 3:105 Median :77.88 Median : 80 Median :1.5041
12 : 7 Mean :77.53 Mean : 80 Mean :1.3814
13 : 7 3rd Qu.:81.02 3rd Qu.:100 3rd Qu.:1.8718
14 : 7 Max. :86.83 Max. :110 Max. :2.0149
(Other):308 NA's :98
eff low high control
Min. :-0.90260 Mode :logical Mode :logical Mode :logical
1st Qu.:-0.90260 FALSE:224 FALSE:231 FALSE:245
Median : 0.17801 TRUE :126 TRUE :119 TRUE :105
Mean :-0.00475
3rd Qu.: 0.72833
Max. : 0.72833
NA's :98
lowtime hightime controltime t
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.4055
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.9163
Median :0.0000 Median :0.0000 Median :0.0000 Median :1.5041
Mean :0.4973 Mean :0.4697 Mean :0.4144 Mean :1.3814
3rd Qu.:1.2528 3rd Qu.:0.9163 3rd Qu.:0.9163 3rd Qu.:1.8718
Max. :2.0149 Max. :2.0149 Max. :2.0149 Max. :2.0149
C L H
Min. :0.0 Min. :0.00 Min. :0.00
1st Qu.:0.0 1st Qu.:0.00 1st Qu.:0.00
Median :0.0 Median :0.00 Median :0.00
Mean :0.3 Mean :0.36 Mean :0.34
3rd Qu.:1.0 3rd Qu.:1.00 3rd Qu.:1.00
Max. :1.0 Max. :1.00 Max. :1.00
And we fit the model:
\[ y_{ij} = \beta_{0} + \gamma_{0i} + \beta_{1} L_{i} \cdot t_{ij} + \beta_{2} H_{i} \cdot t_{ij} + \beta_{3} C_{i} \cdot t_{ij} + \gamma_{1i} \cdot t_{ij} + \epsilon_{ij} \]
With \(\epsilon_{ij} \sim N(0, \sigma^2)\) as in all these examples. And \(\gamma_{0i}\) and \(\gamma_{1i}\) are random effects associated with the subjects following:
\[ \gamma_{0i} \sim N(0, \tau_{0}^2) \wedge \gamma_{1i} \sim N(0, \tau_{1}^2) \]
library(lme4)
Loading required package: Matrix
rats_model <- lmer(
response ~ L:t + H:t + C:t + (1 + t | subject), data = rats_dataset
)
boundary (singular) fit: see help('isSingular')
summary(rats_model)
Linear mixed model fit by REML ['lmerMod']
Formula: response ~ L:t + H:t + C:t + (1 + t | subject)
Data: rats_dataset
REML criterion at convergence: 932.4
Scaled residuals:
Min 1Q Median 3Q Max
-2.25589 -0.66674 -0.01327 0.58310 2.89228
Random effects:
Groups Name Variance Std.Dev. Corr
subject (Intercept) 3.429240 1.85182
t 0.001081 0.03288 1.00
Residual 1.444602 1.20192
Number of obs: 252, groups: subject, 50
Fixed effects:
Estimate Std. Error t value
(Intercept) 68.6061 0.3272 209.70
L:t 7.5045 0.2274 32.99
t:H 6.8750 0.2301 29.88
t:C 7.3178 0.2836 25.80
Correlation of Fixed Effects:
(Intr) L:t t:H
L:t -0.336
t:H -0.325 0.109
t:C -0.316 0.106 0.103
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')
And the simpler model eliminating the subject-specific term \(\tau_{1i} \cdot t_{ij}\):
\[ y_{ij} = \beta_{0} + \gamma_{0i} + \beta_{1} L_{i} \cdot t_{ij} + \beta_{2} H_{i} \cdot t_{ij} + \beta_{3} C_{i} \cdot t_{ij} + \epsilon_{ij} \]
simple_rats_model <- lmer(
response ~ L:t + H:t + C:t + (1 | subject), data = rats_dataset
)
summary(simple_rats_model)
Linear mixed model fit by REML ['lmerMod']
Formula: response ~ L:t + H:t + C:t + (1 | subject)
Data: rats_dataset
REML criterion at convergence: 932.4
Scaled residuals:
Min 1Q Median 3Q Max
-2.25575 -0.65899 -0.01164 0.58358 2.88309
Random effects:
Groups Name Variance Std.Dev.
subject (Intercept) 3.565 1.888
Residual 1.445 1.202
Number of obs: 252, groups: subject, 50
Fixed effects:
Estimate Std. Error t value
(Intercept) 68.6074 0.3312 207.13
L:t 7.5069 0.2252 33.34
t:H 6.8711 0.2276 30.19
t:C 7.3139 0.2808 26.05
Correlation of Fixed Effects:
(Intr) L:t t:H
L:t -0.351
t:H -0.340 0.119
t:C -0.327 0.115 0.111
We then plot a kernel density estimator with a normal distribution for those values. Alongside a normal quantity plot:
par(mfrow = c(1, 2))
tau_0i <- ranef(rats_model)$subject[, 1]
kde <- density(tau_0i)
tau_seq <- seq(min(kde$x), max(kde$x), length.out = 100)
plot(
x = tau_seq,
y = dnorm(tau_seq, mean = mean(tau_0i), sd = sd(tau_0i)),
type = 'l',
main = 'random intercept',
xlab = 'random intercept',
ylab = 'density',
col = 'red',
lwd = 2
)
lines(
x = kde$x,
y = kde$y,
col = 'black',
lwd = 2
)
legend(
'topleft',
legend = c('kde', 'adapted normal'),
col = c('black', 'red'),
lwd = 2
)
qqnorm(
tau_0i,
main = 'random intercept',
xlab = 'quantiles of normal distribution',
ylab = 'random intercept',
pch = 19,
col = "darkblue"
)
# Add reference line
qqline(
tau_0i,
col = "black",
lwd = 2
)
Import data from script:
source("import_data/malnutrition_zambia.R")
To be continued…