On Nov 4th, 2022 the Agency Fund team got together and reviewed submissions to our second cohort of the Social Impact Fellowship. These were collected by an Airtable form that has an API so we can directly pull this data in for analysis.
library(tidyverse)
library(rairtable)
# https://airtable.com/appViLze3t4A0sY1M/tblHb9zc9nhzTjloG/viwR4yRZehqdL1iwo
rairtable::set_airtable_api_key(read_lines("at_key_af.txt"))
✔ AIRTABLE_API_KEY set for current session. To install your API key for use in future sessions, run this function with `install = TRUE`.
submissions_at <- airtable("tblHb9zc9nhzTjloG", "appViLze3t4A0sY1M")
submissions <- submissions_at %>% read_airtable()
Warning: This data may contain 'user' field types. This type is currently unsupported in `insert_records` and `update_records`
reviews <- submissions %>%
filter(map(Reviewer, length) != 0) %>%
select(Name, Reviewer, contains(c("[VC]", "[Tech]"))) %>%
select(-contains("Rationale")) %>%
mutate(
`Name` = str_sub(map_chr(Name, digest::sha1), 1, 5),
`Reviewer [VC]` = map_chr(Reviewer, ~ .x[[1]]),
`Reviewer [Tech]` = map_chr(Reviewer, ~ .x[[2]])) %>%
select(-Reviewer) %>%
select(Name, starts_with("Reviewer"), everything()) %>%
filter(if_all(starts_with("Rubric"), ~ !is.na(.x))) %>%
mutate(across(-starts_with(c("Name", "Reviewer", "Rubric")), ~ as.integer(str_sub(.x, 1, 1))),
across(starts_with("Rubric"), ~ str_sub(.x, 6, 6)),
across(starts_with("Rubric"), ~ abs(map_int(.x, function(.y) which(.y == LETTERS)) - 3)))
reviews %>%
mutate(across(where(is.character), as.factor)) %>%
summary()
Name Reviewer [VC] Reviewer [Tech] Rubric Decision [VC] Track Record [VC]
45a3d : 2 Albert : 1 Albert :120 Min. :0.0000 Min. :1.000
902b3 : 2 Aras :126 Jake :124 1st Qu.:0.0000 1st Qu.:2.000
932bb : 2 Richard:118 Richard: 1 Median :0.0000 Median :2.000
cf5f4 : 2 Temina :118 Rob :118 Mean :0.2424 Mean :2.223
00088 : 1 3rd Qu.:0.0000 3rd Qu.:3.000
02937 : 1 Max. :2.0000 Max. :4.000
(Other):353 NA's :179
Adaptability [VC] Hustle [VC] Idea + Impact [VC] Rubric Decision [Tech]
Min. :1.000 Min. :1.000 Min. :1.000 Min. :0.0000
1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:0.0000
Median :2.000 Median :2.000 Median :2.000 Median :0.0000
Mean :1.984 Mean :2.192 Mean :1.995 Mean :0.2534
3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:0.0000
Max. :4.000 Max. :4.000 Max. :5.000 Max. :2.0000
NA's :181 NA's :181 NA's :176
Track Record [Tech] Idea + Impact [Tech] Adaptability [Tech] Hustle [Tech]
Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000
Median :2.000 Median :2.000 Median :2.000 Median :2.000
Mean :2.038 Mean :1.809 Mean :1.927 Mean :2.206
3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:3.000
Max. :4.000 Max. :4.000 Max. :3.000 Max. :3.000
NA's :44 NA's :54 NA's :131 NA's :120
This is the most important outcome that summarized how each reviewer felt about the submission.
Let’s first see if everyone scored their candidates.
submissions %>%
group_by(Reviewer) %>%
summarise(total = n(),
scored_tech = sum(!is.na(`Rubric Decision [Tech]`)),
scored_vc = sum(!is.na(`Rubric Decision [VC]`))) %>%
unnest(Reviewer) %>%
mutate(scored = if_else(row_number() %% 2 == 0, scored_tech, scored_vc)) %>%
select(-starts_with("scored_")) %>%
group_by(Reviewer) %>%
summarise(across(everything(), sum))
Gold stars all around. Good job team.
Our review process had two reviewers for each submission. One from the “VC” side of our staff (Temina, Richard, Aras) and another for our Tech team (Rob, Albert, Jake). This allows us to check for some consistency between the two groups.
reviews %>%
mutate(`Rubric Difference` = abs(`Rubric Decision [VC]` - `Rubric Decision [Tech]`)) %>%
group_by(across(starts_with("Rubric"))) %>%
count() %>%
group_by(`Rubric Decision [VC]`) %>%
mutate(p = round(n / sum(n), 2)) %>%
group_by(`Rubric Decision [VC]`, `Rubric Difference`) %>%
mutate(`Decision VC-Tech` = str_c(`Rubric Decision [VC]`,
`Rubric Decision [Tech]`, sep = "-")) %>%
ggplot(aes(`Rubric Decision [VC]`, p, color = `Decision VC-Tech`,
fill = `Rubric Difference` == 0)) +
geom_col(size = 2)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
Please use `linewidth` instead.
There was fairly broad consensus about the worst proposals but the best (according to VCs) had some disagreement with Techs.
Also, this might be the worst visualization I’ve ever made (says a lot), open to ideas to improvement.
A basic listing of the individual submissions where there wwas the greatest discrepency to visually flip through.
Ordered from least agreement to most agreement.
reviews %>%
mutate(`Rubric Difference` = abs(`Rubric Decision [VC]` - `Rubric Decision [Tech]`),
`Name` = str_sub(map_chr(Name, digest::sha1), 1, 5)) %>%
arrange(desc(`Rubric Difference`)) %>%
DT::datatable()
Let’s compute some actual numbers around this, correlation coefficients for each pair of VC+Tech reviewers:
# mutate(Reviewers = str_c(`Reviewer [VC]`, `Reviewer [Tech]`, sep = "+")) %>%
# select(Reviewers, starts_with("Rubric")) %>%
reviews %>%
select(starts_with(c("Reviewer", "Rubric"))) %>%
group_by(across(starts_with("Reviewer"))) %>%
summarise(rho = round(cor(`Rubric Decision [VC]`, `Rubric Decision [Tech]`), 2),
.groups = "drop") %>%
arrange(desc(rho)) %>%
# DT::datatable() %>%
ggplot(aes(`Reviewer [VC]`, `Reviewer [Tech]`, color = rho, size = rho)) +
geom_point()
This could be a good opportunity for small dot pairs to talk about their differences :)
Aras is agreeable overall, Temina less so with Jake, and Richard less so with Rob.
Criteria are the set of measures we used to determine whether or not a submission was strong overall. In theory, we can use these scores to directly determine whether or not the submission decision would lead to an interview.
Since these criteria were not the actionable determinant of whether or not a submission was invited for an interview, they weren’t as regularly completed. Let’s see who completed what:
reviews %>%
select(contains("[VC]")) %>%
rename_with(~ str_remove(.x, " \\[VC\\]")) %>%
bind_rows(
reviews %>%
select(contains("[Tech]")) %>%
rename_with(~ str_remove(.x, " \\[Tech\\]"))) %>%
group_by(Reviewer) %>%
summarise(across(everything(), ~ round(mean(!is.na(.x)), 2))) %>%
pivot_longer(-Reviewer, names_to = "Criteria") %>%
ggplot(aes(Reviewer, value, fill = Criteria)) +
geom_col() + facet_wrap(~ Criteria) + coord_flip() +
theme(legend.position = "none")
Most reviewers were consistent it filling out every criteria for each
submissions apart from Albert who focused on Idea + Impact
and Track Record
vs Adaptablity
and
Hustle
. Generally it seemed that those who had most
experience reviewing submissions also filled out these Criteria least
often. This will lead to some bias from the missing scores so it’s
something to keep in mind.
Here we look at the average scores in each Criteria by each reviewer. Because there’s some missingness in the scoring, we might expect some of these means to be biased towards higher scores assuming that some reviewers didn’t bother to score the less appealing submissions.
reviews %>%
select(where(is.numeric), starts_with("Reviewer")) %>%
pivot_longer(where(is.numeric),
names_pattern = "([^\\]]*)\\[([^\\]]*)\\]",
names_to = c("Criteria", "type")) %>%
mutate(Reviewer = if_else(type == "VC", `Reviewer [VC]`, `Reviewer [Tech]`)) %>%
select(-`Reviewer [VC]`, -`Reviewer [Tech]`) %>%
group_by(Reviewer, Criteria) %>%
summarise(mean = mean(value, na.rm = TRUE),
`non-missing` = mean(!is.na(value)),
.groups = "drop") %>%
ggplot(aes(Reviewer, mean, fill = Criteria, alpha = `non-missing`)) +
geom_col() + facet_wrap(~ Criteria, scales = "free_x") +
coord_flip() +
theme(legend.position = "none")
We’d like to see how much variation in the submission decision is explained by these Criteria.
reviews %>%
select(contains("[VC]")) %>%
rename_with(~ str_remove(.x, " \\[VC\\]")) %>%
bind_rows(
reviews %>%
select(contains("[Tech]")) %>%
rename_with(~ str_remove(.x, " \\[Tech\\]"))) %>%
filter(complete.cases(.)) %>%
lm(`Rubric Decision` ~ ., data = .) %>%
summary()
Call:
lm(formula = `Rubric Decision` ~ ., data = .)
Residuals:
Min 1Q Median 3Q Max
-0.84310 -0.29763 -0.02361 0.20132 1.29394
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.66180 0.09425 -7.022 9.61e-12 ***
ReviewerAras -0.20236 0.07894 -2.563 0.01073 *
ReviewerJake -0.09320 0.08003 -1.165 0.24489
ReviewerRichard -0.03951 0.11698 -0.338 0.73572
ReviewerRob -0.37483 0.08547 -4.385 1.49e-05 ***
ReviewerTemina -0.61384 0.10156 -6.044 3.48e-09 ***
Adaptability 0.20485 0.04305 4.759 2.74e-06 ***
Hustle 0.07271 0.04003 1.816 0.07010 .
`Idea + Impact` 0.17834 0.02979 5.986 4.84e-09 ***
`Track Record` 0.12500 0.03618 3.455 0.00061 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3763 on 395 degrees of freedom
Multiple R-squared: 0.5109, Adjusted R-squared: 0.4998
F-statistic: 45.85 on 9 and 395 DF, p-value: < 2.2e-16
Taking into account variation in reviewers and criteria, we’re explaining about half the variation in our Decision scores. Maybe we would explain more if the data were not missing - but we can dive into this for each reviewer:
reviews %>%
select(contains("[VC]")) %>%
rename_with(~ str_remove(.x, " \\[VC\\]")) %>%
bind_rows(
reviews %>%
select(contains("[Tech]")) %>%
rename_with(~ str_remove(.x, " \\[Tech\\]"))) %>%
filter(complete.cases(.)) %>%
group_by(Reviewer) %>%
group_map(~ summary(lm(`Rubric Decision` ~ ., data = .x))) %>%
set_names("Albert", "Aras", "Jake", "Richard", "Robert", "Temina")
$Albert
Call:
lm(formula = `Rubric Decision` ~ ., data = .x)
Residuals:
Min 1Q Median 3Q Max
-0.69796 -0.08470 0.02262 0.11423 0.47322
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.75523 0.25414 -2.972 0.006828 **
Adaptability 0.13499 0.13702 0.985 0.334781
Hustle 0.05489 0.14617 0.375 0.710731
`Idea + Impact` 0.41494 0.10660 3.892 0.000735 ***
`Track Record` 0.03619 0.11871 0.305 0.763232
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3259 on 23 degrees of freedom
Multiple R-squared: 0.6492, Adjusted R-squared: 0.5882
F-statistic: 10.64 on 4 and 23 DF, p-value: 4.964e-05
$Aras
Call:
lm(formula = `Rubric Decision` ~ ., data = .x)
Residuals:
Min 1Q Median 3Q Max
-0.62489 -0.29794 0.02613 0.25855 1.19944
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.81505 0.10170 -8.014 7.85e-13 ***
Adaptability 0.23529 0.08204 2.868 0.00487 **
Hustle 0.09165 0.06862 1.336 0.18420
`Idea + Impact` 0.14554 0.05879 2.476 0.01468 *
`Track Record` 0.08401 0.06272 1.340 0.18289
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3542 on 121 degrees of freedom
Multiple R-squared: 0.4994, Adjusted R-squared: 0.4828
F-statistic: 30.18 on 4 and 121 DF, p-value: < 2.2e-16
$Jake
Call:
lm(formula = `Rubric Decision` ~ ., data = .x)
Residuals:
Min 1Q Median 3Q Max
-0.75162 -0.27559 -0.00315 0.22335 1.12773
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.82002 0.10903 -7.521 1.17e-11 ***
Adaptability 0.12065 0.07796 1.548 0.1244
Hustle 0.15179 0.08071 1.881 0.0625 .
`Idea + Impact` 0.09774 0.07169 1.363 0.1754
`Track Record` 0.22649 0.07234 3.131 0.0022 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3653 on 118 degrees of freedom
Multiple R-squared: 0.5027, Adjusted R-squared: 0.4859
F-statistic: 29.83 on 4 and 118 DF, p-value: < 2.2e-16
$Richard
Call:
lm(formula = `Rubric Decision` ~ ., data = .x)
Residuals:
Min 1Q Median 3Q Max
-0.54431 -0.17130 -0.01268 0.07384 0.59451
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.579317 0.241414 -2.400 0.0289 *
Adaptability 0.102687 0.160679 0.639 0.5318
Hustle 0.009982 0.160737 0.062 0.9512
`Idea + Impact` 0.234189 0.095810 2.444 0.0265 *
`Track Record` 0.158618 0.145127 1.093 0.2906
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3327 on 16 degrees of freedom
Multiple R-squared: 0.7732, Adjusted R-squared: 0.7165
F-statistic: 13.64 on 4 and 16 DF, p-value: 5.03e-05
$Robert
Call:
lm(formula = `Rubric Decision` ~ ., data = .x)
Residuals:
Min 1Q Median 3Q Max
-0.79440 -0.16044 -0.07909 0.19193 0.99107
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.61114 0.27579 -5.842 1.38e-07 ***
Adaptability 0.57747 0.11589 4.983 4.16e-06 ***
Hustle -0.05649 0.09570 -0.590 0.556860
`Idea + Impact` 0.27101 0.06971 3.888 0.000223 ***
`Track Record` 0.08136 0.08674 0.938 0.351432
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4023 on 72 degrees of freedom
Multiple R-squared: 0.489, Adjusted R-squared: 0.4606
F-statistic: 17.23 on 4 and 72 DF, p-value: 5.917e-10
$Temina
Call:
lm(formula = `Rubric Decision` ~ ., data = .x)
Residuals:
Min 1Q Median 3Q Max
-0.37727 -0.09807 -0.02936 0.02073 1.53131
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.33982 0.27719 -1.226 0.232
Adaptability 0.01166 0.12220 0.095 0.925
Hustle 0.20699 0.14338 1.444 0.161
`Idea + Impact` 0.08182 0.07370 1.110 0.277
`Track Record` -0.12723 0.12883 -0.988 0.333
Residual standard error: 0.35 on 25 degrees of freedom
Multiple R-squared: 0.2079, Adjusted R-squared: 0.08121
F-statistic: 1.641 on 4 and 25 DF, p-value: 0.1953
It looks as though these Criteria values are informing decisions the
most for Richard and the least for Temina. Everyone else is within
coin-flip territory (~0.5). There’s some variation in what different
reviewers found to be most important but Hustle
was never
one of them.
Maybe there’s some similarity between how these Criteria are being
scored that makes Hustle
appear to be unimportant.
reviews %>%
select(-starts_with("Reviewer"), -Name) %>%
pivot_longer(everything(), names_sep = " \\[", names_to = c("Criteria", "Group")) %>%
pivot_wider(names_from = Criteria, values_from = value) %>%
slice(2) %>% select(-Group) %>%
unnest(everything()) %>%
filter(complete.cases(.)) %>%
corrr::correlate()
Warning: Values from `value` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
{data} %>%
dplyr::group_by(Group, Criteria) %>%
dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
dplyr::filter(n > 1L)Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'
There’s some reasonably strong correlation between
Adaptability
and Hustle
which could explain
why it doesn’t show up in the regressions above.
Here’s how much agreement there was for each criteria between pairs of reviewers (requested by Jake!):
reviews %>%
mutate(Reviewers = str_c(`Reviewer [VC]`, `Reviewer [Tech]`, sep = "+")) %>%
select(Reviewers, where(is.numeric)) %>%
filter(complete.cases(.)) %>%
pivot_longer(where(is.numeric),
names_pattern = "([^\\]]*)\\[([^\\]]*)\\]",
names_to = c("Criteria", "type")) %>%
pivot_wider(names_from = type, values_from = value) %>%
mutate(rho = map2_dbl(VC, Tech, ~ cor(.x, .y))) %>%
select(-VC, -Tech) %>%
arrange(rho) %>%
DT::datatable()
Finally, we’d like to provide some feedback to applicants. It helps
when there are some notes in the Rationale
section. Let’s
see who helped:
submissions %>%
unnest(Reviewer) %>%
select(Reviewer, contains("Rationale")) %>%
mutate(across(contains("Rationale"), ~ map_int(.x, nchar))) %>%
pivot_longer(-Reviewer) %>%
filter((Reviewer %in% c("Aras", "Temina", "Richard") & str_detect(name, "VC")) |
Reviewer %in% c("Rob", "Jake", "Albert") & str_detect(name, "Tech")) %>%
mutate(value = replace_na(value, 0)) %>%
group_by(Reviewer) %>%
summarise(num_characters = sum(value)) %>%
arrange(desc(num_characters))
If we believe our rationale to be good explainers of feedback, I can imagine at minimum automating an email that takes a look at which Criteria they scored well on (ideally with notes) and which they scored less well on (again, ideally with notes) and provides this feedback to the candidate.