library(tidyverse)
library(rairtable)
library(keras)
library(tfhub)

TLDR;

The notebook below demonstrates the ability to use NLP and Deep Learning to predict the probability of declination. The algorithm uses only the information submitted through the website and processes these texts to produce a score based on reviewer declination labels from previous rounds of proposals. The algorithm produces a reasonably performant 0.84 ROC AUC that suggests solid predictive performance. In the Analysis section below we demonstrate the empirical results of two use cases around (1) ordering proposals due to limited reviewer resources or (2) highlighting declined proposals that had low predicted declination scores to ensure we did not mistakenly decline a worthy proposal.

Background

Funders seek to fund promising ideas and it is their core competency to be able to do this well. The process of selecting which proposals get funded is a challenging one as indicators of a successful program more often rely on retrospective storytelling vs measured objective impact.

When proposals are submitted to a funder, they go through a very human process of review and selection. In our case, a number of reviewers go through a series of review rounds with increasing scrutiny and attention and a decreasing number of proposals that make to it successive rounds.

The first round has the greatest number of proposals and with an Open Call the variation in proposals can vary quite a lot. Based on what is written in the proposal, some of these decisions can be made fairly quickly without too much scrutiny. It is also hard to predict how many proposals are submitted and there are times where the number of proposals can overwhelm the resources available to review them.

The notebook below will attempt to use a history of labeled proposals and their textual content to produce a predictive algorithm to rate proposal quality. This quality score can be leveraged to prioritize proposals when there is a reviewer resource constraint or be used as a secondary robustness check to ensure high scoring proposals are not missed due to reviewer fatigue.

Data prep

We’re using the data from an Airtable for all the proposals that are submitted through the website.

Reading data from Airtable

Use the Airtable API to read from our submissions table:

rairtable::set_airtable_api_key(read_lines("at_key.txt"))
✔ AIRTABLE_API_KEY set for current session. To install your API key for use in future sessions, run this function with `install = TRUE`.
submissions_at <- airtable("tblheGxoCuoMOFC6K", "appSz3aSZX0uVXKbV")
submissions <- submissions_at %>% read_airtable()
dim(submissions)
[1] 933  83

A total of 933 submissions and 83 columns.

tibble(
  name = submissions %>% names(),
  type = sapply(submissions, class)) %>%
  DT::datatable()
Registered S3 methods overwritten by 'htmltools':
  method               from         
  print.html           tools:rstudio
  print.shiny.tag      tools:rstudio
  print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
  method           from         
  print.htmlwidget tools:rstudio

Only some of these columns are derived from the original proposal submission while others are added by or labeled by the reviewers. It’s important that we only consider the original content of the proposal along with the prediction target which is the key reviewer outcome we wish to estimate. Other fields are likely to be related to the key outcome and can make the algorithm appear more predictive than it is.

Key outcomes

The key outcomes we’re focused on predicting are related to decisions made about the proposal which are ultimately captured in Round 1 Outcome and Round 2 Outcome fields. There are a greater number of proposals in Round 1 and there is likely less nuance that goes into decisions in this round compared to future rounds. There’s a trade-off going into this between having a more balanced labels (Round 1) while potentially having a clearer and ultimately more important signal of success (Round 2). The nuance in decision making in Round 2 also may be more hard to capture with our features since there may be a good amount of extra research outside of the proposal content going into it.

Let’s first have a look at the values in these two columns:

submissions %>%
  unnest(Call) %>%
  select(Call, ends_with("Outcome")) %>%
  pivot_longer(-Call, names_to = "round", values_to = "status") %>%
  group_by(Call, round, status) %>%
  count(name = "count") %>%
  ungroup() %>%
  # filter(round == "Round 1 Outcome") %>%
  # select(-round) %>%
  arrange(Call, desc(count)) %>%
  group_by(Call) %>%
  mutate(pct = count / sum(count),
         total = sum(count),
         Call = str_remove(Call, c(" and Mentoring")),
         Call = str_glue("{Call} ({total})")) %>%
  ggplot(aes(Call, pct, fill = status, label = count)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ round, ncol = 1)

NA values are always tricky to deal with, what do they mean in each round?

submissions %>%
  group_by(across(ends_with("Outcome"))) %>%
  count() %>% ungroup() %>%
  filter(is.na(`Round 1 Outcome`) | is.na(`Round 2 Outcome`))

Having an NA value in Round 1 directly translates to an NA value in Round 2 but an NA value in Round 2 infers a variety of things for Round 1. We’re unsure if the NA values in Round 2 map to a successful proposal or if they mean that they didn’t succeed Round 1 or Round 2.

submissions_outcomes <- submissions %>%
  mutate(
    r1_declined = `Round 1 Outcome`== "Decline",
    r2_declined = `Round 2 Outcome` == "Declined")
library(ggalluvial)
submissions_outcomes %>%
  unnest(Call) %>%
  mutate(Call = Call,
         across(ends_with(" Outcome"), replace_na, "NA"), .keep = "used") %>%
  filter(!`Round 1 Outcome` %in% c("Decline", "NA")) %>%
  group_by(across(everything())) %>%
  count() %>% ungroup() %>%
  mutate(Accepted = `Round 2 Outcome` != "Declined") %>%
  ggplot(aes(y = n, axis1 = `Round 1 Outcome`, axis2 = `Round 2 Outcome`)) +
  geom_alluvium(aes(fill = Accepted), width = 1/12) +
  geom_stratum(width = 1/12, fill = "black", color = "grey") +
  geom_label(stat = "stratum", aes(label = after_stat(stratum)), size = 2) +
  scale_x_discrete(limits = c("Round 1 Outcome", "Round 2 Outcome"), expand = c(.05, .05)) +
  scale_fill_brewer(type = "qual", palette = "Set1") +
  facet_wrap(~ Call, scales = "free")

Constructing data for NLP model

We’re treating this as a text classification problem and most columns are text based with the exception of a few columns which we will try to convert to text.

sub_text_all <- submissions_outcomes %>%
  # turn lists with texts into a text column
  # for anything else we just get the count
  mutate(across(where(is.list), function(any_list) {
    if (class(any_list[[1]]) == "character") {
      map_chr(any_list, ~ str_c(.x, collapse = " "))
    } else {
      map_chr(any_list, ~ str_c(cur_column(), length(.x)))
    }
  })) %>%
  select(airtable_record_id,
         `Your Name`,
         `Online profile`,
         Attachments,
         `Email of primary contact`,
         `Applicant organization`,
         Call,
         Progress,
         Activities,
         `Outputs & Outcomes`,
         Superpowers,
         `IP Waiver`,
         `Idea & TOC`,
         Evidence,
         `Primary Residence Country`,
         `Funding requested`,
         `Implementation Country`,
         `Incorporation Country`,
         `Executive Summary`,
         ends_with("_declined")) %>%
  select(airtable_record_id, ends_with("_declined"), where(is.character)) %>%
  unite(text, -c(airtable_record_id, ends_with("_declined")), sep = " ")

A total count of number of declined and not declined submissions at the two rounds:

sub_text_all %>%
  group_by(across(ends_with("_declined"))) %>%
  count()

What’s with the proposal that was declined in Round 1 but not decined in Round 2?

sub_text_all %>%
  filter(r1_declined & !r2_declined) %>%
  select(-text) %>%
  inner_join(submissions) %>%
  select(ends_with("Outcome"))
Joining, by = "airtable_record_id"
sub_text_r1 <- sub_text_all %>%
  filter(!is.na(r1_declined))

sub_text_r2 <- sub_text_all %>%
  filter(!is.na(r2_declined))

Can a proposal be declined in Round 1 and then referred to fellowship in Round 2?

Model building

Break up our proposals into train and test sets:

ind_train <- sample.int(nrow(sub_text_r1), 0.8*nrow(sub_text_r1))
train <- sub_text_r1[ind_train,]
test <- sub_text_r1[-ind_train,]

In a previous version we used a simple text_vectorization supplied by Keras to transform our text into a tensor. This time we try something ore advacned by leveraging a token-based text embedding trained on Google News articles. This leverages transfer learning where models trained on another corpus can be leveraged as components of other models as popularized by Tensorflow Hub.

This becomes the first layer of our network learner, followed by 256/128/32 node ReLU nodes before classifying to a sigmoid function for declined_r1 classification:

# https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1
embeddings <- layer_hub(
  handle = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1",
  trainable = FALSE
)
+ . /home/ron/.local/share/r-miniconda/bin/activate
+ conda activate '/home/ron/.local/share/r-miniconda/envs/r-reticulate'
+ '/home/ron/.local/share/r-miniconda/envs/r-reticulate/bin/python' -m pip install --upgrade 'tensorflow_hub'
Requirement already satisfied: tensorflow_hub in /home/ron/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (0.12.0)
Requirement already satisfied: protobuf>=3.8.0 in /home/ron/.local/lib/python3.6/site-packages (from tensorflow_hub) (3.19.4)
Requirement already satisfied: numpy>=1.12.0 in /home/ron/.local/lib/python3.6/site-packages (from tensorflow_hub) (1.19.5)
Done!
2022-06-13 11:27:39.244867: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-13 11:27:39.606159: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 11339 MB memory:  -> device: 0, name: NVIDIA TITAN X (Pascal), pci bus id: 0000:41:00.0, compute capability: 6.1
2022-06-13 11:27:39.740761: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
input <- layer_input(shape = shape(), dtype = "string")
Loaded Tensorflow version 2.6.2
output <- input %>%
  embeddings() %>%
  layer_dense(units = 2048, activation = "relu") %>%
  layer_dense(units = 512, activation = "relu") %>%
  layer_dense(units = 128, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

model <- keras_model(input, output)

We pay attention to accuracy and AUC since class labels are not balanced.

model %>%
  compile(
    loss = "binary_crossentropy",
    optimizer = "adam",
    metrics = c("accuracy", "AUC"))

Fitting the model

history <- model %>%
  fit(x = train$text,
      y = as.matrix(train[, "r1_declined"]),
      epochs = 10,
      validation_split = 0.2)
Epoch 1/10

 1/18 [>.............................] - ETA: 13s - loss: 0.5797 - accuracy: 0.8438 - auc: 0.8036
 5/18 [=======>......................] - ETA: 0s - loss: 1.0540 - accuracy: 0.7250 - auc: 0.5296 
 8/18 [============>.................] - ETA: 0s - loss: 0.8704 - accuracy: 0.6953 - auc: 0.5193
12/18 [===================>..........] - ETA: 0s - loss: 0.7593 - accuracy: 0.7552 - auc: 0.5130
16/18 [=========================>....] - ETA: 0s - loss: 0.6712 - accuracy: 0.7852 - auc: 0.5101
18/18 [==============================] - 1s 14ms/step - loss: 0.6508 - accuracy: 0.7908 - auc: 0.5171

18/18 [==============================] - 1s 31ms/step - loss: 0.6508 - accuracy: 0.7908 - auc: 0.5171 - val_loss: 0.4947 - val_accuracy: 0.8369 - val_auc: 0.5853
Epoch 2/10

 1/18 [>.............................] - ETA: 0s - loss: 0.3479 - accuracy: 0.9375 - auc: 0.8750
 5/18 [=======>......................] - ETA: 0s - loss: 0.4259 - accuracy: 0.8438 - auc: 0.6347
 9/18 [==============>...............] - ETA: 0s - loss: 0.3889 - accuracy: 0.8611 - auc: 0.6609
13/18 [====================>.........] - ETA: 0s - loss: 0.4125 - accuracy: 0.8438 - auc: 0.7063
17/18 [===========================>..] - ETA: 0s - loss: 0.4052 - accuracy: 0.8438 - auc: 0.7248
18/18 [==============================] - 0s 13ms/step - loss: 0.3999 - accuracy: 0.8440 - auc: 0.7335

18/18 [==============================] - 0s 17ms/step - loss: 0.3999 - accuracy: 0.8440 - auc: 0.7335 - val_loss: 0.4805 - val_accuracy: 0.8440 - val_auc: 0.6412
Epoch 3/10

 1/18 [>.............................] - ETA: 0s - loss: 0.3943 - accuracy: 0.8125 - auc: 0.8814
 5/18 [=======>......................] - ETA: 0s - loss: 0.3737 - accuracy: 0.8438 - auc: 0.7747
 9/18 [==============>...............] - ETA: 0s - loss: 0.3532 - accuracy: 0.8438 - auc: 0.7844
13/18 [====================>.........] - ETA: 0s - loss: 0.3777 - accuracy: 0.8365 - auc: 0.7822
17/18 [===========================>..] - ETA: 0s - loss: 0.3847 - accuracy: 0.8272 - auc: 0.7817
18/18 [==============================] - 0s 13ms/step - loss: 0.3802 - accuracy: 0.8280 - auc: 0.7863

18/18 [==============================] - 0s 17ms/step - loss: 0.3802 - accuracy: 0.8280 - auc: 0.7863 - val_loss: 0.4168 - val_accuracy: 0.8085 - val_auc: 0.7569
Epoch 4/10

 1/18 [>.............................] - ETA: 0s - loss: 0.3233 - accuracy: 0.7812 - auc: 0.9114
 5/18 [=======>......................] - ETA: 0s - loss: 0.3362 - accuracy: 0.8500 - auc: 0.8801
10/18 [===============>..............] - ETA: 0s - loss: 0.3362 - accuracy: 0.8562 - auc: 0.8329
14/18 [======================>.......] - ETA: 0s - loss: 0.3544 - accuracy: 0.8415 - auc: 0.8322
18/18 [==============================] - 0s 13ms/step - loss: 0.3597 - accuracy: 0.8369 - auc: 0.8219

18/18 [==============================] - 0s 17ms/step - loss: 0.3597 - accuracy: 0.8369 - auc: 0.8219 - val_loss: 0.4609 - val_accuracy: 0.8369 - val_auc: 0.7254
Epoch 5/10

 1/18 [>.............................] - ETA: 0s - loss: 0.3847 - accuracy: 0.8438 - auc: 0.8074
 5/18 [=======>......................] - ETA: 0s - loss: 0.3129 - accuracy: 0.8438 - auc: 0.8935
 9/18 [==============>...............] - ETA: 0s - loss: 0.3089 - accuracy: 0.8542 - auc: 0.8481
14/18 [======================>.......] - ETA: 0s - loss: 0.3395 - accuracy: 0.8504 - auc: 0.8219
18/18 [==============================] - 0s 13ms/step - loss: 0.3412 - accuracy: 0.8564 - auc: 0.8310

18/18 [==============================] - 0s 17ms/step - loss: 0.3412 - accuracy: 0.8564 - auc: 0.8310 - val_loss: 0.4414 - val_accuracy: 0.8298 - val_auc: 0.8100
Epoch 6/10

 1/18 [>.............................] - ETA: 0s - loss: 0.4398 - accuracy: 0.7812 - auc: 0.8406
 5/18 [=======>......................] - ETA: 0s - loss: 0.3516 - accuracy: 0.8438 - auc: 0.8305
 9/18 [==============>...............] - ETA: 0s - loss: 0.3401 - accuracy: 0.8576 - auc: 0.8146
12/18 [===================>..........] - ETA: 0s - loss: 0.3396 - accuracy: 0.8568 - auc: 0.8333
16/18 [=========================>....] - ETA: 0s - loss: 0.3398 - accuracy: 0.8594 - auc: 0.8375
18/18 [==============================] - 0s 14ms/step - loss: 0.3315 - accuracy: 0.8635 - auc: 0.8465

18/18 [==============================] - 0s 18ms/step - loss: 0.3315 - accuracy: 0.8635 - auc: 0.8465 - val_loss: 0.4335 - val_accuracy: 0.8227 - val_auc: 0.7728
Epoch 7/10

 1/18 [>.............................] - ETA: 0s - loss: 0.2057 - accuracy: 0.8750 - auc: 1.0000
 6/18 [=========>....................] - ETA: 0s - loss: 0.3145 - accuracy: 0.8698 - auc: 0.8788
10/18 [===============>..............] - ETA: 0s - loss: 0.3439 - accuracy: 0.8344 - auc: 0.8613
14/18 [======================>.......] - ETA: 0s - loss: 0.3174 - accuracy: 0.8616 - auc: 0.8691
18/18 [==============================] - 0s 13ms/step - loss: 0.3116 - accuracy: 0.8599 - auc: 0.8660

18/18 [==============================] - 0s 17ms/step - loss: 0.3116 - accuracy: 0.8599 - auc: 0.8660 - val_loss: 0.5518 - val_accuracy: 0.8440 - val_auc: 0.7461
Epoch 8/10

 1/18 [>.............................] - ETA: 0s - loss: 0.4827 - accuracy: 0.7500 - auc: 0.9453
 5/18 [=======>......................] - ETA: 0s - loss: 0.3545 - accuracy: 0.8313 - auc: 0.8290
 9/18 [==============>...............] - ETA: 0s - loss: 0.3467 - accuracy: 0.8507 - auc: 0.8375
13/18 [====================>.........] - ETA: 0s - loss: 0.3274 - accuracy: 0.8582 - auc: 0.8413
17/18 [===========================>..] - ETA: 0s - loss: 0.3263 - accuracy: 0.8603 - auc: 0.8418
18/18 [==============================] - 0s 13ms/step - loss: 0.3340 - accuracy: 0.8528 - auc: 0.8398

18/18 [==============================] - 0s 18ms/step - loss: 0.3340 - accuracy: 0.8528 - auc: 0.8398 - val_loss: 0.4710 - val_accuracy: 0.8298 - val_auc: 0.7961
Epoch 9/10

 1/18 [>.............................] - ETA: 0s - loss: 0.3285 - accuracy: 0.8750 - auc: 0.9375
 5/18 [=======>......................] - ETA: 0s - loss: 0.2931 - accuracy: 0.8750 - auc: 0.8952
 9/18 [==============>...............] - ETA: 0s - loss: 0.3266 - accuracy: 0.8576 - auc: 0.8454
13/18 [====================>.........] - ETA: 0s - loss: 0.2955 - accuracy: 0.8702 - auc: 0.8716
17/18 [===========================>..] - ETA: 0s - loss: 0.3097 - accuracy: 0.8603 - auc: 0.8672
18/18 [==============================] - 0s 13ms/step - loss: 0.3083 - accuracy: 0.8599 - auc: 0.8703

18/18 [==============================] - 0s 17ms/step - loss: 0.3083 - accuracy: 0.8599 - auc: 0.8703 - val_loss: 0.4634 - val_accuracy: 0.8440 - val_auc: 0.8266
Epoch 10/10

 1/18 [>.............................] - ETA: 0s - loss: 0.2786 - accuracy: 0.8750 - auc: 0.9662
 5/18 [=======>......................] - ETA: 0s - loss: 0.2681 - accuracy: 0.8562 - auc: 0.9279
10/18 [===============>..............] - ETA: 0s - loss: 0.2937 - accuracy: 0.8406 - auc: 0.8942
15/18 [========================>.....] - ETA: 0s - loss: 0.2820 - accuracy: 0.8542 - auc: 0.8904
18/18 [==============================] - 0s 13ms/step - loss: 0.2950 - accuracy: 0.8475 - auc: 0.8824

18/18 [==============================] - 0s 17ms/step - loss: 0.2950 - accuracy: 0.8475 - auc: 0.8824 - val_loss: 0.4490 - val_accuracy: 0.8298 - val_auc: 0.8040
model %>%
  evaluate(x = test$text, y = as.matrix(test[,"r1_declined"]))

1/6 [====>.........................] - ETA: 0s - loss: 0.2658 - accuracy: 0.8438 - auc: 0.8391
5/6 [========================>.....] - ETA: 0s - loss: 0.3326 - accuracy: 0.8313 - auc: 0.8430
6/6 [==============================] - 0s 12ms/step - loss: 0.3363 - accuracy: 0.8249 - auc: 0.8411

6/6 [==============================] - 0s 12ms/step - loss: 0.3363 - accuracy: 0.8249 - auc: 0.8411
     loss  accuracy       auc 
0.3363103 0.8248588 0.8411111 

AUC greater than ~0.8 is sort of a arbitrary industry benchmark for passable prediction models which we’re in the ballpark of.

plot(history)

We can see that the model stabilized after around 3 epochs suggesting more training would not help improve model performance.

Prediction check

preds <- test %>%
  bind_cols(tibble(pred_r1_declined = predict(model, test$text))) %>%
  inner_join(submissions, by = "airtable_record_id")

preds %>%
  mutate(r1_declined = as.factor(r1_declined)) %>%
  ggplot(aes(pred_r1_declined, color = r1_declined)) +
  geom_density()

There’s a large mass in the high range for declined proposals, while those that are not declined are more spread out through the distribution, with a mode around 0.4.

Overall we’re seeing higher declination scores for proposals labeled as declined, so that’s a good sign.

Practical predictive diagnostics

We came up with two scenarios earlier that motivated our thinking for how we could use these predictions. The first was motivated by a lack of reviewer resources such that we would prioritize proposals with more promise:

preds %>%
  select(Created, pred_r1_declined, r1_declined) %>%
  mutate(total_accepted = sum(r1_declined == 0)) %>%
  # pred order
  arrange(pred_r1_declined) %>%
  mutate(total_read__pred = row_number(),
         total_accepted__pred = sum(r1_declined == 0),
         cum_accepted__pred = total_read__pred - cumsum(r1_declined == 1),
         missed_accepted__pred = total_accepted - cum_accepted__pred) %>%
  # created order
  arrange(Created) %>%
  mutate(total_read__created = row_number(),
         cum_accepted__created = total_read__created - cumsum(r1_declined == 1),
         missed_accepted__created = total_accepted - cum_accepted__created) %>%
  pivot_longer(starts_with(c("total_read", "missed_accepted")),
               names_sep = "__", names_to = c("name", "method")) %>%
  pivot_wider() %>%
  ggplot(aes(total_read, missed_accepted, color = method)) +
  geom_line()

NA

After about 110 proposals, all the remaining would have been declined so there would have been no missed accepted proposals while reading them in order of submission time would have led to 17 missed proposals at that cutoff.

Prediction and write back

Let’s write these predictions back to the airtable for human scrutiny. Note that we are producing predictions for all submission, even those that did not have a declined marker.

update_records(at_update, submissions_at, columns = grants2vec)
You are about to update 933 records of the following variables:
    grants2vec
Do you wish to proceed? 

1: ✔ Yes
2: ✖ No
1

  Sending PATCH requests [>--------------------------------------------------------------------]   2% eta: 34s
  Sending PATCH requests [=>-------------------------------------------------------------------]   3% eta: 47s
  Sending PATCH requests [==>------------------------------------------------------------------]   4% eta:  1m
  Sending PATCH requests [===>-----------------------------------------------------------------]   5% eta:  1m
  Sending PATCH requests [===>-----------------------------------------------------------------]   6% eta: 48s
  Sending PATCH requests [====>----------------------------------------------------------------]   7% eta:  1m
  Sending PATCH requests [=====>---------------------------------------------------------------]   9% eta:  1m
  Sending PATCH requests [======>--------------------------------------------------------------]  10% eta:  1m
  Sending PATCH requests [======>--------------------------------------------------------------]  11% eta: 48s
  Sending PATCH requests [=======>-------------------------------------------------------------]  12% eta: 47s
  Sending PATCH requests [========>------------------------------------------------------------]  13% eta: 46s
  Sending PATCH requests [=========>-----------------------------------------------------------]  14% eta: 44s
  Sending PATCH requests [=========>-----------------------------------------------------------]  15% eta: 43s
  Sending PATCH requests [==========>----------------------------------------------------------]  16% eta: 42s
  Sending PATCH requests [===========>---------------------------------------------------------]  17% eta: 41s
  Sending PATCH requests [===========>---------------------------------------------------------]  18% eta: 41s
  Sending PATCH requests [============>--------------------------------------------------------]  19% eta: 40s
  Sending PATCH requests [=============>-------------------------------------------------------]  20% eta: 39s
  Sending PATCH requests [==============>------------------------------------------------------]  21% eta: 38s
  Sending PATCH requests [==============>------------------------------------------------------]  22% eta: 37s
  Sending PATCH requests [===============>-----------------------------------------------------]  23% eta: 37s
  Sending PATCH requests [================>----------------------------------------------------]  24% eta: 36s
  Sending PATCH requests [=================>---------------------------------------------------]  26% eta: 35s
  Sending PATCH requests [=================>---------------------------------------------------]  27% eta: 35s
  Sending PATCH requests [==================>--------------------------------------------------]  28% eta: 34s
  Sending PATCH requests [===================>-------------------------------------------------]  29% eta: 34s
  Sending PATCH requests [====================>------------------------------------------------]  30% eta: 34s
  Sending PATCH requests [====================>------------------------------------------------]  31% eta: 33s
  Sending PATCH requests [=====================>-----------------------------------------------]  32% eta: 33s
  Sending PATCH requests [======================>----------------------------------------------]  33% eta: 32s
  Sending PATCH requests [======================>----------------------------------------------]  34% eta: 32s
  Sending PATCH requests [=======================>---------------------------------------------]  35% eta: 31s
  Sending PATCH requests [========================>--------------------------------------------]  36% eta: 30s
  Sending PATCH requests [=========================>-------------------------------------------]  37% eta: 30s
  Sending PATCH requests [=========================>-------------------------------------------]  38% eta: 29s
  Sending PATCH requests [==========================>------------------------------------------]  39% eta: 29s
  Sending PATCH requests [===========================>-----------------------------------------]  40% eta: 28s
  Sending PATCH requests [============================>----------------------------------------]  41% eta: 28s
  Sending PATCH requests [============================>----------------------------------------]  43% eta: 27s
  Sending PATCH requests [=============================>---------------------------------------]  44% eta: 27s
  Sending PATCH requests [==============================>--------------------------------------]  45% eta: 26s
  Sending PATCH requests [===============================>-------------------------------------]  46% eta: 26s
  Sending PATCH requests [===============================>-------------------------------------]  47% eta: 25s
  Sending PATCH requests [================================>------------------------------------]  48% eta: 25s
  Sending PATCH requests [=================================>-----------------------------------]  49% eta: 24s
  Sending PATCH requests [=================================>-----------------------------------]  50% eta: 24s
  Sending PATCH requests [==================================>----------------------------------]  51% eta: 23s
  Sending PATCH requests [===================================>---------------------------------]  52% eta: 22s
  Sending PATCH requests [====================================>--------------------------------]  53% eta: 22s
  Sending PATCH requests [====================================>--------------------------------]  54% eta: 21s
  Sending PATCH requests [=====================================>-------------------------------]  55% eta: 21s
  Sending PATCH requests [======================================>------------------------------]  56% eta: 20s
  Sending PATCH requests [=======================================>-----------------------------]  57% eta: 20s
  Sending PATCH requests [=======================================>-----------------------------]  59% eta: 19s
  Sending PATCH requests [========================================>----------------------------]  60% eta: 19s
  Sending PATCH requests [=========================================>---------------------------]  61% eta: 18s
  Sending PATCH requests [==========================================>--------------------------]  62% eta: 18s
  Sending PATCH requests [==========================================>--------------------------]  63% eta: 17s
  Sending PATCH requests [===========================================>-------------------------]  64% eta: 17s
  Sending PATCH requests [============================================>------------------------]  65% eta: 16s
  Sending PATCH requests [=============================================>-----------------------]  66% eta: 16s
  Sending PATCH requests [=============================================>-----------------------]  67% eta: 15s
  Sending PATCH requests [==============================================>----------------------]  68% eta: 15s
  Sending PATCH requests [===============================================>---------------------]  69% eta: 14s
  Sending PATCH requests [===============================================>---------------------]  70% eta: 14s
  Sending PATCH requests [================================================>--------------------]  71% eta: 13s
  Sending PATCH requests [=================================================>-------------------]  72% eta: 13s
  Sending PATCH requests [==================================================>------------------]  73% eta: 12s
  Sending PATCH requests [==================================================>------------------]  74% eta: 12s
  Sending PATCH requests [===================================================>-----------------]  76% eta: 11s
  Sending PATCH requests [====================================================>----------------]  77% eta: 11s
  Sending PATCH requests [=====================================================>---------------]  78% eta: 10s
  Sending PATCH requests [=====================================================>---------------]  79% eta: 10s
  Sending PATCH requests [======================================================>--------------]  80% eta:  9s
  Sending PATCH requests [=======================================================>-------------]  81% eta:  9s
  Sending PATCH requests [========================================================>------------]  82% eta:  8s
  Sending PATCH requests [========================================================>------------]  83% eta:  8s
  Sending PATCH requests [=========================================================>-----------]  84% eta:  7s
  Sending PATCH requests [==========================================================>----------]  85% eta:  7s
  Sending PATCH requests [==========================================================>----------]  86% eta:  6s
  Sending PATCH requests [===========================================================>---------]  87% eta:  6s
  Sending PATCH requests [============================================================>--------]  88% eta:  5s
  Sending PATCH requests [=============================================================>-------]  89% eta:  5s
  Sending PATCH requests [=============================================================>-------]  90% eta:  4s
  Sending PATCH requests [==============================================================>------]  91% eta:  4s
  Sending PATCH requests [===============================================================>-----]  93% eta:  3s
  Sending PATCH requests [================================================================>----]  94% eta:  3s
  Sending PATCH requests [================================================================>----]  95% eta:  2s
  Sending PATCH requests [=================================================================>---]  96% eta:  2s
  Sending PATCH requests [==================================================================>--]  97% eta:  1s
  Sending PATCH requests [===================================================================>-]  98% eta:  1s
  Sending PATCH requests [===================================================================>-]  99% eta:  0s
  Sending PATCH requests [=====================================================================] 100% eta:  0s
                                                                                                              

Analysis

Let’s see how well these predictions performed against various outcomes measures in our submissions airtable. First we re-read our submissions with the newly updated grants2vec declination score:

submissions <- submissions_at %>% read_airtable()

We replicate the plots above demonstrating our two use cases for these scores.

The first strategy recommends ordering the reads by scores in event of limited reviewer resources or declining reviewer quality:

submissions %>%
  unnest(Call) %>%
  select(Created, Call, pred_r1_declined = grants2vec, `Round 1 Outcome`) %>%
  mutate(r1_declined = `Round 1 Outcome` == "Decline") %>%
  filter(!is.na(r1_declined)) %>%
  group_by(Call) %>%
  mutate(total_accepted = sum(r1_declined == 0)) %>%
  # pred order
  arrange(pred_r1_declined) %>%
  mutate(total_read__pred = row_number(),
         total_accepted__pred = sum(r1_declined == 0),
         cum_accepted__pred = total_read__pred - cumsum(r1_declined == 1),
         missed_accepted__pred = total_accepted - cum_accepted__pred) %>%
  # created order
  arrange(Created) %>%
  mutate(total_read__created = row_number(),
         cum_accepted__created = total_read__created - cumsum(r1_declined == 1),
         missed_accepted__created = total_accepted - cum_accepted__created) %>%
  pivot_longer(starts_with(c("total_read", "missed_accepted")),
               names_sep = "__", names_to = c("name", "method")) %>%
  pivot_wider() %>%
  ggplot(aes(total_read, missed_accepted, color = method)) +
  geom_line() + facet_wrap(~ Call, scales = "free") +
  theme(legend.position = "bottom")

The blue curve represents the total number of missed_accepted proposals if the reviewer stopped reading after total_read proposals which is consistently under the strategy that involves reading the proposals in order that they were received. The gains are the distance between the two curves.

The second use case involves highlighting some proposals that had a very low declination score but ended up getting declined. This can be used as a second pass for proposals we may have missed.

(redacted)

Additional analysis

Any other ideas of how you want to see this data analyzed? Some ideas below:

We have a qualitative understanding of how quality each of the different submissions were for each call. Let’s see if this is reflected in the score:

submissions %>%
  unnest(Call) %>%
  group_by(Call) %>%
  summarise(decline_prob = mean(grants2vec, na.rm = TRUE)) %>%
  arrange(decline_prob)

It appears Mental Health and Digital Coaching had the best proposals while Livelihood Exploration may have had the worst. The other calls were somewhere in between.

submissions %>%
  select(matches("^R1.*[1-4]$"), grants2vec) %>%
  pivot_longer(-grants2vec) %>%
  filter(str_starts(value, "Tier")) %>%
  mutate(value = str_remove(value, " - .*$"),
         name = str_remove(name, "R1-")) %>%
  group_by(name, value) %>%
  summarise(decline_score = mean(grants2vec)) %>%
  ggplot(aes(name, decline_score, fill = value)) +
  geom_col(position = "dodge")
`summarise()` has grouped output by 'name'. You can override using the `.groups` argument.

NA

Only Reviewer3 had some inconsistent scoring between Tier 2 and Tier 3 with respect to the predicted score.

Temina thoughts

Great insights from Temina, follow ups below:

R2 Prediction target

Which labels should we be using? We currently use “Round 1 outcome”, but as you’ve noted these proposals are reviewed quickly, by just 2 people per proposal, and the labels are noisy. Once we’ve had a chance to more thoroughly review the shortlist in Round 2, we have a set of proposals that is probably closer to the level of scrutiny I’d want in a training set. Within “Round 2 outcomes”, i think you could count anything that is not marked “decline” as a reasonably successful proposal. This means that you’ll end up with roughly 1/3 to 1/2 fewer labeled proposals. But the labels are more careful. If this loss of labels creates problems for training, perhaps we need to go through a few more rounds of grant-making?

Btw did you already try training on “R2 outcomes” or other outcome variables? Would be interesting to know whatever else you looked at, and how it performed.

ind_train <- sample.int(nrow(sub_text_r2), 0.8*nrow(sub_text_r2))
train <- sub_text_r2[ind_train,] %>%
  filter(!is.na(r1_declined))
test <- sub_text_r2[-ind_train,] %>%
  filter(!is.na(r1_declined))

Fewer positive labels here so much more imbalanced of a dataset.

history <- model %>%
  fit(x = train$text,
      y = as.matrix(train[, "r2_declined"]),
      epochs = 10,
      validation_split = 0.2)
Epoch 1/10

1/5 [=====>........................] - ETA: 0s - loss: 2.6673 - accuracy: 0.2812 - auc: 0.2885
5/5 [==============================] - 0s 15ms/step - loss: 1.7872 - accuracy: 0.6144 - auc: 0.4016

5/5 [==============================] - 0s 31ms/step - loss: 1.7872 - accuracy: 0.6144 - auc: 0.4016 - val_loss: 0.5214 - val_accuracy: 0.8462 - val_auc: 0.4040
Epoch 2/10

1/5 [=====>........................] - ETA: 0s - loss: 0.7989 - accuracy: 0.7500 - auc: 0.2057
5/5 [==============================] - 0s 15ms/step - loss: 0.6909 - accuracy: 0.8105 - auc: 0.3546

5/5 [==============================] - 0s 25ms/step - loss: 0.6909 - accuracy: 0.8105 - auc: 0.3546 - val_loss: 0.4545 - val_accuracy: 0.8462 - val_auc: 0.4949
Epoch 3/10

1/5 [=====>........................] - ETA: 0s - loss: 0.4038 - accuracy: 0.9062 - auc: 0.5575
5/5 [==============================] - 0s 15ms/step - loss: 0.5200 - accuracy: 0.8301 - auc: 0.3979

5/5 [==============================] - 0s 25ms/step - loss: 0.5200 - accuracy: 0.8301 - auc: 0.3979 - val_loss: 0.4388 - val_accuracy: 0.8462 - val_auc: 0.5505
Epoch 4/10

1/5 [=====>........................] - ETA: 0s - loss: 0.5896 - accuracy: 0.7500 - auc: 0.5260
5/5 [==============================] - 0s 14ms/step - loss: 0.4964 - accuracy: 0.8301 - auc: 0.5518

5/5 [==============================] - 0s 25ms/step - loss: 0.4964 - accuracy: 0.8301 - auc: 0.5518 - val_loss: 0.4494 - val_accuracy: 0.8462 - val_auc: 0.5455
Epoch 5/10

1/5 [=====>........................] - ETA: 0s - loss: 0.4642 - accuracy: 0.8125 - auc: 0.6603
5/5 [==============================] - 0s 13ms/step - loss: 0.4649 - accuracy: 0.8301 - auc: 0.6084

5/5 [==============================] - 0s 23ms/step - loss: 0.4649 - accuracy: 0.8301 - auc: 0.6084 - val_loss: 0.4135 - val_accuracy: 0.8462 - val_auc: 0.6414
Epoch 6/10

1/5 [=====>........................] - ETA: 0s - loss: 0.3853 - accuracy: 0.8438 - auc: 0.7778
5/5 [==============================] - 0s 15ms/step - loss: 0.4328 - accuracy: 0.8301 - auc: 0.6769

5/5 [==============================] - 0s 24ms/step - loss: 0.4328 - accuracy: 0.8301 - auc: 0.6769 - val_loss: 0.3890 - val_accuracy: 0.8462 - val_auc: 0.7727
Epoch 7/10

1/5 [=====>........................] - ETA: 0s - loss: 0.3704 - accuracy: 0.8750 - auc: 0.7321
5/5 [==============================] - 0s 15ms/step - loss: 0.3929 - accuracy: 0.8301 - auc: 0.8122

5/5 [==============================] - 0s 24ms/step - loss: 0.3929 - accuracy: 0.8301 - auc: 0.8122 - val_loss: 0.3635 - val_accuracy: 0.8462 - val_auc: 0.7980
Epoch 8/10

1/5 [=====>........................] - ETA: 0s - loss: 0.4235 - accuracy: 0.7812 - auc: 0.7340
5/5 [==============================] - 0s 13ms/step - loss: 0.3680 - accuracy: 0.8235 - auc: 0.8192

5/5 [==============================] - 0s 23ms/step - loss: 0.3680 - accuracy: 0.8235 - auc: 0.8192 - val_loss: 0.3416 - val_accuracy: 0.8462 - val_auc: 0.8283
Epoch 9/10

1/5 [=====>........................] - ETA: 0s - loss: 0.3768 - accuracy: 0.7812 - auc: 0.8914
5/5 [==============================] - 0s 14ms/step - loss: 0.3529 - accuracy: 0.8105 - auc: 0.8398

5/5 [==============================] - 0s 24ms/step - loss: 0.3529 - accuracy: 0.8105 - auc: 0.8398 - val_loss: 0.3195 - val_accuracy: 0.8462 - val_auc: 0.8409
Epoch 10/10

1/5 [=====>........................] - ETA: 0s - loss: 0.2397 - accuracy: 0.8750 - auc: 0.9704
5/5 [==============================] - 0s 14ms/step - loss: 0.3275 - accuracy: 0.8431 - auc: 0.8605

5/5 [==============================] - 0s 25ms/step - loss: 0.3275 - accuracy: 0.8431 - auc: 0.8605 - val_loss: 0.3271 - val_accuracy: 0.7949 - val_auc: 0.8914
model %>%
  evaluate(x = test$text, y = as.matrix(test[,"r2_declined"]))

1/2 [==============>...............] - ETA: 0s - loss: 0.6253 - accuracy: 0.7188 - auc: 0.5111
2/2 [==============================] - 0s 10ms/step - loss: 0.6603 - accuracy: 0.6939 - auc: 0.4833

2/2 [==============================] - 0s 11ms/step - loss: 0.6603 - accuracy: 0.6939 - auc: 0.4833
     loss  accuracy       auc 
0.6603321 0.6938776 0.4833333 

AUC is worse, in between a model of no value and the AUC values we were getting earlier.

sub_text_all$grants2vec_r2 <- model %>% predict(sub_text_all$text) %>% as.vector()

submissions %>%
  left_join(sub_text_all, by = "airtable_record_id") %>%
  unnest(Call) %>%
  select(Created, Call, pred_r2_declined = grants2vec_r2,
         ends_with("_declined")) %>%
  filter(!is.na(r2_declined)) %>%
  group_by(Call) %>%
  mutate(total_accepted = sum(r2_declined == 1)) %>%
  # pred order
  arrange(pred_r2_declined) %>%
  mutate(total_read__pred = row_number(),
         total_accepted__pred = sum(r2_declined == 1),
         cum_accepted__pred = total_read__pred - cumsum(r2_declined == 0),
         missed_accepted__pred = total_accepted - cum_accepted__pred) %>%
  # created order
  arrange(Created) %>%
  mutate(total_read__created = row_number(),
         cum_accepted__created = total_read__created - cumsum(r2_declined == 0),
         missed_accepted__created = total_accepted - cum_accepted__created) %>%
  pivot_longer(starts_with(c("total_read", "missed_accepted")),
               names_sep = "__", names_to = c("name", "method")) %>%
  pivot_wider() %>%
  ggplot(aes(total_read, missed_accepted, color = method)) +
  geom_line() + facet_wrap(~ Call, scales = "free") +
  theme(legend.position = "bottom")

Perhaps as reflected by the lower predictive scores, we see that these predictions don’t fare as well against reading in order of submission for many of the proposals, perhaps with the exception of the Open Call and Livelihood Exploration.

Below are the top 10 proposals that were declined but had the lowest r2_declined score.

(redacted)

And the next top 10 proposals that were declined in R2 with the highest r2_declined score.

(redacted)

A quick gut check that proposals declined in R1 have lower scores than those accepted in R1.

submissions %>%
  left_join(sub_text_all, by = "airtable_record_id") %>%
  group_by(r1_declined) %>%
  summarise(across(starts_with("grants2vec"), mean))

Leveraging model for new submissions

new_submissions_at <- airtable("tblfNQfUsRCFAiIHd", "app1Cg2upg4u5RqTX")
new_submissions <- new_submissions_at %>% read_airtable()

new_sub_text_all <- new_submissions %>%
  # turn lists with texts into a text column
  # for anything else we just get the count
  mutate(across(where(is.list), function(any_list) {
    if (class(any_list[[1]]) == "character") {
      map_chr(any_list, ~ str_c(.x, collapse = " "))
    } else {
      map_chr(any_list, ~ str_c(cur_column(), length(.x)))
    }
  })) %>%
  select(airtable_record_id, ends_with("_declined"), where(is.character)) %>%
  unite(text, -c(airtable_record_id, ends_with("_declined")), sep = " ")

new_sub_text_all$grants2vec <- model %>% predict(new_sub_text_all$text) %>% as.vector()
at_update <- new_sub_text_all %>% select(airtable_record_id, grants2vec) %>%
  column_to_rownames("airtable_record_id")

update_records(at_update, new_submissions_at, columns = grants2vec)
You are about to update 162 records of the following variables:
    grants2vec
Do you wish to proceed? 

1: ✔ Yes
2: ✖ No
1

  Sending PATCH requests [=======>-------------------------------------------------------------]  12% eta: 34s
  Sending PATCH requests [===========>---------------------------------------------------------]  18% eta: 31s
  Sending PATCH requests [===============>-----------------------------------------------------]  24% eta: 25s
  Sending PATCH requests [===================>-------------------------------------------------]  29% eta: 22s
  Sending PATCH requests [=======================>---------------------------------------------]  35% eta: 28s
  Sending PATCH requests [===========================>-----------------------------------------]  41% eta: 25s
  Sending PATCH requests [===============================>-------------------------------------]  47% eta: 24s
  Sending PATCH requests [====================================>--------------------------------]  53% eta: 22s
  Sending PATCH requests [========================================>----------------------------]  59% eta: 19s
  Sending PATCH requests [============================================>------------------------]  65% eta: 15s
  Sending PATCH requests [================================================>--------------------]  71% eta: 12s
  Sending PATCH requests [====================================================>----------------]  76% eta: 10s
  Sending PATCH requests [========================================================>------------]  82% eta:  7s
  Sending PATCH requests [============================================================>--------]  88% eta:  4s
  Sending PATCH requests [================================================================>----]  94% eta:  3s
  Sending PATCH requests [=====================================================================] 100% eta:  0s
                                                                                                              
---
title: "Grants2Vec"
output: html_notebook
---

```{r setup}
library(tidyverse)
library(rairtable)
library(keras)
library(tfhub)
```

## TLDR;

The notebook below demonstrates the ability to use NLP and Deep Learning to predict the probability of declination. The algorithm uses only the information submitted through the website and processes these texts to produce a score based on reviewer declination labels from previous rounds of proposals. The algorithm produces a reasonably performant 0.84 ROC AUC that suggests solid predictive performance. In the [Analysis] section below we demonstrate the empirical results of two use cases around (1) ordering proposals due to limited reviewer resources or (2) highlighting declined proposals that had low predicted declination scores to ensure we did not mistakenly decline a worthy proposal.

## Background

Funders seek to fund promising ideas and it is their core competency to be able to do this well. The process of selecting which proposals get funded is a challenging one as indicators of a successful program more often rely on retrospective storytelling vs measured objective impact.

When proposals are submitted to a funder, they go through a very human process of review and selection. In our case, a number of reviewers go through a series of review rounds with increasing scrutiny and attention and a decreasing number of proposals that make to it successive rounds. 

The first round has the greatest number of proposals and with an Open Call the variation in proposals can vary quite a lot. Based on what is written in the proposal, some of these decisions can be made fairly quickly without too much scrutiny. It is also hard to predict how many proposals are submitted and there are times where the number of proposals can overwhelm the resources available to review them.

The notebook below will attempt to use a history of labeled proposals and their textual content to produce a predictive algorithm to rate proposal quality. This quality score can be leveraged to prioritize proposals when there is a reviewer resource constraint or be used as a secondary robustness check to ensure high scoring proposals are not missed due to reviewer fatigue.

## Data prep

We're using the data from an Airtable for all the proposals that are submitted through the website.

### Reading data from Airtable

Use the Airtable API to read from our submissions table:

```{r}
rairtable::set_airtable_api_key(read_lines("at_key.txt"))
submissions_at <- airtable("tblheGxoCuoMOFC6K", "appSz3aSZX0uVXKbV")
submissions <- submissions_at %>% read_airtable()
```

```{r}
dim(submissions)
```

A total of 933 submissions and 83 columns.

```{r}
tibble(
  name = submissions %>% names(),
  type = sapply(submissions, class)) %>%
  DT::datatable()
```
Only some of these columns are derived from the original proposal submission while others are added by or labeled by the reviewers. It's important that we only consider the original content of the proposal along with the prediction target which is the key reviewer outcome we wish to estimate. Other fields are likely to be related to the key outcome and can make the algorithm appear more predictive than it is.

### Key outcomes

The key outcomes we're focused on predicting are related to decisions made about the proposal which are ultimately captured in `Round 1 Outcome` and `Round 2 Outcome` fields. There are a greater number of proposals in Round 1 and there is likely less nuance that goes into decisions in this round compared to future rounds. There's a trade-off going into this between having a more balanced labels (Round 1) while potentially having a clearer and ultimately more important signal of success (Round 2). The nuance in decision making in Round 2 also may be more hard to capture with our features since there may be a good amount of extra research outside of the proposal content going into it.

Let's first have a look at the values in these two columns:

```{r}
submissions %>%
  unnest(Call) %>%
  select(Call, ends_with("Outcome")) %>%
  pivot_longer(-Call, names_to = "round", values_to = "status") %>%
  group_by(Call, round, status) %>%
  count(name = "count") %>%
  ungroup() %>%
  # filter(round == "Round 1 Outcome") %>%
  # select(-round) %>%
  arrange(Call, desc(count)) %>%
  group_by(Call) %>%
  mutate(pct = count / sum(count),
         total = sum(count),
         Call = str_remove(Call, c(" and Mentoring")),
         Call = str_glue("{Call} ({total})")) %>%
  ggplot(aes(Call, pct, fill = status, label = count)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ round, ncol = 1)
```
NA values are always tricky to deal with, what do they mean in each round?

```{r}
submissions %>%
  group_by(across(ends_with("Outcome"))) %>%
  count() %>% ungroup() %>%
  filter(is.na(`Round 1 Outcome`) | is.na(`Round 2 Outcome`))
```
Having an NA value in Round 1 directly translates to an NA value in Round 2 but an NA value in Round 2 infers a variety of things for Round 1. We're unsure if the NA values in Round 2 map to a successful proposal or if they mean that they didn't succeed Round 1 or Round 2.

```{r}
submissions_outcomes <- submissions %>%
  mutate(
    r1_declined = `Round 1 Outcome`== "Decline",
    r2_declined = `Round 2 Outcome` == "Declined")
```

```{r}
library(ggalluvial)
submissions_outcomes %>%
  unnest(Call) %>%
  mutate(Call = Call,
         across(ends_with(" Outcome"), replace_na, "NA"), .keep = "used") %>%
  filter(!`Round 1 Outcome` %in% c("Decline", "NA")) %>%
  group_by(across(everything())) %>%
  count() %>% ungroup() %>%
  mutate(Accepted = `Round 2 Outcome` != "Declined") %>%
  ggplot(aes(y = n, axis1 = `Round 1 Outcome`, axis2 = `Round 2 Outcome`)) +
  geom_alluvium(aes(fill = Accepted), width = 1/12) +
  geom_stratum(width = 1/12, fill = "black", color = "grey") +
  geom_label(stat = "stratum", aes(label = after_stat(stratum)), size = 2) +
  scale_x_discrete(limits = c("Round 1 Outcome", "Round 2 Outcome"), expand = c(.05, .05)) +
  scale_fill_brewer(type = "qual", palette = "Set1") +
  facet_wrap(~ Call, scales = "free")
```

### Constructing data for NLP model

We're treating this as a text classification problem and most columns are text based with the exception of a few columns which we will try to convert to text.

```{r}
sub_text_all <- submissions_outcomes %>%
  # turn lists with texts into a text column
  # for anything else we just get the count
  mutate(across(where(is.list), function(any_list) {
    if (class(any_list[[1]]) == "character") {
      map_chr(any_list, ~ str_c(.x, collapse = " "))
    } else {
      map_chr(any_list, ~ str_c(cur_column(), length(.x)))
    }
  })) %>%
  select(airtable_record_id,
         `Your Name`,
         `Online profile`,
         Attachments,
         `Email of primary contact`,
         `Applicant organization`,
         Call,
         Progress,
         Activities,
         `Outputs & Outcomes`,
         Superpowers,
         `IP Waiver`,
         `Idea & TOC`,
         Evidence,
         `Primary Residence Country`,
         `Funding requested`,
         `Implementation Country`,
         `Incorporation Country`,
         `Executive Summary`,
         ends_with("_declined")) %>%
  select(airtable_record_id, ends_with("_declined"), where(is.character)) %>%
  unite(text, -c(airtable_record_id, ends_with("_declined")), sep = " ")
```

A total count of number of declined and not declined submissions at the two rounds:

```{r}
sub_text_all %>%
  group_by(across(ends_with("_declined"))) %>%
  count()
```

What's with the proposal that was declined in Round 1 but not decined in Round 2?

```{r}
sub_text_all %>%
  filter(r1_declined & !r2_declined) %>%
  select(-text) %>%
  inner_join(submissions) %>%
  select(ends_with("Outcome"))

sub_text_r1 <- sub_text_all %>%
  filter(!is.na(r1_declined))

sub_text_r2 <- sub_text_all %>%
  filter(!is.na(r2_declined))
```

Can a proposal be declined in Round 1 and then referred to fellowship in Round 2?

## Model building

Break up our proposals into train and test sets:

```{r}
ind_train <- sample.int(nrow(sub_text_r1), 0.8*nrow(sub_text_r1))
train <- sub_text_r1[ind_train,]
test <- sub_text_r1[-ind_train,]
```

In a previous version we used a simple [text_vectorization](https://www.rdocumentation.org/packages/keras/versions/2.8.0/topics/layer_text_vectorization) supplied by Keras to transform our text into a tensor. This time we try something ore advacned by leveraging a token-based text embedding trained on Google News articles. This leverages transfer learning where models trained on another corpus can be leveraged as components of other models as popularized by [Tensorflow Hub](https://tfhub.dev/).

This becomes the first layer of our network learner, followed by 256/128/32 node ReLU nodes before classifying to a sigmoid function for `declined_r1` classification:

```{r}
# https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1
embeddings <- layer_hub(
  handle = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1",
  trainable = FALSE
)

input <- layer_input(shape = shape(), dtype = "string")

output <- input %>%
  embeddings() %>%
  layer_dense(units = 2048, activation = "relu") %>%
  layer_dense(units = 512, activation = "relu") %>%
  layer_dense(units = 128, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

model <- keras_model(input, output)
```

We pay attention to accuracy and AUC since class labels are not balanced.

```{r}
model %>%
  compile(
    loss = "binary_crossentropy",
    optimizer = "adam",
    metrics = c("accuracy", "AUC"))
```

### Fitting the model 

```{r}
history <- model %>%
  fit(x = train$text,
      y = as.matrix(train[, "r1_declined"]),
      epochs = 10,
      validation_split = 0.2)

model %>%
  evaluate(x = test$text, y = as.matrix(test[,"r1_declined"]))

```

AUC greater than ~0.8 is sort of a arbitrary industry benchmark for passable prediction models which we're in the ballpark of.

```{r}
plot(history)
```

We can see that the model stabilized after around 3 epochs suggesting more training would not help improve model performance.

### Prediction check

```{r}
preds <- test %>%
  bind_cols(tibble(pred_r1_declined = predict(model, test$text))) %>%
  inner_join(submissions, by = "airtable_record_id")

preds %>%
  mutate(r1_declined = as.factor(r1_declined)) %>%
  ggplot(aes(pred_r1_declined, color = r1_declined)) +
  geom_density()
```
There's a large mass in the high range for declined proposals, while those that are not declined are more spread out through the distribution, with a mode around 0.4.

Overall we're seeing higher declination scores for proposals labeled as declined, so that's a good sign.

### Practical predictive diagnostics

We came up with two scenarios earlier that motivated our thinking for how we could use these predictions. The first was motivated by a lack of reviewer resources such that we would prioritize proposals with more promise:

```{r}
preds %>%
  select(Created, pred_r1_declined, r1_declined) %>%
  mutate(total_accepted = sum(r1_declined == 0)) %>%
  # pred order
  arrange(pred_r1_declined) %>%
  mutate(total_read__pred = row_number(),
         total_accepted__pred = sum(r1_declined == 0),
         cum_accepted__pred = total_read__pred - cumsum(r1_declined == 1),
         missed_accepted__pred = total_accepted - cum_accepted__pred) %>%
  # created order
  arrange(Created) %>%
  mutate(total_read__created = row_number(),
         cum_accepted__created = total_read__created - cumsum(r1_declined == 1),
         missed_accepted__created = total_accepted - cum_accepted__created) %>%
  pivot_longer(starts_with(c("total_read", "missed_accepted")),
               names_sep = "__", names_to = c("name", "method")) %>%
  pivot_wider() %>%
  ggplot(aes(total_read, missed_accepted, color = method)) +
  geom_line()
  
```
After about 110 proposals, all the remaining would have been declined so there would have been no missed accepted proposals while reading them in order of submission time would have led to 17 missed proposals at that cutoff.

### Prediction and write back

Let's write these predictions back to the airtable for human scrutiny. Note that we are producing predictions for all submission, even those that did not have a declined marker.

```{r}
sub_text_all$grants2vec <- model %>% predict(sub_text_all$text) %>% as.vector()
at_update <- sub_text_all %>% select(airtable_record_id, grants2vec) %>%
  column_to_rownames("airtable_record_id")

update_records(at_update, submissions_at, columns = grants2vec)
```

## Analysis

Let's see how well these predictions performed against various outcomes measures in our submissions airtable. First we re-read our submissions with the newly updated `grants2vec` declination score:

```{r}
submissions <- submissions_at %>% read_airtable()
```

We replicate the plots above demonstrating our two use cases for these scores.

The first strategy recommends ordering the reads by scores in event of limited reviewer resources or declining reviewer quality: 

```{r}
submissions %>%
  unnest(Call) %>%
  select(Created, Call, pred_r1_declined = grants2vec, `Round 1 Outcome`) %>%
  mutate(r1_declined = `Round 1 Outcome` == "Decline") %>%
  filter(!is.na(r1_declined)) %>%
  group_by(Call) %>%
  mutate(total_accepted = sum(r1_declined == 0)) %>%
  # pred order
  arrange(pred_r1_declined) %>%
  mutate(total_read__pred = row_number(),
         total_accepted__pred = sum(r1_declined == 0),
         cum_accepted__pred = total_read__pred - cumsum(r1_declined == 1),
         missed_accepted__pred = total_accepted - cum_accepted__pred) %>%
  # created order
  arrange(Created) %>%
  mutate(total_read__created = row_number(),
         cum_accepted__created = total_read__created - cumsum(r1_declined == 1),
         missed_accepted__created = total_accepted - cum_accepted__created) %>%
  pivot_longer(starts_with(c("total_read", "missed_accepted")),
               names_sep = "__", names_to = c("name", "method")) %>%
  pivot_wider() %>%
  ggplot(aes(total_read, missed_accepted, color = method)) +
  geom_line() + facet_wrap(~ Call, scales = "free") +
  theme(legend.position = "bottom")
```
The blue curve represents the total number of `missed_accepted` proposals if the reviewer stopped reading after `total_read` proposals which is consistently under the strategy that involves reading the proposals in order that they were received. The gains are the distance between the two curves.

The second use case involves highlighting some proposals that had a very low declination score but ended up getting declined. This can be used as a second pass for proposals we may have missed.

```{r include=FALSE}
submissions %>%
  mutate(r1_declined = `Round 1 Outcome` == "Decline") %>%
  arrange(grants2vec) %>%
  filter(r1_declined == 1) %>%
  head(10) %>%
  select(`Applicant organization`, grants2vec)
```

(redacted)

### Additional analysis

Any other ideas of how you want to see this data analyzed? Some ideas below:

We have a qualitative understanding of how quality each of the different submissions were for each call. Let's see if this is reflected in the score:

```{r}
submissions %>%
  unnest(Call) %>%
  group_by(Call) %>%
  summarise(decline_prob = mean(grants2vec, na.rm = TRUE)) %>%
  arrange(decline_prob)
```

It appears Mental Health and Digital Coaching had the best proposals while Livelihood Exploration may have had the worst. The other calls were somewhere in between.

```{r}
submissions %>%
  select(matches("^R1.*[1-4]$"), grants2vec) %>%
  pivot_longer(-grants2vec) %>%
  filter(str_starts(value, "Tier")) %>%
  mutate(value = str_remove(value, " - .*$"),
         name = str_remove(name, "R1-")) %>%
  group_by(name, value) %>%
  summarise(decline_score = mean(grants2vec)) %>%
  ggplot(aes(name, decline_score, fill = value)) +
  geom_col(position = "dodge")
  
```
Only Reviewer3 had some inconsistent scoring between Tier 2 and Tier 3 with respect to the predicted score.

## Temina thoughts

Great insights from Temina, follow ups below:

### R2 Prediction target

Which labels should we be using? We currently use "Round 1 outcome", but as you've noted these proposals are reviewed quickly, by just 2 people per proposal, and the labels are noisy.  Once we've had a chance to more thoroughly review the shortlist in Round 2, we have a set of proposals that is probably closer to the level of scrutiny I'd want in a training set. Within "Round 2 outcomes", i think you could count anything that is not marked "decline" as a reasonably successful proposal. This means that you'll end up with roughly 1/3 to 1/2 fewer labeled proposals. But the labels are more careful. If this loss of labels creates problems for training, perhaps we need to go through a few more rounds of grant-making? 

Btw did you already try training on "R2 outcomes" or other outcome variables? Would be interesting to know whatever else you looked at, and how it performed. 

```{r}
ind_train <- sample.int(nrow(sub_text_r2), 0.8*nrow(sub_text_r2))
train <- sub_text_r2[ind_train,] %>%
  filter(!is.na(r1_declined))
test <- sub_text_r2[-ind_train,] %>%
  filter(!is.na(r1_declined))
```

Fewer positive labels here so much more imbalanced of a dataset.

```{r}
history <- model %>%
  fit(x = train$text,
      y = as.matrix(train[, "r2_declined"]),
      epochs = 10,
      validation_split = 0.2)

model %>%
  evaluate(x = test$text, y = as.matrix(test[,"r2_declined"]))
```

AUC is worse, in between a model of no value and the AUC values we were getting earlier.

```{r}
sub_text_all$grants2vec_r2 <- model %>% predict(sub_text_all$text) %>% as.vector()

submissions %>%
  left_join(sub_text_all, by = "airtable_record_id") %>%
  unnest(Call) %>%
  select(Created, Call, pred_r2_declined = grants2vec_r2,
         ends_with("_declined")) %>%
  filter(!is.na(r2_declined)) %>%
  group_by(Call) %>%
  mutate(total_accepted = sum(r2_declined == 1)) %>%
  # pred order
  arrange(pred_r2_declined) %>%
  mutate(total_read__pred = row_number(),
         total_accepted__pred = sum(r2_declined == 1),
         cum_accepted__pred = total_read__pred - cumsum(r2_declined == 0),
         missed_accepted__pred = total_accepted - cum_accepted__pred) %>%
  # created order
  arrange(Created) %>%
  mutate(total_read__created = row_number(),
         cum_accepted__created = total_read__created - cumsum(r2_declined == 0),
         missed_accepted__created = total_accepted - cum_accepted__created) %>%
  pivot_longer(starts_with(c("total_read", "missed_accepted")),
               names_sep = "__", names_to = c("name", "method")) %>%
  pivot_wider() %>%
  ggplot(aes(total_read, missed_accepted, color = method)) +
  geom_line() + facet_wrap(~ Call, scales = "free") +
  theme(legend.position = "bottom")
```

Perhaps as reflected by the lower predictive scores, we see that these predictions don't fare as well against reading in order of submission for many of the proposals, perhaps with the exception of the Open Call and Livelihood Exploration.

Below are the top 10 proposals that were declined but had the lowest `r2_declined` score. 

```{r include=FALSE}
submissions %>%
  left_join(sub_text_all, by = "airtable_record_id") %>%
  arrange(grants2vec_r2) %>%
  filter(r2_declined == 0) %>%
  head(10) %>%
  select(`Idea & TOC`, `Applicant organization`, grants2vec_r2,
         everything())
```

(redacted)

And the next top 10 proposals that were declined in R2 with the highest `r2_declined` score.

```{r include=FALSE}
submissions %>%
  left_join(sub_text_all, by = "airtable_record_id") %>%
  arrange(desc(grants2vec_r2)) %>%
  filter(r2_declined == 0) %>%
  head(10) %>%
  select(`Idea & TOC`, `Applicant organization`, grants2vec_r2,
         everything())
```

(redacted)

A quick gut check that proposals declined in R1 have lower scores than those accepted in R1.

```{r}
submissions %>%
  left_join(sub_text_all, by = "airtable_record_id") %>%
  group_by(r1_declined) %>%
  summarise(across(starts_with("grants2vec"), mean))
```

## Leveraging model for new submissions

```{r}
new_submissions_at <- airtable("tblfNQfUsRCFAiIHd", "app1Cg2upg4u5RqTX")
new_submissions <- new_submissions_at %>% read_airtable()

new_sub_text_all <- new_submissions %>%
  # turn lists with texts into a text column
  # for anything else we just get the count
  mutate(across(where(is.list), function(any_list) {
    if (class(any_list[[1]]) == "character") {
      map_chr(any_list, ~ str_c(.x, collapse = " "))
    } else {
      map_chr(any_list, ~ str_c(cur_column(), length(.x)))
    }
  })) %>%
  select(airtable_record_id, ends_with("_declined"), where(is.character)) %>%
  unite(text, -c(airtable_record_id, ends_with("_declined")), sep = " ")

new_sub_text_all$grants2vec <- model %>% predict(new_sub_text_all$text) %>% as.vector()
at_update <- new_sub_text_all %>% select(airtable_record_id, grants2vec) %>%
  column_to_rownames("airtable_record_id")

update_records(at_update, new_submissions_at, columns = grants2vec)
```


