Using Machine Learning (ML) and Natural Language Processing (NLP) to identify relationships and similarities in grant proposal texts

For each new grant proposal, the Office of Research and Innovation (OR&I) must manually parse through thousands of files to check for significant overlap with past grants – which is very slow. A team, part of the 2022 Duke University Code+ Program, was tasked with developing an algorithm to automate the search and speed up the process. The product that the team developed – SciVerify – passed the ground truth tests provided by the OR&I perfectly.

In working on the project, the team requested OnDemand sessions within the Duke Compute Cluster to write and test the algorithm in Jupyter Notebooks. The use of virtual machines was useful as the shared environment allowed the team to use the approximately 4500 documents made available by OR&I. The easy access to powerful computing resources helped the team test the demanding natural language processing models and algorithms.

To extract relevant features for the OR&I, the algorithm focuses on the grants’ abstracts and aims. It does so by filtering the text and then summarizing it by leveraging Transformer Networks – state of the art Natural Language Processing techniques. The OR&I provided 5 pairs of ground truths to train/test the algorithm. SciVerify flagged the correct file each time, resulting in a perfect score.

Once the algorithm was developed it was packaged to provide different functionalities for the OR&I:

  • Database comparison – main use case – input one PDF file to compare against past grant proposals.
  • Batch comparison – input multiple files and compare them against each other.
  • PDF summary – the researchers can review the output of the algorithm.
  • Keyword flagging – searching for keywords in a PDF file; e.g. HIPAA, DoD.


The project was executed by 6 undergraduate students: Prince Ahmed, Rose DiPietro, Quan Doan, Rodrigo Guerreiro, Julie Ou, and Isabelle Xiong, with the help of Project Leads Mark McCahill and Andy Ingham, in collaboration with the Office of Research and Innovation (OR&I), the Office of Information Technology (OIT), and Duke Computer Science faculty. The code repository can be found here.