CICI: RSSD: Massive Internal System Traffic Research Analysis and Logging Dataset (MISTRAL Dataset)

  • Awarding Agency: National Science Foundation
  • Amount: $600,000
  • Grant Dates: 11/1/22 – 10/31/25
  • PI: Richard Biever
  • co-PIs: Tracy Futhey, Charles Kneifel, Neil Gong, Alexander Merck

This project creates a dataset (the MISTRAL Dataset) for cybersecurity researchers and network operators to use in identifying threats and thereby better protect research-related resources. The sources of data contained in the Dataset reflect actual network activity to and from several scientific applications and their related cyberinfrastructure. These data are safely captured, securely stored and accessible through authorized access to associated cybersecurity researchers for in the purpose of detecting abnormal or malicious activities that could represent threats to the identified science applications and cyberinfrastructure. Because the data are collected continuously and through automated means, the MISTRAL Dataset provides a realistic and relevant characterization of threats over time. The project also produces a public version of the Dataset.

The MISTRAL project encompasses an Infrastructure, the Dataset and a set of proof-of-concept analytic endeavors. The Infrastructure includes a data storage pipeline for handling an estimated 1TB/day of data stored on-premises and/or in the cloud, a reference monitoring framework, and tools for collecting, analyzing, and sharing the data and relevant metadata that characterize both north-south (Internet-facing) and east-west (lateral) data flows. The Dataset consists of safely captured domain science workflow behavior using production network flows (e.g., source/destination IP, port, protocol, date/time, number, and size of connections) and data centers and research labs, as well as supplemental data from DNS, authentication logs, intrusion detection alerts and other security event alerts (e.g., threat intelligence data detailing Indicators of Compromise). The initial proof-of-concept analytics comprise various researcher and student (graduate and undergraduate course project) data analysis efforts to devise techniques for detecting abnormal or malicious activity or to study that activity; these collaborators also test the MISTRAL environment and Dataset to recommend refinement of the Infrastructure and the data collection process.