Roger French1

1, Case Western Reserve University, Cleveland, Ohio, United States

Data science arises from advances in computing, communication, and data resulting in the ability to develop data-driven models based on large petabyte scale datasets. These distributed computing approaches, complemented by the ease of acquiring large datasets at petabyte scale, is driving the digital transformations of industry, science and technology and society itself. These approaches, complement the “petaflop” computing characteristic of high performance computing, used in materials science such as the materials genome initiative and integrated computational materials engineering research. One challenge for materials data science is that typically materials science datasets have been small and sparse, in comparison to epidemiological studies in the life sciences.
Data Science combines advances in statistics, computer science and domain science (such as materials science) to enable new understandings through the application of statistical and machine learning and most recently deep learning. Consider “pure” data scientists as specialists in the academic fields of math and statistics, and computer science. A need arises to develop broader data science skills across the workforce to produce T-shaped graduates, with deep skills in a domain science such as materials science, while at the same time having broad skills in data science[1].
In 2013 we launched a 1 year study to design an applied data science (ADS) undergraduate minor, available to students across our university. These ADS students learn programming, inferential statistics, exploratory data analysis, modeling and prediction and complete a semester long data science project[2]. The ADS minor, started in 2015, and has grown to include 100 undergraduate and graduate students last academic year. The ADS curricula is taught using an open data science tool chain focused on open and reproducible science, based on R/Rstudio, Python, Git, Markdown and LaTeX to produce, compilable data analyses. In R, for example, advances such as the TidyVerse package of pipes and pipelined code and GGPlot2 for the grammar of graphics for data visualization are major steps towards realizing Donald Knuth’s vision of literate programming and are well matched to today’s multi-disciplinary team research [3].
For materials data science, we now offer a data science concentration, focusing the ADS courses on materials problems while addressing the core challenges of integrating data science with the physical and chemical sciences foundations of Materials Science. Essential to adoption of data-driven modeling is demonstrating how they do not replace our physical and chemical theories, models, and experimental experience. Instead they are a new tool, adding statistical power and significance, with improved inference and prediction. And these analyses must be subject to robust validation, using training and testing splits of the data.
Materials data science is not only an educational challenge, but also calls for advancing how we perform our research experiments and acquire data for analysis. A study protocol, encompassing the samples, their exposures and the evaluations performed on them, constitutes the basis of the metadata, the predictors and the responses of the experiment. In many experiments, it is possible to augment the experiment with additional predictors measured in sufficient numbers to provide statistically sound results. Having materials scientists knowledgeable about these data issues is an important to advancing our research methods.

[1] Debbie Hughes, Roger H. French, Crafting a Minor to Produce T-Shaped Graduates, (2016).
[2] Business Higher Education Forum, Creating a Minor in Applied Data Science | BHEF, The Business Higher Education Forum, 2016.
[3] D.E. Knuth, Literate Programming, Computer Journal, 27 (1984) 97–111. doi:10.1093/comjnl/27.2.97.