A new paradigm for scientific discovery from data

Tamires Soares
2 min readJun 14, 2021

A common problem in scientific domains is to represent relationships among physical variables, e.g., the combustion pressure and launch velocity of a rocket or the shape of an aircraft wing and its resultant air drag. The conventional approach for representing such relationships is to use models based on scientific knowledge, i.e., theory based models, which encapsulate cause-effect relationships between variables that have either been empirically proven or theoretically deduced from first principles. These models can range from solving closed-form equations (e.g. using Navier–Stokes equation for studying laminar flow) to running computational simulations of dynamical systems (e.g. the use of numerical models in climate science, hydrology, and turbulence modeling). An alternate approach is to use a set of training examples involving input and output variables for learning a data science model that can automatically extract relationships between the variables. Theory-based and data science models represent the two extremes of knowledge discovery, which depend on only one of the two sources of information available in any scientific problem, i.e., scientific knowledge or data. They both enjoy unique strengths and have found success in different types of applications. Theory based models are wellsuited for representing processes that are conceptually well understood using known scientific principles. On the other hand, traditional data science models mainly rely on the information contained in the data . They have a wide range of applicability in domains where we have ample supply of representative data samples, e.g., in Internet-scale problems such as text mining and object recognition. Despite their individual strengths, theory-based and data science models suffer from certain deficiencies when applied in problems of great scientific relevance, where both theory and data are currently lacking. For example, a number of scientific problems involve processes that are not completely understood by our current body of knowledge, because of the inherent complexity of the processes. In such settings, theory-based models are often forced to make a number of simplifying assumptions about the physical processes, which not only leads to poor performance but also renders the model difficult to comprehend and analyze.

Reference:
Karpatne, Anuj, et al. “Theory-guided data science: A new paradigm for scientific discovery from data.” IEEE Transactions on knowledge and data engineering 29.10 (2017): 2318–2331

--

--

Tamires Soares

Tamires is technical advisor and instructor focusing on reservoir/production engineering and data analytics .