Abstract
The rapid growth in interest in deep learning and foundation models (FMs) in particular, has attracted the attention of a diverse range of researchers thanks to their generalization ability. However, the advent of these techniques has also brought to light the lack of transparency and rigor in the way development is pursued. In particular, the inability to determine the number of epochs and other hyperparameters in advance presents challenges in identifying the best model. To address this challenge, machine learning frameworks such as MLFlow can automate the collection of this type of information. However, these tools capture data using proprietary formats and pose little attention to lineage. This paper proposes yProv4ML, a framework that captures provenance information generated during machine learning processes in PROV-JSON format, with minimal code modification.