-Looking at the high-level utility of the different datasets created in this reference kit, with the industry specific ones (utilities, e-commerce, etc) we took a more targeted approach which also allowed us to demonstrate the functionality of a diverse set of Intel® optimized technologies such as Intel® Distribution of Modin and Intel® Distribution for Python*'s optimizations for NumPy and SciPy. Having said that, the industry-agnostic dataset is equally and probably more valuable in terms of the actual utility as it allows the user to customize the minutiae of the synthetic data, e.g. defining number of columns needed, data type, distrubutions, weights, etc. Furthermore we have also added the capability to generate a basic target value based on regression, binary classification or multi-class classification. One thing that would make the synthetic data even more realistic is (1) adding correlation(s) between features and (2) adding correlation between the response and one or more features. There can be a very high number of possibilities in which these correlations occur in the real world. However, this would have made the scope of this reference kit harder to contain. As a result, this capability is not offered as part of the reference kit. However, we would like to mention it here, since it can easily be added during run-time by drawing from the users' knowledge and subject-matter expertise.
0 commit comments