Idaka/DB Data/Practices.json at main · TheSoftwareDesignLab/Idaka · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
{"_id":5,"Practice":"The distribution of the training data should reflect the real distribution.","Description":"The distribution of the training data ( e.g., data used to train the ML model) should reflect the real distribution ( i.e., original distribution of the population from which the data was extracted). This is because machine learning models try to approximate the realworld distribution based on the training data. If a completely different distribution is passed as input for training, then the model is going to approximate to a distribution that does not exist in real life, and it is going to have a bad performance.","Task":"Divide the data set for next steps"}
{"_id":3,"Practice":"If an ML model is published as a cloud service, specifically, when a classification/prediction task uses ML cloud-based services, it is important to define the use case and the model requirements in order to identify how frequently the service should be invoked.","Description":"If a ML model is published as a cloud service, specifically, when a classification/prediction task uses ML cloud-based services, it is important to define the use case and the model requirements in order to identify how frequently the service should be invoked. This is in particular important for selecting the model and for provisioning the infrastructure required to execute the model. For instance, if a cloud model is expected to constantly classify posts from social networks, then it requires a more performant model and infrastructure than a use case in which the model classifies a group of posts at the end of each day.","Task":"Define deployment environment (Storage, size,data retrieval, scalability, feedback)"}
{"_id":9,"Practice":"The images that are going to be used as instances of the negative class should have some common characteristics with the positive ones.","Description":"The images that are going to be used as instances of the negative class should have some common characteristics with the positive ones. This means that the negative samples should “look similar” to the positive ones. In this way, the model will learn special features that will help to differentiate a particular object from similar ones. For instance, when classifying images that contain buttons, a good idea will be to include in the negative sample images with components that look similar to buttons (e.g., cards).","Task":"Profiling data"}
{"_id":28,"Practice":"Transformations should take into account (i) the model that is going to be used and (ii) the data nature.","Description":"Transformations should take into account: \r\n(i) the model that is going to be used ( e.g., model assumptions, model characteristics, model sensitivity to magnitude) \r\n(ii) the data nature ( e.g., data distribution, data type). \r\n\r\nFor instance, when clustering data (e.g., when doing unsupervised categorization of software), the clustering algorithm is going to be affected if the data is transformed and the shapes of the clusters can change, e.g., when scaling the axes to bigger scales, the cluster assignment could change. In addition, when the features/attributes have meaningful units like distances, then data transformations should be done carefully. When scaling a feature that has meaningful units (e.g., 1 liter, 1 meter), the interpretation could be lost.","Task":"Data transformation."}
{"_id":31,"Practice":"if an algorithm does not support categorical data, this type of feature should be encoded into multiple binary features or by counting frequencies.","Description":"If an algorithm ( i.e., model) does not support categorical data ( e.g., the algorithm does not accept a category as an input), this type of feature should be encoded into multiple binary features or by counting frequencies. In other words, if at least one categorical variable exists in a dataset that is going to be used by an algorithm that does not support this type of feature, then the variable could be transformed in two ways. Firstly, the categorical variable is transformed into multiple binary features. For example, let us assume we have a variable working area with the values: back-end, front-end, analytics. The variable could be transformed into three boolean variables: is-back-end, is-front-end, and is-analytics, and each variable could take the value of 1 in the case that the original working area corresponds to that area or 0 in the opposite case. This encoding is known as one-hot encoding.","Task":"Data transformation."}
{"_id":32,"Practice":"Datetime data could be partitioned into their corresponding components (e.g., if you have day-month-year then split it into day, month, year).","Description":"Datetime data could be partitioned into their corresponding components. For example, a datetime value such as 21/10/2020 could be transformed into three different variables, day (21), month (10), and year (2020). This division is done with the purpose of simplifying the date variable into simpler variables that could help to identify possible trends per each variable. For instance, it could be found that in a specific month (e.g., December), the average time to accept or reject a pull request (PR) in a repository takes longer than in other months.","Task":"Data transformation."}
{"_id":37,"Practice":"When handling geographical data, the geodesic distance should be used to compute the distance between two points of interest.","Description":"When handling geographical data, the geodesic distance ( i.e., the shortest path between two points on a surface, where the path between those points is constrained to move along two dimensions) should be used to compute the distance between two points of interest. This distance is suggested since it takes into account that two points are in a “sphere” (e.g., planet Earth), and dependency between latitude and longitude is automatically considered. For example, when calculating the shortest distance between two points in a city, the directions of the ways have to be taken into account, and a Manhattan distance could be considered.","Task":"Data transformation."}
{"_id":40,"Practice":"If noise is present in a dataset, then, it should be removed,with techniques such as smoothing.","Description":"If noise ( i.e., random error or variance in a measured variable ) is present in a dataset, then the noise should be removed with techniques such as smoothing. Keep in mind that data noise has two main sources (implicit and random errors) and two main types (class noise and attributes types) . Thus, understanding the noise and removing it, is important since, as noise hinders (i.e., obstruct) most types of data analysis. For removing noisy data, multiple techniques are used. For instance, binning technique in which numerical values are sorted, then grouped in bins, and the values are replaced by a characteristic of the bin (e.g., mean, median) ; distance-based outliers removal, a technique in which an outlier is determined based on a distance","Task":"Handle noise e.g., unwanted outliers"}
{"_id":41,"Practice":"If rectified linear unit (ReLU) is being used as an activation function in NeuralNetworks (NN), then, the features should be scaled.","Description":"If a rectified linear unit (ReLU) is being used as an activation function in neural networks, the features should be scaled. Features should be scaled because a ReLU activation function has no upper limit for its output, and this output could be so large that the net could be overfitted.","Task":"Data transformation."}
{"_id":43,"Practice":"When working with an long short-term memory (LSTM) network the sequences that are going to be fed into the NN should be padded in order to equalize the length of all the sequences.","Description":"When working with an LSTM network, the sequences that are going to be fed into the neural network should be padded in order to equalize the length of all the sequences. In other words, if the sequences used as input have not the same length, “something” like a set of characters or tokens (e.g., 0) should be added at the end (i.e., post padding) or beginning (i.e., pre padding) of the sequence in order to homogenize all the sequences length. Note that the padding 13 ReLU: non-linear activation function used in NN. The function can be represented as f(x) = max(0, x) where x is an input value is only needed if the LSTM is trained by batches since each batch needs to take the same length to apply the proper calculations.","Task":"Data transformation."}
{"_id":54,"Practice":"Regardless the classifier that is being used, mutual information techniques for feature selection should be considered.","Description":"Regardless of the classifier that is being used, mutual information techniques for feature selection should be considered. This feature selection could be used regardless of the algorithm since it is not related or included in any ML. This could be done in a previous step before training. ","Task":"Define the methodology of extracting and selecting features."}
{"_id":55,"Practice":"Using backward and forward feature selection should be taken into account regardless the classifier that is being used.","Description":"Using backward and forward feature selection should also be taken into account regardless of the classifier that is being used. In other words, in the forward feature selection (FFS) method, we should do an iterative process in which we start with no variables, and in each iteration, we keep adding a feature that best improves the model performance. This process of adding a variable is done in each iteration after a round of training and testing a model for each, not previously selected, variable, and then the feature that contributes to the best model’s improvement is selected. This process is done until the addition of a variable does not improve the performance of the model. On the other hand, the backward selection starts with all the features, and in each iteration, the least significant feature is removed until no improvement is observed in the removal of features.","Task":"Define the methodology of extracting and selecting features."}
{"_id":58,"Practice":"If the objective is to build specialized models, then a specific model for each category in a categorical feature should be trained.","Description":"If the objective is to build specialized models, then a specific model for each category in a categorical feature should be trained. Note that this practice corresponds to the next part of splitting the data in order to be able to train specialized models in this stage.","Task":"Define which models are going to be use"}
{"_id":59,"Practice":"If the dataset that is going to be used has multiple features in different scale systems, then invariant-algorithms like tree based algorithms should be used.","Description":"If the dataset that is going to be used has multiple features in different scale systems, then scale-insensitive algorithms ( i.e., algorithms that are not affected by re-scaling one (or all) the features.) like tree-based algorithms should be used. Some examples of those algorithms are Decision Trees, Gradient Boosting Trees , Extratrees, and Naive Bayes.","Task":"Define which models are going to be use"}
{"_id":60,"Practice":"When training prediction models with time series data, then, sliding windows should be used.","Description":"When training prediction models with time-series data, then sliding windows should be used. Keep in mind that when using a sliding window, k predecessor data (d0, d1, .., dn−1) are used to predict the next value (dn). In this way, the prediction model is trained with historical data to predict the present","Task":"Train the model"}
{"_id":61,"Practice":"Test data should not be included in the training data in order to have a fair metric.","Description":"Test data should not be included in the training data in order to have a fair metric ( i.e., a metric that is not being biased because of the data leakage. With data leakage, we meant that test data is seen by the model before the testing phase.","Task":"Divide the data set for next steps"}
{"_id":64,"Practice":"If symmetry is needed to be avoided, in a neural network, their components should be initialized with different values.","Description":"If you want to avoid symmetry ( i.e., weights in a neural network with the same value, thus, neurons behave the same), in a NN, their components should be initialized with different values. In other words, if you want to avoid that all/majority of the nodes learn the same concept/feature/abstraction (i.e., have the same exact value), then you have to initialize each of them with different values.","Task":"Optimize hyperparameters"}
{"_id":72,"Practice":"A model should be retrained with an online approximation if the model needs to dynamically adapt to new patterns from the data.","Description":"A model should be retrained with an online approximation ( i.e., the model is retrained when new data arrives) if the model needs to dynamically adapt to new patterns from the data. In other words, if in previous steps (requirement definition), it is defined that the model needs to be updated constantly with new data, then the model has to be retrained each time new data arrives.","Task":"Update/Retrain the model"}
{"_id":75,"Practice":"If there is a large data variance, the number of layers in a neural network must be increased.","Description":"If there is a large data variance, the number of layers in a NN must be increased. This is because the variance of data introduces more complexity, then more layers are needed in order to learn more details.","Task":"Optimize hyperparameters"}
{"_id":76,"Practice":"A neural network should be trained with different initialized weights several times, in order to first get an average of the results and its variance and then to get a more specific judgment of the overall performance.","Description":"A NN should be trained with different initialized weights several times in order to first get an average of the results and their variance, and then to get a more specific judgment of the overall performance. Therefore, the judgment about the model is not biased with a selected seed used for initializing the weights.","Task":"Select hyper-parameters that should be consider/optimized in the process"}
{"_id":84,"Practice":"Only independent models should be combined in an ensemble.","Description":"Only independent models should be combined in an ensemble, meaning that after analyzing the correlation between predictions and determining the dependency between models, those that are independent should be used for an ensemble. This dependency between models’ predictions should be taken into account since 100% of correlated predictions do not provide additional information. Keep in mind that this dependency analysis is useful when training parallel models for an ensemble, not for sequential techniques in which base learners are used in tandem (e.g., Adaptive Boosting).","Task":"Train the model"}
{"_id":86,"Practice":"When using clustering, the model should be re-trained with new data to build more clusters and increase its efficiency.","Description":"When using clustering, the model should be re-trained with new data to build more clusters and increase its efficiency. When new data, that can be used for training a model, arrive, it could provide new features to be taken into account that could improve the performance of the separation between classes. In addition, re-training a model prevents model degradation.","Task":"Train the model"}
{"_id":101,"Practice":"When evaluating a model, bootstrapping or cross-validation should be tried.","Description":"When evaluating a model, use bootstrapping or multiple cross-validation executions. Executing multiple times CV could generate a robust evaluation since the model would be tested in different samples, which could be a better estimate of the generalization error that evaluating only a single time. In addition, when using bootstrapping provides the standard error of the estimates.","Task":"Validate the model with the validation set or cross-validation (Calculate the metrics previously defined on the training and the validating sets)."}
{"_id":102,"Practice":"When training/validating/testing a model, the process of cross-validation should be repeated with different seeds to prevent the impact of using a fixed seed.","Description":"When training/validating/testing a model, the process of cross-validation should be repeated with different seeds to prevent the impact of using a fixed seed. Using a fixed seed generates random numbers that are always the same; this may cause that the performance of the model depends on those starting numbers.","Task":"Validate the model with the validation set or cross-validation (Calculate the metrics previously defined on the training and the validating sets)."}
{"_id":110,"Practice":"The performance and the time required to train a model should be taken into account when comparing models","Description":"The performance and the time required to train a model should be taken into account when comparing models. In this way, not only the performance (e.g., precision, accuracy, recall) is taken into account, but also time, which is also related to resource usage.","Task":"Compare and select the model and hyperparameters that fits better to the problem."}
{"_id":120,"Practice":"For each trained model, a separate file should be used in order to keep track of all the possible experiments.","Description":"For each trained model, a separate file should be used in order to keep track of all the possible experiments. For example, if three different models are trained in the process of tuning and model selection, then each of those models should be saved in a separate file. It is important to note that some existing tools like version control systems facilitate keeping track of changes, even if the same file(s) are overwritten.","Task":"Implement the model."}
{"_id":125,"Practice":"It should be verified if both, model and dataset, fit in memory.","Description":"It should be verified if both model and dataset fit in memory. Note:  It depends on the type of model, in particular, it does not apply to deep learning.","Task":"Experiment design."}
{"_id":140,"Practice":"Go beyond in-the-moment explanations. Help users better understand your product with deeper explanations outside immediate product flows.","Description":"\"It can benefit your users and your product to provide deeper explanations of how your product works. This type of explanation can help users strengthen their understanding of how your system works, and how they can use it to best meet their needs. Having this level of understanding can also help them give you more robust feedback on your product.\"","Task":"Present the system to the user","Cite":"People + AI Guidebook"}
{"_id":141,"Practice":"Automate more when risk is low. Consider user trust and the stakes of the situation when determining how much to automate.","Description":"\"When determining how much to automate your product flows, think about the stakes of your product, and the level of comfort that users may have with your type of product. In low risk, well-established products, like content recommendation systems, you might choose to prioritize a more heavily automated product flow where user control is available but optional. However, when onboarding to a new type of product, or in high-stakes situations, errors can be particularly problematic, and can corrode user trust and potentially cause dangerous situations. In such cases, design your system to give users more control over the system.\"","Task":"Define objective","Cite":"People + AI Guidebook"}
{"_id":146,"Practice":"Design for your data labelers. Make sure that data labelers have well designed tools and workflows.","Description":"\"For supervised learning, accurate data labels are a crucial ingredient to achieve relevant ML output. Labels can be added through automated processes or by people known as labelers.\nLabeling tools range from in-product prompts to specialized software. If you’re working with labelers, it’s worth investing time upfront in selecting or designing the tools, workflows, and instructions. The best way to do this is often in collaboration with the labelers themselves.\nWhen labelers understand what you’re asking them to label, and why, and they have the tools to do so effectively, they’re more likely to label the data correctly. And as partners in the process, they can also help you improve your labeling tasks overall.\"","Task":"Define the methodology of labeling","Cite":"People + AI Guidebook"}
{"_id":148,"Practice":"Learn from label disagreements. Understand differences in how labelers interpret and apply labels to prevent problems later on.","Description":"\"When you encounter labels that are “messy,\" unexpected, or hard to reconcile, don’t categorically discard them as “noisy.\" Take time to investigate whether issues with labeler tools, workflows, instructions, or overall data strategy may be leading to such issues with labels.\nFor example, say you’re training a model to flag toxic comments. Your labelers might apply different toxicity labels based on their personal experience, which can lead to discrepancies.\nThese disagreements in labels offer an opportunity to identify deeper data and/or labeling issues that you may need to address to ensure data quality.\"","Task":"Audit the labels","Cite":"People + AI Guidebook"}
{"_id":8,"Practice":"If it is required to augment the number of instances in the negative class, preexisting datasets could be used for including more instances in the dataset.","Description":"If it is required to augment the number of instances in the negative class, pre-existing datasets could be used to include more instances in the dataset. For instance, if a model is classifying whether a UI has buttons or not, then you could augment your dataset with instances from a generic dataset that contains images with mobile interfaces to find samples of a negative class (e.g., views without buttons).","Task":"Adjust the class distribution of a data set e.g., oversampling, undersampling"}
{"_id":10,"Practice":"The minimum size of the object that is going to be detected should be present in the data that is going to be used for training the model.","Description":"The minimum size of the object that is going to be detected should be present in the data that is going to be used for training the model; and the object regions of interest (ROI) should have a similar aspect ratio in all the positive images. These two practices could be exemplified as follows. Imagine that the classification button problem now includes the detection of the exact position of buttons in images of mobile apps.","Task":"Divide the data set for next steps"}
{"_id":14,"Practice":"When analyzing time series/temporal data, possible temporal trends should be identified.","Description":"When analyzing time series/temporal data, possible temporal trends should be identified ( e.g., every Monday, January, every two weeks). For example, when a dataset of fixed vulnerabilities is analyzed, a possible trend would be that a huge amount of vulnerabilities are fixed every last Friday of the month. This kind of insight could be useful for the model as a new feature, which is simpler than a time-dependent variable.","Task":"Profiling data"}
{"_id":15,"Practice":"When analyzing time series/temporal data, weird trends should be identified.","Description":"Weird trends ( i.e., trends that need to cross-reference the data against an external source that maps time events) should also be identified. For instance, let us assume the periods where fewer vulnerabilities are fixed are Christmas and new years eve; then this could be a new feature that indicates a cause of why in some periods, the issues take longer to be solved.","Task":"Profiling data"}
{"_id":18,"Practice":"If correlation between existing features and instances of geographical data are needed, then, geographical data should be clustered.","Description":"If correlations between existing features and instances of geographical data are needed, then geographical data should be clustered. In other words, when working with geographical data, you could cluster/group similar data points (e.g., points in an akin location), which facilitates the identification of relations between a targeted class and the geographical features.","Task":"Data transformation."}
{"_id":19,"Practice":"For improving model performance with a dataset that has a categorical feature, the data should be split by each category.","Description":"For improving model performance with a dataset that has a categorical feature, the data should be split by each category. To illustrate this, imagine that a model is needed to predict how many hours are required to fix a bug. The input dataset has a variable that indicates the component in which the developer is working (e.g., back-end and front-end). The dataset should be grouped by each component, back-end, and front-end, and afterward, each model should be trained separately in order to have a model specialized in predicting the amount of time needed to solve an issue per area.","Task":"Data transformation."}
{"_id":20,"Practice":"If it is needed to optimize hyper-parameters, the dataset should be split into training, validation and test sets in order to prevent overfitting and avoid biased computing metrics.","Description":"If in the model training stage it is needed to optimize hyper-parameters, the dataset should be split into training, validation, and test sets in order to prevent overfitting and avoid biased computing metrics. In this way, the training and validation models are used to optimize the parameters, while the model with the best parameters is evaluated with the test set. It should be noted that an expert left a comment indicating that it would be better to refer to this practice to as “data partitioning.”","Task":"Divide the data set for next steps"}
{"_id":23,"Practice":"If an imputation technique is used, the same technique should be used in the training, validation and testing sets.","Description":"If an imputation ( i.e., replacement of missing values) technique is used, then the same technique should be used in the validation and testing sets. When modifying a dataset, e.g., with imputation, it should be done only using the training data to ensure that no information of the validation or test data is leaked into the training data. Then the learned information about how to impute the missing data should be applied to the other sets. \r\nIf the imputation is not applied to the remaining data, the training data would have a different format and distribution than the other data. Note that any kind of imputation could introduce a bias since the distribution is being affected.","Task":"Handle missing data"}
{"_id":27,"Practice":"Transformation of numerical data should be done in all the sets by using the statistics computed on the training data.","Description":"Transformations of numerical data should be done in all the sets ( e.g., training, validation, and test)by using the statistics computed on the training data. In other words, after selecting the training data, we should calculate the mean and standard deviation of the variable that is needed to standardize. Then, these statistics should be used to transform all the sets. The reason behind only calculating the statistic in the training data, as previously mentioned, is that calculating statistics with the whole dataset will introduce future information (i.e., information about the validation and testing data). Furthermore, those statistics should be used to transform the remaining sets. In this way, the algorithm can be tested and evaluated if the model is able to generalize well to unseen data.","Task":"Data transformation."}
{"_id":30,"Practice":"When transforming numerical data into ratios and the transformation leads to infinite values, smoothing should be used.","Description":"When transforming numerical data into ratios and the transformation leads to infinite values ( e.g., because the denominator is zero or close to zero), smoothing should be used. One way of using smoothing is by adding a small constant to the denominator to avoid divisions by zero (i.e., Laplace smoothing)","Task":"Data transformation."}
{"_id":35,"Practice":"If the task of a model is detecting objects, then each region of interest (ROI) should be cropped instead of using the complete image.","Description":"If the task of a model is to detect objects, then each ROI (Region Of Interest) should be cropped instead of using the complete image. To illustrate this, consider a button detection example. The region in which the button is present should be used to train a model and not the whole interface in which the button is present. To accomplish that, for each image in the dataset, the ROI (e.g., the region in which the\r\nbutton is present) should be cropped. \r\nIn this practice, it depends on whether we have nested objects or not. To some object detection tasks that involve objects that are inside each other, a special\r\ncharacteristic that should not be ignored. Therefore, this practice should be used carefully depending on the ML system goal.","Task":"Data transformation."}
{"_id":39,"Practice":"When encoding geographical data, Sine and Cosine facets could be used for representing the cyclical characteristics from date time features.","Description":"When encoding geographical data, sine and cosine facets could be used for representing the cyclical characteristics ( e.g., after 23:59, it is 00:00) from datetime features. Keep in mind that Sine and Cosine facets allow modeling the fact that there are cycles in time data, and the closeness between numbers is not the same as a linear scale (e.g., 2 is closer to 23 than 20 to 23).","Task":"Data transformation."}
{"_id":42,"Practice":"If clustering is needed, the dataset contains aggregated structures (e.g., a 3D array), and the order of the aggregated data matters, then a new feature should be created in to encode the order.","Description":"If clustering is needed and the dataset contains aggregated structures, like 3D arrays, and the order of the aggregated data matters, a new feature should be created to encode the order. In this way, the order of the aggregated structures is not lost because it is modeled as a feature","Task":"Data transformation."}
{"_id":49,"Practice":"It is relevant to check existing techniques of dimensionality reduction.","Description":"It is relevant to check existing techniques of dimensionality reduction. When having a multidimensional dataset (a large number of variables/features), training a model could become time exhaustive and resource exhaustive due to the many variables that need to be analyzed and stored. Therefore, if all the possible dimensions (e.g., 10) could be expressed with fewer dimensions, then the computational cost (i.e., number of variables to analyze) and the storage are reduced. There are multiple techniques for dimensionality reduction, like Linear Discriminant Analysis (LDA), Non-negative matrix factorization, Principal Component Analysis (PCA).","Task":"Define the methodology of extracting and selecting features."}
{"_id":51,"Practice":"When using the principal component analysis (PCA) method, its assumptions should be checked.","Description":"When using the PCA method, its assumptions, e.g., the features exhibit linear relationships among themselves, should be checked. The PCA method calculates the covariance matrix to identify correlations between features; next, it computes the eigenvectors (direction) and eigenvalues (magnitude) of the covariance matrix to identify the principal components. Then, these principal components are ordered by their eigenvalues which help to find which ones cover most of the variance, hence most of the information in the data.","Task":"Define the methodology of extracting and selecting features."}
{"_id":63,"Practice":"When training a neural network (NN), dropout layer(s) should be used to avoid overfitting.","Description":"When training a neural network, dropout layer(s) should be used to avoid over-fitting. Using dropout layers is a popular technique for the regularization of a NN. With this technique, some units of the NN are ignored during a particular forward and backward pass.","Task":"Train the model"}
{"_id":65,"Practice":"Neural Networks weights should be initialized randomly to avoid symmetry.","Description":"NNs should be initialized randomly. Please note that this practice should be used carefully for large NNs, for instance, using well-fitting distributions because any random initialization might not work. The random initialization should take into account the network architecture. For example, Xavier initialization should be used with the Tanh activation.","Task":"Select hyper-parameters that should be consider/optimized in the process"}
{"_id":70,"Practice":"If a deep neural network does not converge in an specific number of iterations, then increasing this number should be tried.","Description":"If a deep NN does not converge ( i.e., when the loss of NN moves towards a minimum (local or global) with a decreasing trend) in a specific number of iterations ( e.g., 100), then increasing this number ( e.g., 150) should be tried. This practice is suggested because sometimes, the number of epochs is not enough for a NN to learn or find a minimum. Therefore, increasing the number of epochs could help the NN to start learning.","Task":"Train the model"}
{"_id":79,"Practice":"For faster convergence of Multi Layer Perceptrons, Stochastic Gradient Descent (SGD) should be used.","Description":"For faster convergence of Multi-Layer Perceptrons (MLP), Stochastic Gradient Descent (SGD) should be used. In general, MLP uses gradient descent, in which all the samples of the training data are used to do a singular update in a particular iteration; if SGD is used instead, only one data point or a subset of the training data is used for a singular update.","Task":"Train the model"}
{"_id":82,"Practice":"When sub-sampling time-series data, it should be checked that unwanted confounding between sampled origins and seasonality in data is not being introduced.","Description":"When sub-sampling time-series data, it should be checked that unwanted confusion between sampled origins and seasonality in data is not being introduced. For example, when daily data is used, and a subsampling occurs once a week (i.e., sample once per week), do not use/select a specific day (e.g., Monday) because, in that case, you are only evaluating the forecasting for that single day (you are not generalizing for all the possible days).","Task":"Train the model"}
{"_id":87,"Practice":"In order to deal with concept drift when working with data streams, models should be re-trained with small a limited number of instances.","Description":"In order to deal with concept drift ( i.e., statistical properties of a target variable, which a model is trying to predict, change over time in unforeseen ways) when working with data streams, models should be retrained with limited number of instances (mini -batches of information). For example, these changes in data could be seen when training models to detect a failure of a system based on performance metrics of it. Then, the original model learned some patterns in the performance metrics (e.g., down time of the system, number of non received packages).","Task":"Update/Retrain the model"}
{"_id":94,"Practice":"It is preferable to use cross-validation instead of a training and test data split for training and testing a model.","Description":"It is preferable to use cross-validation instead\r\nof a training and test data split for training and testing a model. In the validation\r\nprocess, an expert indicated that this is a best practice, but it depends on the case of\r\nuse. One aspect to consider when using cross-validation is the size of the dataset. For\r\nlarge datasets, cross-validation is expensive because of the iterative process involved.","Task":"Validate the model with the validation set or cross-validation (Calculate the metrics previously defined on the training and the validating sets)."}
{"_id":95,"Practice":"It is preferable to use nested cross-validation instead of a training and test split for training and testing a model.","Description":"It is preferable to use nested cross-validation instead of a training and test split for training and testing a model. In the nested cross-validation approximation, there are two loops instead of having a single loop like in non-nested cross-validation. The first loop (outer loop) works the same as in the cross-validation approximation (i.e., k−1 folds for training and the remaining one for testing); the second loop (inner loop) is used for hyper-parameter tuning, in which the k − 1 folds of the outer loop are divided into two; k − 2 folds are used for training with a specific combination of hyper-parameters, and the remaining fold is used for testing the selected hyper-parameters.","Task":"Validate the model with the validation set or cross-validation (Calculate the metrics previously defined on the training and the validating sets)."}
{"_id":97,"Practice":"In order to prevent overfitting, cross-validation should be used.","Description":"In order to prevent over-fitting, cross-validation should be used. This practice encourage (i) to train models multiple times and in different sets\r\nof data and (ii) to test multiple times the performance of the model in different sets.\r\nWhen training and testing the model in different sets, the bias induced by a single\r\nset of data is reduced. Therefore, the possibility of memorizing the data is reduced.\r\n","Task":"Validate the model with the validation set or cross-validation (Calculate the metrics previously defined on the training and the validating sets)."}
{"_id":98,"Practice":"In order to avoid overfitting nested cross-validation should be used.","Description":"In order to avoid over-fitting nested cross-validation should be used. This practice encourage (i) to train models multiple times and in different sets\r\nof data and (ii) to test multiple times the performance of the model in different sets.\r\nWhen training and testing the model in different sets, the bias induced by a single\r\nset of data is reduced. Therefore, the possibility of memorizing the data is reduced.\r\n","Task":"Validate the model with the validation set or cross-validation (Calculate the metrics previously defined on the training and the validating sets)."}
{"_id":99,"Practice":"First tune the hyper-parameters of the probabilistic model and then adjust the decision thresholds.","Description":"First tune the hyper-parameters of the probabilistic model and then adjust the decision thresholds.","Task":"Optimize hyperparameters"}
{"_id":106,"Practice":"After the best hyper-parameters are found, the model with those “optimum\" hyper-parameters should be tested in the set built for testing.","Description":"After the best hyper-parameters are found in the training stage, the model with those “optimum” hyper-parameters should be evaluated with the test set. For instance, if you trained three models (e.g., Decision treea, Decision treeb, and Decision treec) with different hyper-parameters (e.g., max depth, 3, 5, 7); and Decision treeb outperformed the other two models in the validation set, then Decision treeb should be evaluated in the test set in order to understand the generalization error of the model.","Task":"Run the algorithm with the testing part of the data set"}
{"_id":112,"Practice":"When cross-validation is used for testing, then the hold-out set should not be used.","Description":"When cross-validation is used for testing ( i.e., cross-validation is not used for tuning hyper-parameter), then the hold-out set should not be used. The hold-out set (i.e., a data subset that is divided since the beginning in order to test the model at the end of all the training and tuning process), in this case, is not necessary since the folds are used for training and testing. Otherwise, the folds in CV will be used for training and hyper-parameters tuning, and the hold-out set is necessary to test the model.","Task":"Validate the model with the validation set or cross-validation (Calculate the metrics previously defined on the training and the validating sets)."}
{"_id":113,"Practice":"A deployed model should be the one that has the best hyper-parameters, but it should be re-trained with the entire dataset.","Description":"A deployed model should be the one that has the best hyper-parameters, but it should be retrained with the entire dataset ( i.e., training, validation, and test sets) before deployment. In other words, after finding the best model with the best hyper-parameters, the model has to be trained with the whole dataset and not only the training set. However, in some cases, a validation model is needed. Thus, the best model can not be trained in the whole dataset.","Task":"Deploy the selected model"}
{"_id":117,"Practice":"For the data cleaning procedure stage, routines should be used to avoid implementing everything from scratch.","Description":"For the data cleaning procedures, routines should be used to avoid implementing everything from scratch. For example, if in a previous project, in which a data cleaning process was needed, a code that could be generalized was built, then it could be used for a current project. In this way, building everything from scratch is avoided.","Task":"Implement the model."}
{"_id":119,"Practice":"In order to enable reproducibility in data pre-processing, pipelines should be used and exported.","Description":"In order to enable reproducibility in data pre-processing, pipelines should be used and exported. This practice  encourage the usage of predefined procedures\r\nin different stages of the ML pipeline. In some cases, the existing tools/code could\r\nbe previous self-made tools/code that could be modified or adjusted for further implementations. With this, we want to emphasize that code/tools are not necessarily\r\nbuilt by third parties, and they could also be self-made.","Task":"Implement the model."}
{"_id":121,"Practice":"The (hyper-)parameters used in the training process should be documented.","Description":"The (hyper) parameters used in the training process should be documented. In other words, it shows the necessity of keeping track of the parameters used when training a model. Following these two practices, it is possible to keep track of some changes in the experiments (e.g., models, (hyper)parameters). However, it should be considered that each saved model could consume important storage, and more specialized tools for keeping track of changes should be used.","Task":"Implement the model."}
{"_id":124,"Practice":"In order to optimize, parallel execution should be used if it supported by the algorithms.","Description":"In order to optimize, parallel execution should be used if it is supported by the algorithms. In this way, computer/server resources such as CPU could be optimized. If multiple ML models could be trained in parallel, then it could also reduce the time needed for machine a choice across multiple algorithms.","Task":"Implement the model."}
{"_id":127,"Practice":"When dealing with large data, resource-aware implementations should be used.","Description":"When dealing with large data, resource-aware implementations should be used. When using this type of implementation, optimal use of the available resources could be achieved because being aware of what is used and what is not allows to distribute the resources as needed and not blocking resources that are not needed. In the same way, when being aware of the available resources, only portions of data (i.e., the data that could fit in memory) are loaded into principal memory, avoiding an “out of memory” error.","Task":"Implement the model."}
{"_id":129,"Practice":"Set the right expectations. Be transparent with your users about what your AI-powered product can and cannot do.","Description":"\"Because AI systems are probabilistic, your system will probably give an incorrect or unexpected output at some point. This makes it critical that you help users calibrate their expectations about system functionality and output. Do this by being transparent about both its capabilities and limitations. For example, indicating a prediction could be wrong may cause the user to trust that particular prediction less. However, in the long term, users may use or rely on your product more, because they’re less likely to over-trust your system and be disappointed.\"","Task":"Define the 'success criteria'","Cite":"People + AI Guidebook"}
{"_id":130,"Practice":"Explain the benefit, not the technology. Help users understand your product’s capabilities rather than what’s under the hood.","Description":"\"AI as a technology has unlocked a wide array of new, and often exciting, products and services. But no matter how novel your use of AI, when explaining your AI-powered product to your users, focus primarily on conveying how it makes part of the experience better or delivers new value, versus explaining how the underlying technology works.\nRun user studies to answer questions like:\n*What do users need to know about how the system works to understand and use your product?\n*Why is this knowledge useful to the users?\n*The level of technical detail needed in your explanation will vary depending on the product and users.\"","Task":"Define objective","Cite":"People + AI Guidebook"}
{"_id":132,"Practice":"Invest early in good data practices. The better your data planning and collection processes, the higher quality of your end output.","Description":"\"AI-powered products can suffer without the right level of focus and resources on data from early on. We call such downstream effects “data cascades” and they can be hard to diagnose and detect until your product experience is impacted.\nGood planning and scrutiny of your dataset can help you avoid issues downstream. Some actions that you can take include:\n*Collect data in batches\n*Embrace “noisy” data\n*Plan for data maintenance\n*Partner with domain experts\"","Task":"Implement the model.","Cite":"People + AI Guidebook"}
{"_id":133,"Practice":"Make precision and recall tradeoffs carefully. Determine whether to prioritize more results or higher quality results based on your product's goals.","Description":"\"When designing your AI-powered product, evaluate the tradeoffs between having the set of recommendations your system gives to be broad (prioritizing recall) or accurate (prioritizing precision). This decision will have a big impact on the end user experience.\nPrioritize precision: If you are building a product in a high stakes domain (such as healthcare), and the risks stemming from an error are high, you will likely want to be more conservative with your model output, and only give recommendations for the things the system is most confident of. This means prioritizing precision.\nPrioritize recall: if your product is in a lower stakes domain and it’s not a problem for users to sort through a longer list of recommendations (for example, music recommendations), you might prefer to prioritize recall. This means your user will get a larger set of results, offering the chance for surprising or serendipitous discoveries, but some of them may be less relevant.\"","Task":"Define the metrics to use","Cite":"People + AI Guidebook"}
{"_id":136,"Practice":"Anchor on familiarity. As you onboard users to a new AI-driven product or feature, guide them with familiar touchpoints.","Description":"\"With AI-driven products, there can be a temptation to communicate the “newness” or “magic” of the system’s predictions through its UI metaphors. However, unfamiliar UI touchpoints can make it harder for users to learn to use your system, potentially leading to degraded understanding of, or trust in, your product, no matter the quality of your AI output. Instead, anchor new users with familiar UI patterns and features. This will make it easier for them to focus on the key task at hand, which is building comfort with, and calibrating their trust in, your system’s recommendations.\"","Task":"Present the system to the user","Cite":"People + AI Guidebook"}
{"_id":147,"Practice":"Actively maintain your dataset. Maintain the quality of your product experience by proactively maintaining the quality of your data.","Description":"\"Develop a data maintenance plan early on, and monitor and maintain your data over time to detect issues proactively, rather than relying on corrective measures when something goes wrong.\nKeep the following in mind for data maintenance:\n*Similarities and differences between training and live data\n*When the data was gathered\n*When the data was gathered\"","Task":"Collect data while in production and compare changes in model inferences.","Cite":"People + AI Guidebook"}
{"_id":150,"Practice":"Get input from domain experts as you build your dataset. Building partnerships with domain experts early can help reduce iterations on your dataset later on.","Description":"\"When creating your own dataset, make time early on to observe a domain expert your product aims to serve — for example, watch an accountant analyze financial data, or a botanist classify plants. This can give you valuable insights about the types of data that they use to solve the problem your product is addressing.\nWhen creating your own dataset, make time early on to observe a domain expert your product aims to serve — for example, watch an accountant analyze financial data, or a botanist classify plants. This can give you valuable insights about the types of data that they use to solve the problem your product is addressing.\nAim for sustained relationships with domain experts throughout the project lifecycle (rather than one-off consultations), whenever possible.\"","Task":"Validate data coherence (Benchmark)","Cite":"People + AI Guidebook"}
{"_id":7,"Practice":"When measuring a dataset size it should not be done only by referring to storage space but also in terms of rows and columns.","Description":"When measuring a dataset size, it should not be done only by referring to storage space (Mb, Gb) but also in terms of rows and columns. Measuring the size of a data set only in terms of storage space is often misleading, making the dataset seem gigantic than it is, which could lead to underestimating/miscalculating the needed resources. In addition, in some cases, having the size of the matrix (i.e., rows and columns) that is going to be used by the ML model could help to compute the model’s complexity or estimate the execution time in advance.","Task":"Profiling data"}
{"_id":1,"Practice":"The goal of the model that is going to be trained should be considered when selecting the most appropriate metrics for assessing the model performance.","Description":"The goal of the model that is going to be trained should be considered when selecting the most appropriate metrics for assessing the model performance/accuracy. For instance, if the goal/requirement is to build a more precise model than one that retrieves all the relevant instances (i.e., all the instances that are important for the task), then a metric that focuses on precision is needed, like precision or f-score with a β coefficient that favors precision. For example, for the task of classifying X-rays as images with malignant tumors, some of the decisions to take into account would be:\r\n(i) would it be better to retrieve with the model all the images with malignant tumors (i.e., relevant images), regardless some of the retrieval images would be false positive (i.e., images classified as relevant ones, but in reality, they do not have malignant tumors)?\r\n(ii) would it be better to retrieve some of the relevant images with high certainty, decreasing the false-positive images, but also missing to retrieve some images with malignant tumors (i.e., relevant images)?\r\n(iii) is it desirable to have a model that balances the two aforementioned cases?. \r\n\r\nIn the first case, the model would be focused on recall (i.e., retrieving as many relevant images as possible); in the second case, the classification model would be focused on precision; and in the third case, the model would be focused on balancing the two aforementioned metrics (i.e., f-score).","Task":"Define the metrics to use"}
{"_id":12,"Practice":"For each possible attribute/column in a dataset, we should identify its type before applying any pre-processing technique.","Description":"For each possible attribute/column in a dataset, we should identify its type ( e.g., numeric, categorical, time data) before applying any pre-processing technique. This practice is relevant since some pre-processing techniques are not valid for all the data types. For example, if a categorical variable is represented with numbers, it does not mean that it could be normalized. In addition, some ML models do not receive all kinds of data as input","Task":"Profiling data"}
{"_id":16,"Practice":"It is needed to identify missing values and their nature before doing any pre-processing.","Description":"It is needed to identify missing values and their nature before doing any preprocessing. In particular, missing values in a variable/feature could be valuable; thus, removing them before analyzing and understanding them could cause data losses.","Task":"Profiling data"}
{"_id":21,"Practice":"In order to prevent overfitting and avoid biased metrics when optimizing, the whole dataset should be divided into train and test; then, the training data should be divided into folds for cross-validation","Description":"In order to prevent overfitting and avoid biased metrics when optimizing the whole dataset, it should be divided into train and test; then, the training data should be divided into folds for cross-validation. Note that this practice aslo should be called “data partitioning.” Separating the training data from the other sets allows to:\r\n(i) avoid overfitting since the model is not going to learn the whole dataset by “memory”\r\n(ii) it allows to have another set for testing the model in which a fair metric can be calculated since the data has not been previously seen. ","Task":"Divide the data set for next steps"}
{"_id":33,"Practice":"Data augmentation should be used in order to prevent over-fitting.","Description":"Data augmentation ( i.e., add more data to a dataset via newly created synthetic samples or slightly modified copies of already existing data) should be used in order to prevent overfitting. In the example of classifying buttons from a UI, some existing UI image samples could be taken, and their background color could be changed; the selected images can be converted into images with a greyscale, or the contrast could be changed to augment the number of existing samples. Data augmentation techniques are not only used in images.","Task":"Enhance and augment data"}
{"_id":34,"Practice":"For analyzing handwritten text images, the image should be split into individual characters.","Description":"For analyzing handwritten text images, the image should be split into individual characters. As an illustration, if an image contains the word “Malware” written by hand.  The word should be divided into\r\nseven pieces. Each of the seven pieces should contain a single character, such as, “M”, “a”, “l”, “w”, “a”, “r”, “e”. A tool for optical character recognition (OCR) should be used instead of implementing the recognition from scratch.","Task":"Data transformation."}
{"_id":36,"Practice":"When handling geographical data and the latitude and longitude will be transformed, directional statistics should be used to deal with potential interdependence of latitude and longitude, and with scales wrapping around.","Description":"When handling this kind of data and the latitude and longitude will be transformed, then directional statistics ( i.e., the subdiscipline of statistics that works with directions, axes, or rotations) should be used to deal with potential interdependence of latitude and longitude, and with scales wrapping around ( e.g., when using degrees 0◦ is closer to 359◦ than to 10◦ ). Note that other types of statistics may not handle these characteristics (i.e., interdependence between the two axes, wrapping scales around), which are intrinsic features of geographical data.","Task":"Data transformation."}
{"_id":44,"Practice":"When data is being anonymized with hashing, it should be done over a unique identifier using salt.","Description":"When data is being anonymized with hashing, it should be done over a unique identifier using salt ( i.e., additional random piece added to the original input, e.g., password in order). To illustrate this practice, imagine that there are two inputs (i) A, (ii) B, and (iii) A; and there is a need to anonymize them with a hashing function. After trying the proposed function, it is possible to note that the hashed values for (i) and (ii) are the same (i.e., (i) A’, (ii) B’, (iii) A’); the reason behind this, is because (i) and (iii) are the same input values. Therefore, if a hacker, by trial and error, could identify who was (i), then she can also identify (iii). However, this could be fixed if a dynamic salt (i.e., new salt is generated every time) is added before hashing, e.g., (a) Abdc (b) Bhgf, (c) Auyt. Therefore, the hashed outputs will be different. On the other hand, if the hashed data was unique, using static or dynamic salt, the process of guessing the possible used hashing function is difficult since the additional piece added (salt) is needed. Note that using dynamic salt implies that each newly generated salt must be stored.","Task":"Data transformation."}
{"_id":46,"Practice":"In the case of object detection, when selecting the region of interest, a tool should be used to fix the aspect ratio for all the labeled objects.","Description":"In the case of object detection when selecting the region of interest, a tool should be used to fix the aspect ratio for all the labeled objects. For instance, in the example of detecting buttons in a mobile app,using a tool that sets a ratio (e.g., 16:9) for identifying the region in which the button is present should be used.","Task":"Labeling"}
{"_id":48,"Practice":"Multivariate feature engineering is preferred over univariate feature selection.","Description":"Multivariate feature engineering is preferred over univariate feature selection. In other words, this practice, indicates that it is preferable to use methods for feature selection that take dependencies and relations between variables into account. \r\nExamples of multivariate feature selection methods are the multivariate Voxel selection method and minimal-Redundancy Maximal-Relevance selection. Examples of univariate methods are the chi-square test, ANOVA F-test (for classification), or mutual information.","Task":"Define the methodology of extracting and selecting features."}
{"_id":53,"Practice":"When using Logistic Regression for selecting features, select the significant ones.","Description":"When using Logistic Regression for selecting features, select the significant ones ( e.g., features with a p-value smaller than 0.05). Selecting only the significant variables means selecting those variables that have a smaller p-value than the defined for the test, e.g., 0.05. When selecting those variables, we are selecting the features that reject the null hypothesis of the linear regression (i.e., the analyzed coefficient is equal to zero).","Task":"Select features"}
{"_id":67,"Practice":"If images are being used for training a neural network, it is preferable a convolutional layer than a fully connected one.","Description":"If images are being used for NN training, a convolutional layer is preferable rather than a fully connected one. Three of the experts commented on this practice: two of them indicated that its usage depends on the problem or in the architecture of the NN, and the last comment was directed to the categorization of the practice, indicating that it is more related to “model design”. Regarding the first two expert comments, according to the current state of research, transformer architecture should be considered, e.g., Vision image Transformer and MLP mixers.","Task":"Define which models are going to be use"}
{"_id":69,"Practice":"If there is not enough data when training a deep neural network from scratch, transfer learning with fine tuning should be tried.","Description":"If there is not enough data when training a deep NN from scratch, transfer learning ( i.e., use previously learned knowledge and apply it to a new but similar problem) with finetuning should be considered. As an illustration, consider a scenario in which only a small sample of code documentation in Spanish is available to train a deep NN to understand the main topic of the texts. Under that scenario, an existing pre-trained model for language processing like BERT could be used to deal with the lack of training data. The previous pre-trained weights from the existing model could be used as an initial step for fine-tuning the new NN.","Task":"Train the model"}
{"_id":71,"Practice":"When having a large dataset that does not fit in memory, multiples models could be first trained on random data samples, and then an ensemble with those models could be created to get a more accurate model than a base learner with fewer data.","Description":"When having a large dataset that does not fit in memory, multiple models could be first trained on randomized subsamples of the data. Then an ensemble with those models could be created to get a more accurate model ( i.e., the ensemble) than a base learner with fewer data. In other words, this practice suggests that if the whole training dataset does not fit in memory, training multiple models on data that is randomly selected and joining these models with an ensemble method (e.g., bagging, is a potentially better option than sampling some data and training a single model on this sampled data. Bear in mind that standard bagging draws samples with the same number of instances as the original data, i.e., there would be no size reduction.","Task":"Train the model"}
{"_id":78,"Practice":"When a neural network does not converge, the size of the mini-batches should be increased.","Description":"When a NN does not converge, the size of the mini batches should be increased. In other words, if a NN does not converge and you are updating a model after the NN sees 10 samples of\r\ndata (10 being the size of the mini-batch), then the number of samples should be\r\naugmented (e.g., 20). For instance, a NN may not learn because the small-sized of\r\nthe mini-batches is similar to using a single instance (i.e., SGD), which can lead to\r\nunstable gradient estimations.","Task":"Optimize hyperparameters"}
{"_id":85,"Practice":"When training a model with mixed type features, which includes time series features, separate models for each part of the representation should be built, and then the different models should be combined in an ensemble.","Description":"When training a model with mixed type features, which includes time series features, separate models for each part of the representation should be built, and then the different models should be combined in an ensemble. In other words, if a dataset has time-series features and non-time-series features, then the former should be used as an input to train specialized models for that type of data, and the other features should be used to train another model. Afterward, both models could be combined in an ensemble. In this way, specialized models for each type of category (i.e., models built for time series data) are used. Therefore, the strengths of each one are being exploited. ","Task":"Train the model"}
{"_id":88,"Practice":"A model should be retrained in offline mode, due to this method could lead to a better global approximation.","Description":"A model should be retrained in offline mode because it could lead to a better global approximation; however, it is unpractical for large datasets. Note that when a model is needed to be updated as the data changes or some events are generated, then using offline retraining is not useful. In addition, when having a large dataset retraining a model with all the possible data points could be extremely costly. Moreover, if the data distribution changed older data points could be useless or introduce noise to the model.","Task":"Update/Retrain the model"}
{"_id":89,"Practice":"In the case of online and offline re-training are not an option, then using batch/mini-batch approach should be used.","Description":"In the case that online and offline re-training are not an option, then the using batch/mini-batch approach should be considered. Mini-batch training is a point in the middle of the other two, in which the first one (online training) re-trains the model each time a new observation arrives; and the other (offline) takes the whole dataset to retrain the model; mini-batch training takes n samples of the whole dataset to re-train the model.","Task":"Update/Retrain the model"}
{"_id":91,"Practice":"Each model that is combined into an ensemble should be optimized independently.","Description":"Each base model used as part of an ensemble should be optimized independently. For instance, if the ensemble includes three models (i.e., M1, M2, and M3), each model should be first tuned, and then the ensemble should be built. When training and tuning each base model/learner independently, the number of hyper-parameter to optimize is smaller than if tuning them as an ensemble. For instance, imagine that each of the three models has 100 possible hyperparameter combinations, then 300 (100*3) hyper-parameters should be tuned. On the other hand, if the tuning is done as part of an ensemble, then the possible number of combinations is 1003 = 100 ∗ 100 ∗ 100, which raises the number of hyper-parameters combinations to 1’000,000.","Task":"Optimize hyperparameters"}
{"_id":92,"Practice":"In order to monitor the learning process's evolution, the model's predefined metrics should be calculated at the end of each epoch.","Description":"In order to monitor the learning process’s evolution, the model’s predefined metrics should\r\nbe calculated at the end of each epoch. In order to understand this practice, take\r\ninto account how a NN is trained. A NN is typically trained in multiple epochs.\r\nHowever, generally, an epoch is too big to be processed at once. Therefore, the\r\ntraining data is divided into pieces called batches, and each batch is used to train\r\n(update) the model. On this basis, a metric (e.g., precision)\r\nshould be computed at the end of each epoch. In this way, a metric reflects how the\r\nNN performs after all the training was used to update the model. Note that this kind\r\nof monitoring could be used for regularization purposes like Early Stopping.","Task":"Optimize hyperparameters"}
{"_id":93,"Practice":"When data is insufficient to split it into train, validation and test, then cross-validation should be used.","Description":"When data is insufficient to split into train, validation, and test, cross-validation should be used. An example of cross-validation is K-fold cross-validation. This method specifies that the whole dataset is divided into k partitions (each piece is called a fold), and the training process is executed in k iterations. For each iteration, k − 1 folds are used for training, and the remaining fold is used for testing the model. In each iteration, performance metrics are computed, then after all the k iterations are executed, the metrics are summarized (e.g., by using average) to compute the model performance. Take into account that with this cross-validation strategy, if a test set is not available, then hyper parameter tuning is not possible.","Task":"Validate the model with the validation set or cross-validation (Calculate the metrics previously defined on the training and the validating sets)."}
{"_id":96,"Practice":"When training a neural network, in order to ensure reproducibility and fair comparison, the seed used for generating random numbers in the training process should be fixed.","Description":"When training neural networks in order to ensure reproducibility and fair comparison, the seed used for generating random numbers in the training process should be fixed. In this way, a known seed always produces the randomized numbers, and they do not change every time the NN is trained.","Task":"Train the model"}
{"_id":100,"Practice":"In order to monitor the evolution of the learning process and identify issues at an early stage on it, learning curves should be plotted.","Description":"In order to monitor the evolution of the learning process and identify issues at an early stage on it, learning curves should be plotted. For example, plotting how a model performs (e.g., accuracy for NaiveBayes) in the training and validation sets while augmenting the number of samples used for training the model helps to understand whether the model needs more data to be able to generalize. Note that, in general, plotting and the usage of learning curves are effective tools for monitoring the performance of models.","Task":"Plot learning curves"}
{"_id":107,"Practice":"If a superset vocabulary was built for a natural language processing {NLP} task, then the model bias should be checked.","Description":"If a superset vocabulary ( i.e., a vocabulary that was built with the whole dataset instead of only the training set) was built for an NLP task, then the model bias should be checked. The bias in these models could be checked as follows, once the model is trained with the superset, all the entries (e.g., words/n-grams) that reference elements that are not present in the training data should be set to zero (i.e., simulate that those entries do not exist). Then, the model should be executed (i.e., classify a text with those entries in zero) and check if the model’s output changes. If the output changes, then the model is biased by the superset; otherwise, the model is not biased by the superset.","Task":"Analyze the results"}
{"_id":114,"Practice":"When deploying a model, the entire pipeline should be exported instead of exporting only the model.","Description":"When deploying a model, the entire pipeline should be exported instead of exporting only the model. Therefore, the whole procedure will be replicable and will be easily deployed.","Task":"Deploy the selected model"}
{"_id":115,"Practice":"After deployment, the new data that will serve as input for the model should be constantly monitored to detect any deviation from the original data.","Description":"After deployment, the new data that will serve as input for the model should be constantly monitored to detect any deviation from the original data. In other words, some relevant characteristics of the data, such as data distribution, quality, and format, should be monitored and compared to the characteristics of the original data. The rationale behind this practice is: that the data could change; therefore, the model will be degraded, making their predictions/classification useless.","Task":"Collect data while in production and compare changes in model inferences."}
{"_id":118,"Practice":"Across the ML stages, pipelines should be used to automate processes and save time in complex tasks.","Description":"Across the ML stages, pipelines should be used to automate processes and save time in complex tasks.","Task":"Implement the model."}
{"_id":123,"Practice":"When dealing with large datasets or large files, aspects such as parallel executions, GPU usage, and input/output efficiency should be taken into account.","Description":"When dealing with large datasets or large files, aspects such as parallel executions, GPU usage, and input/output efficiency should be taken into account. This is important because the slowest part of a standard computer is the input/output process, and if a model requires a lot of this, then the performance of the model could be affected.","Task":"Experiment design."}
{"_id":131,"Practice":"Be accountable for errors. Understand the types of errors users might encounter and have a plan for resolving them.","Description":"\"From early in the product development process, plan for the fact that your AI system will make bad predictions at some point. This is an important part of confirming that AI is the right technology for your project.\nThink through the types of errors that your system could make, and their consequences. You should have an informed point of view about what's at stake for your user for a given error and expected impact of false positive and false negative predictions.\nPlan to remediate for such errors using approaches like:\n*Setting users expectations about your system with explanations\n*Providing manual controls when the AI fails\n*Offering high-touch customer support\"","Task":"Support possible errors","Cite":"People + AI Guidebook"}
{"_id":137,"Practice":"Add context from human sources. Help users appraise your recommendations with input from third-party sources.","Description":"\"In high stakes domains or when users are new, it can be useful to provide users with contextual information from other people to help them determine the value of your recommendations. Sources of this kind of contextual information can include:\n* third-party experts, whose information can help users vet the quality of your output\n* relevant community groups, who can help users calibrate their trust through human-to-human connection\"\n","Task":"Present the system to the user","Cite":"People + AI Guidebook"}
{"_id":138,"Practice":"Determine how to show model confidence, if at all. If you decide to show model confidence, make sure it’s done in a way that’s helpful to your users.","Description":"\"In some situations, you can help users gauge how much trust to put in the AI output with model confidence displays that explain how certain the AI is in its prediction, and the alternatives considered.\nHowever, in other contexts, confidence displays can be challenging for users to understand.\nIf you decide to use them, test different types of displays early in the product development process to find what works best for your users.\"","Task":"Present the system to the user","Cite":"People + AI Guidebook"}
{"_id":139,"Practice":"Explain for understanding, not completeness. Focus on giving your users the information they need in the moment, rather than a full run-down of your system.","Description":"\"When explaining recommendations from your AI system, focus on sharing the information that users need to make decisions and move forward. Don’t attempt to explain everything that’s happening in the system. Often, the rationale behind a particular prediction is unknown or too complex to be summarized in a simple phrase or sentence. Users may also not want to be overwhelmed or distracted by superfluous explanations as they use your product.\"","Task":"Present the system to the user","Cite":"People + AI Guidebook"}
{"_id":142,"Practice":"Let users give feedback. Give users the opportunity for real-time teaching, feedback, and error correction.","Description":"\"When your AI-enabled system behaves in a way that a user doesn’t expect or want, make sure that they have an option to share feedback. And, as much as possible, use that feedback to improve your model.\nFeedback in AI systems can take a range of forms, including,\n*Giving a thumbs up or thumbs down on a recommendation\n*Hiding unwanted recommendations\n*Flagging or reporting problematic recommendations\n*More traditional feedback flows, where a user manually reports a problem through a form or other mechanism\nOnce a user gives feedback, acknowledge that you received it. If possible, let them know how the system will respond to the feedback.\"","Task":"Incorporate customers feedback.","Cite":"People + AI Guidebook"}
{"_id":143,"Practice":"Let users supervise automation. Maintaining control over automation helps users build comfort and correct when things go wrong.","Description":"\"Allow your users to supervise automation and take back control when needed, rather than automating across the board. This is helpful to users in a number of ways, including:\n*Building comfort in new or high stakes situations with the support of controls to override the system if needed.\n*Learning how the system works as they exercise their controls. For example, by manually confirming an appointment time recommendation, they become more familiar with the system’s appointment recommendations.\n*Giving them a way to complete their task when the system doesn’t work as intended.\nEven in cases where users may not frequently exercise the option to take back control, it can be helpful to let them know that they have that option, and to help them build confidence in the system.\"","Task":"Incorporate customers feedback.","Cite":"People + AI Guidebook"}
{"_id":144,"Practice":"Automate in phases. Progressively increase automation under user guidance.","Description":"\"As you design your product, think critically about the balance of automation and control that you need to offer your users for them to use your product successfully.\nIn some cases, it may make sense to offer multiple “levels” of automation in your product (for example, partial automation and full automation), to help users with varying levels of comfort or familiarity with your product use it successfully.\nIn such cases, start users with the lowest level of automation, and progressively increase automation. Make sure that users can easily adjust their choice, including dialing up or down their level of automation, and that the steps between levels are small and land well.\nChoosing the right level of automation depends heavily on your user and product needs and context.\"","Task":"Integrate the model into the process.","Cite":"People + AI Guidebook"}
{"_id":2,"Practice":"It is important to identify the needs/requirements of model retraining.","Description":"It is important to identify the needs/requirements of model retraining, e.g., whether it should be online, periodically, or in batch. For instance, if you need to update a model each time that a user interacts with the model or when the data is being updated, an online approach is suitable for this case.","Task":"Update/Retrain the model"}
{"_id":4,"Practice":"When using probabilistic forecasting in a decision system, it is necessary to decouple the probabilistic model optimization from the probability threshold selection.","Description":"When using probabilistic forecasting in a decision system, it is necessary to decouple the probabilistic model optimization from the probability threshold selection. In other words, the model hyper-parameters should be trained first, and then the decision threshold should be adjusted. Adjusting this threshold will trade-off between True Positives (TP) (i.e., number of positive instances correctly classified) and False Negatives (FN) (i.e., number of negative instances incorrectly classified), as well as False Positives (FP) (i.e., number of positive instances incorrectly classified) and True Negatives (TN) (i.e., number of negative instances correctly classified). In other words, adjusting the threshold could cause that some of the instances classified as TP could be classified as FN and the same with FP and TN. To illustrate this, imagine that a logistic regression model is used for identifying vulnerable or not vulnerable code snippets.","Task":"Define the 'success criteria'"}
{"_id":6,"Practice":"If a model is expected to detect \"something\" in an image, then representative examples of that \"something\" should be present in the training and testing data.","Description":"If a model is expected to detect “something” in an image, then representative examples of that “something” should be present in the training and testing data. For example, let us assume that an ML model is expected to recognize a component — of a mobile interface — such as a button or a text area in UI screenshots. Then, according to the practice, the data used for training and testing should have images in which the component is exhibited. It should be clarified that this also implies that the “context” of the detected object is relevant (e.g., background). For instance, if the model now has to identify buttons in dark backgrounds, then the training data should also include those types of images. Although this practice is reported by STE practitioners in the specific domain of images, it can be generalized to other artifacts (e.g., text).","Task":"Divide the data set for next steps"}
{"_id":11,"Practice":"The object region of interest (ROI) should have a similar aspect ratio in all the positive images.","Description":"The object region of interest (ROI) should have a similar aspect ratio in all the positive images. Indicates which aspect ratio the important\r\nfeatures (i.e., features that could help to detect an object) are going to have.","Task":"Profiling data"}
{"_id":13,"Practice":"It should be determined what variables are dependent and what ones are independent.","Description":"It should be determined which variables are dependent and which ones are independent. In this way, it could be determined which attributes could be used as features/measured variables and what ones could be a target/response variable. ","Task":"Profiling data"}
{"_id":17,"Practice":"Before applying any pre-processing technique, potential errors in data should be identified.","Description":"Before applying any pre-processing technique, potential errors in data should be identified. For example, if at some point in time the format changed, e.g., from MM/DD/YYYY to DD/MM/YYYY ), it should be noticed at this stage; otherwise, the month and day could be mixed up without any notice.","Task":"Profiling data"}
{"_id":22,"Practice":"The size of a dataset should be only calcuated after the data pre-processing.","Description":"The size of a dataset should only be calculated after the data pre-processing. This is because features and records could be deleted or added in the pre-processing process, which changes the original size of the raw data.","Task":"Profiling data"}
{"_id":24,"Practice":"When having an unbalanced dataset, balancing the datashould be done only in the training dataset.","Description":"When having an unbalanced dataset, balancing the data ( i.e., a process in which an unbalanced dataset is treated to have an equal or almost equal number of samples from each class) should be only done in the training dataset.","Task":"Adjust the class distribution of a data set e.g., oversampling, undersampling"}
{"_id":25,"Practice":"Oversampling should be done only after the whole dataset is split into train, validation and test sets.","Description":"Oversampling ( i.e., a technique to balance uneven datasets by duplicating samples from the minority class) should be done only after the whole dataset is split into train, validation, and test sets","Task":"Adjust the class distribution of a data set e.g., oversampling, undersampling"}
{"_id":26,"Practice":"When undersampling data, the samples that are being selected should be randomly sampled.","Description":"When undersampling data ( i.e., a technique to balance uneven datasets by decreasing the size of the majority class by selecting some samples and keeping all of the data in the minority class), the samples that are being selected should be randomly sampled. It is to be noted that imbalanced datasets could be a challenge since most of the ML algorithms used for classification were designed having in mind balanced classes (i.e., an equal number of instances per each class). This could lead to poor performance models, especially for the minority class (i.e., class with fewer examples) because the model has few examples of the class to learn from. In addition, when calculating performance metrics, this should be taken into account. Otherwise, the selected metric could mislead how a model performed. How the dataset should be balanced depends on multiple factors, such as which is the minority class, the main class of interest or not, and determining the magnitude of difference between each class.","Task":"Adjust the class distribution of a data set e.g., oversampling, undersampling"}
{"_id":29,"Practice":"Features/attributes should be scaled when using a support vector machine (SVM) model.","Description":"Features/attributes should be scaled when using a SVM model. In particular, when building SVMs classifiers, a decision boundary between classes is built. This boundary has the maximum distance from the nearest point on each data class; and therefore, if one feature/variable has very large variables, it will dominate over the others when calculating the distance.","Task":"Data transformation."}
{"_id":38,"Practice":"If a variable is dependable on time along with other features,then, the timeseries variable could be encoded as frequencies.","Description":"If a variable is dependable on time along with other features, then the time-series variable could be encoded as frequencies. For instance, if a data set is composed of categorical variables and a set of time-related features (e.g., Metricday1, Metricday2, ....., Metricday31), then the latter could be modeled in the frequency domain (e.g., by using sine, cosine functions) to enable the use of the frequency characteristics as new features, e.g., the amplitude.","Task":"Data transformation."}
{"_id":45,"Practice":"Manual data labeling could be expensive, so it could be done by more than one person in order to scale the labeling process.","Description":"Manual data labeling could be expensive, so it could be done by more than one person in order to scale the labeling process. One example of scaling this process could be done by using cloud services, in which records are labeled by humans using services such as Amazon’s Mechanical Turk.","Task":"Define the methodology of labeling"}
{"_id":47,"Practice":"Feature engineering should be only executed with the training data and without taking into account the whole dataset.","Description":"Feature engineering should only be executed with the training data and without taking into account the whole dataset. Therefore, this process should be performed only after the training data are selected. This should be done in order to avoid data leakage (i.e., when information from future sets (testing set) is seen/learned by models, giving the model an advantage to make better predictions) between the data used for training and the remaining set(s).","Task":"Select features"}
{"_id":50,"Practice":"When using feature selection algorithms, it is important to evaluate which are their assumptions and if they hold for a particular case.","Description":"When using feature selection algorithms, it is important to evaluate/check which are their assumptions and if they hold for a particular case. For instance, if an algorithm is needed for this selection process, it should be checked which types of inputs and outputs the algorithm needs, and it should be checked the type of data distribution in order to have accurate and useful results. For example, the Chi-Square test has some assumptions, like:\r\n(i) the data (i.e., the table that is going to be used for the analysis) should be frequencies, rather than percentages.\r\n(ii) in the contingency table (i.e., a special type of frequency distribution table, where two variables are shown simultaneously), the cell values should be five or more in at least 80% of the cells. \r\n(iii) cells in the contingency table are mutually exclusive. \r\n(iv) all the observations are independent. \r\n\r\nIf those assumptions are violated, then other types of test should be applied. Otherwise, the tests results could be misleading. In particular, it could generate Type II error (i.e., is the mistaken acceptance of the null hypothesis (“no relationship exists on the categorical variables) as the result of a test procedure”) or Type I (i.e., the mistaken rejection of a null hypothesis as the result of a test procedure) error.","Task":"Define the methodology of extracting and selecting features."}
{"_id":52,"Practice":"When selecting features, the existing techniques for this purpose should be checked.","Description":"There is a plethora of feature selection techniques with different assumptions and purposes; therefore, when selecting features, make sure that the most suitable one is selected. Although this practice could seem to be \"too general\", sometimes, there could be cases in which practitioners could be tempted to use previously used techniques and not search for a more suitable one.","Task":"Define the methodology of extracting and selecting features."}
{"_id":56,"Practice":"For feature selection, the regularization techniques that are part of the algorithm should be considered (e.g., Lasso, Elastic Net when high collinearity).","Description":"For feature selection, the regularization techniques that are part of the algorithm should be considered ( e.g., Lasso , Elastic Net). For example, when using Logistic Regression or a Muli layer Perceptron, feature selection with Lasso regularization should be tried, or in the case of datasets with high colinearity, Elastic Net should be considered.","Task":"Define the methodology of extracting and selecting features."}
{"_id":57,"Practice":"The usage of Lasso, Ridge, or Glmnet or feature selection in a multiple regression model should be considered.","Description":"The usage of Lasso, Ridge, or Glmnet for feature selection in a multiple regression model should be considered. The rationale behind this practice relates to that those algorithms take a fully multivariate view of a predictor/response relationship and avoid making high variance, binary decisions.","Task":"Define the methodology of extracting and selecting features."}
{"_id":62,"Practice":"As many data as possible should be used when training a model.","Description":"As much data as possible should be used when training a model. However, more data does not\r\nalways mean better performance. An extension to this practice\r\ncould be to train a model with the currently available data and with fewer data in\r\norder to understand the behavior of the model regarding the amount of data used. In\r\ngeneral, a sensitivity analysis of the amount of data could be conducted to identify\r\nif less data (than the available data) provides the same or better results.","Task":"Divide the data set for next steps"}
{"_id":66,"Practice":"When training a NN, Early Stopping should be used (e.g., after 10 epochs the model has not improved in 0.001 in accuracy).","Description":"When training a neural network, Early Stopping should be used. In other words, monitoring on a regular basis the model quality against a validation set and stopping it when a termination condition is satisfied (e.g., after 10 epochs, the model has not improved by 0.001 in accuracy) should be used. Note that Early Stopping is used as a regularization technique to prevent over-fitting.","Task":"Train the model"}
{"_id":68,"Practice":"if a neural network stagnates with non-optimal results, then the learning rate should be decreased.","Description":"If a neural network stagnates ( i.e., a NN is not learning) with non-optimal results, then the learning rate should be decreased. For example, if a model aimed at classifying vulnerable functions reaches a point in the training process in which the metric that is being used does not improve, and the learning rate is 0.05, then the learning rate could be decreased to a smaller number to check if the results may further improve. ","Task":"Optimize hyperparameters"}
{"_id":73,"Practice":"If dependencies between features exist, it is suggested to used a gradien boost decision tree (GBDT) instead of a Random Forest (RF).","Description":"If dependencies between features exist, it is suggested to use a Gradient Boosting Decision Tree (GBDT) instead of a Random Forest. It is important to clarify that Random Forest samples the variables in order to construct the threes. In other words, a Random forest uses a subset of the possible variables to build the trees that made the forest, while GBDTs do not sample dimensions.","Task":"Define which models are going to be use"}
{"_id":74,"Practice":"The number of parameters of a deep neural network should be proportional to the number of instances needed for training a deep neural network.","Description":"The number of parameters of a deep NN should be proportional to the number of instances needed for training a deep NN. A broader example of this practice could be that training a shallow learning model requires fewer instances than for a deep learning model.","Task":"Select hyper-parameters that should be consider/optimized in the process"}
{"_id":77,"Practice":"If a neural netwok does not converge, then the learning rate should be reduced by a half or one third.","Description":"If a neural network does not converge, then the learning rate should be reduced by half or one-third. On the literature is more common to find recommendations of reducing it by half.","Task":"Optimize hyperparameters"}
{"_id":80,"Practice":"If robustness is required in a machine learning model, then, Adversarial machine learning should be used.","Description":"Adversarial ML should be used when robustness ( i.e., property that characterizes how effective an algorithm is while tested on a new or similar/original data to original but with noise) is required. For instance, when using this type of learning, the trained model is tried to be tricked by providing deceptive inputs; therefore, the model is not only going to learn features of the original data but also properties of modified data, making a model more robust to possible changes in data.","Task":"Train the model"}
{"_id":81,"Practice":"If the time-performance of a model is being  affected due to a large time series data, then the data could be sub-sampled.","Description":"If the time-performance of a model is being affected due to a large time-series data, then the data could be sub-sampled, because sub-sampling will help to reduce the number of points the algorithm uses for training.","Task":"Train the model"}
{"_id":83,"Practice":"Before combining models, the correlation of models' prediction should be measured.","Description":"Before combining models, the correlation of models’ prediction should be measured. For instance, if two models are going to be combined in an ensemble model, first, each model has to predict its outputs, and afterward, a correlation analysis between these two predictions should be done.","Task":"Train the model"}
{"_id":90,"Practice":"The optimization process should be done with a validation set.","Description":"That the optimization process should be done with a validation set. In other words, a portion of the whole dataset should be used to find the best hyper-parameters. It is important to know that in a tuning process, the same model is trained multiple times in a training set, and in each training round new hyper-parameters should be tried. Then, each model is evaluated, and the one with the best performance is selected.","Task":"Optimize hyperparameters"}
{"_id":103,"Practice":"If you want to lower the uncertainty of the hyper-parameter tuning process, then nested cross-validation should be used.","Description":"If a lower uncertainty of the hyper-parameter tuning process is desired, then nested cross-validation should be used. The rationale behind this practice is that nested CV technique implies that the training and the evaluation process for a specific group of hyper-parameters, are done multiple times. Therefore, the variance of the performance between each fold could be evaluated.","Task":"Optimize hyperparameters"}
{"_id":104,"Practice":"In order to avoid overfitting the performance between training and testing should be compared.","Description":"In order to avoid over-fitting the performance between training and testing should be compared. For instance, good training performance with low testing performance means that the model is not able to generalize when new data is observed.","Task":"Check for overfitting or under fitting (root-mean-square error)"}
{"_id":105,"Practice":"Use adversarial inputs for testing a model in order to ensure robustness.","Description":"The usage of adversarial inputs for testing a model in order to ensure robustness. In other words, if it is needed to check the robustness of a model, adversarial inputs (i.e., purposely manipulated inputs to make a trained model predict an incorrect prediction) for testing the model should be used. For example, in our button detection use case, to test the robustness of the detection model, an adversarial input could be an image that contains a pixelated button. Then when testing the model with that image, if the model is robust, the button in the manipulated image should be correctly detected.","Task":"Run the algorithm with the testing part of the data set"}
{"_id":108,"Practice":"The comparison between models to identify the best model should be made in the test set.","Description":"The comparison between models to identify the best model should be made in the test set. Please note that in this case, the test set is used as a validation set since the generalization error can not be reported the same as the one used to select a model. This is because the generalization error should be computed in a test that has not been seen before to the error calculation.","Task":"Compare and select the model and hyperparameters that fits better to the problem."}
{"_id":109,"Practice":"Previously annotated data should be used when testing machine learning models with unit tests.","Description":"Previously annotated data should be used when testing ML models with unit tests. It should be noted that this practice only applies to machine learning algorithms that use labeled data, such as supervised or semi-supervised learning. In addition, it should be kipped in mind that when building unit tests, it should be considered that each model has a generalization error (i.e., no model is perfect and cannot generalize for all the possible data). With that, we meant that even if a sample is labeled as a positive instance, it could be classified as a negative, and it does not mean that the model is not working correctly","Task":"Test the pipeline, techniques, and models implementation."}
{"_id":111,"Practice":"The membership of the testing data to the training data should be tested.","Description":"The membership of the new/unobserved data (testing data) to the training data should be tested/executed. It is a good idea to check if new data points are members of the training population (original data); otherwise, the model is obsolete/useless since that data no longer belongs to the training population. In other words, if unobserved data do not belong to the same distribution of the original data, then it indicates that the model would not be optimal on this data since the model is not updated.","Task":"Validate the model with the validation set or cross-validation (Calculate the metrics previously defined on the training and the validating sets)."}
{"_id":116,"Practice":"When monitoring the data distribution, if it deviates from the original one used for training and tuning models, then the model should be retrained in order to avoid degradation.","Description":"When monitoring the data distribution, if it deviates from the original one used for training and tuning, then the model should be retrained in order to avoid degradation. This practice takes a step further and suggests an action to take, which is retraining the model when the data that is used as input changes. The retraining process is suggested in order to have a useful and updated model.","Task":"Collect data distribution and compare it from the baseline data distribution"}
{"_id":122,"Practice":"Once a model is deployed and a petition of data deletion is made, the data should be deleted across the entire pipeline.","Description":"Once a model is deployed and a petition to delete some data is made, the data should be deleted across the entire pipeline. For example, a client of an enterprise needs her data to be deleted; then, the data should be deleted not only on the cleaned data but also in the raw data, and even the model should unlearn that instance.","Task":"Implement the model."}
{"_id":126,"Practice":"When dealing with large corpus, in NLP-related tasks, sparse structures should be used to improve the implementation performance.","Description":"When dealing with a large corpus, in NLP-related tasks, sparse structures should be used to improve the implementation performance. Here, it is important to note that NLP tasks often include a vocabulary, usually a sparse matrix (i.e., a matrix in which most of the elements are absent to reduce space). Therefore, some space (a significant portion) is wasted when using dense structures (i.e., structures that models all the possible values; for example, a table in which each cell has an assigned value, even if it is empty), with values that indicate the absence of something (e.g., a null value or a 0 to indicate that a specific word is not present in some documents of the corpus).","Task":"Implement the model."}
{"_id":128,"Practice":"Determine if AI adds value. AI is better at some things than others. Make sure that it’s the right technology for the user problem you’re solving.","Description":"\"Before you start building with AI, make sure the product or feature that you have in mind requires AI, or would be enhanced by it. AI is well-suited for applications like:\r\n*Recommending different content to different users, such as movie suggestions\r\n*Predicting future events, such as weather events or flight price changes\r\n*Natural language understanding\r\n*Image recognition\r\nA rule or heuristic-based solution may be better when:\r\n*Maintaining predictability is important\r\n*Users, customers or developers need complete transparency\r\n*People don’t want a task automated\"","Task":"Define the type of problem to solve.","Cite":"People + AI Guidebook"}
{"_id":134,"Practice":"Be transparent about privacy and data settings. From initial onboarding through ongoing use, continue to communicate about settings and permissions.","Description":"\"Many AI systems rely on user data to personalize their recommendations or optimize the system. If you are collecting user data, proactively ask users for permissions early on, and make it easy for them to adjust their settings. Over time, you might also need to remind your users of the settings that they’ve chosen so that they can update to match their current preferences. When prompting users to set or review their permissions, explain what you’re using their data for and why, and make sure your explanations are easily understandable.\"","Task":"Define ethical considerations","Cite":"People + AI Guidebook"}
{"_id":135,"Practice":"Make it safe to explore. Let users test drive the system with easily reversible actions.","Description":"\"AI systems often require access to some data or input from their users to provide them with personalized recommendations. Before asking new users what and how much data they are willing to share, give them the opportunity to test your system. For example, this can be done by offering an initial experimentation experience.  This will help new users who are eager to get started, or who don’t have time to fully consider and configure their preferences before diving in. It will also support users who are wary of sharing this information before they fully understand what the system offers in return. Even after onboarding is complete, continue to make user actions and decisions reversible where possible. User preferences and context are always changing. Giving piecemeal undo options allows them to manipulate preferences recommendations without needing to resort to a nuclear reset option.\"","Task":"Experiment design.","Cite":"People + AI Guidebook"}
{"_id":145,"Practice":"Give control back to the user when automation fails. Give your users a way to move forward even when the system fails or offers poor quality output.","Description":"\"When an AI system fails or gives users a poor prediction, the easiest path forward is often to let the user take over in a non-automated way.\nMake it as easy and intuitive as possible for users to quickly pick up where the system leaves off. Give them all the information they need to take the reins:\n*Awareness of the situation\n*What they need to do next\n*How to take their next action\nIn difficult or high-stakes situations, you may even need to redirect users to a human for extra support.\"","Task":"Support possible errors","Cite":"People + AI Guidebook"}
{"_id":149,"Practice":"Embrace “noisy” data. The real world is messy! Expect the same from the data that you gather.","Description":"\"As you develop your training dataset, don’t strive for something perfectly curated. Instead, allow some “noise” to make the data as similar as possible to the real-world data you expect to get from your users. This can help head off errors and poor quality recommendations once you release your model into the real world.\nTo do this, think about the types of data that you expect to get from your users, and then ensure that data is represented in your training set.\nFor example, for an image recognition system, consider the data you might get from you users. If it’s likely they will not have the time to take high-quality photographs and your model will have to work with blurry smartphone images, include blurry images in your training data.\"","Task":"Handle noise e.g., unwanted outliers","Cite":"People + AI Guidebook"}