Skip to content

A-Z Data Science Glossary

A

Accuracy

Accuracy is a performance metric commonly used in machine learning to evaluate how well a model can correctly predict the outcome of a classification task.

Adaboost

AdaBoost (Adaptive Boosting) is an ensemble learning algorithm used in machine learning for classification and regression problems. The algorithm combines multiple weak learners into a single strong learner, where each weak learner is trained on a subset of the data.

Algorithm

A set of instructions that a computer program follows to solve a particular problem or complete a specific task.

Anomaly Detection

Anomaly detection refers to the use of machine learning algorithms or other techniques to identify unusual or anomalous behavior or events within a system or dataset. It involves training a model to recognize patterns in data and then using that model to detect data points that deviate significantly from the norm.

Artificial Intelligence

The branch of computer science that focuses on creating machines that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and language translation.

Artificial Neural Network

A computational model that is inspired by the structure and function of the human brain and is used in machine learning to recognize patterns and make predictions.

Association Rule Learning

A method used in machine learning to find patterns and relationships between variables in a large dataset.

B

Balanced Accuracy

Balanced Accuracy is a performance metric used in machine learning to evaluate the accuracy of a classification model when the classes in the dataset are imbalanced

Batch Learning

A training method in which a model is trained on a fixed dataset and then tested on a separate set of data.

Bayesian Network

A probabilistic graphical model that represents a set of random variables and their conditional dependencies using a directed acyclic graph.

Bias

A systematic error that occurs when a model consistently overestimates or underestimates the true values of a target variable.

Big Data

Extremely large and complex datasets that cannot be analyzed using traditional data processing techniques.

Black-box effect

The black-box effect refers to the situation where a machine learning model is able to make accurate predictions, but the internal workings of the model are not fully understood by the user or developer. This lack of transparency can make it difficult to interpret the model's behavior and make it challenging to debug and improve the model.

Bucket

A bucket is a term commonly used in cloud computing to refer to a container or storage repository for holding large amounts of data. Buckets are typically used for storing structured and unstructured data such as text, images, audio, and video files. These data buckets can be accessed by applications and services that need to process or analyze the data stored within them.

C

Classification

A supervised learning technique used to predict the class or category of a target variable based on the values of one or more input variables.

Clustering

A technique used in unsupervised learning to group similar objects together in a dataset based on their attributes.

Collaborative Filtering

A technique used in recommender systems to predict user preferences by analyzing patterns in the behavior of similar users.

Confusion Matrix

A confusion matrix is a performance evaluation matrix used in machine learning to evaluate the accuracy of a classification model. It is a matrix that shows the number of true positives, false positives, true negatives, and false negatives for a set of predictions compared to the actual class labels.

Convolutional Neural Network

A type of neural network commonly used in computer vision applications, which applies filters to an input image to identify patterns and features.

Counterfactual Explanations

Counterfactual explanations in machine learning refer to explanations generated by algorithms that describe what changes to an input would result in a different output.

Cross-Validation

A technique used to evaluate the performance of a machine learning model by testing it on multiple subsets of the data.

D

Data

Raw facts and figures that are collected and processed to extract useful information.

Database

A database refers to a structured collection of data that is organized and stored in a way that allows it to be easily accessed, managed, and updated.

Data Drift

Data drift in machine learning refers to the phenomenon of changes in the input data used to train a machine learning model over time, which can cause the model's performance to deteriorate.

Data Mining

The process of discovering patterns and relationships in large datasets using statistical and machine learning techniques.

Data Science

Data science is an interdisciplinary field that involves using statistical, computational, and machine learning methods to extract insights and knowledge from data.

Decision Tree

A type of model that uses a tree-like graph to represent decisions and their possible consequences.

Deep Learning

A subfield of machine learning that uses neural networks with multiple layers to learn hierarchical representations of data.

Dimensionality Reduction

A technique used to reduce the number of variables in a dataset while retaining as much information as possible.

E

Elastic Net

A regularized linear regression model that combines both L1 and L2 regularization.

Embedding

An embedding is a representation of a feature or object as a vector or a set of vectors in a lower-dimensional space. The goal of an embedding is to capture the important characteristics of the feature or object in a way that is useful for machine learning algorithms. In general, embeddings are useful because they allow machine learning algorithms to work with high-dimensional data in a more efficient and effective way.

Ensemble Learning

A technique that combines multiple machine learning models to improve their accuracy and reduce the risk of overfitting.

Expectation-Maximization

A statistical algorithm used to estimate the parameters of a probabilistic model when some of the data is missing or incomplete.

Explainability

Explainability (sometimes referred to as "interpretability" or "transparency") in machine learning refers to the ability to understand and interpret the decisions and predictions made by a machine learning model. It is pretty useful to avoid the black-box effect.

F

Factor Analysis

A statistical method used to reduce the number of variables in a dataset by identifying underlying factors that explain the patterns in the data.

Fairness

A concept in machine learning that refers to ensuring that the algorithm does not discriminate against individuals or groups based on their race, gender, age, or other protected attributes.

Feature

A measurable property or characteristic of a dataset that is used as input to a machine learning model.

Feature Selection

A technique used to select the most relevant features or variables in a dataset for a given task.

Federated Learning

A distributed learning technique that enables multiple parties to collaborate on training a machine learning model without sharing their raw data.

Fuzzy Logic

A form of logic that allows for degrees of truth, rather than just binary true/false values.

G

Gaussian Process

A probabilistic model used for regression and classification tasks that models the distribution of data using a multivariate normal distribution.

Generative Model

A type of machine learning model that learns to generate new data that is similar to the training data.

Genetic Algorithm

An optimization algorithm inspired by the process of natural selection and evolution that uses a population of candidate solutions and genetic operators to find the optimal solution.

GPU

Graphics Processing Unit, a specialized processor used to accelerate the training of machine learning models.

Gradient Descent

An optimization algorithm used to minimize the error of a machine learning model by adjusting the parameters in the direction of the steepest descent of the cost function.

Grid search is a hyperparameter tuning technique in machine learning that involves searching for the best combination of hyperparameter values for a given model.

GRU

"Gated Recurrent Unit", a type of neural network architecture that is especially good at understanding sequences of data, such as sentences or time series data.

H

Hierarchical Clustering

A clustering technique that builds a hierarchy of clusters by recursively merging or dividing existing clusters.

High-dimensional Data

Data that has a large number of features or variables, which can make it difficult to analyze using traditional statistical methods.

Hyperparameter

A parameter of a machine learning model that is set before training and affects the model's performance but is not learned from the data.

Hyperparameter search is a process in machine learning that involves finding the optimal values for the hyperparameters of a model. Hyperparameters are model parameters that are set before training and cannot be learned from the data.

I

Imbalanced Data

A dataset where the number of examples in each class or category is not equal, which can lead to biased or inaccurate models.

Inference

The process of using a trained machine learning model to make predictions or estimate the values of unknown variables.

Information Theory

A branch of mathematics and computer science that studies the quantification, storage, and communication of information.

Instance-based Learning

A type of machine learning in which the model is based on similarity between the training instances and the new input.

Interpolation

In time-series analysis, interpolation is a technique used to estimate missing or incomplete values in a time series dataset based on the known values in the series. Some common interpolation methods used in time series analysis include linear interpolation, spline interpolation, and polynomial interpolation.

Interpretability

The degree to which the inner workings of a machine learning model can be understood and explained by humans.

J

Jupyter Notebook

An open-source web application used for interactive data analysis and visualization that supports multiple programming languages, including Python and R.

K

K-Means Clustering

A popular clustering technique that partitions a dataset into k clusters by minimizing the sum of squared distances between the data points and their cluster centroids.

Kernel Method

A technique used in machine learning to transform the input data into a higher-dimensional space,

KNN

K-Nearest Neighbors, a simple classification algorithm that predicts the class of a new data point based on the classes of its k nearest neighbors in the training set.

Knowledge Representation

The process of creating a formal system for storing and manipulating knowledge in a computer or AI system.

KFold

K-fold cross-validation is a technique in machine learning for estimating the performance of a model on unseen data.

L

Learning Rate

A hyperparameter in machine learning optimization algorithms that controls the step size of parameter updates during training.

Linear Regression

A statistical model used to predict a continuous target variable based on one or more predictor variables, assuming a linear relationship between them.

Logistic Regression

A statistical model used to predict the probability of a binary outcome based on one or more predictor variables.

Loss Function

A function that measures the difference between the predicted output of a machine learning model and the true output, used to guide the training process towards better performance.

LSTM

Long Short-Term Memory, a type of recurrent neural network architecture that can learn long-term dependencies in sequential data.

M

Multiclass Classification

A classification task where the goal is to predict the class of a data point from three or more possible classes.

Multi-Layer Perceptron

A type of neural network architecture composed of multiple layers of interconnected neurons.

Markov Chain

A mathematical model that describes a sequence of events where the probability of each event depends only on the previous event.

Model Selection

The process of choosing the best machine learning model among a set of candidate models based on their performance on a validation set.

Monte Carlo Simulation

A method of statistical analysis that uses random sampling to estimate the probability distribution of an outcome.

MLOps

MLOps (Machine Learning Operations) is a practice in machine learning that focuses on the deployment, management, and optimization of machine learning models in production environments.

N

Natural Language Processing (NLP)

The field of AI that focuses on the interaction between computers and human language, including tasks such as language translation, sentiment analysis, and text classification.

Neural Network

A type of machine learning model inspired by the structure and function of the human brain, composed of interconnected neurons that process and transmit information.

Noise Reduction

The process of removing unwanted noise or artifacts from data, to improve the quality and reliability of machine learning models.

Non-parametric Model

A type of machine learning model that does not assume a specific functional form for the underlying distribution of the data.

Normalization

The process of scaling and transforming input data to have a mean of zero and a standard deviation of one, to make it more suitable for machine learning algorithms.

O

Object Detection

The task of identifying and localizing objects of interest in an image or video, often used in applications such as self-driving cars and security systems.

One-Hot Encoding

A technique used to represent categorical data as binary vectors, where each feature corresponds to a unique category and has a value of 1 or 0.

Optimization

The process of finding the best set of model parameters that minimize the loss function, often using techniques such as gradient descent and stochastic gradient descent.

Outlier Detection

The process of identifying data points that deviate significantly from the expected pattern or distribution, which can help identify errors or anomalies in the data.

Overfitting

A common problem in machine learning where a model is overly complex and fits the training data too closely, leading to poor generalization and performance on new data.

P

PCA

Principal Component Analysis, a dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional representation by identifying the most important features that explain the variance in the data.

Precision

Performance metrics used to evaluate the quality of a machine learning model, where precision measures the proportion of true positives among all predicted positives

Perceptron

A type of neural network architecture that consists of a single layer of binary classifiers, used for binary classification tasks.

Predictive Modeling

The process of using machine learning algorithms to predict future outcomes or events based on historical data.

Preprocessing

The process of cleaning, transforming, and preparing data before feeding it into a machine learning model.

Probability Distribution

A function that describes the probability of different outcomes or events in a random experiment, such as a coin toss or a dice roll.

PDP

A Partial Dependence Plot (PDP) is a visualization technique used in machine learning to understand the relationship between a target variable and one or more input variables in a model.

Q

Quantization

The process of reducing the number of unique values in a dataset or feature, often used to reduce the storage and computational requirements of machine learning models.

Query Optimization

The process of selecting the most efficient execution plan for a database query, to minimize the response time and resource usage.

R

Random Forest

A type of ensemble learning method that combines multiple decision trees to make more accurate predictions, often used for classification and regression tasks.

Recurrent Neural Network

A type of neural network architecture that can process sequential data by using feedback connections between the neurons, often used for natural language processing and time series analysis.

Regression

Regression is a type of machine learning algorithm used to predict a continuous numerical output variable based on one or more input variables.

Reinforcement Learning

A type of machine learning that focuses on learning through interaction with an environment, where the agent receives rewards or penalties based on its actions and learns to maximize its cumulative reward.

Ridge Regression

A variant of linear regression that adds a regularization term to the loss function, to prevent overfitting and improve the generalization performance.

Root Mean Squared Error

A common metric used to evaluate the performance of regression models, which measures the average squared difference between the predicted and actual values.

ROC Curve

A ROC (Receiver Operating Characteristic) Curve is a graphical representation of the performance of a binary classification model in machine learning.

ROC AUC Score

ROC AUC (Receiver Operating Characteristic Area Under the Curve) Score is a performance metric used in binary classification tasks in machine learning.

S

Sentiment Analysis

The task of automatically identifying the sentiment or opinion expressed in a piece of text, often used for social media analysis and customer feedback analysis.

Singular Value Decomposition

A matrix factorization technique that decomposes a matrix into its singular values and corresponding eigenvectors, often used for dimensionality reduction and data compression.

Stochastic Gradient Descent (SGD)

A variant of gradient descent that uses a random subset of the training data at each iteration to estimate the gradient of the loss function, which can improve the efficiency and scalability of the algorithm.

Supervised Learning

A type of machine learning where the model learns from labeled examples, where the input features are associated with a known output or target value.

Support Vector Machine (SVM)

A powerful and versatile classification algorithm that finds the hyperplane that maximally separates the classes in the input space, often used for both linear and non-linear classification tasks.

Survival analysis

Survival analysis is a statistical method used in machine learning to analyze and model time-to-event data, where the outcome of interest is the time until a specific event occurs, such as death, failure, or disease diagnosis.

T

Test Data

The data used to evaluate the performance of a machine learning model.

Text Mining

The process of extracting useful information and insights from unstructured text data, often using techniques such as natural language processing and machine learning.

Time-Series Analysis

The process of analyzing and modeling time-dependent data, often used for forecasting and prediction tasks in finance, economics, and engineering.

Training Data

The data used to train a machine learning model.

Transfer Learning

A technique where a pre-trained machine learning model is used as a starting point for a new task, often with fine-tuning and additional training on a small amount of task-specific data.

Transformer

It refers to any technique or method that involves transforming inputs or data to improve their representation, processing, or analysis to better capture their relationships and structure, such as data augmentation

Tree-Based Model

A type of machine learning model that uses decision trees to make predictions or classifications, often used for both regression and classification tasks.

U

Underfitting

A problem that occurs when a machine learning model is too simple and cannot capture the underlying patterns in the data, resulting in poor performance on both the training and test sets.

Uniform Distribution

A probability distribution where all outcomes or events have an equal chance of occurring, often used in random sampling and simulation studies.

Universal Approximation Theorem

A theorem that states that a neural network with a single hidden layer and a sufficient number of neurons can approximate any continuous function to any desired level of accuracy, under mild assumptions on the activation function.

Unsupervised Learning

A type of machine learning where the model learns from unlabeled data, without any predefined output or target values.

Unsupervised Feature Learning

The process of learning useful features or representations from unlabeled data, often used as a pre-processing step for supervised learning tasks or for unsupervised learning tasks such as clustering and anomaly detection.

V

Validation Set

A subset of the training data that is used to evaluate the performance of a machine learning model during training, to prevent overfitting and select the best hyperparameters.

Vanishing Gradient Problem

A problem that occurs in deep neural networks when the gradients become too small and the weights cannot be updated effectively, often addressed by using techniques such as weight initialization, non-linear activation functions, and batch normalization.

Variance

A statistical measure that describes how much the values in a dataset deviate from their mean, often used to quantify the variability or spread of the data.

Vector Space Model

A mathematical model that represents documents or text data as vectors in a high-dimensional space, often used for information retrieval and text classification tasks.

Viterbi Algorithm

An algorithm that finds the most likely sequence of hidden states in a hidden Markov model, often used for speech recognition and natural language processing.

W

Weighted Average

A method of aggregating values that assigns different weights to each value based on their importance or relevance, often used for feature engineering and ensemble learning.

Whitening Transformation

A linear transformation that decorrelates and normalizes the input data, often used as a pre-processing step to improve the performance of machine learning models.

Wide and Deep Learning

A hybrid neural network architecture that combines a wide linear model with a deep neural network, often used for recommender systems and advertising.

Word Embedding

A technique that maps each word in a vocabulary to a dense vector in a high-dimensional space, often used to represent and compare words in natural language processing tasks.

X

XGBoost

An optimized implementation of gradient boosting that uses a variety of advanced techniques to improve the accuracy and efficiency of the algorithm, often used for classification and regression tasks.

Xavier Initialization

A technique for initializing the weights of a neural network, which ensures that the variance of the outputs of each layer is approximately equal to the variance of the inputs, often used to prevent the vanishing or exploding gradient problem.

Z

Zero Bias

A condition where the expected value of the errors in a statistical model is zero, often used as a desirable property of the model.

Zero-Shot Learning

A type of machine learning where the model can recognize and classify objects that it has never seen before, by learning to generalize from known objects and their attributes.

Z-Score

A standardized score that measures the number of standard deviations a data point is from the mean, often used to identify outliers and anomalies in a dataset.

Zero-Inflated Model

A type of statistical model that accounts for excess zeros in count data, often used in fields such as ecology, epidemiology, and finance.