Missing Data Imputation in Categorical Variables: MCAR, MAR, MNAR and Methods

Jul 28, 2025
5 min read

One of the most common problems encountered in data analysis and machine learning projects is missing values in datasets. While many courses and resources focus on the imputation of numerical variables, missing values in categorical variables are often overlooked or glossed over with simple methods. However, missing values, especially in categorical variables, can significantly impact the accuracy of analyses and the performance of models.

While various strategies exist for dealing with missing data, choosing the right method depends on understanding the underlying mechanism of missing data. In this blog post, we will examine missing data mechanisms (MCAR, MAR, MNAR) in detail and discuss the various methods available for imputing missing values in categorical variables. My goal is to provide data scientists and analysts with a comprehensive guide to making informed decisions when faced with missing data (specifically, Categorical Missing Data Imputation).

Missing Data Mechanisms: MCAR, MAR, and MNAR

Understanding why missing data occur is a critical step in determining which imputation method is most appropriate. Rubin (1976) divided missing data problems into three main categories:

Missing Completely at Random (MCAR)

MCAR refers to the situation where the probability of a missing data point is completely independent of both the observed and unobserved data points. Simply put, missing data occurs entirely by chance and is not related to any variable in the data set. For example, a random error in data entry during a survey or the loss of a specific record due to a sensor malfunction can be considered MCAR. In the case of MCAR, ignoring missing data or using simple imputation methods (e.g., padding with a mode) generally does not lead to biased results because the missingness pattern does not distort the data structure. However, MCAR is quite rare in real-world data sets, and more complex missingness mechanisms are typically encountered.

Missing at Random (MAR)

MAR refers to the situation where the probability of missing data depends on other observed variables in the dataset, but not on the unobserved values of the missing variable itself. That is, the missingness can be explained by other information present in the dataset. For example, in a health survey, MAR might be considered if older respondents were less likely than younger respondents to answer questions about a particular health issue. In this case, the missingness is related to the variable 'age' (an observed variable). MAR is a more common and realistic scenario than MCAR. In this case, more sophisticated imputation methods (e.g., regression imputation, multiple imputation) can be applied using the observed variables that explain the missingness. Most modern imputation techniques are based on the MAR assumption and can produce reliable results under this assumption.

Missing Not at Random (MNAR)

MNAR refers to the situation where the probability of a missing data point depends on the unobserved values of the missing variable itself. This is the most complex and difficult mechanism of missingness to handle because the information explaining the missingness is not present in the dataset. For example, low-income individuals are more likely to not report their income, or people with a certain disease tend not to report their symptoms. Here, the missingness is related to the true value of the missing variable "income" or "disease symptom." In the case of MNAR, simply ignoring the missing data or using imputation methods based on observed data can lead to seriously biased results. More advanced strategies for dealing with MNAR are often required, such as collecting additional data that can explain the missingness, using domain knowledge, or conducting sensitivity analyses. This complicates the imputation process and requires a cautious approach.

Imputation Methods for Categorical Variables

Once you understand the missing data mechanism, it's time to choose the methods for filling in missing values (imputation). The main imputation methods available for categorical variables are:

1. Deleting Observations

The simplest approach is to remove all rows (observations) containing missing values from the dataset. This method may be acceptable when the rate of missing data in the dataset is very low and the missingness is MCAR. However, it incurs data loss and can lead to seriously biased results, especially if the rate of missing data is high or the missingness is MAR/MNAR. Therefore, it is generally not recommended.

2. Filling with the Most Frequently Seen Value (Mode Imputation)

This method involves replacing missing values of a categorical variable with the most frequent category (mode) of that variable. It is easy and quick to implement. However, if the dataset contains a large number of missing values or the variable's category distribution is unbalanced, this method can distort the original distribution of the dataset and negatively impact the model's performance. It can be a simple starting point, especially in the case of MCAR.

3. Fill with 'Unknown' or a New Category

Filling in missing values with a new category not present in the dataset (e.g., 'Unknown', 'Other') is one way to preserve missing information. This method can be useful when the missing values themselves carry meaningful information (e.g., not answering a question might be a choice) or when MNAR is suspected. However, how this new category will be interpreted by the model and its impact on model performance must be carefully considered.

4. Predictive Imputation Methods

These methods use other variables to impute missing values. They are more sophisticated and generally more accurate, and are particularly effective in the case of MAR. Predictive methods that can be used for categorical variables include:

Classification Models: A classification model (e.g., Decision Trees, Random Forest, Logistic Regression) is trained using the variable containing the missing value as the target variable and other variables in the dataset. This model is then used to predict missing values.

K-Nearest Neighbor (KNN) Imputation: KNN fills in missing values by looking at the values of the K neighbors that are most similar (nearest) to the observation with the missing value. For categorical variables, the most frequent neighbor category can be used. However, because KNNImputer does not work directly with categorical data, it is typically applied after generating numerical representations of categorical variables using methods such as one-hot encoding.

Multiple Imputation by Chained Equations (MICE): MICE is one of the most powerful and flexible methods for dealing with missing data. It imputes each missing variable using a regression model using other variables. This process is repeated iteratively until all missing values are filled. MICE also accounts for imputation uncertainty by generating multiple imputed data sets, allowing for more reliable statistical inferences. It is particularly effective in the case of MAR.

Dealing with missing data in categorical variables is a crucial part of the data analysis process. Understanding the underlying mechanism of missing data (MCAR, MAR, MNAR) is a fundamental step in selecting the right imputation method. While MCAR is the simplest case, MAR and MNAR are more frequently encountered in the real world. In MAR, predictive imputation methods (MICE, classification models) offer effective solutions, while MNAR is the most challenging scenario and often requires additional information or sensitivity analyses. Each imputation method has its own advantages and disadvantages. It is important for data scientists to choose the most appropriate method by considering the characteristics of their datasets, the missingness mechanism, and the purpose of their analysis. It is important to remember that no imputation method is perfect and always involves some uncertainty. Therefore, it is useful to conduct sensitivity analyses to assess the robustness of post-imputation analyses. We hope this blog post has provided you with a comprehensive perspective on dealing with missing data in categorical variables. With the right imputation strategies, you can obtain more reliable and meaningful results from your datasets.

References

[1] Rubin, D. B. (1976). Inference and Missing Data. Biomatika, 63(3), 581–592. [2] Analytics Vidhya. (2021). Handling Missing Values of Categorical Variables. Retrieved from: https://www.analyticsvidhya.com/blog/2021/04/how-to-handle-missingvalues-of-categorical-variables/ [3] Stef van Buuren. (n.d.). 1.2 Concepts of MCAR, MAR and MNAR. Flexible Imputation of Missing Data. Access Address: https://stefvanbuuren.name/fimd/sec-MCAR.html