Tuesday, 08 October 2024

P Programming

What is data balancing and how is it done in Python?

User Rating: 4 / 5

Star ActiveStar ActiveStar ActiveStar ActiveStar Inactive
 

Data balancing is an important step in data processing, which aims to balance different categories of data. This is usually done to solve data imbalance in classification or pattern recognition problems. There are different methods for balancing data in Python, which we will learn about in this article.

Reasons for data imbalance

Imbalance in the data means the non-uniform distribution of samples in different categories. This issue usually occurs in cases where one category has much more samples than other categories. Various reasons can lead to data imbalance:

 

Type of data collection process: In many cases, the data collection process is not random sampling and naturally focuses on a specific category. For example, in medical matters, some diseases are rarer than others, and therefore the number of samples belonging to them is smaller.

Social interactions: In some cases, data balance may be destroyed due to social interactions and human behaviors. For example, in the field of financial fraud, the number of fraud examples is less than the non-fraud examples, because frauds are contrary to normal behavior and are less than the total number of transactions.

Costs and time required for data collection: Data collection is often costly and time-consuming. Therefore, there may be limited resources to collect samples in all categories and thus the number of samples in each category may vary.

Labeling error: In some cases, the occurrence of labeling error can lead to an unbalanced number of samples in batches. For example, if the labels are labeled by humans based on their own recognition of the data, they may be less accurate in recognizing fewer categories, resulting in an unbalanced number of samples across categories.

Sampling methods: In the data collection process, the use of certain sampling methods may lead to data imbalance. Normally, in the non-random sampling method, the number of samples in each category may be different and cause imbalance in the data.

Data imbalance can have a direct impact on the performance of machine learning models. If the data is unbalanced, the model may misidentify the undersampled category and its performance will decrease. To deal with this problem, using methods such as class weighting can help the model deal with unbalanced classes appropriately.

 

1. Oversampling

Oversampling is a data balancing method in which we increase the number of samples of the lower category to balance the data distribution. This method is especially useful when faced with unbalanced data classification problems, when the samples of one of the categories are significantly less than those of the other categories. There are various methods to increase samples, but one of the famous and effective methods for this purpose is SMOTE (Synthetic Minority Over-sampling Technique). In SMOTE, random samples are generated from the nearest neighbors of each sample to generate new synthetic samples. The main steps of implementing the SMOTE method are as follows:

 

1.Select an instance from the category as the source instance.

2.Find the nearest neighbors of this source sample from the lower category. For this purpose, the k-NN (k-Nearest Neighbors) algorithm is usually used.

3.Selecting one of the neighbors randomly and calculating the vector average between the source sample and the selected neighbor.

4. Create a new synthetic sample using the calculated vector mean. This new sample is added to the less category.

5. These steps are repeated for the desired number of samples to increase the number of smaller samples to a desired value

 

By using Oversampling using SMOTE or other methods, balance is created in the distribution of data and more focus on categories. This method can improve the performance of classification models in the face of unbalanced data.

2. Undersampling

Undersampling is another method of data balancing, in which we reduce the number of samples of a larger category to create a balance in the data distribution. This method is also used in the face of unbalanced data classification problems, when the samples of one of the categories are significantly more than the other categories. Unlike oversampling, which increases the number of samples, in undersampling, we reduce the number of samples of more categories to achieve a balance between categories. This reduction can be done randomly or using algorithms that select samples based on criteria such as distance or similarity. The main steps of implementing the undersampling method are as follows:

 

 Select a random sample from more categories as the source sample.

 Finding samples from a lower category that are close to the source sample. To select these samples, methods such as the k-NN algorithm can be used.

 Remove the samples from the most selected category and only the samples from the lowest category remain.

 These steps are repeated for the desired number of samples to reduce the number of samples to a desired value.

By using undersampling, there is a balance in the data distribution and there is more focus on the lesser category. This method can improve the performance of classification models in the face of unbalanced data, but may cause the loss of important information from more categories. Therefore, it is very important to choose the appropriate data balancing method depending on the problem and the characteristics of the data.

 

3. Hybrid Approaches

Hybrid approaches are used in data balancing in unbalanced data categories. These methods try to balance the distribution of data by combining or using oversampling and undersampling methods. Combined methods can work as follows:

 

Combined oversampling and undersampling: In this method, both oversampling and undersampling are taught. First, by using oversampling methods, we increase the number of samples of the lower category, and then by using undersampling methods, we reduce the number of samples of the higher category. This method tries to take advantage of both methods and achieve a balance in data distribution.

 Combining oversampling and undersampling methods using sample-based learning algorithms: In this method, sample-based learning algorithms such as SMOTEBoost, ADASYNBoost and RUSBoost are used. These algorithms use a combination of oversampling and undersampling methods in each step of model training to balance the data distribution.

 

Hybrid methods use the advantages of oversampling and undersampling methods and try to create a proper balance in the data distribution. But in order to use these methods, there is a need to adjust the parameters and order of execution of the methods in order to get the best results in the specific problems we are dealing with.

 

4. Class Weighting

Class weighting is a method in unbalanced data classification, which balances the distribution of data by assigning different weights to different categories. In unbalanced data classification, the category with fewer samples may be inadvertently ignored by the model and the performance of the model in detecting this category will decrease. By using class weighting, this effect can be reduced and fewer classes are considered in model training in proportion to their importance. There are different ways to weight a class. Two common methods for class weighting are:

 

1. Inverse of the number of samples (Inverse Class Frequency): In this method, for each class, the weight is equal to the inverse of the number of samples of that class. In other words, less sampled classes receive more weight and more sampled classes have less weight. This method uses the number of samples as a measure to determine the importance of each class.

 

2. Balance classification (Class Proportional): In this method, the weight of each class is equal to the ratio of the number of class samples to the total number of samples. In this way, classes with more number of samples have less weight and classes with less number of samples have more weight. This method pays attention to the distribution of data between categories.

 

By using class weighting, it is possible to balance the data distribution and increase the effect of fewer classes in model training. This method usually provides a significant improvement in model performance in imbalanced data classification problems.

 

In Python, there are various packages for data balancing. Some of the most popular packages include unbalanced-learn, scikit-learn, and SMOTE. Using these packages you will be able to implement different data balancing methods. For example, in the unbalanced-learn package, methods like RandomOverSampler and RandomUnderSampler are used to increment and decrement samples, respectively. Next, I show a simple example of increasing samples using the unbalanced-learn package:

 

from imblearn.over_sampling import SMOTE

 

from sklearn.datasets import make_classification

 

# Create test data

 

X, y = make_classification(n_samples=1000, n_features=10, weights=[0.95], random_state=42)

 

# Incremental examples using SMOTE

 

smote = SMOTE()

 

X_resampled, y_resampled = smote.fit_resample(X, y)

 

# Number of samples before and after increase

 

print("Number of samples before incrementing:", len(X))

 

print("Number of samples after increment:", len(X_resampled))

 

 

In this example, we first use the make_classification function in the scikit-learn package to create a test dataset. Then, using SMOTE from the unbalanced-learn package, we increase the number of samples in the smaller category. Finally, we print the number of samples before and after incrementing. Using data balancing methods, you can balance the distribution of data and use it to train classification models.