Data Science: Data Pre-processing with Orange
Orange is a component-based data mining software. It includes a range of data visualization, exploration, preprocessing, and modeling techniques. It can be used through a nice and intuitive user interface or, for more advanced users, as a module for the Python programming language.
Installing Orange 3 library using pip
# Install Orange
!pip install orange3import Orange
Discretization
Discretization is the process through which we can transform continuous variables, models, or functions into a discrete form. We do this by creating a set of contiguous intervals (or bins) that go across the range of our desired variable/model/function.
heartDisease = Orange.data.Table('https://raw.githubusercontent.com/biolab/orange3/master/Orange/datasets/heart_disease.tab')
disc = Orange.preprocess.Discretize()
disc.method = Orange.preprocess.discretize.EqualFreq(n=3)
d_heart_disease = disc(heartDisease)print("Original dataset:")
for e in heartDisease[:3]:
print(e)print("Discretized dataset:")
for e in d_heart_disease[:3]:
print(e)
Continuization
A continuation reifies the program control state, i.e. the continuation is a data structure that represents the computational process at a given point in the process execution; the created data structure can be accessed by the programming language, instead of being hidden in the runtime environment.
titanic = Orange.data.Table('https://raw.githubusercontent.com/biolab/orange3/master/Orange/datasets/titanic.tab')
continuizer = Orange.preprocess.Continuize()
titanic1 = continuizer(titanic)print('Before Continuization',titanic.domain)
print('After Continuization',titanic1.domain)print('7th row of data before : ',titanic[7])
print('7th row of data after : ',titanic1[7])
Normalization
Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. Normalization is generally required when we are dealing with attributes on a different scale, otherwise, it may lead to a dilution ineffectiveness of an important equally important attribute(on a lower scale) because of other attributes having values on a larger scale. We use the Normalize function to perform normalization.
from Orange.preprocess import Normalizenormalizer = Normalize(norm_type=Normalize.NormalizeBySpan)
normalized_data = normalizer(heartDisease)print("Before Normalization: ",heartDisease[2])
print("After noramlization: ",normalized_data[2])
Randomization
With randomization, given a data table, the preprocessor returns a new table in which the data is shuffled. Randomize function is used from the Orange library to perform randomization.
from Orange.preprocess import Randomizerandomizer = Randomize(Randomize.RandomizeClasses)
randomized_data = randomizer(heartDisease)print("Before Randomization: ",heartDisease[2])
print("After Randomization: ",randomized_data[2])
Conclusion
We use several preprocessing functions in the orange library for data preprocessing operations on data, such as randomization, normalization, discretization, and continuity.