Sitemap

Data Science: Data Pre-processing with Orange

3 min readSep 17, 2021
Press enter or click to view image in full size

Orange is a component-based data mining software. It includes a range of data visualization, exploration, preprocessing, and modeling techniques. It can be used through a nice and intuitive user interface or, for more advanced users, as a module for the Python programming language.

Installing Orange 3 library using pip

# Install Orange
!pip install orange3
import Orange

Discretization

Discretization is the process through which we can transform continuous variables, models, or functions into a discrete form. We do this by creating a set of contiguous intervals (or bins) that go across the range of our desired variable/model/function.

heartDisease = Orange.data.Table('https://raw.githubusercontent.com/biolab/orange3/master/Orange/datasets/heart_disease.tab')
disc = Orange.preprocess.Discretize()
disc.method = Orange.preprocess.discretize.EqualFreq(n=3)
d_heart_disease = disc(heartDisease)
print("Original dataset:")
for e in heartDisease[:3]:
print(e)
print("Discretized dataset:")
for e in d_heart_disease[:3]:
print(e)
Press enter or click to view image in full size

Continuization

A continuation reifies the program control state, i.e. the continuation is a data structure that represents the computational process at a given point in the process execution; the created data structure can be accessed by the programming language, instead of being hidden in the runtime environment.

titanic = Orange.data.Table('https://raw.githubusercontent.com/biolab/orange3/master/Orange/datasets/titanic.tab')
continuizer = Orange.preprocess.Continuize()
titanic1 = continuizer(titanic)
print('Before Continuization',titanic.domain)
print('After Continuization',titanic1.domain)
print('7th row of data before : ',titanic[7])
print('7th row of data after : ',titanic1[7])
Press enter or click to view image in full size

Normalization

Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. Normalization is generally required when we are dealing with attributes on a different scale, otherwise, it may lead to a dilution ineffectiveness of an important equally important attribute(on a lower scale) because of other attributes having values on a larger scale. We use the Normalize function to perform normalization.

from Orange.preprocess import Normalizenormalizer = Normalize(norm_type=Normalize.NormalizeBySpan)
normalized_data = normalizer(heartDisease)
print("Before Normalization: ",heartDisease[2])
print("After noramlization: ",normalized_data[2])
Press enter or click to view image in full size

Randomization

With randomization, given a data table, the preprocessor returns a new table in which the data is shuffled. Randomize function is used from the Orange library to perform randomization.

from Orange.preprocess import Randomizerandomizer = Randomize(Randomize.RandomizeClasses)
randomized_data = randomizer(heartDisease)
print("Before Randomization: ",heartDisease[2])
print("After Randomization: ",randomized_data[2])
Press enter or click to view image in full size

Conclusion

We use several preprocessing functions in the orange library for data preprocessing operations on data, such as randomization, normalization, discretization, and continuity.

--

--

No responses yet