Data Science: Data Pre-processing with Orange

YASH PATEL
3 min readSep 17, 2021

--

Orange is a component-based data mining software. It includes a range of data visualization, exploration, preprocessing, and modeling techniques. It can be used through a nice and intuitive user interface or, for more advanced users, as a module for the Python programming language.

Installing Orange 3 library using pip

# Install Orange
!pip install orange3
import Orange

Discretization

Discretization is the process through which we can transform continuous variables, models, or functions into a discrete form. We do this by creating a set of contiguous intervals (or bins) that go across the range of our desired variable/model/function.

heartDisease = Orange.data.Table('https://raw.githubusercontent.com/biolab/orange3/master/Orange/datasets/heart_disease.tab')
disc = Orange.preprocess.Discretize()
disc.method = Orange.preprocess.discretize.EqualFreq(n=3)
d_heart_disease = disc(heartDisease)
print("Original dataset:")
for e in heartDisease[:3]:
print(e)
print("Discretized dataset:")
for e in d_heart_disease[:3]:
print(e)

Continuization

A continuation reifies the program control state, i.e. the continuation is a data structure that represents the computational process at a given point in the process execution; the created data structure can be accessed by the programming language, instead of being hidden in the runtime environment.

titanic = Orange.data.Table('https://raw.githubusercontent.com/biolab/orange3/master/Orange/datasets/titanic.tab')
continuizer = Orange.preprocess.Continuize()
titanic1 = continuizer(titanic)
print('Before Continuization',titanic.domain)
print('After Continuization',titanic1.domain)
print('7th row of data before : ',titanic[7])
print('7th row of data after : ',titanic1[7])

Normalization

Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. Normalization is generally required when we are dealing with attributes on a different scale, otherwise, it may lead to a dilution ineffectiveness of an important equally important attribute(on a lower scale) because of other attributes having values on a larger scale. We use the Normalize function to perform normalization.

from Orange.preprocess import Normalizenormalizer = Normalize(norm_type=Normalize.NormalizeBySpan)
normalized_data = normalizer(heartDisease)
print("Before Normalization: ",heartDisease[2])
print("After noramlization: ",normalized_data[2])

Randomization

With randomization, given a data table, the preprocessor returns a new table in which the data is shuffled. Randomize function is used from the Orange library to perform randomization.

from Orange.preprocess import Randomizerandomizer = Randomize(Randomize.RandomizeClasses)
randomized_data = randomizer(heartDisease)
print("Before Randomization: ",heartDisease[2])
print("After Randomization: ",randomized_data[2])

Conclusion

We use several preprocessing functions in the orange library for data preprocessing operations on data, such as randomization, normalization, discretization, and continuity.

--

--