Data Science: Data Pre-processing with Orange

YASH PATEL
3 min readSep 17, 2021

--

Orange is a component-based data mining software. It includes a range of data visualization, exploration, preprocessing, and modeling techniques. It can be used through a nice and intuitive user interface or, for more advanced users, as a module for the Python programming language.

Installing Orange 3 library using pip

# Install Orange
!pip install orange3
import Orange

Discretization

Discretization is the process through which we can transform continuous variables, models, or functions into a discrete form. We do this by creating a set of contiguous intervals (or bins) that go across the range of our desired variable/model/function.

heartDisease = Orange.data.Table('https://raw.githubusercontent.com/biolab/orange3/master/Orange/datasets/heart_disease.tab')
disc = Orange.preprocess.Discretize()
disc.method = Orange.preprocess.discretize.EqualFreq(n=3)
d_heart_disease = disc(heartDisease)
print("Original dataset:")
for e in heartDisease[:3]:
print(e)
print("Discretized dataset:")
for e in d_heart_disease[:3]:
print(e)

Continuization

A continuation reifies the program control state, i.e. the continuation is a data structure that represents the computational process at a given point in the process execution; the created data structure can be accessed by the programming language, instead of being hidden in the runtime environment.

titanic = Orange.data.Table('https://raw.githubusercontent.com/biolab/orange3/master/Orange/datasets/titanic.tab')
continuizer = Orange.preprocess.Continuize()
titanic1 = continuizer(titanic)
print('Before Continuization',titanic.domain)
print('After Continuization',titanic1.domain)
print('7th row of data before : ',titanic[7])
print('7th row of data after : ',titanic1[7])

Normalization

Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. Normalization is generally required when we are dealing with attributes on a different scale, otherwise, it may lead to a dilution ineffectiveness of an important equally important attribute(on a lower scale) because of other attributes having values on a larger scale. We use the Normalize function to perform normalization.

from Orange.preprocess import Normalizenormalizer = Normalize(norm_type=Normalize.NormalizeBySpan)
normalized_data = normalizer(heartDisease)
print("Before Normalization: ",heartDisease[2])
print("After noramlization: ",normalized_data[2])

Randomization

With randomization, given a data table, the preprocessor returns a new table in which the data is shuffled. Randomize function is used from the Orange library to perform randomization.

from Orange.preprocess import Randomizerandomizer = Randomize(Randomize.RandomizeClasses)
randomized_data = randomizer(heartDisease)
print("Before Randomization: ",heartDisease[2])
print("After Randomization: ",randomized_data[2])

Conclusion

We use several preprocessing functions in the orange library for data preprocessing operations on data, such as randomization, normalization, discretization, and continuity.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

Write a response