Data Science: Data Pre-processing with Orange

Orange is a component-based data mining software. It includes a range of data visualization, exploration, preprocessing, and modeling techniques. It can be used through a nice and intuitive user interface or, for more advanced users, as a module for the Python programming language.

Installing Orange 3 library using pip

# Install Orange
!pip install orange3
import Orange

Discretization

Discretization is the process through which we can transform continuous variables, models, or functions into a discrete form. We do this by creating a set of contiguous intervals (or bins) that go across the range of our desired variable/model/function.

heartDisease = Orange.data.Table('https://raw.githubusercontent.com/biolab/orange3/master/Orange/datasets/heart_disease.tab')
disc = Orange.preprocess.Discretize()
disc.method = Orange.preprocess.discretize.EqualFreq(n=3)
d_heart_disease = disc(heartDisease)
print("Original dataset:")
for e in heartDisease[:3]:
print(e)
print("Discretized dataset:")
for e in d_heart_disease[:3]:
print(e)

Continuization

A continuation reifies the program control state, i.e. the continuation is a data structure that represents the computational process at a given point in the process execution; the created data structure can be accessed by the programming language, instead of being hidden in the runtime environment.

titanic = Orange.data.Table('https://raw.githubusercontent.com/biolab/orange3/master/Orange/datasets/titanic.tab')
continuizer = Orange.preprocess.Continuize()
titanic1 = continuizer(titanic)
print('Before Continuization',titanic.domain)
print('After Continuization',titanic1.domain)
print('7th row of data before : ',titanic[7])
print('7th row of data after : ',titanic1[7])

Normalization

Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. Normalization is generally required when we are dealing with attributes on a different scale, otherwise, it may lead to a dilution ineffectiveness of an important equally important attribute(on a lower scale) because of other attributes having values on a larger scale. We use the Normalize function to perform normalization.

from Orange.preprocess import Normalizenormalizer = Normalize(norm_type=Normalize.NormalizeBySpan)
normalized_data = normalizer(heartDisease)
print("Before Normalization: ",heartDisease[2])
print("After noramlization: ",normalized_data[2])

Randomization

With randomization, given a data table, the preprocessor returns a new table in which the data is shuffled. Randomize function is used from the Orange library to perform randomization.

from Orange.preprocess import Randomizerandomizer = Randomize(Randomize.RandomizeClasses)
randomized_data = randomizer(heartDisease)
print("Before Randomization: ",heartDisease[2])
print("After Randomization: ",randomized_data[2])

Conclusion

We use several preprocessing functions in the orange library for data preprocessing operations on data, such as randomization, normalization, discretization, and continuity.

--

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

15 SVG UI Icon Sets for 2019

Testing — part 2

Elon Musk’s latest attempt to pump Dogecoin fails miserably. He’s becoming irrelevant!

Elon Musk's latest attempt to pump Dogecoin fails miserably. He’s becoming irrelevant!

Building and Deploying Azure DevOps extensions with Azure Pipelines

Product managers are turning to products to help them build solutions that matter

Java Multithreading and Concurrency for Senior Engineering Interviews

Successful Interoperability Test between IPLOOK 5GC and AsiaInfo International CHF

IPLOOK and AsiaInfo partner to deliver an end-to-end 5G solution for the Telecom Industry.

Technologies I’m learning in 2022

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
YASH PATEL

YASH PATEL

More from Medium

End-to-End Light-Weight Machine learning model deployment (Using Statistics, Python, Streamlit…

Database normalization

Data Assurance

Data x Categorical Variables