Data Science: Data Pre-processing with Orange

Orange is a component-based data mining software. It includes a range of data visualization, exploration, preprocessing, and modeling techniques. It can be used through a nice and intuitive user interface or, for more advanced users, as a module for the Python programming language.

Installing Orange 3 library using pip

# Install Orange
!pip install orange3

Discretization

Discretization is the process through which we can transform continuous variables, models, or functions into a discrete form. We do this by creating a set of contiguous intervals (or bins) that go across the range of our desired variable/model/function.

heartDisease = Orange.data.Table('https://raw.githubusercontent.com/biolab/orange3/master/Orange/datasets/heart_disease.tab')
disc = Orange.preprocess.Discretize()
disc.method = Orange.preprocess.discretize.EqualFreq(n=3)
d_heart_disease = disc(heartDisease)

Continuization

A continuation reifies the program control state, i.e. the continuation is a data structure that represents the computational process at a given point in the process execution; the created data structure can be accessed by the programming language, instead of being hidden in the runtime environment.

titanic = Orange.data.Table('https://raw.githubusercontent.com/biolab/orange3/master/Orange/datasets/titanic.tab')
continuizer = Orange.preprocess.Continuize()
titanic1 = continuizer(titanic)

Normalization

Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. Normalization is generally required when we are dealing with attributes on a different scale, otherwise, it may lead to a dilution ineffectiveness of an important equally important attribute(on a lower scale) because of other attributes having values on a larger scale. We use the Normalize function to perform normalization.

from Orange.preprocess import Normalize

Randomization

With randomization, given a data table, the preprocessor returns a new table in which the data is shuffled. Randomize function is used from the Orange library to perform randomization.

from Orange.preprocess import Randomize

Conclusion

We use several preprocessing functions in the orange library for data preprocessing operations on data, such as randomization, normalization, discretization, and continuity.