Data Science: Visual Programming with Orange Tool
We would normally split the data to 3 parts:
- Training data for building a model
- Validation data for testing which parameters and which model to use
- Test data for estimating the accuracy of the model
Creating the workflow
Here, we load the heart-disease.tab data set from Browse documentation data sets in File widget. We have 303 patients diagnosed with blood vessel narrowing (1) or diagnosed as healthy (0).
File and Data Sampler
- Drag the Data Sampler widget to the canvas.
- At the right side of the File widget, there is a semi-circular shape. Mouse down on it and drag it to the Data Info widget.
- Notice that there is a link between both widgets with the word data on top.
Now, we will split the data into two parts, 85% of data for training and 15% for testing. We will send the first 85% onwards to build a model.
A fixed proportion of data and went with 85%, which is 258 out of 303 patients.
Sampling and Cross-Validation
Now send the sample data from Data Sampler to Test and Score widget.
Now we will use Naive Bayes, Logistic Regression, and Tree. Now we will send the models to Test & Score widget. We used cross-validation and discovered Logistic Regression scores the highest AUC.
Split data into training and testing
Now it is time to bring in our test data (the remaining 15%) for testing. Connect Data Sampler to Test & Score once again and set the connection Remaining Data — Test Data.
Now get the comparison scores of the three different algorithms. To do so double click on the Test and Score widget and choose the option of Test on test data there and get the scores for all three algorithms.
Here we had learned how to split our data into training and testing data in the orange tool.