Prior to training a machine learning or deep learning model, it is important to cleanse, pre-process and analyse the dataset at hand. Processes like dealing with missing values, converting text data into numbers and so on are all part of the pre-processing phase. More often than not, these processes come across as being repetitive and monotonous. Although there are tools for automating this process, they behave like a black box and do not give intuition about how they changed the data. To overcome this problem, python introduced a library called dabl – Data Analysis Baseline library. Dabl can be used to automate many of the tasks that seem repetitive in the early stages of model development. This was developed quite recently and the latest version of Dabl was released earlier this year. The number of available features currently are less, but the development process is happening at a good pace at Dabl.
In this article, we will use this tool for data pre-processing, visualisation and analysis as well as model development. Let’s get started.
Data pre-processing
To use dabl to perform data analysis we need to first install the package. You can install this using the pip command as
pip install dabl
Once the installation is done, let us go ahead and pick a dataset. I will select a sample dataset from Kaggle. You can click this link to download the data. I have chosen the diabetes dataset. It is a small dataset which will make it easy to understand how dabl works.
After downloading the dataset, let us import the important libraries and look at our dataset.
import numpy as np import dabl import pandas as pd db_data=pd.read_csv('diabetes.csv') db_data.head()
Usually, after looking at the dataset you would get into the data cleaning process by trying to identify missing rows, identify the erroneous data and understand the datatypes of the columns. These processes are made easy using dabl by automating these.
db_clean = dabl.clean(db_data, verbose=1)
We have a list of detected feature types for the dataset given. These types indicate the following.
Continuous: This is the number of columns containing continuous values and columns with high cardinality.
Dirty_float: Float variables that sometimes take string values are called dirty_float.
Low_card_int: Columns that contain integers with low cardinality fall under this category.
Categorical: This is the number of columns containing pandas categorical values in a string, integer or floating-point formats.
Date: Columns with data in them. These are currently not handled by dabl.
free_string: string data types which contain multiple unique values are labelled as free_string.
Useless: Constant or integer values that do not match with any of the categories are given a name useless.
For more information about the feature types it has identified you can do the following step.
type_info = dabl.detect_types(db_clean)
type_info
Here, we can clearly see which column of the dataset is of which data type. We can also change the type to meet our needs and requirements. For example, the column named Pregnancies is labelled neither as continuous nor as categorical and since the values in the column are single integer values we can make them into categorical values.
db_clean = dabl.clean(db_data, type_hints={"Pregnancies": "categorical"})
We have successfully converted the column into a categorical one.
Data visualisation
The next part before training of the model is to visualise the data. Using visualisation tools like matplotlib or seaborn is effective but dabl makes the process very simple and displays a wide range of plots for a single line of code. As dabl detects feature types and automatically cleans the data this makes analysing the data extremely fast.
Using the plot() method you can plot all your features against the target. In our dataset, the column Outcome is the target.
dabl.plot(db_clean, 'Outcome')
Dabl first automatically identifies and drops any outliers present in the dataset. It then identifies what type of data is present in the target (whether is categorical or continuous) and then displays the appropriate graph. Since ours is a categorical target, the output is a bar graph containing the count of 0s and 1s. Dabl also calculates and displays Linear discriminant analysis scores for the training set.
The next graph is to identify the distribution of each feature against our target. As you can see below, each feature is plotted as a histogram against our target and the number of features that lead to 1 and 0 are shown in orange and blue respectively.
The next graph is the scatter plot of the different combinations of data that exists in the dataset. For example, the feature glucose will be plotted against all the other columns and the distribution is shown below.
In order to increase the efficiency and speed of training, dabl automatically performs PCA on the dataset and also shows the distribution to us. The next graph is the discriminant PCA graph for all the features in the dataset. It also displays the variance and cumulative variance for the dataset.
The final graph displayed here is the linear discriminant analysis which is done by combining the features against the target.
It is clear that in a single line of code we are able to analyse the data in different ways that would usually be done in multiple steps and with code redundancy. But dabl is not repetitive and is an automated way to make data visualisation easy and simple to use.
Model development
Dabl intends to speed up the process of model training and provides a low code environment to train models. It takes up very little time and memory to train models using dabl. But, as mentioned earlier, this is a recently developed library and provides basic methods for machine learning training. Here I will be using a simple classifier model to train the diabetes dataset.
classifier = dabl.SimpleClassifier(random_state=0) x = db_clean.drop('Outcome', axis=1) y = db_clean.Outcome classifier.fit(x, y)
The simple classifier method performs training on the most commonly used classification models and produces accuracy, precision, recall and roc_auc for the data.
Not only this, but it also identifies the best model giving the best results on your dataset and displays it.
Similar to classification, you can also use a simple regressor model for regression type of problem.
Conclusion
Dabl offers ways of automating processes that otherwise take a lot of time and effort. Faster processing of data leads to faster model development and prototyping. Using Dabl not only makes data wrangling easier but also makes it efficient by saving a lot of memory. The documentation of dabl indicated that there are some useful features still to come, including model explainers and tools for enhanced model building.
The post Let’s Learn Dabl: A Python Tool for Data Analysis and ML Automation appeared first on Analytics India Magazine.