Data science is an interrelated and integrated practice that uses concepts from various fields to analyze the data and derive insights from it. When insights are derived from Data, it turns into vital information which can lead to various benefits. In this introduction to data science, we will learn how data-driven approaches are used to understand and solve a business problems. Machine Learning predictions form the core of Data science practice, compared to post-event analysis, it focusses on predicting the outcome based on the historical data.

Why Data Science is More Prevalent Now Compared to the Previous Decade?

Compared to the previous decade, the volume and variety of the data has vastly increased, and organizations are more interested to derive insights from this data and build predictive models to predict the outcomes. For example, based on the past network transactions, Data science helps to build a robust predictive model which can identify whether a transaction is normal or has properties of network intrusion and prevent fraudulent transactions.

Let us understand various Concepts in Data Science by analyzing the stages in building and implementing a Machine Learning Model.

Understanding the Data

In this introduction to data science, we can see that data we handle comes in various formats such as numerical, categorical, ordinal and more. Exploratory Data Analysis provides a way to understand the various facets of the data. Such as,

  • How the data is distributed, whether is it normally distributed or uniformly distributed
  • What is the central Tendency and the Dispersion of the data?
  • What are the outliers in the Data? Outliers Impact the overall predictability of the model. Removing the outliers and anomalies in the data is very important.

From these visualizations, we can derive various information of the data which can be used Data pre-processing and Feature Engineering.

For Measuring Central tendency of the data, we can use mean, median or mode.

  • Mean provide us with average value of the data, which is a widely used metrics for measuring the central tendency. How ever if the outliers are present, it can be skewed towards a direction based on the value of the outlier
  • Median is a robust central tendency metrics where the middle element is identified as median. It is not impacted by outliers.

Data can be segregated based on the type of the distribution such as uniform distribution, normal distribution, binomial distribution and more. Data science with strong links with statistics can helps us to identify these intrinsic properties of the data and help us to perform corrective actions.

Data Pre-Processing

As part of Data Pre-Processing, we will be involved in identifying the issues in the data and performing the corrective actions.


Missing values is one of the important issues in the dataset and it needs to be imputed. There are various statistical methods to perform missing value imputation such as using Mean, median and Mode, using algorithms based on linear trends and, we can apply machine learning based algorithms to perform this action.


Presence of outliers can influence the model predictions. It can skew the predictions in certain directions based on the value of outliers. Therefore, it is necessary to remove the outliers or perform corrective actions.

Feature Engineering

Feature engineering steps involves feature selection or elimination based on the relevance of the model prediction and feature importance in predictions. As part of this step, a base machine learning model is setup to identify relevant features for the predictions. This stage involves following steps

  • Recursive Feature Elimination
  • Correlation Identification
  • Multi Collinearity check using Variance Inflation Factor

Using Recursive feature elimination, important features are identified for the model based on brute force feature selection and elimination by validating all feature combinations.

Model Selection

As part of model selection method, various models are identified and evaluated as a base model. This step is to identify the best prediction algorithms for this project. Base Models are trained on the training dataset and used to predict for the validation dataset. Based on the output prediction accuracy, top performing models are selected.

Machine learning models are classified into different categories or families based on the characteristics and predicting properties.

data science stages

Supervised learning is based on the dataset which has a label or target to guide the model to train itself. For Example, let us say we are building a Deep learning-based classification models for classifying an Image as one with Dogs or Cats, the dataset which is used for training this model will have prior information of whether the image has Dog or Cat. In this way the dataset will help the model to learn the patterns in the images for performing classification. As part of supervised learning there are many families of algorithms such as tree-based, gradient boosting based, linear models, Support Vector Machines and more.

In a scenario where there is no luxury of training a machine learning model using a dataset with targets or label, we may need to opt for unsupervised learning models. These models use the intrinsic properties of the data points to cluster them into various clusters. In these clusters you can identify data points which are closer to each other in terms of their characteristics. There are various Clustering algorithms such as K-Mean Clustering, Hierarchical Clustering and more.

Reinforcement Learning is an interesting field, where the model is trained based on the rewards and penalties based on agent actions. Agents are ML based bots which can perform actions in the predefined environment. Based on their actions and objectives, rewards are generated.

Deep Learning Models

Deep Learning models evolved due to the advancement in the computation capacity of processing units and innovations in the Algorithms. Deep learning models can handle and analyse huge volumes of data and understand the underlying patterns in it. Deep learning models are widely used in scenarios where Big data, Image based data, time series data are involved.

ERP College courses are designed in such a way it covers the core concepts of the algorithms and how they can be implemented technically. Various Machine learning and Deep learning models are covered in the courses with extensively reviewed topics and use cases.

Model Building, Validation and Fine Tuning

After selecting the best models for a scenario, the next step is to perform Model building and fine tuning the model. This is an iterative step where we would need to identify the base model accuracy and based on it we will fine tune the models on Hyperparameters. Different Machine Learning models have different hyperparameters and we can change these values in a specified range to fine tune the models.>

Fine tuned ML model is trained, and the predictions are validated using evaluation metrics based on the type of it such as regression or classification. This validation score is used to check the performance of the model in terms of accuracy and fine tuned again if required. This an iterative process.

Data Science Model Evaluation

Next steps are to validate the model using actual test data. This step gives us more visibility on the performance of the model in real time scenario. As part of evaluation metrics, there are specific metrics for specific machine learning problem

Model Interpretation and Test Data

After we have built the machine learning and deep learning models, the next steps are to interpret the model predictions and to provide the business context to the stakeholder. Data science provides us with various tools to perform this step such as how the weights are configured and what is the feature importance.

As we can see, Data science is an extensive space where various concepts are brought in to derive insights and predict the outcomes.

ERP College’s Data Science Course

ERP College provides an interactive online data science course,  This course is not only an introduction to data science, but it’s also a course that dives deep into it, it teaches prospect students how to analyze, visualize and design machine learning models and search for answers to queries in a collection of enormous databases which are too complex to analyze using conventional methods. Graduates of this program can analyze data written in different programming languages, from various sites, or in different formats. Participants of this program could qualify for the following positions: Machine Learning Engineer, Data Engineer, or Data Scientist