Introduction to Data Science

Today’s era shows that everything is data, in every field of computer science every field needs data to perform any task. Suppose we have created any website then data science provides the data to perform any task.

In Machine learning and Artificial Intelligence fields previous data and current data to perform any correct task, even any machine needs data to understand the behavior of any user.

The term data scientist and data analytics came in 1980 and 1990,when some IT professors ,looking for a data curriculum ,they thought that they would call it data science.


Who Is A Data Scientist?

Is someone struggling with data at 24×7, getting some new experiment, solved the complex mathematics queries ,that they are the data scientist ?

The definition of Data scientist is different in markets, in a simple word a person who practices with data ,cracks the complex data problem and extracts the meaningful information is called the data scientist.

Data Scientist must have below Skills:

  • Programming Languages (Python/ R )
  • Essential Mathematics related to Data Science
  • Numpy
  • Pandas
  • Matplotlib/Seaborn
  • Scikit Learn
  • OpenCV
  • Natural Language Toolkit (NLTK)
  • Tensor Flow

Prerequisite For Data Science

Non-Technical Prerequisite
  1. Curiosity: if you want learn data science you have to curios about data science, if you are curious then you will ask any question then you will understand the concept of the data science.
  2. Critical Thinking: Data science also requires the high level of thinking and multiple way thinking to solve the complex problem.
  3. Communication Skills: Communication is a very useful tool of data science, because you have to communicate with the other team mate.
Technical Prerequisite
  1. Mathematical Problem: Mathematical concept must be clear.
  2. Database: Deep knowledge with databases like sql is an essential part of data science.
  3. Computer programming: For data science you should have knowledge of at least one programming language like Python, R or C.
  4. Statistic: Basic understanding of statistic must be required for the data science.
We can divide Data Science into two part which is following:-


  1. Data Analysis: It refers to the technique where data is processed, inspecting, transforming and modeling with the goal of extracting useful information for better decision making.
    • Data Acquisition: It is the process of gathering ,filtering, the data before sending data into the warehouse and other storage devices.
    • Data Wranglings : It refers to the technique where we process the data and clean the data for further we know that data comes from the various sources and this type of the data is very bulky and incomplete, Python performs the operations to clean the data.
    • Data Exploration:This term is similar to data analysis,where we have to understand the dataset ,grab the useful information from the dataset to achieve the organization goal. These deal with huge amounts of data,making dataset completeness of data,conreectness of the data and possible making the relationship among the information.
    • Data Visualization:With the combination of Pandas and Matplotlib we can draw the excellent type of graphical representation. Basically data visualization is the way to understand complex and massive data in a better way.
  2. Machine Learning: Machine learning is a subset of artificial intelligence that allows applications and machines to become more accurate at predicting outcomes without being explicitly programmed to do so. Machine Learning algorithm is used to take historical data as an input and predict the future result.
    • Supervised Learning: If you are learning something under any supervision, someone is there to judge you whether you are doing wrong and right similarly Under the Supervised Learning there are full of labeled data that exist while training algorithms.
    • Unsupervised Learning : It is the training of machines where the data is neither clean nor labeled data are not present and allow the algorithm to work on the information without guidelines.
    • Reinforcement : In the reinforcement process ,an Artificial Intelligence agent is attempting the work to achieve the specific goal. for improving itself, the agent takes the action for achieving their goal, where it receives the rewards.
    • Neural Network : Human brain is composed from the billion of Neuron, they are shared the information with the each other, It takes the input form the outer environment similarly organs is working , In the same way we have created the Artificial brain which is closest to human brain ,Which can thinking independently , This complex programming is called the Neural Network.
There is following libraries which is used in Data Science: –


NumPy is the python library working with numerical data, It is an open source project. This library consists of multiple in-build functions and a multidimensional array.
We have seen that python using the list aims at the serve of array but it is the slow process Numpy provides very fast processing of list (array). Array in numpy is called ndarray.
Numpy provides the function to deal with n dimension arrays very quickly and easily.


Panda is an open source library of Python. This library is used for analysis and manipulation of data. Pandas is used for the high performance of merging data,Pandas is used in many fields like Finance,Statistic,Web Development,Artificial Intelligence and data preprocessing and much more.


Matplotlib is a famous library of Python Programming which is used for Data Visualization,it is a cross-platform and open source library. It is used for plot 2 Dimension graphs. Matplotlib is a based upon object oriented API.


Seaborn is a visualization library of python(matplotlib).Seaborn gives the high quality interface drawing to understand the complex statistical problem.

Scikit Learn

Scikit learn is the powerful library or tool of machine learning, it provides statistical regression,modelling,classification etc.This library is related to Python and also developed in Python Languages.

It is the free library,Scikit also support the functionality of the Numpy and Scipy

Benefit of Scikit Learn
  1. Consistent interface to machine learning model
  2. Provide many tuning parameter but with the sensible default
  3. Exceptional documentation
  4. Rich set of Functionality for companion task
  5. Active community for Development and support
Natural Languages Toolkit (N.L.T.K)


NLTK is a platform where we worked with Python Programming to understand the human language data. It provides a very easy interface. ,NLTK is one of the most powerful libraries in Python Programming which contains machine learning.


OpenCV is Abbreviation of Open Source Computer Vision LIbrary . It is a process by which we can read the Images and video ,how we can stored the information, manipulation the data, and we can extract the information according the goal


TensorFlow is an open source machine learning or Deep Learning framework, It is used for implementing Deep Learning and Machine Learning code, with the help of this we can create a high level of Artificial intelligence implementation. TensorFlow is created by the Google.

Subscribe Now