How is Data important in Machine Learning

12 Nov

How is Data important in Machine Learning

Introduction to data in machine learning Introduction to data in ml

Suppose we are given to solve a machine learning problem then what is the first task that you would do? You will firstly collect the data without data you cannot start the process of applying machine learning

Data in machine learning consist of many sub parts i.e. Data, information, knowledge

What is data: - they are characteristics or information, usually numerical, that are collected through observation or previous results. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects.

Information: - Information is a set of data which is processed in a meaningful way according to the given requirement. Information is processed, structured, or presented in a given context to make it meaningful and useful. For example: if you want to know the marks of a particular subject from a list of subjects then you will find the required marks, here “required/particular marks” is information and “list of subject” is data

Knowledge: is a collection of information with reference to its context. The context is in the form of relationships between information sets collected over period.

Split data in Machine Learning
Before we perform any analysis on data we split the data into 2 types

1: training dataset

Test dataset
Validation Dataset

Training dataset (Used to fit the machine learning model.) The training set is the material through which the computer learns how to process information. Machine learning uses algorithms it makes mistakes and learns from the data and predict the result.

Test Dataset 🙁 Used to evaluate the fit machine learning model.)When we predicted the result on training dataset then we test our model on the new dataset that is test dataset then we observe that is our model able to predict to correct result on test dataset.

Common percentage/Ratio:

Train: 80%, Test: 20%
Train: 67%, Test: 33%
Train: 50%, Test: 50%

Code for splitting: sklearn.model_selection.train_test_split (*arrays, **options)

VALIDATION DATA : when we train our model on training dataset then we don’t know when to stop the training ,if we keep on training the model it can lead to over fitting ,to avoid this Validation Data comes into play .

Property of data

Now we learned about the data and its types, but that is not only sufficient, we should ensure that the data that we are taking is good and more readable. There are 5 major properties (also known as 5V) of a data:-

Volume
Variety
Velocity
Value
Veracity

Volume: The volume of the data means the size of the dataset that we need to analyze and processed, this “volume” is greater than terabytes and petabytes. The handling of such data requires different processing techniques.
Variety: In Data Variety refers to the Structured and unstructured data these types of data are generated either by humans or machines. Its main function is to classify the incoming data into various categories
Velocity: It refers to the speed by which the data is generated. The data is generated with a good pace that it required different processing technique Example: Twitter messages.
Value: Value is a property. The data given to us has unique values and you need to understand the meaning of that value/columns in order to make certain judgments.
Veracity: It tell that how accurate is our data .It also ensure that the provider of the data is trustworthy.

EXAMPLE OF 5V:

In healthcare industry, there is a lot of data (i.e.2314 Exabyte) which is collected annually in the form of patient records and test results, this is termed as Volume all this data is generated at a very high speed which is known as velocity of big data. Now the data type is of 3 forms

structured data(Excel records)
semi-Structured data(log files)
unstructured data(X-ray files)

it’s called Variety of big data , ”Accuracy and Trustworthy” of data is termed as Veracity, we draw the conclusion from the data like “Disease detection”, “better treatment”, “Reduced cost” it is termed as Value

BIG DATA

Did you know when a user uses its phone there is a lot of data generated. The generated data can be in the form of mp4, image, pdf, xls, images etc. It was estimated that a user generate a data of 40 Exabyte in one month when this data is collected it is called as big data.

Let’s have a look at the data generated per minute in one minute:-

2.1 million Snaps are shared on snapchat

3.8 million Search queries are made on internet

1.0 million People log in on Facebook

4.5 million Videos are watched on YouTube

188 million emails are sent

That’s a lot of data!!

For classification of any data as big data. This is only possible with the concept of 5V as we discussed before. To handle Big data we use software like Cassandra, Hadoop etc.

Written by: Devansh Srivastava

(https://www.linkedin.com/in/devansh-srivastava-3669ba190/)

Course Curriculum

How is Data important in Machine Learning

How is Data important in Machine Learning