Read Aloud the Text Content
This audio was created by Woord's Text to Speech service by content creators from all around the world.
Text Content or SSML code:
Welcome to the lesson on understanding data. You will implement this project in four parts, understanding the historical data, exploratory data analysis, model building, and model evaluation. First, let us see the understanding of the historical data. You need historical data to train the machine learning model using the algorithm. First, you need to understand what are the variables in the data. What are the inputs? What is the output? How data distributed? You get answers to these questions in the first part, understanding the historical data. In Python, you need several modules to perform data analysis or to understand the data. These modules come with methods that will help understand the data. For example, if you want to find the maximum in the given data, you require maximum function. You can retrieve the maximum value of a particular variable by using the maximum function. Suppose if you want to check the maximum value in the GRE score. You can check the maximum GRE score with this function. You can use the existing methods in the libraries to find some insights from the data. First, you have to import the required libraries. In Python, there are several modules or packages available for machine learning. To understand the data, you need three libraries. The first module is NumPy. The second module is the Pandas. And the third module is Matplotlib. NumPy module is useful for performing numerical operations on the data. The historical data contains rows and columns. So it is like two-dimensional data or multidimensional data. So all the data is available as numerical arrays or matrices. Numpy is useful for performing several operations on this structured data or the Matrix data. After this, you need to use the pandas' library. Pandas is a data analysis library. We use pandas for data analysis. To gain insights into the data, we use the Pandas library. In some cases, to understand the data, we require data visualization. For data visualizations, we use matplotlib. Matplotlib is useful in visualizing the data. In some cases, the direct numerical analysis may not be helpful. In that case, we will use the data visualizations. Data visualizations are very helpful in understanding the data. Hence we require these three libraries for analyzing and visualizing the data. Now import pandas library using import Pandas as pd. You have imported Pandas as pd into the current python environment. So whenever you want to access the methods available in the pandas' library, you can use the name pd instead of pandas. Similarly, import the NumPy library and give it a valid name. Similarly, from matplotlib, import pyplot and give it a name plt. Hence all the methods in the libraries are available in the current environment. So we can use methods once imported. This percentage matplotlib inline is useful to display the plots in Jupyter Notebook. Next, read underscore csv. Graduate admission data is available in the graduate_admission.csv file. This csv file contains student's profiles, university rankings, and the chance of admission. You need to store this data into some variable. We use the read_csv function present in pandas to read or load data from CSV files. By using this function, we can copy data to a data frame variable. Here the data frame variable name is data. Now we can do any operations on this data frame data using pandas and NumPy. Data is available in rows and columns format. You can get information about the columns present in the data frame using data.columns method. Each column refers to a variable. By using data.columns, you will get the list of the columns available in the dataset or the csv file. There are nine columns. The first column is the serial number, the second column is GRE score, the third column TOEFL score, University rating, the strength of purpose, LOR, CGPA, Research, and Chance of admitting. To see the first five rows, use the data.head(). Head is the method that will display the first five rows in the data by default. Here the serial number is the number of the student. Each row indicates the data of a single student. The index starts at 0. Serial number one means it is the first student data. The student achieved a GRE score of 337, a TOEFL score of 118, a University rating is four, a statement of purpose of 9.65, and research experience. For him, the chance of admittance is 0.92. Already, he admitted into some University, that data was collected. With this student profile starting from GRE Score to research, his chance of admission in getting University admission is 0.92. He has a 92% chance of getting admission. Here we see five students' data. Except for the serial number, all the remaining are useful for us. These are the columns. So we have to train our model on this given data. We want to see further details of the data. Here we see available variables. Here we see GRE Score, TOEFL score, CGPA research in several columns. Except for the serial number, there are seven variables. If you exclude the serial number, there are seven variables. We want to see statistical details of these variables. You can get statistical information by using the describe function available in the Pandas. Describe will give the statistical measures of all the variables. For example, if you see the GRE score, its mean is 316, maximum is 340. So most of the students are getting the average values near to the maximum. So it is skewed towards the maximum that is the right side. These 25%, 50%, and 75% are the quartiles. Quartiles give a hint of whether there are outliers present or not. The minimum is 290. By using describe, you will get some basic understanding of the data. Suppose a variable has the same value in all the rows then it is not useful. If the variable has the same values in all rows, then the standard deviation is zero. We have 500 entries. You can seethe the count is 500. That means there are five hundred students' data. There is data from 500 students. If all the students got the same CGPA, then it is difficult for us to predict the chance of admission. Because all the students got the same CGPA, there is no variation. If there is no much variation, then that variable is not useful. If you have any such kind of variables, you have to remove that column. Describe function provides the first-hand information of the data. You can measure variance, standard deviation, mean, range, minimum, and maximum values using the describe function. For all the variables, numerical values will give this kind of information. The next one is seeing the data types. In pandas, you have a method called dtypes. Data.dtypes will give the datatype of each column. If you see the serial number, It is an integer. Here int stands for integer. GRE score is also an integer. TOEFL score is an integer. SOP and LOR and CGPA are floating numbers. All are numerical values, so there are no categorical or text variables. All are continuous variables. There is no discrete value in this data, so our work is somewhat simpler. If you have categorical or text variables, we have to convert them into numerical. So now we don't require any conversion because all are numerical values. There is no need for encoding and decoding.