Download Free Audio of Next, you need to do exploratory data analysis (ED... - Woord

Read Aloud the Text Content

This audio was created by Woord's Text to Speech service by content creators from all around the world.


Text Content or SSML code:

Next, you need to do exploratory data analysis (EDA). Exploratory data analysis means you can get deeper insights into the data. With EDA, you can get some insights that help build the machine learning model so that you can predict unseen data. We will try to analyze the data using statistical measures and different methods. EDA helps to build and evaluate the model. First, you need to check for missing values. Use the isnull function available in padas to detect missing values. The isnull method helps to find the number of missing values. Missing values means null values. You can read them as nan or not a number. The isnull function will give the number of missing values or null values in a given column. We have eight columns in the data. It will give the count of the null values from the data. If there is a null value, it shows True. Else it shows False. The sum function will find the total sum of True values. The true indicates it is a missing value. The sum will count the number of True values from the isnull output. In serial number, it is showing 0. That means there are no null values. If there is any null value, the number will be non-zero. Suppose it indicates that in the column serial number, we have zero null values. You can see there are no null values in the data. So there is no need to replace the missing values. Next, you need to identify and remove the outliers. Since it is a regression problem, the output is a continuous variable. You do not see discrete values. Our objective is to predict the chance of admission. It is a continuous output variable. Since the output is not a discrete value, it is a regression problem. If the output variable is a discrete value, you can call it a classification problem. In supervised, there are two types of algorithms. The first is regression algorithms, and the other is classification algorithms. If the output is continuous, we call it a regression. If the output is discrete, we call it classification. For regression, when you are going for Linear methods, the outliers are a big problem. We are approximating the relation between input and output using a curve. If outliers are present, curve fitting will be affected. You need to remove the outliers. In this case, you use the interquartile range method to filter out outliers. First, you have to identify the outliers. After that, you have to remove the outliers. First, let us see how to identify the outliers. Instead of using numerical methods, first, we will use graphical methods to detect outliers. For this, a boxplot is very useful. Pandas library has a boxplot method. Boxplot will take a column of that data frame. Draw boxplot for the column chance of admission. You see, in the boxplot, the upper region is 75%, and the lower region is 25%. If you see in the plot, there is a circle below the lower range. That means there is an outlier, and it less than the minimum value. There is an outlier on the other side. If circles are there outside the boundaries, we call them outliers. In the chance of admit, there are outliers. We do not know the number of outliers. But we know that there are outliers in the chance of admit column. Next, draw a boxplot of the GRE score and the Toefl score. Now, you can see the boxplot of GRE score and TOEFL score. You can observe that there are no circles or outliers in the plot. Draw box plots for university rating, SOP, LOR CGPA, and research also. You can see, LOR is having the outliers, and all the remaining do not have outliers. You can see there is a circle on the other side of the boxplot. LOR has outliers. You can see the chance of admit that is the output variable and the LOR have the outliers. We don’t know the count of outliers. But we know that there are outliers in the chance of admit and LOR. Next, your job is to remove those outliers. You can use the interquartile range method to remove outliers. What is IQR or interquartile range? The interquartile range (IQR) is the measure of variability based on dividing a data set into quartiles. The interquartile range method sorts the data in the column in ascending or descending order. Then it will find the median that is the middle value. It uses the median to divide the entire data into two parts. The middle of the first half is Q1, and the middle of the second half is Q3. The middle one is Q2. There is Q1, Q2, and Q3. Q1 is the first quartile, Q2 is the second quartile, and Q3 the third quartile. You see data.copy will copy the data into another variable data1, the same data we are copying into another data frame data1. Since you do not need the serial number, you can drop that one from the data1. Here axis=1 means we are dropping the column. In the NumPy, that is quantile(0.25), will give the first quartile Q1. Quantile(0.75) will give the third quartile. The difference between Q3 and Q1 is the interquartile range. Below values will show the interquartile range. GRE is having an interquartile range of 17. You can see here Q1 minus 1.5 into IQR we have calculated Q1 and also IQR similarly Q3 and IQR . so this is the this Q1 -1.5 into IQR is the lower limit of the outlier and Q3 plus 1.5 into IQR is the upper limit for the outliers. Any value outside this first boundary or above the second boundary we call it outliers. The values below q1 and minus 1.5 into IQR and the values above that Q3 plus 1.5 into IQR are called outliers. In data1, we find the indices of the data values which are above and below the boundaries. So after executing this method, all the outliers are saved in the df_out1 variable. If you use the head method, it will display all the values. Here there are only three outliers. The outliers are the student with serial number 92, the student with serial number 347, and the student with serial number 376. You can see outliers in Student Number 92, which is having an outlier in the chance of admit. There are two outliers in chance of admit, and they are 0.34 and 0.34. The chance of admit is low even they are having a decent CGPA Score of 8.03. So it is the outlier. Whereas in the case of a student with serial number 347, the LOR is only one, student he chance of 42 so it is also an outlier. Now, how to remove the outlier. You use the same OR condition here, limit condition, and will put a not operator before that one. data1 will put a NOT operator. It will save all the values other than these outliers into the df_out variable. Except for these three outliers, the remaining values are stored in the student entries in the df_out. Now we will copy that one into the original data frame data. Now variable or the data frame data is having only without outliers. Hence outliers are removed. From the data, we have removed the three outliers. If you see the shape of the data with shape Command, the number of rows only 497. So we removed three rows which are having outliers. Hence the number of students now we are having 497 with 8 columns. This way, you can remove the outliers in the data.