Download Free Audio of Welcome to the lesson on exploratory data analysis... - Woord

Read Aloud the Text Content

This audio was created by Woord's Text to Speech service by content creators from all around the world.


Text Content or SSML code:

Welcome to the lesson on exploratory data analysis part-2. The next step in exploratory data analysis is the univariate analysis and the bivariate analysis. Next, we use univariate analysis and bivariate analysis for data analysis. You have seen detecting missing values and removing outliers. First, you will do the univariate analysis. What is a univariate analysis? Univariate analysis is the simplest form of analyzing data. Uni means one, so in other words, your data has only one variable. It does not deal with causes or relationships, and its purpose is to describe. It takes data, summarizes that data, and finds patterns in the data. In regression analysis, the data must be normally distributed (Gaussian distribution). What is the normal distribution? Normally distributed data set (when plotted) must follow a bell-shaped symmetrical curve centered around the mean. It is better if the data has a gaussian distribution. Draw histogram using plot.hist method. You can see the chance of admission graph is a slightly skewed distribution towards the right. But still, it has some variation. It is varying from 0.4 to 0.9. You can observe enough variation in the chance of admit. This plot is a histogram plot. What is the histogram plot? A histogram is an approximate representation of the distribution of numerical data. For example, from 0.6 to 0.7, considered as a single bin like that. It gives the frequency of the values in the bin. If you see 80, there are 80 values in the bin corresponding to 0.7. Xlabel means we label the x-axis as the chances and its font size is 12. So we are plotting chances. If the output variable is only a single value, and if it is not varying. It is not useful to predict. If the chance of admit is 0.5 in every row, then it is difficult to predict. Data should have some variations. The ideal condition is data should have a normal distribution. In our case, the chance of admitting has enough variation. The output is varying enough, and you can use this data for building the model. Similarly, the university rating has 5 values. In the first bin, you see the university ranking 30. We have enough values for each quantity. For the first one and the second one, we have enough values. If any bin has less frequency, then it is not useful. It becomes unbalanced data. So our predictions will be hampered by this unbalanced data. For every entry, we should have enough values so that we can model accurately. Maximum students are getting between 3 to 3.5 ratings from this analysis. If you see the value research, it has a value 0 or 1. In the research column, a total of 497 student data is present. The value_counts() function gives a Series containing counts of unique values. You see here, 280 students are having one in the research variable. That means 280 students have the research experience. Whereas 217 students do not have the research experience, that’s why it is zero. Here also, it is balanced. So here you see enough values in the research experience column. Univariate analysis is a routine check to find whether data has enough distribution or not. Whether it is discrete or continuous value? We should see that it is having enough variations so that our model is predicting well. Next, you learn bivariate analysis. In Bivariate analysis, we use different plots. One of the plots is the scatter plot. A scatter plot is a type of data visualization that shows the relationship between different variables. We use the data.plot.scatter method to draw a scatter plot. It will take the data and take both of our variables and represent them along both of these axes here on the bottom one we call this our x-axis. This is where we will display our independent variable. The plot shows all (x,y) points on the graph. If you see it in the graph, it is evident. For example, this one is around 315 and 0.4. On the x-axis, we took a GRE score, and on the y-axis, we took Chance of admit. In this graph, we are trying to establish a relation between the GRE score and the chance of admit. The graph explains how the chance of admit depends on the GRE score. We can see it has a linear trend. As the GRE score increases, the chance of admit will also increase. They are positively correlated. So it will give the correlation also. We use the corr method to find the correlation between two variables. For correlation, you require two variables. Here one is GRE Score, and another one is the chance of admit. Using corr method, you will find the correlation between the chance of admit and the GRE Score. The correlation value is 0.8, and it is positive. That means GRE score and the chance of admit are positively correlated. Similarly, we can see the correlation between TOEFL score and the chance of admit. It is also positively correlated and is having a correlation coefficient of 0.78. TOEFEL Score and the chance of admit are strongly correlated. That means when you are building a model, you should consider these variables. That means you should consider the GRE score and Toefl score. You must include these variables in building the model. For building a model to predict the chance of admit, you must include these variables. CGPA correlated with chance of admit. Hence we need to consider CGPA also. Sometimes the independent variables or input variables are correlated. For example, we are drawing the scatter plot of CGPA and the TOEFL score. CGPA and TOEFL scores are strongly related. There is a value of 0.8 correlation between these two. Similarly, CGPA and GRE Score are also correlated, and their correlation value is 0.8. With more CGPA, the student will get good scores in GRE and TOEFL. So you can draw this kind of insights from the exploratory data analysis. You did the univariate analysis with a single variable. And bivariate analysis between the two variables. EDA helps you to make wise decisions while building the model. What is the use of exploratory data analysis? EDA is useful for seeing what the data can tell us before the modeling task.