Read Aloud the Text Content
This audio was created by Woord's Text to Speech service by content creators from all around the world.
Text Content or SSML code:
Welcome to the lesson on Model training and prediction. The next part is building the model. What is a machine learning model? A machine learning model is a file that has been trained to recognize certain types of patterns. You train a model over a set of data, providing it an algorithm that it can use to reason over and learn from those data. To relate input and output to predict the output variable, we need to build a machine learning model. The model is an equation relating to input and output having parameters. We use the algorithm to find the parameters of the model. In the first step, you need to assume that input and output are related by some equation that includes parameters. From the algorithm, we can find the parameters of the model. There are four steps in building the model. The first step is to preprocess the data. The second step is model selection and training. The third step is the prediction from the model. The fourth step is the model evaluation. What is preprocessing data? Data preprocessing is a data mining technique that is used to transform the raw data into a useful and efficient format. Let us see preprocessing the data. Sometimes the data we are presenting to the model is required to be pe-processed. That means you have to do normalization. Sometimes we have to separate them into training and testing data. Some columns have to be Transformed. You need to separate input and output data. In the data frame, we drop the chance of admit and store that data in x. To drop a column, we use data.drop. Here x is the input variable, and y is the output variable. The output variable is the chance of admit. We stored all columns except the chance of admit in x variable. The Chance of admit is the output column. We are dropping this column and taking the remaining variables as the input. There are seven columns in x. So x has seven variables. Hence x has seven columns, and Y is the only output variable that is only one column. Y has only one column. If you see the shape of x, you can see the number of rows is 497. x contains seven input variables belongs to 497 students. It has seven input variables, and the output is only a single variable. The variable x is called the independent variable, and the output variable y is called the dependent variable. That means the dependent variable is dependent on the independent variables. For building the model, there are two steps. The first step is the training model, and the second step is the testing model. We have to separate data for model training and testing. So the data will be divided into the training set and testing set. The algorithm gives more accuracy on the training set since we use the training data set in model training. To get better results, we have to check the model performance on the testing data set. Using the training data, you will train the model. You will test the model on the testing data set. So the data of students has to be separated into training data and the testing. The model_selection class of the sci-kit-learn module has train_test_split method. We use the train_test_split method for separating data. You need to import train_test_split method. You need to use this method on x and y. Here x is the first variable, and y is the second variable. The train_test_split method divides x into training and testing data. It divides y also into training and test data based on the percentage given. For example, the default training data size is 80%, and the testing data size is 20%. Here we have not given any size. So x is divided into two data sets and stored in variables. This method divides x into train_x and test_x. And this method divides y into train_y and test_y. So the training data is stored in a variable train_y, and testing data is stored in a variable test_y. Randomly data is separated into training and testing data. It selects the student data randomly. The parameter random_state is the seed value for the random sampling. If you don't specify the random state and run method many times, then a different selection will be chosen every time. To ensure that the same sampling is selected for every execution, you need to specify some integer to the random state. Hence every time you run, you get the same results. Now we will take a model and train that model on this 80% training data. After training the model, we will l test the model on the 20% testing data. We use the multiple linear regression model to solve this problem. This model is available in the linear_model class as linear regression. You need to import the LinearRegression class. Give this a name LR. You need to test the performance of the model. For testing the model, we need to calculate the mean absolute error. What is mean absolute error? Absolute Error is the amount of error in your measurements. It is the difference between the Measured value and the True value. The Mean Absolute Error(MAE) is the average of all absolute errors. Import mean_absolute_error from the metrics as mae. We are creating an instance of LR class. The lr is our model now. You need to train this model. It is in the raw state. You need to train this model. You have created a multiple linear regression model lr, and now you need to train the model. For training, you need a method called fit. Lr_fit train_x, train_y trains the linear regression model using train_x as the input and train_y as the output. Then the algorithm finds parameters by fitting the relation between train_x and train_y. Finds the best fit or equation between train_x and train_y with optimization techniques. The best fit line obtained from the fit method is not a hundred percent accurate. It does not satisfy all the points. There is some error. The algorithm finds the average curve between input and output. After the fit method, we got a trained model. We got parameters also. The next step is to predict using the trained model. We use the predict method for testing the model. Give same training set train_x to lr.predict method. This method predicts the output of the model for given train_x. We are giving train_x as the input to the trained model to see whether it is correctly giving the output values or not. For the input train_x, the model predicts the values. Store the predicted values in the train_predict variable. Now we have actual and predicted values. Train_y is the actual value we got from the train_ test_split command. The train_predict is the predicted value, the values predicted by the model. MAE is the mean absolute error. Here if you run, you can see that deviation is 0.03. The absolute error means the absolute value of the actual value minus the predicted value. The range of chance of admit is between 0 to 1. Hence 0.03 error is acceptable, which means our algorithm is accurately predicting the chance of admit for a given student. So far, we trained the model on training data. As far as training is concerned, our model is performing well. We have trained the model on the data train_x and train_y. We measured the absolute error for the trained model. We call this error a training error. When we use the training input as the input for the model, the error we are getting we call it a training error. That means when you are using the training data for the prediction. We call it a training error. For trained data, it may give accurately, but it should be generalized to unseen data. Now we will test the model on data that is not in the training set. For this data also, the error should be small. The error measured based on testing data is testing error. Now we are giving the testing data. We kept 20% of the data aside, which is not present in the training data. We will use this 20% data to see whether it is predicting accurately or not. Now in this lr.predict, we are giving test_x, test data as the input to the trained model lr. We are saving these predicted values or output values in a variable test_predict. Just like the previous case, the actual values of the output is test_x. And predicted values from the trained model are test_predict. You see that the error is 0.046. Earlier for the training data, we got 0.039. This error is close to the training error. That means our trained model is performing well on the testing data also. Training and testing errors are small. That means we are getting acceptable errors. Hence the error is small. So our model is predicting with accuracy. I hope you understand how we will train and test the model. We have created a multiple linear regression model and train the model using the fit method. We have seen the performance of the model with training data and testing data. With the trained model, we can predict the chances of admission for any new student. Now you can deploy the model. Once deployed, you can provide student details and can check the chance of admission to the new university based on his profile.