|
2052 | 2052 | "cell_type": "markdown", |
2053 | 2053 | "metadata": {}, |
2054 | 2054 | "source": [ |
2055 | | - "[Back to Top](#Table-of-Contents)\n", |
| 2055 | + "### Conclusion\n", |
2056 | 2056 | "\n", |
2057 | | - "## Step 4: Modeling\n", |
| 2057 | + "In this case study, we explored the Titanic dataset following the steps of the data mining process:\n", |
2058 | 2058 | "\n", |
2059 | | - "Now we have a relatively clean dataset(Except for the **Cabin** column which has many missing values). We can do a classification on Survived to predict whether a passenger could survive the disaster or a regression on Fare to predict ticket fare. This dataset is not a good dataset for regression. But since we don't talk about classification in this workshop we will construct a linear regression on Fare in this exercise." |
2060 | | - ] |
2061 | | - }, |
2062 | | - { |
2063 | | - "cell_type": "markdown", |
2064 | | - "metadata": {}, |
2065 | | - "source": [ |
2066 | | - "##### Task16: Construct a regression on Fare\n", |
2067 | | - "Construct regression model with statsmodels.\n", |
| 2059 | + "1. We started by understanding the business context and the objectives of the analysis.\n", |
| 2060 | + "2. We then explored and understood the data, identifying important features and their relationships.\n", |
| 2061 | + "3. Finally, we prepared the data by handling missing values and creating new features.\n", |
2068 | 2062 | "\n", |
2069 | | - "Pick Pclass, Embarked, FamilySize as independent variables." |
2070 | | - ] |
2071 | | - }, |
2072 | | - { |
2073 | | - "cell_type": "code", |
2074 | | - "execution_count": 25, |
2075 | | - "metadata": { |
2076 | | - "scrolled": false |
2077 | | - }, |
2078 | | - "outputs": [ |
2079 | | - { |
2080 | | - "data": { |
2081 | | - "text/html": [ |
2082 | | - "<table class=\"simpletable\">\n", |
2083 | | - "<caption>OLS Regression Results</caption>\n", |
2084 | | - "<tr>\n", |
2085 | | - " <th>Dep. Variable:</th> <td>Fare</td> <th> R-squared: </th> <td> 0.427</td> \n", |
2086 | | - "</tr>\n", |
2087 | | - "<tr>\n", |
2088 | | - " <th>Model:</th> <td>OLS</td> <th> Adj. R-squared: </th> <td> 0.424</td> \n", |
2089 | | - "</tr>\n", |
2090 | | - "<tr>\n", |
2091 | | - " <th>Method:</th> <td>Least Squares</td> <th> F-statistic: </th> <td> 131.9</td> \n", |
2092 | | - "</tr>\n", |
2093 | | - "<tr>\n", |
2094 | | - " <th>Date:</th> <td>Wed, 24 Apr 2019</td> <th> Prob (F-statistic):</th> <td>1.92e-104</td>\n", |
2095 | | - "</tr>\n", |
2096 | | - "<tr>\n", |
2097 | | - " <th>Time:</th> <td>12:07:17</td> <th> Log-Likelihood: </th> <td> -4495.8</td> \n", |
2098 | | - "</tr>\n", |
2099 | | - "<tr>\n", |
2100 | | - " <th>No. Observations:</th> <td> 891</td> <th> AIC: </th> <td> 9004.</td> \n", |
2101 | | - "</tr>\n", |
2102 | | - "<tr>\n", |
2103 | | - " <th>Df Residuals:</th> <td> 885</td> <th> BIC: </th> <td> 9032.</td> \n", |
2104 | | - "</tr>\n", |
2105 | | - "<tr>\n", |
2106 | | - " <th>Df Model:</th> <td> 5</td> <th> </th> <td> </td> \n", |
2107 | | - "</tr>\n", |
2108 | | - "<tr>\n", |
2109 | | - " <th>Covariance Type:</th> <td>nonrobust</td> <th> </th> <td> </td> \n", |
2110 | | - "</tr>\n", |
2111 | | - "</table>\n", |
2112 | | - "<table class=\"simpletable\">\n", |
2113 | | - "<tr>\n", |
2114 | | - " <td></td> <th>coef</th> <th>std err</th> <th>t</th> <th>P>|t|</th> <th>[0.025</th> <th>0.975]</th> \n", |
2115 | | - "</tr>\n", |
2116 | | - "<tr>\n", |
2117 | | - " <th>Intercept</th> <td> 79.2989</td> <td> 3.543</td> <td> 22.381</td> <td> 0.000</td> <td> 72.345</td> <td> 86.253</td>\n", |
2118 | | - "</tr>\n", |
2119 | | - "<tr>\n", |
2120 | | - " <th>C(Pclass)[T.2]</th> <td> -59.0955</td> <td> 3.921</td> <td> -15.073</td> <td> 0.000</td> <td> -66.790</td> <td> -51.401</td>\n", |
2121 | | - "</tr>\n", |
2122 | | - "<tr>\n", |
2123 | | - " <th>C(Pclass)[T.3]</th> <td> -68.8790</td> <td> 3.253</td> <td> -21.174</td> <td> 0.000</td> <td> -75.264</td> <td> -62.494</td>\n", |
2124 | | - "</tr>\n", |
2125 | | - "<tr>\n", |
2126 | | - " <th>C(Embarked)[T.Q]</th> <td> -11.8147</td> <td> 5.446</td> <td> -2.169</td> <td> 0.030</td> <td> -22.504</td> <td> -1.126</td>\n", |
2127 | | - "</tr>\n", |
2128 | | - "<tr>\n", |
2129 | | - " <th>C(Embarked)[T.S]</th> <td> -14.9202</td> <td> 3.414</td> <td> -4.371</td> <td> 0.000</td> <td> -21.620</td> <td> -8.220</td>\n", |
2130 | | - "</tr>\n", |
2131 | | - "<tr>\n", |
2132 | | - " <th>FamilySize</th> <td> 7.8256</td> <td> 0.789</td> <td> 9.919</td> <td> 0.000</td> <td> 6.277</td> <td> 9.374</td>\n", |
2133 | | - "</tr>\n", |
2134 | | - "</table>\n", |
2135 | | - "<table class=\"simpletable\">\n", |
2136 | | - "<tr>\n", |
2137 | | - " <th>Omnibus:</th> <td>1043.506</td> <th> Durbin-Watson: </th> <td> 2.040</td> \n", |
2138 | | - "</tr>\n", |
2139 | | - "<tr>\n", |
2140 | | - " <th>Prob(Omnibus):</th> <td> 0.000</td> <th> Jarque-Bera (JB): </th> <td>118621.734</td>\n", |
2141 | | - "</tr>\n", |
2142 | | - "<tr>\n", |
2143 | | - " <th>Skew:</th> <td> 5.718</td> <th> Prob(JB): </th> <td> 0.00</td> \n", |
2144 | | - "</tr>\n", |
2145 | | - "<tr>\n", |
2146 | | - " <th>Kurtosis:</th> <td>58.357</td> <th> Cond. No. </th> <td> 13.4</td> \n", |
2147 | | - "</tr>\n", |
2148 | | - "</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." |
2149 | | - ], |
2150 | | - "text/plain": [ |
2151 | | - "<class 'statsmodels.iolib.summary.Summary'>\n", |
2152 | | - "\"\"\"\n", |
2153 | | - " OLS Regression Results \n", |
2154 | | - "==============================================================================\n", |
2155 | | - "Dep. Variable: Fare R-squared: 0.427\n", |
2156 | | - "Model: OLS Adj. R-squared: 0.424\n", |
2157 | | - "Method: Least Squares F-statistic: 131.9\n", |
2158 | | - "Date: Wed, 24 Apr 2019 Prob (F-statistic): 1.92e-104\n", |
2159 | | - "Time: 12:07:17 Log-Likelihood: -4495.8\n", |
2160 | | - "No. Observations: 891 AIC: 9004.\n", |
2161 | | - "Df Residuals: 885 BIC: 9032.\n", |
2162 | | - "Df Model: 5 \n", |
2163 | | - "Covariance Type: nonrobust \n", |
2164 | | - "====================================================================================\n", |
2165 | | - " coef std err t P>|t| [0.025 0.975]\n", |
2166 | | - "------------------------------------------------------------------------------------\n", |
2167 | | - "Intercept 79.2989 3.543 22.381 0.000 72.345 86.253\n", |
2168 | | - "C(Pclass)[T.2] -59.0955 3.921 -15.073 0.000 -66.790 -51.401\n", |
2169 | | - "C(Pclass)[T.3] -68.8790 3.253 -21.174 0.000 -75.264 -62.494\n", |
2170 | | - "C(Embarked)[T.Q] -11.8147 5.446 -2.169 0.030 -22.504 -1.126\n", |
2171 | | - "C(Embarked)[T.S] -14.9202 3.414 -4.371 0.000 -21.620 -8.220\n", |
2172 | | - "FamilySize 7.8256 0.789 9.919 0.000 6.277 9.374\n", |
2173 | | - "==============================================================================\n", |
2174 | | - "Omnibus: 1043.506 Durbin-Watson: 2.040\n", |
2175 | | - "Prob(Omnibus): 0.000 Jarque-Bera (JB): 118621.734\n", |
2176 | | - "Skew: 5.718 Prob(JB): 0.00\n", |
2177 | | - "Kurtosis: 58.357 Cond. No. 13.4\n", |
2178 | | - "==============================================================================\n", |
2179 | | - "\n", |
2180 | | - "Warnings:\n", |
2181 | | - "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", |
2182 | | - "\"\"\"" |
2183 | | - ] |
2184 | | - }, |
2185 | | - "execution_count": 25, |
2186 | | - "metadata": {}, |
2187 | | - "output_type": "execute_result" |
2188 | | - } |
2189 | | - ], |
2190 | | - "source": [ |
2191 | | - "# import statsmodels.formula.api as smf\n", |
2192 | | - "# result = smf.ols(\"Fare ~ C(Pclass) + C(Embarked) + FamilySize\", data=df_titanic).fit()\n", |
2193 | | - "# result.summary()" |
| 2063 | + "This analysis allowed us to draw several interesting conclusions about the factors that influenced survival and ticket prices on the Titanic. However, it's important to note that this is just a beginning. For a more in-depth analysis, we could consider:\n", |
| 2064 | + "\n", |
| 2065 | + "- Using classification techniques to predict survival.\n", |
| 2066 | + "- Exploring other features or combinations of features.\n", |
| 2067 | + "- Using more advanced modeling techniques.\n", |
| 2068 | + "\n", |
| 2069 | + "This case study illustrates how data analysis can help us understand historical events and draw lessons that could be applicable in other contexts." |
2194 | 2070 | ] |
2195 | 2071 | } |
2196 | 2072 | ], |
|
0 commit comments