|
9 | 9 | },
|
10 | 10 | {
|
11 | 11 | "cell_type": "markdown",
|
12 |
| - "metadata": { |
13 |
| - "jp-MarkdownHeadingCollapsed": true |
14 |
| - }, |
| 12 | + "metadata": {}, |
15 | 13 | "source": [
|
16 | 14 | "## **Concepts**"
|
17 | 15 | ]
|
|
82 | 80 | "### **Concept 4 - What is p-value?**\n",
|
83 | 81 | "The p-value, short for \"probability value,\" is a number that helps us understand the strength of evidence against a null hypothesis in hypothesis testing.\n",
|
84 | 82 | "\n",
|
| 83 | + "**OR**\n", |
| 84 | + "\n", |
| 85 | + "p-value quantifies the strength of evidence i.e. how likely the evidence occured by a random chance. The smaller the p-value, less likely the result occured by random change and the stronger the evidence that you should reject the null hypothesis.\n", |
| 86 | + "\n", |
85 | 87 | "To say whether the p-value is significant or not, we need a significance threshold called the **significance level**. This threshold is usually set at 0.05. \n",
|
86 | 88 | "\n",
|
87 | 89 | "It's important because it helps control the rate of Type I errors. By setting a threshold, we define the level of evidence needed to reject the null hypothesis. Lower thresholds (e.g., 0.01) require stronger evidence.\n",
|
88 | 90 | "\n",
|
89 | 91 | "\n",
|
90 |
| - "If the p-value is **BELOW** the threshold (meaning smaller than), then you can infer a **statistically significant relationship** between the input and target variables. \n", |
| 92 | + "If the p-value is **BELOW** the threshold (meaning smaller than), then you can infer a **statistically significant evidence** i.e. outcome didn't occur randomly. \n", |
91 | 93 | "i.e. For p-value < 0.05, you can Reject the Null Hypothesis\n",
|
92 | 94 | "\n",
|
93 |
| - "Otherwise, then you can infer **no statistically significant relationship** between the predictor and outcome variables. \n", |
| 95 | + "Otherwise, then you can infer **no statistically significant evidence** i.e. outcome occured at random. \n", |
94 | 96 | "i.e. For p-value > 0.05, you Fail to Reject the Null Hypothesis\n",
|
95 | 97 | "\n",
|
96 | 98 | "Here's a simple explanation:\n",
|
|
149 | 151 | },
|
150 | 152 | {
|
151 | 153 | "cell_type": "markdown",
|
152 |
| - "metadata": { |
153 |
| - "jp-MarkdownHeadingCollapsed": true |
154 |
| - }, |
| 154 | + "metadata": {}, |
155 | 155 | "source": [
|
156 | 156 | "## **Importing all the Required Libraries**"
|
157 | 157 | ]
|
|
173 | 173 | },
|
174 | 174 | {
|
175 | 175 | "cell_type": "markdown",
|
176 |
| - "metadata": { |
177 |
| - "jp-MarkdownHeadingCollapsed": true |
178 |
| - }, |
| 176 | + "metadata": {}, |
179 | 177 | "source": [
|
180 | 178 | "## **Loading the Data**"
|
181 | 179 | ]
|
|
405 | 403 | },
|
406 | 404 | {
|
407 | 405 | "cell_type": "markdown",
|
408 |
| - "metadata": { |
409 |
| - "jp-MarkdownHeadingCollapsed": true |
410 |
| - }, |
| 406 | + "metadata": {}, |
411 | 407 | "source": [
|
412 | 408 | "## **Renaming the Columns**"
|
413 | 409 | ]
|
|
478 | 474 | },
|
479 | 475 | {
|
480 | 476 | "cell_type": "markdown",
|
481 |
| - "metadata": { |
482 |
| - "jp-MarkdownHeadingCollapsed": true |
483 |
| - }, |
| 477 | + "metadata": {}, |
484 | 478 | "source": [
|
485 | 479 | "## **Univariate Analysis - Discrete Data**"
|
486 | 480 | ]
|
|
773 | 767 | "source": [
|
774 | 768 | "### Chi-Square Test for Goodness-of-fit\n",
|
775 | 769 | "\n",
|
776 |
| - "Tests whether the observed frequencies of categorical data match the expected frequencies according to a specified distribution." |
| 770 | + "Tests whether the observed frequencies of categorical data match the expected frequencies according to a specified distribution.\n", |
| 771 | + "\n", |
| 772 | + "**Assumptions**\n", |
| 773 | + "- Observations in each sample are independent and identically distributed (iid).\n", |
| 774 | + "- Observations should be discrete.\n", |
| 775 | + "\n", |
| 776 | + "**Interpretation**\n", |
| 777 | + "- H0: The observed and expected frequencies are matching.\n", |
| 778 | + "- H1: The observed and expected frequencies are not matching." |
777 | 779 | ]
|
778 | 780 | },
|
779 | 781 | {
|
|
818 | 820 | },
|
819 | 821 | {
|
820 | 822 | "cell_type": "markdown",
|
821 |
| - "metadata": { |
822 |
| - "jp-MarkdownHeadingCollapsed": true |
823 |
| - }, |
| 823 | + "metadata": {}, |
824 | 824 | "source": [
|
825 | 825 | "## **Univariate Analysis - Numerical Data**"
|
826 | 826 | ]
|
|
841 | 841 | {
|
842 | 842 | "cell_type": "code",
|
843 | 843 | "execution_count": 16,
|
844 |
| - "metadata": { |
845 |
| - "scrolled": true |
846 |
| - }, |
| 844 | + "metadata": {}, |
847 | 845 | "outputs": [
|
848 | 846 | {
|
849 | 847 | "name": "stdout",
|
|
1029 | 1027 | "### Kolmogorov-Smirnov Test\n",
|
1030 | 1028 | "\n",
|
1031 | 1029 | "The Kolmogorov-Smirnov (KS) test is used to check if a sample follows a specific distribution, including normal distribution. \n",
|
| 1030 | + "\n", |
1032 | 1031 | "**Assumptions**\n",
|
1033 | 1032 | "- Observations in each sample are independent and identically distributed (iid).\n",
|
1034 | 1033 | "\n",
|
1035 | 1034 | "**Interpretation**\n",
|
1036 | 1035 | "- H0: the sample has a distribution.\n",
|
1037 |
| - "- H1: the sample does not have that distribution.\n" |
| 1036 | + "- H1: the sample does not have that distribution." |
1038 | 1037 | ]
|
1039 | 1038 | },
|
1040 | 1039 | {
|
|
1044 | 1043 | "outputs": [],
|
1045 | 1044 | "source": [
|
1046 | 1045 | "def kolmogorov_smirnov(data, significance_level):\n",
|
| 1046 | + " # You can replace 'norm' with stats.norm\n", |
1047 | 1047 | " stat, p = stats.kstest(data, 'norm')\n",
|
1048 | 1048 | " \n",
|
1049 | 1049 | " print('stat=%.3f, p=%.3f' % (stat, p))\n",
|
|
1078 | 1078 | "source": [
|
1079 | 1079 | "### One-Sample t-test\n",
|
1080 | 1080 | "\n",
|
1081 |
| - "Tests whether the mean of a single sample is significantly different from a known or hypothesized population mean." |
| 1081 | + "Tests whether the mean of a single sample is significantly different from a known or hypothesized population mean.\n", |
| 1082 | + "\n", |
| 1083 | + "**Assumptions**\n", |
| 1084 | + "- Observations in each sample are independent and identically distributed (iid).\n", |
| 1085 | + "- Observations are continuous measurements.\n", |
| 1086 | + "- Data is normally distributed.\n", |
| 1087 | + "\n", |
| 1088 | + "**Interpretation**\n", |
| 1089 | + "- H0: $\\mu_{pop}=m_o$.\n", |
| 1090 | + "- H1: $\\mu_{pop}\\ne m_o$." |
1082 | 1091 | ]
|
1083 | 1092 | },
|
1084 | 1093 | {
|
|
1122 | 1131 | },
|
1123 | 1132 | {
|
1124 | 1133 | "cell_type": "markdown",
|
1125 |
| - "metadata": { |
1126 |
| - "jp-MarkdownHeadingCollapsed": true |
1127 |
| - }, |
| 1134 | + "metadata": {}, |
1128 | 1135 | "source": [
|
1129 | 1136 | "## **Bivariate Analysis - Numerical vs Numerical**"
|
1130 | 1137 | ]
|
|
1238 | 1245 | },
|
1239 | 1246 | {
|
1240 | 1247 | "cell_type": "markdown",
|
1241 |
| - "metadata": { |
1242 |
| - "jp-MarkdownHeadingCollapsed": true |
1243 |
| - }, |
| 1248 | + "metadata": {}, |
1244 | 1249 | "source": [
|
1245 | 1250 | "## **Bivariate Analysis - Categorical vs Categorical**"
|
1246 | 1251 | ]
|
|
1477 | 1482 | },
|
1478 | 1483 | {
|
1479 | 1484 | "cell_type": "markdown",
|
1480 |
| - "metadata": { |
1481 |
| - "jp-MarkdownHeadingCollapsed": true |
1482 |
| - }, |
| 1485 | + "metadata": {}, |
1483 | 1486 | "source": [
|
1484 | 1487 | "## **Bivariate Analysis - Numerical vs Categorical**"
|
1485 | 1488 | ]
|
|
1735 | 1738 | },
|
1736 | 1739 | {
|
1737 | 1740 | "cell_type": "markdown",
|
1738 |
| - "metadata": { |
1739 |
| - "jp-MarkdownHeadingCollapsed": true |
1740 |
| - }, |
| 1741 | + "metadata": {}, |
1741 | 1742 | "source": [
|
1742 | 1743 | "## **This is not the end!**\n",
|
1743 | 1744 | "\n",
|
|
0 commit comments