Fundamentals of Statistical Tests

Population and Sample

Population is the entire set of elements of the study.
The list of all the elements or individuals of the study is called as frame.
Study of all elements of a population is called census.
If we select only a subset of the population is called sample.

Parameter and Statistic

Descriptive measures calculated from the entire population are called parameters, where as measures that are calculated from a sample data are called statistics.

Standard deviation

The standard deviation is a measure of the amount of variation or dispersion in a set of values.
The standard deviation (SD) is defined as the square root of the average of the squared deviations from the mean.
This statistical measure quantifies the dispersion or variability of a set of data points relative to their mean (average) value. In essence, it reflects how much individual data points differ from the mean of the dataset.
A low standard deviation means that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation means that the values are spread out over a wider range.

Standard deviation of population:

The formula for the standard deviation of a population (\(s\)) is:

\[ \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2} \]

where:

\(\sigma\) is the population standard deviation,
\(n\) is the size of the population,
\(x_i\) is each individual observation,
\(\mu\) is the population mean,
\(\sum_{i=1}^{n}\) denotes the sum over all observations in the population.

Standard deviation of sample:

The formula for the standard deviation of a sample (\(s\)) is:

\[ s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2} \]

where:

\(s\) is the sample standard deviation,
\(n\) is the number of observations in the sample,
\(x_i\) is each individual observation,
\(\bar{x}\) is the sample mean,
\(\sum_{i=1}^{n}\) denotes the sum over all observations.

The formula for the sample standard deviation (\(\sigma\)) is similar but divides by \(n-1\) instead of \(n\):

The divisor \(n-1\) in the sample standard deviation formula accounts for the degrees of freedom in the estimation of the standard deviation and corrects the bias in the estimation of the population standard deviation from a sample. This correction is known as Bessel’s correction.

Standard deviation calculation

Standard deviation calculation using Excel:

link here

Standard deviation calculation using R:

Sample Standard Deviation

Code

# Sample data
scores <- c(66,84,70,72,73,86,75,88,62,81,85,63,79,79,83,90,61,82,69,89)
sample_size <- length(scores)
sample_size

[1] 20

Code

sample_mean = mean(scores)
sample_mean

[1] 76.85

Code

# Calculate sample standard deviation
sample_sd <- sqrt(sum((scores - sample_mean)^2) / (sample_size - 1))
# another way of calculation is using the direct function sd() as below
# sample_sd <- sd(scores)
sample_sd

[1] 9.371148

Population Standard Deviation

Code

# Calculate population standard deviation
population_sd <- sqrt(sum((scores - sample_mean)^2) / (sample_size))
# another way of calculation is using the sample sd as below
# population_sd <- sample_sd * sqrt((n - 1) / n)
population_sd

[1] 9.133866

Standard deviation calculation using python:

Sample Standard Deviation

Code

python

import numpy as np
scores= np.array([66,84,70,72,73,86,75,88,62,81,85,63,79,79,83,90,61,82,69,89])
sample_size =len(scores)
sample_size

Code

# Calculate sample mean
sample_mean = np.mean(scores)
sample_mean

76.85

Code

python

# Calculate the standard deviation for the sample
# The default ddof=1 is used for sample standard deviation
sample_sd= np.std(scores, ddof=1)
sample_sd

9.371148331588376

Population Standard Deviation

Code

python

# Calculate the standard deviation for the population
# The default ddof=0 is used for population standard deviation
population_sd = np.std(scores, ddof=0)
population_sd

9.133865556269154

Example dataset:

download the dataset here

Calculate Standard Deviation of credit score with R

Code

install.packages("readxl")

Code

library(readxl)
# laod data and View
data_bank <- read_excel("Bank Customer Churn Prediction.xlsx")
data_bank

# A tibble: 10,000 × 12
   customer_id credit_score country gender   age tenure balance products_number
         <dbl>        <dbl> <chr>   <chr>  <dbl>  <dbl>   <dbl>           <dbl>
 1    15634602          619 France  Female    42      2      0                1
 2    15647311          608 Spain   Female    41      1  83808.               1
 3    15619304          502 France  Female    42      8 159661.               3
 4    15701354          699 France  Female    39      1      0                2
 5    15737888          850 Spain   Female    43      2 125511.               1
 6    15574012          645 Spain   Male      44      8 113756.               2
 7    15592531          822 France  Male      50      7      0                2
 8    15656148          376 Germany Female    29      4 115047.               4
 9    15792365          501 France  Male      44      4 142051.               2
10    15592389          684 France  Male      27      2 134604.               1
# ℹ 9,990 more rows
# ℹ 4 more variables: credit_card <dbl>, active_member <dbl>,
#   estimated_salary <dbl>, churn <dbl>

Code

sample_size <- length(data_bank$credit_score)
sample_size

[1] 10000

Code

# Calculate sample mean
sample_mean <- mean(data_bank$credit_score)
sample_mean

[1] 650.5288

Code

# Calculate sample standard deviation
sample_sd <- sd(data_bank$credit_score)
sample_sd

[1] 96.6533

Code

# Calculate population standard deviation
population_sd <- sqrt(sum((data_bank$credit_score - sample_mean)^2) / (sample_size))
population_sd

[1] 96.64847

Calculate Standard Deviation of credit score with Python

To load excel data into python, install openpyxl in jupyter notebook using the command !pip3 install pandas openpyxl

Code

python

import pandas as pd
import numpy as np
# Load data
data_bank = pd.read_excel("Bank Customer Churn Prediction.xlsx")
data_bank

      customer_id  credit_score  ... estimated_salary churn
0        15634602           619  ...        101348.88     1
1        15647311           608  ...        112542.58     0
2        15619304           502  ...        113931.57     1
3        15701354           699  ...         93826.63     0
4        15737888           850  ...         79084.10     0
...           ...           ...  ...              ...   ...
9995     15606229           771  ...         96270.64     0
9996     15569892           516  ...        101699.77     0
9997     15584532           709  ...         42085.58     1
9998     15682355           772  ...         92888.52     1
9999     15628319           792  ...         38190.78     0

[10000 rows x 12 columns]

Code

# Calculate sample mean of credit score
sample_mean = np.mean(data_bank['credit_score'])
sample_mean

650.5288

Code

# Calculate sample standard deviation of credit score
sample_sd = np.std(data_bank['credit_score'], ddof=1)
sample_sd

96.65329873613035

Code

# Calculate population standard deviation of credit score
population_sd = np.std(data_bank['credit_score'], ddof=0)
population_sd

96.64846595037089

Hypothesis Testing

Hypothesis testing is a statistical method used to make decisions about a population based on sample data. It’s a core concept in statistics and research, allowing scientists, analysts, and decision-makers to test assumptions, theories, or hypotheses about a parameter (e.g., mean, proportion) of a population.

Fundamental Concepts of Hypothesis testing

Hypotheses: In hypothesis testing, two opposing hypotheses are formulated:

Null Hypothesis (\(H_0\)): It assumes no effect or no difference in the population. It’s a statement of “no change” or “status quo.”
Alternative Hypothesis (\(H_a\) or \(H_1\)): It represents what the researcher aims to prove. It suggests a new effect, difference, or change from the status quo.

Significance Level (\(\alpha\)): It’s the threshold for rejecting the null hypothesis, typically set at 0.05 (5%). It represents the probability of rejecting the null hypothesis when it’s actually true, known as Type I error.

P-value: The probability of observing the sample data, or something more extreme, if the null hypothesis is true. A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis.

Test Statistic: A value calculated from the sample data, used to evaluate the likelihood of the null hypothesis. The form of the test statistic depends on the test type (e.g., z-test, t-test).

Steps in Hypothesis Testing

Formulate Hypotheses: Define the null and alternative hypotheses based on the research question.
Choose the Significance Level: Set the \(\alpha\) level (e.g., 0.05).
Select the Appropriate Test: Based on the data type and hypothesis, choose a statistical test (e.g., t-test for comparing means).
Calculate the Test Statistic: Use the sample data to compute the test statistic.
Determine the P-value: Find the probability of observing the test results under the null hypothesis.
Make a Decision: Compare the p-value to \(\alpha\). If the p-value is less than \(\alpha\), reject the null hypothesis; otherwise, fail to reject it.

Errors in Hypothesis Testing

Type I Error(False Positive): Rejecting the null hypothesis when it is true.
Type II Error(False Negative): Failing to reject the null hypothesis when the alternative hypothesis is true.

Basic Types of Tests

Z-test: Used for hypothesis testing when the population variance is known and the sample size is large.
T-test: Applied when the population variance is unknown. It includes one-sample, independent two-sample, and paired t-tests.
ANOVA (Analysis of Variance): Used to compare the means of three or more samples.
Chi-square Test: Applied to categorical data to assess how likely it is that an observed distribution is due to chance.
Regression Analysis: Tests hypotheses about relationships between variables.

Power of the Test

The power of a hypothesis test is the probability that it correctly rejects a false null hypothesis (1 - Probability of Type II error). High power is desirable and can be increased by enlarging the sample size, increasing the effect size, or choosing a higher significance level.

Assumptions of the Test

Most statistical tests have underlying assumptions about the data (e.g., normality, independence, homoscedasticity). Violating these assumptions can affect the validity of the test results.
- It’s important to choose the right test based on these assumptions or use non-parametric tests that don’t rely on such assumptions.

One-tailed vs. Two-tailed Tests

A hypothesis test can be one-tailed or two-tailed, depending on the nature of the alternative hypothesis.

One-tailed and two-tailed tests are two approaches to statistical hypothesis testing that are used to determine if there is enough evidence to reject the null hypothesis, considering the directionality of the relationships or differences.

One-tailed Tests

A one-tailed test, also known as a directional test, is used when the research hypothesis specifies the direction of the relationship or difference. It tests for the possibility of the relationship in one specific direction and ignores the possibility of a relationship in the other direction. This makes a one-tailed test more powerful than a two-tailed test for detecting an effect in one direction because all the statistical power of the test is focused on detecting an effect in that one direction.

When to use: - If you have a specific hypothesis that states one variable is greater than or less than the other variable. - If the consequences of missing an effect in one direction are not as critical as in the other direction.

Example: Suppose you are testing a new drug and believe that it will be more effective than the current treatment. You would use a one-tailed test to determine if the new drug is significantly better.

Two-tailed Tests

A two-tailed test, or a non-directional test, is used when the research hypothesis does not specify the direction of the expected relationship or difference. It tests for the possibility of the relationship in both directions. This means that it checks for both, whether one variable is either greater than or less than the other variable, thus requiring more evidence to reject the null hypothesis compared to the one-tailed test.

When to use: - If you do not have a specific direction in mind or if you are interested in detecting any significant difference, regardless of the direction. - If the consequences of missing an effect are equally important in both directions.

Example: Suppose you are testing a new teaching method and want to find out if it has a different effect (either better or worse) on students’ test scores compared to the traditional method. A two-tailed test would be appropriate in this case.

Choosing Between One-tailed and Two-tailed Tests

The choice between a one-tailed and two-tailed test should be determined by the research question or hypothesis. One-tailed tests are more powerful for detecting an effect in one direction but at the cost of potentially missing an effect in the other direction. Two-tailed tests are more conservative and are used when it is important to detect effects in either direction.

Considerations:

Research Hypothesis: The directionality of your hypothesis should guide your choice.
Potential Biases: Be cautious of choosing a one-tailed test for the mere purpose of achieving statistical significance. This practice can lead to biases in research.
Field of Study: Some fields have conventions preferring one type of test over the other, often based on the typical research questions and hypotheses in those fields.