3 Data Preparation
Preparing the data is one of the most time-consuming parts of any data analysis/data mining project.
3.1 DATA SOURCES
- Surveys or polls
- Experiments
- Observational and other studies
- Operational databases(CRM etc)
- Data warehouses
- Historical databases
- Purchased data
3.2 DATA UNDERSTANDING
- Data Tables
- Continuous and Discrete Variables
- Scales of Measurement(Nominal/Ordinal/IntervalRatio)
- Roles in Analysis(Labels/Descriptors/Response)
- Frequency Distribution
3.3 DATA PREPARATION
- Normalization
- Min-max: $$\acute{value} = \frac{Value - OriginalMin}{OriginalMax - OriginalMin}*(NewMax - NewMin) + NewMin$$
- z-score: $$\acute{value} = \frac{Value - \bar{x}}{s}$$
- Decimal scaling: $$\acute{value} = \frac{Value}{10^n}$$
4 Tables and Graphs
4.2.2 Contingency Tables
4.2.3 Summary Tables
1 | summary(data) |
4.3.2 Frequency Polygrams and Histograms
4.3.3 Scatterplots
4.3.4 Box Plots
5 Statistics
Population
Sample
Confidence intervals
: A confidence interval allows us to make statements concerning the likely range that a population parameter (such as the mean) lies within.Hypothesis tests
: A hypothesis test determines whether the data collected supports a specific claim.Chi-square
: The chi-square test is a statistical test procedure to understand whether a relationship exists between pairs of categorical variables.One-way analysis of variance
: This test determines whether a relationship exists between three or more group means.
Summary of inferential statistical tests
5.3.2 Confidence Intervals
A single statistic could be used as an estimate for a population (commonly referred to as a point estimate).
Confidence Ranges for Continuous Variables
For continuous variables, the mean is the most common population estimate.
$$
\bar{x} \pm {Z}_{C}\frac{S}{\sqrt{n}}
$$
The normal distribution can be used for large sample sizes where the number of observations is greater than or equal to 30. However, for a sample size of less than 30, an alternative distribution is needed: Student’s t-distribution.
$$
\bar{x} \pm {t}_{C}\frac{S}{\sqrt{n}}
$$
Confidence Ranges for Categorical Variables
When handling categorical variables, the proportion with a given outcome is often used to summarize the variable. This equals the outcome’s size divided by the sample size.
- the standard error of the proportion and
- the confidence level with which we wish to state the range.
$$
p \pm {Z}_{C}{\sqrt\frac{p(1-p)}{n}}
$$
p
: the proportion with a given outcomen
: the sample size
${Z}_{C}$: the critical Z-score
$\sqrt\frac{p(1-p)}{n}$: the sandard error of proportion
5.3.3 Hypothesis Tests
A hypothesis test determines whether you have enough data to reject the claim (and accept the alternative) or whether you do not have enough data to reject the claim.
- Null hypothesis
- Alternative hypothesis
Hypothesis Assessment
Once the null hypothesis and the alternative hypothesis have been described, it is now possible to assess the hypotheses using the data collected.
First, the statistic of interest from the sample is calculated.
Next, a hypothesis test will look at the difference between the value claimed in the hypothesis statement and the calculated sample statistic.
For large sample sets (greater than or equal to 30 observations), identifying where the hypothesis test result is located on the normal distribution curve of the sampling distribution, will determine whether the null hypothesis is rejected.
Calculating p-Values
A p-value
is the probability of getting the recorded value or a more extreme value. It is a measure of the likelihood of the result given the null hypothesis is true or the statistical significance of the claim.
Hypothesis Test: Single Group, Continuous Data
$$
Z = \frac{\bar{x}-\mu}{\frac{s}}
$$
Hypothesis Test: Single Group, Categorical Data
Hypothesis Test: Two Groups, Continuous Data
If the group sizes are less than 30.
Hypothesis Test: Two Groups, Categorical Data
Paired Test
Errors
5.3.4 Chi-Square
The chi-square test is a hypothesis test to use with variables measured on a nominal or ordinal scale. It allows an analysis of whether there is a relationship between two categorical variables. As with other hypothesis tests, it is necessary to state a null and alternative hypothesis.
5.3.5 One-Way Analysis of Variance
[^1]:Glenn J. Myatt. “Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining”.