May 18 2014

Data Mining

Making sense of data[^1]

3 Data Preparation

Preparing the data is one of the most time-consuming parts of any data analysis/data mining project.

3.1 DATA SOURCES

Surveys or polls
Experiments
Observational and other studies
Operational databases(CRM etc)
Data warehouses
Historical databases
Purchased data

3.2 DATA UNDERSTANDING

Data Tables
Continuous and Discrete Variables
Scales of Measurement(Nominal/Ordinal/IntervalRatio)
Roles in Analysis(Labels/Descriptors/Response)
Frequency Distribution

3.3 DATA PREPARATION

Normalization
- Min-max: $$\acute{value} = \frac{Value - OriginalMin}{OriginalMax - OriginalMin}*(NewMax - NewMin) + NewMin$$
- z-score: $$\acute{value} = \frac{Value - \bar{x}}{s}$$
- Decimal scaling: $$\acute{value} = \frac{Value}{10^n}$$

4 Tables and Graphs

4.2.2 Contingency Tables

table-4.2.2

4.2.3 Summary Tables

1	summary(data)

4.3.2 Frequency Polygrams and Histograms

4.3.3 Scatterplots

4.3.4 Box Plots

table-4.7

5 Statistics

Population
Sample
Confidence intervals: A confidence interval allows us to make statements concerning the likely range that a population parameter (such as the mean) lies within.
Hypothesis tests: A hypothesis test determines whether the data collected supports a specific claim.
Chi-square: The chi-square test is a statistical test procedure to understand whether a relationship exists between pairs of categorical variables.
One-way analysis of variance: This test determines whether a relationship exists between three or more group means.

Summary of inferential statistical tests

table-5.3

5.3.2 Confidence Intervals

A single statistic could be used as an estimate for a population (commonly referred to as a point estimate).

Confidence Ranges for Continuous Variables

For continuous variables, the mean is the most common population estimate.
$$
\bar{x} \pm {Z}_{C}\frac{S}{\sqrt{n}}
$$

The normal distribution can be used for large sample sizes where the number of observations is greater than or equal to 30. However, for a sample size of less than 30, an alternative distribution is needed: Student’s t-distribution.

$$
\bar{x} \pm {t}_{C}\frac{S}{\sqrt{n}}
$$

Confidence Ranges for Categorical Variables

When handling categorical variables, the proportion with a given outcome is often used to summarize the variable. This equals the outcome’s size divided by the sample size.

the standard error of the proportion and
the confidence level with which we wish to state the range.

$$
p \pm {Z}_{C}{\sqrt\frac{p(1-p)}{n}}
$$

p: the proportion with a given outcome
n: the sample size
${Z}_{C}$: the critical Z-score
$\sqrt\frac{p(1-p)}{n}$: the sandard error of proportion

5.3.3 Hypothesis Tests

A hypothesis test determines whether you have enough data to reject the claim (and accept the alternative) or whether you do not have enough data to reject the claim.

Null hypothesis
Alternative hypothesis

Hypothesis Assessment

Once the null hypothesis and the alternative hypothesis have been described, it is now possible to assess the hypotheses using the data collected.

First, the statistic of interest from the sample is calculated.

Next, a hypothesis test will look at the difference between the value claimed in the hypothesis statement and the calculated sample statistic.

For large sample sets (greater than or equal to 30 observations), identifying where the hypothesis test result is located on the normal distribution curve of the sampling distribution, will determine whether the null hypothesis is rejected.

Calculating `p-Values`

A p-value is the probability of getting the recorded value or a more extreme value. It is a measure of the likelihood of the result given the null hypothesis is true or the statistical significance of the claim.

Hypothesis Test: Single Group, Continuous Data

$$
Z = \frac{\bar{x}-\mu}{\frac{s}}
$$

Hypothesis Test: Single Group, Categorical Data

Hypothesis Test: Two Groups, Continuous Data

If the group sizes are less than 30.

Hypothesis Test: Two Groups, Categorical Data

Paired Test

Errors

5.3.4 Chi-Square

The chi-square test is a hypothesis test to use with variables measured on a nominal or ordinal scale. It allows an analysis of whether there is a relationship between two categorical variables. As with other hypothesis tests, it is necessary to state a null and alternative hypothesis.

5.3.5 One-Way Analysis of Variance

[^1]:Glenn J. Myatt. “Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining”.

The Eye of Data (@Buttonwood)

Float like a butterfly! Stand like a buttonwood!

Making sense of data[^1]

3 Data Preparation

3.1 DATA SOURCES

3.2 DATA UNDERSTANDING

3.3 DATA PREPARATION

4 Tables and Graphs

4.2.2 Contingency Tables

4.2.3 Summary Tables

4.3.2 Frequency Polygrams and Histograms

4.3.3 Scatterplots

4.3.4 Box Plots

5 Statistics

Summary of inferential statistical tests

5.3.2 Confidence Intervals

Confidence Ranges for Continuous Variables

Confidence Ranges for Categorical Variables

5.3.3 Hypothesis Tests

Hypothesis Assessment

Calculating `p-Values`

Hypothesis Test: Single Group, Continuous Data

Hypothesis Test: Single Group, Categorical Data

Hypothesis Test: Two Groups, Continuous Data

Hypothesis Test: Two Groups, Categorical Data

Paired Test

Errors

5.3.4 Chi-Square

5.3.5 One-Way Analysis of Variance

3 Data Preparation

3.1 DATA SOURCES

3.2 DATA UNDERSTANDING

3.3 DATA PREPARATION

4 Tables and Graphs

4.2.2 Contingency Tables

4.2.3 Summary Tables

4.3.2 Frequency Polygrams and Histograms

4.3.3 Scatterplots

4.3.4 Box Plots

5 Statistics

Summary of inferential statistical tests

5.3.2 Confidence Intervals

Confidence Ranges for Continuous Variables

Confidence Ranges for Categorical Variables

5.3.3 Hypothesis Tests

Hypothesis Assessment

Calculating p-Values

Hypothesis Test: Single Group, Continuous Data

Hypothesis Test: Single Group, Categorical Data

Hypothesis Test: Two Groups, Continuous Data

Hypothesis Test: Two Groups, Categorical Data

Paired Test

Errors

5.3.4 Chi-Square

5.3.5 One-Way Analysis of Variance

Calculating `p-Values`