Course Notes: Quantitative Data

Course Notes: Quantitative Data

Oleh Riki Akbar - September 07, 2020

A data distribution (commonly presented using histogram) can be analyzed by utilizing these components

SHAPE: Bell-shaped, Left-skewed, Right-skewed, etc.
CENTER: Mean and Median
SPREAD: Maximum & Minimum value, Quartile, Interquartile range
OUTLIERS: Analyze the existence of outliers. Visit this to find out how to determine outliers.

The relationship between mean and median on different shapes of data distribution typically can be considered as follows:

Bell-shaped: Mean = Median
Left-skewed: Mean < Median
Right-skewed: Mean > Median

The Five Numbers defines five statistical measures to obtain a dataset profile/summary so that the center and the spread of the dataset can be identified. The Five Numbers are:

Min/Minimum
Q1/First Quartile
Median. Also known as Q2/Second Quartile
Q3/Third Quartile
Max/Maximum

Occasionally, The Five Numbers are complemented by Standard Deviation and Interquartile Range to produce a more comprehensive preview of the dataset.

Standard Deviation, another measure in quantitative data analysis, defines the average of the distance of the data element to the mean of the data itself.

Empirical Rule (on bell-shaped/normal distribution) / "68-95-99.7" Rule:

Around 68% of the data falls in the range of the standard deviation. For example, if the mean = 7 and the standard deviation = 1.7 then around 68% of the data fall in the range from 5.3 to 8.7.
Around 95% of the data falls in the standard deviation range of the range 1 (the 68% ones). For example, considering the range 1 (5.3 to 8.7), 95% of the data falls in the range of 3.6 to 10.4.
Around 99.7% of the data falls in the standard deviation range of the range 2 (the 95% ones). For example, considering the range 2 (3.6 to 10.4), 99.7% of the data falls in the range of 1.9 to 12.1.

Standard Score

The standard score defines the deviation of a data value from the data mean, relative to the standard deviation value.
The standard score can be computed as:

(Observed value - Mean) / Standard Deviation

For example, in a dataset with mean =7 and standard deviation = 1.7, the standard score of a data element with a value of 10 as :

(10 - 7) / 1.7 = 1.76, can be considered 'unusual' based on the standard score.

Reference:

1. Coursera.com, course: "Understanding and Visualizing Data with Python" offered by the University of Michigan.

2. Thoughtco.com, article: "How Are Outliers Determined in Statistics"

PS:

Corrections on this note are welcome. Feel free to leave your comment. Cheers.

Komentar