Descriptive Statistics

4–6 minutes

read

It describes some key features of our data.Descriptive statistics refer to the numerical techniques used to summarize and describe the features of a dataset. These techniques help to understand the main characteristics of the data, including central tendency, dispersion, and shape.

Key measures of descriptive statistics include:

  1. Measures of Central Tendency:We try to find a measures of central tendency: What is the central or typical value?
    • Mean: The average value of the dataset.
    • Median: The middle value when the data is arranged in ascending order.
    • Mode: The most frequently occurring value.
  2. Measures of Dispersion/ Spread :How spread our data is?
    • Range: The difference between the maximum and minimum values.
    • Variance: The average of the squared differences from the mean.
    • Standard Deviation: The square root of the variance, indicating the average deviation from the mean.
  3. Measures of Shape:
    • Skewness: A measure of the asymmetry of the distribution.
    • Kurtosis: A measure of the peakedness or flatness of the distribution.

Measures of Central tendency : Mean , Median , Mode

Mean : commonly called as average. sum of observations/ number of observations.It is a measure of central tendency that represents the typical value of a dataset. It is calculated by summing up all the values in the dataset and dividing the total by the number of observations.

Mathematical Formula: The formula for calculating the mean (μ) of a dataset with n observations (x₁, x₂, …, xn) is:

μ = (x₁ + x₂ + … + xn) / n

Example: Let’s consider a simple dataset representing the scores of students in a class test:

Scores: 5, 9, 7, 8, 1

To find the mean score: μ = (5+9+7+8+1) / 5 = 30 / 5 = 6

So, the mean score of the class test is 6.

Median : The median is another measure of central tendency that represents the middle value of a dataset when arranged in ascending order. If there is an odd number of observations, the median is simply the middle value. If there is an even number of observations, the median is the average of the two middle values.

Mathematical Formula: To calculate the median of a dataset with n observations (x₁, x₂, …, xn), follow these steps:

  1. Arrange the data in ascending order.
  2. If n is odd, the median is the value at position (n + 1) / 2.
  3. If n is even, the median is the average of the values at positions n / 2 and (n / 2) + 1.

Example: Consider a dataset representing the ages of individuals in a group:

Ages: 5, 9, 8, 8, 1, 12

Step 1: Arrange the data in ascending order: 1, 5, 8, 8, 9, 12

Step 2: Since there are 6 observations (an even number), we take the average of the values 8 and 8 that is (16 / 2) Median =8

So, the median age of the group is 8

Mode : for example we have couple of values , we need to count the occurrence of the most frequent number.The mode is the value that appears most frequently in a dataset. It is the simplest measure of central tendency and can be useful for identifying the most common observation or category within the data.

Mathematical Formula: There is no specific mathematical formula for finding the mode, as it is simply the value with the highest frequency in the dataset.

Example: Let’s consider a dataset representing the number of siblings of students in a class:

Number of Siblings: 0, 1, 2, 1, 3, 2, 0, 2, 1, 2

To find the mode:

  • Count the frequency of each value:
    • 0 appears twice
    • 1 appears three times
    • 2 appears four times
    • 3 appears once
  • Since the value 2 has the highest frequency (four times), it is the mode of the dataset.

So, in this example, the mode is 2, indicating that most students have 2 siblings.

A dataset can be termed “bimodal” if it has two modes, meaning it has two values that occur with the highest frequency. On the other hand, if no single value appears more frequently than any other, the dataset is considered to have “no mode” or to be “unimodal with no mode.”Example of a Bimodal Dataset: Consider the following dataset representing the scores of students in a class test:

Scores: 85, 90, 75, 88, 92, 75, 80, 82, 88

In this dataset, both 75 and 88 occur twice, making the dataset bimodal.

Example of a Dataset with No Mode: Now, let’s look at a dataset where no value appears more than once:

Numbers: 10, 20, 15, 25, 18, 22

Example of a Dataset with value two values occurs 3 times. Then it is no mode.

Numbers: 10, 20, 15,10, 20, 25, 18, 22, 10, 20

Differences between mean / median/ mode

  1. We can use median when mean doesn’t work well, when data set is skewed,, if we have large outlier in our dataset then mean comes very high so that time we use median. so outlier is skewing our mean . mean is sensitive towards outliers
  2. Median is less sensitive towards outliers
  3. Mode we use when we have categorical value eg when we have categories.

Skewness is a statistical measure that assesses the symmetry of a dataset’s distribution. Imagine a histogram where data points are plotted; if the distribution appears to have outliers predominantly on one side, it’s termed as skewed.

Positive skewness, also known as right skewness, occurs when the outliers are predominantly on the higher end of the scale. In such cases, the mode is less than the median, which is less than the mean, indicating a distribution where the mean is inflated due to the presence of high outliers.

Positive skewness: mode< median < mean. Mean is larger than Median.

Conversely, negative skewness, or left skewness, arises when outliers are mainly on the lower end of the scale. Here, the mean is less than the median, which is less than the mode, reflecting a distribution where the mean is deflated by the presence of low outliers.

Negative skewness: mean< median< mode. Mean is smaller than Median.

Mathematically, skewness (denoted by γ) is determined:

  • When γ > 0, the distribution is positively skewed.
  • When γ < 0, the distribution is negatively skewed.
  • When γ = 0, the dataset is perfectly symmetrical or lacks skewness.

Leave a comment