Getting a Quick Statistical Summary
tags: #python/data_science/eda
To get quick and simple description of the data, we can use the describe() function. This includes:
- Count
- Mean
- STD
- Median
- Mode
- Min/Max Value
- Range
This provides a statistical summary of the data belonging to any numerical datatype (e.g., int, float). This provides a high-level overview of the overall distribution of the data and potential outliers.
df.describe()
numerical_column_1 numerical_column_2 numerical_column_3
count 1000.000000 1000.000000 1000.000000
mean 50.345000 25.678000 3.456000
std 7.546943 5.123456 1.234567
min 35.000000 18.000000 1.000000
25% 45.000000 22.000000 2.500000
50% 50.000000 25.000000 3.000000
75% 55.000000 29.000000 4.000000
max 70.000000 40.000000 6.000000
include Parameter
In the describe() method of pandas, the include parameter allows you to specify the data types to be included in the summary statistics. If you set include='all', it means that the summary statistics will include all columns, regardless of their data types (numeric or object).
For object columns, you get count, unique, top (most frequently occurring value), and freq (frequency of the top value).