Nowadays, the world's most valuable resource is no longer oil, but data. Visualization is becoming a more important tool for making sense of the billions of rows of data. By translating data into a graphical representation that is easy to interpret, data visualization aids in the data storytelling by highlighting relevant information, patterns, and outliers. However, the data and the graphics must work together: It's the art of integrating great analysis with great storytelling. In this blog post, we'll show you "Sea Born", one of the most well-known visualization tools written in Python.
Visualization tools
We use Visualization tools to visualize the trends, patterns, outliers and the relationship between variables. It’s a highly in demand skill especially for a data science career.
Seaborn
Seaborn is a Python data visualization library based onmatplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
Seaborn plotting functions:
In seaborn, we have 3 categories of plots
-
Categorical plots.
-
Distributional plots.
-
Relational plots.
Categorical plots
We use the categorical plotting functions of seaborn to visualize the tendencies of a categorical variable or to visualize the relationship between two variables with at least one categorical one.
Count plot:
- Shows the counts of observations of each category from the categorical variable. We simply count the number of observations of each category of the variable
seaborn.catplot(kind = 'count',
data = dataset,
x = 'variable')
Bar plot:
-
Represents an estimate of tendency of a continuous variable with the height of rectangle for each category of a categorical variable. So the plotting function takes two variables as input, one that is continuous and one that is categorical. For each category from variable_1, we calculate the tendency of variable_2.
-
The tendency can be the mean, the variance, or you can pass some custom function…
seaborn.catplot(kind = 'bar',**data = dataset,**
x = 'variable_1',**y = 'variable_2',**
estimator = np.mean)**
Strip plot:
- Strip plot is one of the simplest and most straightforward plots in data visualization, we simply draw points that represent the values of a continuous variable.For each category of Variable 1, we will draw the values of Variable 2.
seaborn.catplot(kind = 'strip',
data = dataset,
x = 'variable_1',
y = 'variable_2',
jitter = 0.15)
Swarm plot:
-
Swarm plot is so similar to the strip plot, since it has exactly the same functionality. The only difference is in how it displays the points. While in strip plot, data points may overlap since they are randomly put on the x-axis, in swarm plot we make sure that points will not overlap by stacking them on top of each other.
-
The drawback here is that if we have a lot of data points, it would be impossible to not overlap them, so the algorithm will delete some data points in order to not overlap.
seaborn.catplot(kind = 'swarm',
data = dataset,
x = 'variable_1',
y = 'variable_2')
Box plot:
-
Box plot is used to represent the distribution of a continuous variable for each category of a categorical variable. Even though it’s pretty simple, it yields a lot of information:
-
The values of the quartiles:
The box has a horizontal line inside, which represents the median. The horizontal line above is the upper quartiles, the one below it is the lower quartiles.
- The outliers:
Notice that there are some points outside of the box, these points represents the outliers
seaborn.catplot(kind = 'box',
data = dataset,
x = 'variable_1',
y = 'variable_2')
Violin plot:
Instead of plotting the distribution box, violin plot will plot the actual distribution of the continuous variable for each category of the categorical variable using KDE ( Kernel Density Estimation )
seaborn.catplot(kind = 'violin',
data = dataset,
x = 'variable_1',
y = 'variable_2')
Distribution plots:
We use the distribution plotting functions of seaborn to visualize the distribution of continuous variables.
Hist plot:
The hist plot represents the distribution of continuous variables using bins.
seaborn.distplot(kind = 'hist',
data = dataset,
x = 'variable',
bins = 20)
KDE plot:
Kde plot represents the actual distribution of the data, using the Kernel Density Estimation.
seaborn.distplot(kind = 'kde',
data = dataset,
x = 'variable')
It can also be used to represent the bivariate distribution of two continuous variables.
seaborn.distplot(kind = 'kde',
data = dataset,
x = 'variable_1',
y = 'variable_2')
ECDF plot:
ECDF plot represents the empirical cumulative distribution of a continuous variable.
seaborn.distplot(kind = 'ecdf',
data = dataset,
x = 'variable')
Relational plots:
We use the relational plotting functions of seaborn to visualize the relationship between continuous variables.
Scatter plot:
- It shows the relationship between two continuous variables, by simply plotting all the data points.
seaborn.relplot(kind = 'scatter',
data = dataset,
x = 'variable_1',
y = 'variable_2')
Line plot:
- Represents the relationship between variables as a continuous function.
seaborn.relplot(kind = 'line',
data = dataset,
x = 'variable_1',
y = 'variable_2')
More functionalities:
You may have noticed that in all the plotting functions we’ve been using a maximum of two variables per plot, but what if we want to introduce more variables in our visualization ? Fortunately Seaborn took care of that:
Hue:
- Using hue we can introduce a 3rd variable that is categorical to our visualization using color encoding, it means that the data points that belong to the same category of this 3rd variable will have the same color.
seaborn.relplot(kind = 'scatter',
data = dataset,
x = 'variable_1',
y = 'variable_2',
hue = 'variable_3')
Size:
- Size is similar to hue, but uses size encoding instead of color encoding. It means that the data points that belong to the same category of the 3rd variable will have the same unique size. Different sizes means different categories.
seaborn.relplot(kind = 'scatter',
data = dataset,
x = 'variable_1',
y = 'variable_2',
size = 'variable_3',
sizes = [50, 100])
Style:
- Pretty much the same thing as Hue and Size, It means that the data points that belong to the same category of the 3rd variable will have the same unique style. A point style can be a dot, star, cross, triangle, … we call them markers.
seaborn.relplot(kind = 'scatter',
data = dataset,
x = 'variable_1',
y = 'variable_2',
style = 'variable_3',
markers = ['X', '*'])
We can also introduce a new categorical variable using multiples plots, each plot belongs to a category from the cate categorical variable:
Col:
Will create many figures horizontally with respect to the 3rd variable’s categories.
seaborn.relplot(kind = 'scatter',
data = dataset,
x = 'variable_1',
y = 'variable_2',
col = 'variable_3')
Row:
Will create many figures vertically with respect to the 3rd variable’s categories.
seaborn.relplot(kind = 'scatter',
data = dataset,
x = 'variable_1',
y = 'variable_2',
row = 'variable_3')
We can also use Hue and Size in the same plot, to represent 4 variables, or even Hue and Style and Col, to use 5 variables in the same plot ! We can use up to 7 variables ( variable 1, variable 2, Hue, Size, Style, Col, Row ) in the same plot, but it will give us a very charged plot that is so messy and extremely hard to interpret, sometimes not informative at all.
seaborn.relplot(kind = 'scatter',
data = dataset,
x = 'variable_1',
y = 'variable_2',
hue = 'variable_3',
size = 'variable_4')
Conclusion:
In this post, we learned about seaborn, its 3 categories of plotting functions: Categorical, Distribution and Relational plots, and we explained each plotting function for each category, all along with the python code.
Check out our Data Science Bootcamp to learn more about this topic!