In this tutorial I will show you how to create a histogram in R using ggplot2.
Explains the ggplot histogram syntax and shows a step-by-step example of creating a histogram in ggplot2.
If you need something specific, click on any of the links below.
Content:
- Introduction to histograms
- syntax
- example
As always, you will learn more if you thoroughly read the blog post from cover to cover.
A brief introduction to histograms
Let's take a quick look at what histograms are and how they are structured.
If you want to understand the syntax or see examples, you can go togrammatical partLubsample section.
Data distribution in a histogram diagram
Histograms are very important for data visualization, data mining and data analysis.
In fact, this is probably one of the 3 or 4 most important visualization techniques.
They are important because they help us visualize and study the distribution of data.
In particular, the histogram shows us the number of records for a certain range of variables.
The structure of the histogram
Their structure is as follows.
Typically, we map numeric variables to the x-axis. This is the variable we want to visualize so we can see how it is distributed.
This numeric variable is then divided into ranges, often called "ranges".
From there, we count the number of records for each bin and plot the number of records as bars. Therefore, each variable scope we analyze will have a container associated with it. The length of each bar represents the number of records.
When we plot all those columns together (again, one for each range), we get a histogram. Together, the collection of columns in a histogram shows us the shape of the data. They help us understand how data is distributed.
But, of course, we don't do it manually. As data scientists, we use a programming language like R to do all the calculations for us and plot the results.
Let's quickly discuss how to create a histogram in R.
How to make a histogram in R
There are actually many ways to create a histogramNormal.
You can create "old fashioned" histograms in R using "Base R". Specifically, you can create a histogram in R with the following commandrecord history()
Function.
This is the old way of doing things and I do not agree with it at all.
R's old plotting functions are poorly designed. They are difficult to use. They are difficult to modify. And the graphs they create are relatively ugly.
To create a histogram in R, use ggplot2
If you want to create histograms in R, IstronglyIt is recommended to use ggplot2 instead.
ggplot2 is a powerful plotting library that gives you precise control over the appearance and layout of your plots.
The syntax is easier to change and the default drawings are pretty good.
With that in mind, I'll show you how to create a ggplot histogram.
ggplot syntax histogram
Now let's look at the syntax for creating a histogram with ggplot2.
I will try to explain everything in detail, but if you are new to ggplot2, you can check ourggplot2 tutorial for beginners.
Let me quickly explain this syntax.
ggplot function
Tenggplot()
The function simply starts a plot using the ggplot2 data visualization system.
It is used whenever a visualization is created using ggplot2. However, the exact details of everything else vary from visual to visual.
data parameter
interior -ggplot()
function you will findDane
range.
TenDane
The argument allows you to specify the data frame that contains the variables to be plotted.
Note that ggplot2 is configured to visualize data in dataframes, so you must provide the name of the dataframe as an argument to this parameter.
For example, if you haveapartments
you would assumedane = txhousing
.
aes function
also in itggplot()
function, you will find the right oneAES()
Function.
TenAES()
Functions allow you to "map" variables to aesthetic properties of visuals. This may sound complicated, but it really boils down to associating the variables in the data frame with the axes and other chart properties.
if you want to see whatAES()
function works and you should read ouraes() explanationFeatures from our ggplot2 tutorial.
parameter x
interior -AES()
functions, you will seex
range.
Tenx
The parameters allow us to specify a numeric variable to be mapped to the x-axis. This will be a numeric variable represented as a histogram.
For example, if you have a data frame namedthe median
you would assumex = median
.
Histogram "geom"
Finally we havegeometric histogram()
.
This tells ggplot2 that we want to plot a histogram.
Remember: when we use ggplot2, we specify the data frame and variable mappingDane
parameter,AES()
function etc.
but definetype of plotlike a histogramscatter diagram,bar chartWait... we have to specify "geome".
Geometry ultimately determines what kind of diagram we will create.
To create a histogram, we usegeometric histogram()
.
Additional parameters
There are also some optional parameters that can be used to control the exact behavior of the histogram.
Let's look at them one by one.
color
Tencolor
parameter controlframe colorNumber of histogram bins.
be careful.
Many people think that it affects the color of the interior, but this is wrong. Controls the border color. (I'll show examples in the examples section.)
Remember: R ima mnogoavailable colors. You can choose simple colors such asRed
,zelena
, andblue
but there are many more interesting colors, e.gfrom the marines
and more. Have fun and find something you like!
Also, if you provide an argument for this parameter, it must be represented as a string. For example, you can setcolor = "red"
.
puna
Tenpuna
parameter controlinterior colorNumber of histogram bins.
Again, be careful. This onepuna
parameter controls the internal color andcolor
parameter controls the color of the border.
If you provide an argument to this parameter, it must be represented as a string. For example, you can setfill = "red"
.
Also remember: R has manyavailable colors. You can choose simple colors such asRed
,zelena
, andblue
but there are many more interesting colors.
wastebasket
Tenwastebasket
The parameter controls the number of intervals plotted on the histogram.
By default it is set toTanks = 30
.
However, you can increase or decrease the number of containers as needed.
Controlling the number of bins in a histogram is one way to change the way variables are analyzed. In general, reducing the number of containers smooths out data changes. Increasing the number of containers will reveal more details.
Which you choose (more detail or more "smoothness") depends on what you're looking for!
Example: plotting a histogram in R using ggplot2
OK Now that we understand the syntax, let's look at some examples of creating a histogram in R using ggplot2.
example:
- Create a simple ggplot histogram
- change border color
- change the color of the basket
- change the number of histogram intervals
First run this code
Before we start, let's upload the fileorderly universe
Package. please rememberorderly universe
the package includesggplot2
.
we also checkapartments
, which is the dataset we will use.
Called Tidyverse
you can uploadorderly universe
A package containing the following code:
#----------------# Load packet #----------------library(tidyverse)
check the data
Next, let's take a quick look at our dataset.
In the example below we will useapartments
A dataset containing housing data for various cities and years in Texas.
We can check this data frame with the commandpeek()
Function:
txhousing %>% glance()
get out:
# Observations: 8602# Variables: 9# $ grad"Abilene", "Abilene", "Abilene", "Abilene", "Abilene", "Abilene", "Abilene... # $ lat 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,...#$ months 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1...# $ Quantity sold 72, 98, 130, 98, 141, 156, 152, 131, 104, 101, 100, 92, 75, 112, 118, 1...# $ Volume 5380000, 6505000, 9285000, 9730000, 10590000, 13910000, 12635000, 10710...#$ median 71400, 58700, 58100, 68600, 67300, 66900, 73500, 75000, 64500, 59300, 7...# $ description 701, 746, 784, 785, 794, 780, 742, 765, 771, 764, 721, 658, 779, 700, 7...# $ Warehouse 6.3, 6.6, 6.8, 6.9, 6.8, 6.6, 6.2, 6.4, 6.5, 6.6, 6.2, 5.7, 6.8, 6.0, 6...# $ data 2000.000, 2000.083, 2000.167, 2000.250, 2000.333, 2000.417, 2000.500, 2...
Example 1: Create a simple ggplot histogram
Let's start with a very simple histogram.
Here we will draw a histogramthe median
Currency exchange.
ggplot(dane = txhousing, aes(x = mediana)) + geom_histogram()
get out:
explain
It's very simple, but you need to understand it because it forms the basis for other examples.
Here we initiate the draw by callingggplot()
.
interior -ggplot()
function, setdane = txhousing
. This indicates that we will plot the dataapartments
data frame.
Next we haveAES()
Function. Thanks to this, we can determine which variables represent which axes and which "aesthetics" of the action. Here we setx = median
which means we want a plotthe median
from the x-axis.
Finally, in the second row we seegeometric histogram()
. This means that we want to plot the variable as a histogram.
Example 2: Changing the border color
Now that we've created the simple histogram in Example 1, let's make some changes.
Here we will change the color of the basket border.
ggplot(data = txhousing, aes(x = median)) + geom_histogram(color = 'turquoise4')
get out:
explain
It's quite simple.
The code is almost the same asExample 1.
The only difference is what we setcolor = 'turquoise4'
to replacegeometric histogram()
. This changed the container border color to turquoise.
Example 3: Changing the color of the container
Then we will change the color of the basket itself. Inside the cuvette.
We'll use it for thatpuna
range.
let's see:
ggplot(data = txhousing, aes(x = mediana)) + geom_histogram(fill = 'czerwony')
get out:
explain
Everything here is almost identical to our simple ggplot histogramExample 1.
The only big difference is what we set upfill = "red"
. As you can see, this changed the color of the bucket toRed
.
Note that there is no visible border between the containers. This is probably fine, but you can also change the border color. You can use for thatcolor
parameters as shown in the figureExample 2.
Example 4: Changing the number of histogram ranges
Finally, let's modify the number of histogram ranges.
By default, ggplot2 produces a histogram with 30 bins. This is usually fine, but sometimes you want to increase or decrease the number of containers.
For that we can usewastebasket
range. Here we reduce the number of containers to 10 containers:
ggplot(dane = txhousing, aes(x = mediana)) + geom_histogram(bins = 10)
get out:
explain
It's very easy.
Here we have set up a histogram with 10 binscontainers = 10
.
As you can see, by reducing the number of bins, we smooth out some of the variance in the data.
You can also try increasing the number of containers if needed. Try setting it to 60 or 70 and see what happens.
Remember that choosing the right number of containers is more of an art than a science. It really depends on what your goals are and what you're looking for in the data.
It's a good reminder that knowledge of syntax is not rigorous enough. you have to know howuseData visualization is fine!
Leave other questions in the comments below
Have questions about ggplot histograms? Want to know how to do something else that I haven't explained here?
If so, leave your questions in the comments section below.
Sign up to learn more about data analysis in R
This tutorial should give you a good idea of how to create histograms in R using ggplot2.
But there is still much to learn.
If you want to master data visualization in R, you need to learn a lot more about ggplot2.
If you want to learn more about data analysis, you need to know dplyr, tidyr, forecasts, etc.
However, if you really want to master data analysis and data visualization in R, I highly recommend signing up to our mailing list. At Sharp Sight, we regularly publish tutorials that explain how to do data analysis using R and Python.