This is an update to an older versionPosts from 2015on the same topic. This is for exactly the same, but with the latest R packages and a coding style that uses pipes (%>%
) in package tidyverse.
It was inspired by the prevalence of the disease in the United StatesWall Street Journal. The disease prevalence data set was originally usedthis articleIn New England Journal of Medicine. In this post, I use the Tier 1 measles incidence dataset (cases per 100,000 population) obtained as a .csv fileProject Tycho. Download the .csv fileHere.
In this post we'll look at creating a neat, clean and elegant heatmap in R. No clustering, no dendrograms, no traces, no bullshit. We will perform basic data cleanup, reformatting and final plotting. We go through the process step by step. To see all the code with minimal explanation, scroll down the page.
I'm using the R version3.5.264-bit on Ubuntu 18.04 64-bit. The packages used areggplot2 (3.1.0)
,dplyr (0.7.8)
,Tydir (0.8.2)
,Wires (1.3.1)
For the basic plot I usegploty (3.0.1.1)
andStory (3.7-4)
. Install the necessary packages (if not already installed) and load them.
# Installation package# install.packages(pkgs = c("ggplot2","dplyr","tidyr","stringr","gplots","plotrix"), zależność = T)# load packagelibrary(ggplot2) # ggplot() to plotlibrary(dplyr) # data formattinglibrary(let us go) # data formattinglibrary(stringer) # string manipulation
1. Data preparation
Load the .csv file and check the data. The first two rows containing data not in the table will be skipped.
# reading CSV fileRice <- read.csv("odra_lew1.csv",address=time,arrays as factors=F,skip=2)# check the datahead(Rice)ster(Rice)table(Rice$Year)table(Rice$Week)
Tenhead()
The function shows us the name of the header and the first 6 lines of data. This oneseries()
function is displayedYear
andWeek
Columns are stored as integers and occurrences as characters. Although numeric occurrences are read as characters, missing values are encoded as "-". This onetable()
function to check if any year or week is missing. The data is currently stored in the so-called "wide" format, which we convert into the "long" format. This oneggplot2
Drawing packages prefer the long form. The broad format is as follows.
> Head (m) Year Week Alabama Alaska Arizona Arkansas California Colorado 1 1928 1 3.67 - 1.90 4.11 1.38 8,382 1928 2 6.25 - 6.40 9.91 1.80 6,023 1928 3 7, 95 - 4.50 11 .15 1 .31 2.864 1928 4 12.58 - 1.90 13.75 1.87 13.715 1928 5 8.03 - 0.47 20.79 2.38 5.136 1928 6 7.27 - 6.40 26.58 2.79 8.09
TenYear
andWeek
The variables are left as they are, and all frequency values are collapsed into the variables and values columns. Column names have been changed to lowercase letters for convenience. The year and week variables are converted to factors, and the value variables are converted to numbers.
square meter <- Rice %>% # Convert data to long format collect(key="Earth", value="value", -Year, -Week) %>% # duplicate columns name of the setting(C("Year", "Week", "Earth", "value")) %>% # convert the year to a factor Mutations(Year=factor(Year)) %>% # Convert weeks to factors Mutations(Week=factor(Week)) %>% # convert value to number (also convert '-' to NA, give warning) Mutations(value=as numbers(value))
The long format result is shown below.
> head(square meter) Year Week Earth value1 in 1928 1 Alabama 3,672 in 1928 2 Alabama 6.253 in 1928 3 Alabama 7,954 in 1928 4 Alabama 12.585 in 1928 5 Alabama 8.036 in 1928 6 Alabama 7.27
TenEarthVariables are now capitalized and multiword states have dot separators. I prefer to put them in the header and separate them with spaces so they can be displayed in the panel later. A custom function is used to change the title state to uppercase and lowercase. Multi-word states are split using a dot separator and each word in the title is case-sensitive with this featurestr_to_title()
Then they are glued again.
# remove and use a custom function to change state to uppercase and lowercasefn_tc <- Function(x) pasta(address string(remove from list(The dividing line(x, „[。]”))), collapse=” „)square meter$Earth <- Sapley'a(square meter$Earth, fn_tc)
Now I want to draw a heatmap with years on the x-axis and states on the y-axis, which means we have to deal with itWeeksomehow changeable. We add up all events from all weeks of each year and discardWeekCurrency exchange. Dplyr compatible way is to useGroup by...
coming upSummarize()
Use a function.
drugand()
handleDo
is a bit strange. It returns by defaultDo
If one or more elements of the input vectorDo
. If we set the parameterna.rm=TRUE
, ThenDo
s is removed and the remaining numbers are added. But if all the elementsDo
the sum returns to zero. In this case, it is strange and undesirable. So I have a custom sum function calledna_sum()
removeDo
s and return the remaining amount or a refundDo
If all itemsDo
. Then we use this custom function insideSummarize()
The function can summarize data by year and country, regardless of the day of the week.
# Custom sum function returns NA when all values in the set are NA,# In sets mixed with NAs, NAs are removed and others are added together.addition <- Function(x){ I(all(tak.na(x))) Val <- and(x, from rm=F) I(!all(tak.na(x))) Val <- and(x, from rm=time) return(Val)}# sum of events for all weeks of the yearcubic meters <- square meter %>% Group by(Year, Earth) %>% Summarize(accounting=addition(value)) %>% as.a.data.frame()
Now our data looks like this withoutWeekCurrency exchange. The values for each state are summed across years to create a new variableaccounting.
> head(cubic meters) Year Earth accounting1 in 1928 Alabama 334,992 in 1928 Alaska 0,003 in 1928 Arizona 200,754 in 1928 Arkansas 481,775 in 1928 California 69.226 in 1928 Colorado 206,98
With this, the work on the preparation of the data has essentially been brought to an end. The data is in "Long" format, and the plotted x, y, and z variables are available and of the appropriate type: factor, factor, and number. If your data is already in this format, it's easy to go straight to visualization. However, depending on the type of data you start with, preparing and transforming the data can be complex and tedious.
2. Drawing
I prefer to useggplot2
The plotting package can draw diagrams in R because of its consistent code structure. I will mainly focus onggplot2
code here. However, for the sake of completeness, I will also include the heatmap code that uses basic graphics.
2.1 ggplot2
Plotting data in ggplot2 is fairly straightforward if the data is in the correct format. This oneggplot2 index pageIt has code syntax and parameters.
# basic ggplotP <- Can(cubic meters, AES(x=Year, g=Earth, puna=accounting))+ geometric tiles()# save the graph to the working directorykeep(P, file name=„odra-basic.png”)
The default output is zggplot2
Pretty good - there are a few aspects of the basic plot that could be tweaked and improved. We've added tiled borders, custom x-axis separators, and custom text sizes. The ggplot code is modified below.
#Modified ggplotP <- Can(cubic meters, AES(x=Year, g=Earth, puna=accounting))+ #Add a white border with a line thickness of 0.25 geometric tiles(color="white", size=0,25)+ # remove the X and Y axis labels laboratory(x=””, g=””)+ # remove extra space y discrete scale(expansion=C(0, 0))+ #define new breakpoints on the x axis scale x discretization(expansion=C(0, 0), rest=C(„1930”, „1940”, „1950”, „1960”, „1970”, „1980”, „1990”, „2000”))+ # Set the base size of all fonts siva_theme(basic size=8)+ #themeoptions him( # Bold font for legend text text. legend=item text(Lice="bold"), #Set the scale thickness of the axis axis scale=a series of elements(size=0,4), # remove the drawing background the background of the plot=empty_rate(), # remove the drawing box panel. border=empty_rate() )#spremi na 200 dpikeep(P, file name="odra-mod1.png", high=5.5, width=8.8, unit="exist", resolution=200)
I want the Y-axis labels (states) to be arranged in ascending order from top to bottom. This means reverting to "long" format data and refactoringEarthThe variables are reversed. Here the filler variable (accounting) is a continuous variable, so ggplot uses a blue gradient by default. In this case, as in many cases, it might make more sense to divide the continuous data into levels and represent each level as a discrete color. This onecutting()
Functions in R allow you to decompose and label continuous variables.
TenaccountingVariables are divided into 7 levels and stored as new variablescounting factor
. TenDo
remains asDo
. Breaks in defining variables depend on the type of data, the number of containers that make sense in the context, or just trial and error. Too many litter boxes is also not good. Check your variables withsummary(x)
theBoxplot(x)
It can reveal a lot about the data.
Mi 4 <- cubic meters %>% # Factor states and reverse the order of levels Mutations(Earth=factor(Earth, level=Rotation speed(tip(Unique(Earth)))))) %>% # Create a new variable based on the number Mutations(counting factor=cutting(accounting, rest=C(-1, 0, 1, 10, 100, 500, 1000, maximum(accounting, from rm=time)), To mark=C(„0”, „0-1”, „1-10”, „10-100”, „100-500”, „500-1000”, „>1000”))) %>% # Change the order of the levels Mutations(counting factor=factor(as characters(counting factor), level=Rotation speed(level(counting factor))))
We are now ready to plot the final data set.
# set the text colortekstni okvir <- "Siva 40"# further modification of ggplotP <- Can(Mi 4, AES(x=Year, g=Earth, puna=counting factor))+ geometric tiles(color="white", size=0,2)+ driver(puna=guide legend(title="Cases per 100,000 inhabitants"))+ laboratory(x=””, g=””, title="Measles Rate in the United States")+ y discrete scale(expansion=C(0, 0))+ scale x discretization(expansion=C(0, 0), rest=C(„1930”, „1940”, „1950”, „1960”, „1970”, „1980”, „1990”, „2000”))+ Scale for manual filling(values=C(“#d53e4f”, „#f46d43”, "#fdae61", "#charge08b", “#e6f598”, „#abdda4”, "#ddf1da"), value = "Siva 90")+ #coord_fixed()+ siva_theme(basic size=10)+ him(Legend. Location="normal", legend. direction="vertical", legend.title=item text(color=tekstni okvir), legend. margin=profit(neto::unit(0, "centimeter")), text. legend=item text(color=tekstni okvir, size=7, Lice="bold"), key.key.height=neto::unit(0,8, "centimeter"), key.key.width=neto::unit(0,2, "centimeter"), text.x us=item text(size=10, color=tekstni okvir), y-axis text=item text(Only=0,2, color=tekstni okvir), axis scale=a series of elements(size=0,4), the background of the plot=empty_rate(), panel. border=empty_rate(), drawing margins=profit(0,7, 0,4, 0,1, 0,2, "centimeter"), episode title=item text(color=tekstni okvir, dom=0, size=14, Lice="bold") )#export chartkeep(P, file name="measles-mod3.png", high=5.5, width=8.8, unit="exist", resolution=200)
This is the final version of the diagram. All font elements are coloredsiva 40
. missing value (Do
s) in colorsiva 90
. The years in which vaccination was introduced are marked with dark vertical stripes. If there are too many or too few y-axis labels, they can be removed in the same way as the x-axis. use a custom paletteColorful brewerBased on the spectral palette. Use the R packageRColorBrewer()
or use the ggplot2 functionscale_fill_brewer()
Opens all palettes on the ColorBrewer website. Below is an example:
library(RColorBrewer)#Change the Scale_fill_manual from the previous code to the one belowScale for manual filling(values=Rotation speed(parr pivar(7, “YlGnBu”)), value="Siva 90")+
Most of the adaptation of the story takes place in theher()
code section. Seeindex ggplotfor all theme parameters. Depending on where the graphic ends up next, you may need to change the font. This oneadditional font()
The bag is very comfortable. Another option is to export to a vector format such as .svg or .pdf. Import a vector editor like Adobe Illustrator and add your own text. This will give clean results, but requires some manual work. See, for example, the cover of this article.
2.2 Basic graphics
We'll take a quick look at drawing with basic graphics. basic skillsheatmap()
and strengthenedheat map.2()
function fromgraphic diagram
The package takes a wide array of data as input. We start with the "long" data prepared in Chapter 1. We convert "long" format data to "wide" format with the following commandproliferation()
features in the packagelet us go. After removing the non-numeric columns, the wide data is converted into a matrix. State names are reassigned to matrix row names used as text on the Y-axis.
# load packagelibrary(graphic diagram) # heatmap.2() function.library(land) # Gradient.rect() function.# Convert from long format to wide formatMi 5 <- cubic meters %>% proliferation(key="Earth", value=accounting)Mi 6 <- sticker(Mi 5[ ,-1])line name(Mi 6) <- Mi 5$Year#Base heatmapPNG(file name=„odra-baza.png”, high=5.5, width=8.8, save=200, unit="exist")heat map(T(Mi 6), Err=Do, Korff=Do, from rm=time, measure="Yes", depression=the color of the field(100), laboratory=””, go through=””, main="Measles Rate in the United States")closed development()
That's all we can do. Strengthenedheat map.2()
function fromgraphic diagramPackages can do more. The color legend is not the best, so we will use this featuregradient.rectangle()
zlandPack to create your own legends. After numerous arguments and numerous trials and errors, we came to the following results.
# gplots.2 heat mapPNG(file name="odra-gplot.png", high=6, width=9, save=200, unit="exist")landmark(March=C(2, 3, 3, 2))graphic diagram::Heat map.2(T(Mi 6), from rm=time, Wood="Yes", Err=Annul, Korff="Red", a clue="Yes", measure="Yes", move row=0,3, column moved=0,3, rest=C(-1, 0, 1, 10, 100, 500, 1000, maximum(Mi 4$accounting, from rm=time)), Colseppa=Who(sequence(in 1928, Year 2003)%%10==0), profit=C(3, 8), depression=Rotation speed(C(“#d53e4f”, „#f46d43”, "#fdae61", "#charge08b", “#e6f598”, „#abdda4”, "#ddf1da")), laboratory=””, go through=””, key=F, Lai Hei=C(0,1, 0,9), belly button=C(0,2, 0,8))gradient rectangle(0,125, 0,25, 0,135, 0,75, piece=7, border=F, bias="So", depression=Rotation speed(C(“#d53e4f”, „#f46d43”, "#fdae61", "#charge08b", “#e6f598”, „#abdda4”, "#ddf1da")))text(x=present(0,118, 7), g=sequence(0,28, 0,72, go through=0,07), adjective=1, oak=0,8, To mark=C(„0”, „0-1”, „1-10”, „10-100”, „100-500”, „500-1000”, „>1000”))text(x=0,135, g=0,82, To mark="Cases per 100,000 inhabitants", adjective=1, oak=0,85)title(main="Measles Rate in the United States", Wire=1, mine=time, adjective=0,21)closed development()
Finally we have something useful. The white partition is an interesting element. You can use it to group columns or rows as needed. A vertical line through a group of decades. Below is the complete unbroken code.
#2019 | Roy Matthew Francis# R code heat map#load packagelibrary(ggplot2) # ggplot() to plotlibrary(dplyr) # data formattinglibrary(let us go) # data formattinglibrary(stringer) # string manipulation# data preparation-------------------------------- ----------- ---- ----------#read CSV fileRice <- read.csv("odra_lew1.csv", address=time, arrays as factors=F, skip=2)square meter <- Rice %>% # Convert data to long format collect(key="Earth", value="value", -Year, -Week) %>% # duplicate columns name of the setting(C("Year", "Week", "Earth", "value")) %>% # convert the year to a factor Mutations(Year=factor(Year)) %>% # Convert weeks to factors Mutations(Week=factor(Week)) %>% # convert value to number (also convert '-' to NA, give warning) Mutations(value=as numbers(value))# remove and use a custom function to change state to uppercase and lowercasefn_tc <- Function(x) pasta(address string(remove from list(The dividing line(x, „[。]”))), collapse=” „)square meter$Earth <- Sapley'a(square meter$Earth, fn_tc)# Custom sum function returns NA when all values in the set are NA,# In sets mixed with NAs, NAs are removed and others are added together.addition <- Function(x){ I(all(tak.na(x))) Val <- and(x, from rm=F) I(!all(tak.na(x))) Val <- and(x, from rm=time) return(Val)}# sum of events for all weeks of the yearcubic meters <- square meter %>% Group by(Year, Earth) %>% Summarize(accounting=addition(value)) %>% as.a.data.frame()Mi 4 <- cubic meters %>% # Factor states and reverse the order of levels Mutations(Earth=factor(Earth, level=Rotation speed(tip(Unique(Earth)))))) %>% # Create a new variable based on the number Mutations(counting factor=cutting(accounting, rest=C(-1, 0, 1, 10, 100, 500, 1000, maximum(accounting, from rm=time)), To mark=C(„0”, „0-1”, „1-10”, „10-100”, „100-500”, „500-1000”, „>1000”))) %>% # Change the order of the levels Mutations(counting factor=factor(as characters(counting factor), level=Rotation speed(level(counting factor))))# GGPLOT ------------------------------------------------ ------------------------# set the text colortekstni okvir <- "Siva 40"# further modification of ggplotP <- Can(Mi 4, AES(x=Year, g=Earth, puna=counting factor))+ geometric tiles(color="white", size=0,2)+ driver(puna=guide legend(title="Cases per 100,000 inhabitants"))+ laboratory(x=””, g=””, title="Measles Rate in the United States")+ y discrete scale(expansion=C(0, 0))+ scale x discretization(expansion=C(0, 0), rest=C(„1930”, „1940”, „1950”, „1960”, „1970”, „1980”, „1990”, „2000”))+ Scale for manual filling(values=C(“#d53e4f”, „#f46d43”, "#fdae61", "#charge08b", “#e6f598”, „#abdda4”, "#ddf1da"), value = "Siva 90")+ #coord_fixed()+ siva_theme(basic size=10)+ him(Legend. Location="normal", legend. direction="vertical", legend.title=item text(color=tekstni okvir), legend. margin=profit(neto::unit(0, "centimeter")), text. legend=item text(color=tekstni okvir, size=7, Lice="bold"), key.key.height=neto::unit(0,8, "centimeter"), key.key.width=neto::unit(0,2, "centimeter"), text.x us=item text(size=10, color=tekstni okvir), y-axis text=item text(Only=0,2, color=tekstni okvir), axis scale=a series of elements(size=0,4), the background of the plot=empty_rate(), panel. border=empty_rate(), drawing margins=profit(0,7, 0,4, 0,1, 0,2, "centimeter"), episode title=item text(color=tekstni okvir, dom=0, size=14, Lice="bold") )# export graphkeep(P, file name="measles-mod3.png", high=5.5, width=8.8, unit="exist", resolution=200)# Basic graphics ------------------------------ - ---------- ----- ------------# load packagelibrary(graphic diagram) # heatmap.2() function.library(land) # Gradient.rect() function.# Convert from long format to wide formatMi 5 <- cubic meters %>% proliferation(key="Earth", value=accounting)Mi 6 <- sticker(Mi 5[ ,-1])line name(Mi 6) <- Mi 5$Year# Basic heat mapPNG(file name=„odra-baza.png”, high=5.5, width=8.8, save=200, unit="exist")heat map(T(Mi 6), Err=Do, Korff=Do, from rm=time, measure="Yes", depression=the color of the field(100), laboratory=””, go through=””, main="Measles Rate in the United States")closed development()# gplots.2 heat mapPNG(file name="odra-gplot.png", high=6, width=9, save=200, unit="exist")landmark(March=C(2, 3, 3, 2))graphic diagram::Heat map.2(T(Mi 6), from rm=time, Wood="Yes", Err=Annul, Korff="Red", a clue="Yes", measure="Yes",move row=0,3, column moved=0,3, rest=C(-1, 0, 1, 10, 100, 500, 1000, maximum(Mi 4$accounting, from rm=time)), Colseppa=Who(sequence(in 1928, Year 2003)%%10==0), profit=C(3, 8), depression=Rotation speed(C(“#d53e4f”, „#f46d43”, "#fdae61", "#charge08b", “#e6f598”, „#abdda4”, "#ddf1da")), laboratory=””, go through=””, key=F, Lai Hei=C(0,1, 0,9), belly button=C(0,2, 0,8))gradient rectangle(0,125, 0,25, 0,135, 0,75, piece=7, border=F, bias="So", depression=Rotation speed(C(“#d53e4f”, „#f46d43”,"#fdae61", "#charge08b", “#e6f598”, „#abdda4”, "#ddf1da")))text(x=present(0,118, 7), g=sequence(0,28, 0,72, go through=0,07), adjective=1, oak=0,8, To mark=C(„0”, „0-1”, „1-10”, „10-100”, „100-500”, „500-1000”, „>1000”))text(x=0,135, g=0,82, To mark="Cases per 100,000 inhabitants", adjective=1, oak=0,85)title(main="Measles Rate in the United States", Wire=1, mine=time, adjective=0,21)closed development()# Script location --------------------------------- --------- ---- - ---- -----------------
3. Conclusion
We covered data preparation and heatmap plotting in R using basic graphics and ggplot2.ggplot2
It is usually more consistent with the structure of the code, but basic graphics can be useful when combining multiple graphics in complex ways. hacker attackggplot2
This can be more difficult than manipulating basic graphics. I have not introduced heatmaps with dendrograms because they are only useful in certain situations. Perhaps another useful feature for heatmaps in Rheat mapprovedHere.
back to top ↑