A Guide to Elegantly Tiled Heatmaps in R [2019] (2023)

This is an update to an older versionPosts from 2015on the same topic. This is for exactly the same, but with the latest R packages and a coding style that uses pipes (%>%) in package tidyverse.

It was inspired by the prevalence of the disease in the United StatesWall Street Journal. The disease prevalence data set was originally usedthis articleIn New England Journal of Medicine. In this post, I use the Tier 1 measles incidence dataset (cases per 100,000 population) obtained as a .csv fileProject Tycho. Download the .csv fileHere.

In this post we'll look at creating a neat, clean and elegant heatmap in R. No clustering, no dendrograms, no traces, no bullshit. We will perform basic data cleanup, reformatting and final plotting. We go through the process step by step. To see all the code with minimal explanation, scroll down the page.

I'm using the R version3.5.264-bit on Ubuntu 18.04 64-bit. The packages used areggplot2 (3.1.0),dplyr (0.7.8),Tydir (0.8.2),Wires (1.3.1)For the basic plot I usegploty (3.0.1.1)andStory (3.7-4). Install the necessary packages (if not already installed) and load them.

# Installation package# install.packages(pkgs = c("ggplot2","dplyr","tidyr","stringr","gplots","plotrix"), zależność = T)# load packagelibraryggplot2 # ggplot() to plotlibrarydplyr # data formattinglibrarylet us go # data formattinglibrarystringer # string manipulation

1. Data preparation

Load the .csv file and check the data. The first two rows containing data not in the table will be skipped.

# reading CSV fileRice <- read.csv"odra_lew1.csv",address=time,arrays as factors=F,skip=2# check the dataheadRicesterRicetableRice$YeartableRice$Week

Tenhead()The function shows us the name of the header and the first 6 lines of data. This oneseries()function is displayedYearandWeekColumns are stored as integers and occurrences as characters. Although numeric occurrences are read as characters, missing values ​​are encoded as "-". This onetable()function to check if any year or week is missing. The data is currently stored in the so-called "wide" format, which we convert into the "long" format. This oneggplot2Drawing packages prefer the long form. The broad format is as follows.

> Head (m) Year Week Alabama Alaska Arizona Arkansas California Colorado 1 1928 1 3.67 - 1.90 4.11 1.38 8,382 1928 2 6.25 - 6.40 9.91 1.80 6,023 1928 3 7, 95 - 4.50 11 .15 1 .31 2.864 1928 4 12.58 - 1.90 13.75 1.87 13.715 1928 5 8.03 - 0.47 20.79 2.38 5.136 1928 6 7.27 - 6.40 26.58 2.79 8.09

TenYearandWeekThe variables are left as they are, and all frequency values ​​are collapsed into the variables and values ​​columns. Column names have been changed to lowercase letters for convenience. The year and week variables are converted to factors, and the value variables are converted to numbers.

square meter <- Rice %>% # Convert data to long format collectkey="Earth", value="value", -Year, -Week %>% # duplicate columns name of the settingC"Year", "Week", "Earth", "value")) %>% # convert the year to a factor MutationsYear=factorYear)) %>% # Convert weeks to factors MutationsWeek=factorWeek)) %>% # convert value to number (also convert '-' to NA, give warning) Mutationsvalue=as numbersvalue))

The long format result is shown below.

> headsquare meter Year Week Earth value1 in 1928 1 Alabama 3,672 in 1928 2 Alabama 6.253 in 1928 3 Alabama 7,954 in 1928 4 Alabama 12.585 in 1928 5 Alabama 8.036 in 1928 6 Alabama 7.27

TenEarthVariables are now capitalized and multiword states have dot separators. I prefer to put them in the header and separate them with spaces so they can be displayed in the panel later. A custom function is used to change the title state to uppercase and lowercase. Multi-word states are split using a dot separator and each word in the title is case-sensitive with this featurestr_to_title()Then they are glued again.

# remove and use a custom function to change state to uppercase and lowercasefn_tc <- Functionx pastaaddress stringremove from listThe dividing linex, „[。]”))), collapse=” „square meter$Earth <- Sapley'asquare meter$Earth, fn_tc

Now I want to draw a heatmap with years on the x-axis and states on the y-axis, which means we have to deal with itWeeksomehow changeable. We add up all events from all weeks of each year and discardWeekCurrency exchange. Dplyr compatible way is to useGroup by...coming upSummarize()Use a function.

drugand()handleDois a bit strange. It returns by defaultDoIf one or more elements of the input vectorDo. If we set the parameterna.rm=TRUE, ThenDos is removed and the remaining numbers are added. But if all the elementsDothe sum returns to zero. In this case, it is strange and undesirable. So I have a custom sum function calledna_sum()removeDos and return the remaining amount or a refundDoIf all itemsDo. Then we use this custom function insideSummarize()The function can summarize data by year and country, regardless of the day of the week.

# Custom sum function returns NA when all values ​​in the set are NA,# In sets mixed with NAs, NAs are removed and others are added together.addition <- Functionx{ Ialltak.nax))) Val <- andx, from rm=F I!alltak.nax))) Val <- andx, from rm=time returnVal}# sum of events for all weeks of the yearcubic meters <- square meter %>% Group byYear, Earth %>% Summarizeaccounting=additionvalue)) %>% as.a.data.frame()

Now our data looks like this withoutWeekCurrency exchange. The values ​​for each state are summed across years to create a new variableaccounting.

> headcubic meters Year Earth accounting1 in 1928 Alabama 334,992 in 1928 Alaska 0,003 in 1928 Arizona 200,754 in 1928 Arkansas 481,775 in 1928 California 69.226 in 1928 Colorado 206,98

With this, the work on the preparation of the data has essentially been brought to an end. The data is in "Long" format, and the plotted x, y, and z variables are available and of the appropriate type: factor, factor, and number. If your data is already in this format, it's easy to go straight to visualization. However, depending on the type of data you start with, preparing and transforming the data can be complex and tedious.

2. Drawing

I prefer to useggplot2The plotting package can draw diagrams in R because of its consistent code structure. I will mainly focus onggplot2code here. However, for the sake of completeness, I will also include the heatmap code that uses basic graphics.

2.1 ggplot2

Plotting data in ggplot2 is fairly straightforward if the data is in the correct format. This oneggplot2 index pageIt has code syntax and parameters.

# basic ggplotP <- Cancubic meters, AESx=Year, g=Earth, puna=accounting))+ geometric tiles()# save the graph to the working directorykeepP, file name=„odra-basic.png”

The default output is zggplot2Pretty good - there are a few aspects of the basic plot that could be tweaked and improved. We've added tiled borders, custom x-axis separators, and custom text sizes. The ggplot code is modified below.

#Modified ggplotP <- Cancubic meters, AESx=Year, g=Earth, puna=accounting))+ #Add a white border with a line thickness of 0.25 geometric tilescolor="white", size=0,25+ # remove the X and Y axis labels laboratoryx=””, g=””+ # remove extra space y discrete scaleexpansion=C0, 0))+ #define new breakpoints on the x axis scale x discretizationexpansion=C0, 0), rest=C„1930”, „1940”, „1950”, „1960”, „1970”, „1980”, „1990”, „2000”))+ # Set the base size of all fonts siva_themebasic size=8+ #themeoptions him # Bold font for legend text text. legend=item textLice="bold"), #Set the scale thickness of the axis axis scale=a series of elementssize=0,4), # remove the drawing background the background of the plot=empty_rate(), # remove the drawing box panel. border=empty_rate() #spremi na 200 dpikeepP, file name="odra-mod1.png", high=5.5, width=8.8, unit="exist", resolution=200

I want the Y-axis labels (states) to be arranged in ascending order from top to bottom. This means reverting to "long" format data and refactoringEarthThe variables are reversed. Here the filler variable (accounting) is a continuous variable, so ggplot uses a blue gradient by default. In this case, as in many cases, it might make more sense to divide the continuous data into levels and represent each level as a discrete color. This onecutting()Functions in R allow you to decompose and label continuous variables.

TenaccountingVariables are divided into 7 levels and stored as new variablescounting factor. TenDoremains asDo. Breaks in defining variables depend on the type of data, the number of containers that make sense in the context, or just trial and error. Too many litter boxes is also not good. Check your variables withsummary(x)theBoxplot(x)It can reveal a lot about the data.

Mi 4 <- cubic meters %>% # Factor states and reverse the order of levels MutationsEarth=factorEarth, level=Rotation speedtipUniqueEarth)))))) %>% # Create a new variable based on the number Mutationscounting factor=cuttingaccounting, rest=C-1, 0, 1, 10, 100, 500, 1000, maximumaccounting, from rm=time)), To mark=C„0”, „0-1”, „1-10”, „10-100”, „100-500”, „500-1000”, „>1000”))) %>% # Change the order of the levels Mutationscounting factor=factoras characterscounting factor), level=Rotation speedlevelcounting factor))))

We are now ready to plot the final data set.

# set the text colortekstni okvir <- "Siva 40"# further modification of ggplotP <- CanMi 4, AESx=Year, g=Earth, puna=counting factor))+ geometric tilescolor="white", size=0,2+ driverpuna=guide legendtitle="Cases per 100,000 inhabitants"))+ laboratoryx=””, g=””, title="Measles Rate in the United States"+ y discrete scaleexpansion=C0, 0))+ scale x discretizationexpansion=C0, 0), rest=C„1930”, „1940”, „1950”, „1960”, „1970”, „1980”, „1990”, „2000”))+ Scale for manual fillingvalues=C“#d53e4f”, „#f46d43”, "#fdae61", "#charge08b", “#e6f598”, „#abdda4”, "#ddf1da"), value = "Siva 90"+ #coord_fixed()+ siva_themebasic size=10+ himLegend. Location="normal", legend. direction="vertical", legend.title=item textcolor=tekstni okvir), legend. margin=profitneto::unit0, "centimeter")), text. legend=item textcolor=tekstni okvir, size=7, Lice="bold"), key.key.height=neto::unit0,8, "centimeter"), key.key.width=neto::unit0,2, "centimeter"), text.x us=item textsize=10, color=tekstni okvir), y-axis text=item textOnly=0,2, color=tekstni okvir), axis scale=a series of elementssize=0,4), the background of the plot=empty_rate(), panel. border=empty_rate(), drawing margins=profit0,7, 0,4, 0,1, 0,2, "centimeter"), episode title=item textcolor=tekstni okvir, dom=0, size=14, Lice="bold" #export chartkeepP, file name="measles-mod3.png", high=5.5, width=8.8, unit="exist", resolution=200

This is the final version of the diagram. All font elements are coloredsiva 40. missing value (Dos) in colorsiva 90. The years in which vaccination was introduced are marked with dark vertical stripes. If there are too many or too few y-axis labels, they can be removed in the same way as the x-axis. use a custom paletteColorful brewerBased on the spectral palette. Use the R packageRColorBrewer()or use the ggplot2 functionscale_fill_brewer()Opens all palettes on the ColorBrewer website. Below is an example:

libraryRColorBrewer#Change the Scale_fill_manual from the previous code to the one belowScale for manual fillingvalues=Rotation speedparr pivar7, “YlGnBu”)), value="Siva 90"+ 

Most of the adaptation of the story takes place in theher()code section. Seeindex ggplotfor all theme parameters. Depending on where the graphic ends up next, you may need to change the font. This oneadditional font()The bag is very comfortable. Another option is to export to a vector format such as .svg or .pdf. Import a vector editor like Adobe Illustrator and add your own text. This will give clean results, but requires some manual work. See, for example, the cover of this article.

2.2 Basic graphics

We'll take a quick look at drawing with basic graphics. basic skillsheatmap()and strengthenedheat map.2()function fromgraphic diagramThe package takes a wide array of data as input. We start with the "long" data prepared in Chapter 1. We convert "long" format data to "wide" format with the following commandproliferation()features in the packagelet us go. After removing the non-numeric columns, the wide data is converted into a matrix. State names are reassigned to matrix row names used as text on the Y-axis.

# load packagelibrarygraphic diagram # heatmap.2() function.libraryland # Gradient.rect() function.# Convert from long format to wide formatMi 5 <- cubic meters %>% proliferationkey="Earth", value=accountingMi 6 <- stickerMi 5[ ,-1])line nameMi 6 <- Mi 5$Year#Base heatmapPNGfile name=„odra-baza.png”, high=5.5, width=8.8, save=200, unit="exist"heat mapTMi 6), Err=Do, Korff=Do, from rm=time, measure="Yes", depression=the color of the field100), laboratory=””, go through=””, main="Measles Rate in the United States"closed development()

That's all we can do. Strengthenedheat map.2()function fromgraphic diagramPackages can do more. The color legend is not the best, so we will use this featuregradient.rectangle()zlandPack to create your own legends. After numerous arguments and numerous trials and errors, we came to the following results.

# gplots.2 heat mapPNGfile name="odra-gplot.png", high=6, width=9, save=200, unit="exist"landmarkMarch=C2, 3, 3, 2))graphic diagram::Heat map.2TMi 6), from rm=time, Wood="Yes", Err=Annul, Korff="Red", a clue="Yes", measure="Yes", move row=0,3, column moved=0,3, rest=C-1, 0, 1, 10, 100, 500, 1000, maximumMi 4$accounting, from rm=time)), Colseppa=Whosequencein 1928, Year 2003%%10==0), profit=C3, 8), depression=Rotation speedC“#d53e4f”, „#f46d43”, "#fdae61", "#charge08b", “#e6f598”, „#abdda4”, "#ddf1da")), laboratory=””, go through=””, key=F, Lai Hei=C0,1, 0,9), belly button=C0,2, 0,8))gradient rectangle0,125, 0,25, 0,135, 0,75, piece=7, border=F, bias="So", depression=Rotation speedC“#d53e4f”, „#f46d43”, "#fdae61", "#charge08b", “#e6f598”, „#abdda4”, "#ddf1da")))textx=present0,118, 7), g=sequence0,28, 0,72, go through=0,07), adjective=1, oak=0,8, To mark=C„0”, „0-1”, „1-10”, „10-100”, „100-500”, „500-1000”, „>1000”))textx=0,135, g=0,82, To mark="Cases per 100,000 inhabitants", adjective=1, oak=0,85titlemain="Measles Rate in the United States", Wire=1, mine=time, adjective=0,21closed development()

Finally we have something useful. The white partition is an interesting element. You can use it to group columns or rows as needed. A vertical line through a group of decades. Below is the complete unbroken code.

#2019 | Roy Matthew Francis# R code heat map#load packagelibraryggplot2 # ggplot() to plotlibrarydplyr # data formattinglibrarylet us go # data formattinglibrarystringer # string manipulation# data preparation-------------------------------- ----------- ---- ----------#read CSV fileRice <- read.csv"odra_lew1.csv", address=time, arrays as factors=F, skip=2square meter <- Rice %>% # Convert data to long format collectkey="Earth", value="value", -Year, -Week %>% # duplicate columns name of the settingC"Year", "Week", "Earth", "value")) %>% # convert the year to a factor MutationsYear=factorYear)) %>% # Convert weeks to factors MutationsWeek=factorWeek)) %>% # convert value to number (also convert '-' to NA, give warning) Mutationsvalue=as numbersvalue))# remove and use a custom function to change state to uppercase and lowercasefn_tc <- Functionx pastaaddress stringremove from listThe dividing linex, „[。]”))), collapse=” „square meter$Earth <- Sapley'asquare meter$Earth, fn_tc# Custom sum function returns NA when all values ​​in the set are NA,# In sets mixed with NAs, NAs are removed and others are added together.addition <- Functionx{ Ialltak.nax))) Val <- andx, from rm=F I!alltak.nax))) Val <- andx, from rm=time returnVal}# sum of events for all weeks of the yearcubic meters <- square meter %>% Group byYear, Earth %>% Summarizeaccounting=additionvalue)) %>% as.a.data.frame()Mi 4 <- cubic meters %>% # Factor states and reverse the order of levels MutationsEarth=factorEarth, level=Rotation speedtipUniqueEarth)))))) %>% # Create a new variable based on the number Mutationscounting factor=cuttingaccounting, rest=C-1, 0, 1, 10, 100, 500, 1000, maximumaccounting, from rm=time)), To mark=C„0”, „0-1”, „1-10”, „10-100”, „100-500”, „500-1000”, „>1000”))) %>% # Change the order of the levels Mutationscounting factor=factoras characterscounting factor), level=Rotation speedlevelcounting factor))))# GGPLOT ------------------------------------------------ ------------------------# set the text colortekstni okvir <- "Siva 40"# further modification of ggplotP <- CanMi 4, AESx=Year, g=Earth, puna=counting factor))+ geometric tilescolor="white", size=0,2+ driverpuna=guide legendtitle="Cases per 100,000 inhabitants"))+ laboratoryx=””, g=””, title="Measles Rate in the United States"+ y discrete scaleexpansion=C0, 0))+ scale x discretizationexpansion=C0, 0), rest=C„1930”, „1940”, „1950”, „1960”, „1970”, „1980”, „1990”, „2000”))+ Scale for manual fillingvalues=C“#d53e4f”, „#f46d43”, "#fdae61", "#charge08b", “#e6f598”, „#abdda4”, "#ddf1da"), value = "Siva 90"+ #coord_fixed()+ siva_themebasic size=10+ himLegend. Location="normal", legend. direction="vertical", legend.title=item textcolor=tekstni okvir), legend. margin=profitneto::unit0, "centimeter")), text. legend=item textcolor=tekstni okvir, size=7, Lice="bold"), key.key.height=neto::unit0,8, "centimeter"), key.key.width=neto::unit0,2, "centimeter"), text.x us=item textsize=10, color=tekstni okvir), y-axis text=item textOnly=0,2, color=tekstni okvir), axis scale=a series of elementssize=0,4), the background of the plot=empty_rate(), panel. border=empty_rate(), drawing margins=profit0,7, 0,4, 0,1, 0,2, "centimeter"), episode title=item textcolor=tekstni okvir, dom=0, size=14, Lice="bold" # export graphkeepP, file name="measles-mod3.png", high=5.5, width=8.8, unit="exist", resolution=200# Basic graphics ------------------------------ - ---------- ----- ------------# load packagelibrarygraphic diagram # heatmap.2() function.libraryland # Gradient.rect() function.# Convert from long format to wide formatMi 5 <- cubic meters %>% proliferationkey="Earth", value=accountingMi 6 <- stickerMi 5[ ,-1])line nameMi 6 <- Mi 5$Year# Basic heat mapPNGfile name=„odra-baza.png”, high=5.5, width=8.8, save=200, unit="exist"heat mapTMi 6), Err=Do, Korff=Do, from rm=time, measure="Yes", depression=the color of the field100), laboratory=””, go through=””, main="Measles Rate in the United States"closed development()# gplots.2 heat mapPNGfile name="odra-gplot.png", high=6, width=9, save=200, unit="exist"landmarkMarch=C2, 3, 3, 2))graphic diagram::Heat map.2TMi 6), from rm=time, Wood="Yes", Err=Annul, Korff="Red", a clue="Yes", measure="Yes",move row=0,3, column moved=0,3, rest=C-1, 0, 1, 10, 100, 500, 1000, maximumMi 4$accounting, from rm=time)), Colseppa=Whosequencein 1928, Year 2003%%10==0), profit=C3, 8), depression=Rotation speedC“#d53e4f”, „#f46d43”, "#fdae61", "#charge08b", “#e6f598”, „#abdda4”, "#ddf1da")), laboratory=””, go through=””, key=F, Lai Hei=C0,1, 0,9), belly button=C0,2, 0,8))gradient rectangle0,125, 0,25, 0,135, 0,75, piece=7, border=F, bias="So", depression=Rotation speedC“#d53e4f”, „#f46d43”,"#fdae61", "#charge08b", “#e6f598”, „#abdda4”, "#ddf1da")))textx=present0,118, 7), g=sequence0,28, 0,72, go through=0,07), adjective=1, oak=0,8, To mark=C„0”, „0-1”, „1-10”, „10-100”, „100-500”, „500-1000”, „>1000”))textx=0,135, g=0,82, To mark="Cases per 100,000 inhabitants", adjective=1, oak=0,85titlemain="Measles Rate in the United States", Wire=1, mine=time, adjective=0,21closed development()# Script location --------------------------------- --------- ---- - ---- -----------------

3. Conclusion

We covered data preparation and heatmap plotting in R using basic graphics and ggplot2.ggplot2It is usually more consistent with the structure of the code, but basic graphics can be useful when combining multiple graphics in complex ways. hacker attackggplot2This can be more difficult than manipulating basic graphics. I have not introduced heatmaps with dendrograms because they are only useful in certain situations. Perhaps another useful feature for heatmaps in Rheat mapprovedHere.

back to top ↑

References

Top Articles
Latest Posts
Article information

Author: Aracelis Kilback

Last Updated: 07/09/2023

Views: 6302

Rating: 4.3 / 5 (64 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Aracelis Kilback

Birthday: 1994-11-22

Address: Apt. 895 30151 Green Plain, Lake Mariela, RI 98141

Phone: +5992291857476

Job: Legal Officer

Hobby: LARPing, role-playing games, Slacklining, Reading, Inline skating, Brazilian jiu-jitsu, Dance

Introduction: My name is Aracelis Kilback, I am a nice, gentle, agreeable, joyous, attractive, combative, gifted person who loves writing and wants to share my knowledge and understanding with you.