Version information:The code of this page was tested in R under development (unstable) (2012-02-22 r58461) on: 2012-03-28 using: knitr 0.4
Like other statistical packages, R can handle missing values. However, the way R handles missing values may require a change of heart for those used to working with missing values in other packages. On this page, we first cover the basics of representing missing values in R. Then, for those coming from SAS, SPSS, and/or Stata, we outline some of the differences between missing values in R and missing values elsewhere. Finally, we'll cover some tools for handling missing values in R, including data management and analysis.
Very simple
Missing data in R are shown as NA. NA is not a string or number, but a missing pointer. We can create vectors with missing values.
x1 <- C(1, 4, 3, Do, 7)x2 <- C("A", "B", Do, "Ten")
NA is one of the few non-numbers we can includex1No errors are generated (other exceptions are letters representing numbers or numerical concepts such as infinity). existx2, the third value is missing, and the fourth value is the string "NA". To see which values in each R vector are identified as missing, we can usetak.naFunction. It will return a TRUE/FALSE vector containing all elements of the vector we provided.
tak.na(x1)## [1] false false false true falsetak.na(x2)## [1] false false true false
We see that R distinguishes between NA and "NA"x2–NA is considered a missing value, "NA" is not.
Differences from other packages
- NA cannot be used for comparison: In other packages, "missing" values are assigned an extreme numerical value - very high or very low. Thus, values coded as missing can 1) be compared to other values and 2) other values can be compared to missing. In the SAS code example below, we compare the values in y with 0 and the missing sign and find that both comparisons are valid (and the missing sign is less than zero).
test data; input xy; data line; 2.3 45 16 0; test data; test set; low = (y < 0); miss = (y = .); jogging; proc print data = test; jogging;Watch x y Miss Lowy 1 2 . 1 1 2 3 4 0 0 3 5 1 0 0 4 6 0 0 0
We can try an equivalent method in R.
x1 < 0## [1] FALSE FALSE FALSE DO FALSEx1==Do## [1] Look Look Look
Our missing values cannot be compared to 0, nor can any of our values be compared to NA, because NA has no value assigned - it either does or it doesn't.
- NA for various missing data: In other packages, missing strings and missing numbers may be represented differently - empty quotes for strings and periods for numbers. U R NA represents all types of missing data. we are inx1andx2.x1is the object "number" andx2is a "Char" object.
- Values other than NA cannot be interpreted as missing: Other packages allow values to be marked as "missing in system" so that they will be interpreted as missing in the analysis. In R you must explicitly change these values to NA. This onetak.naYou can also use this function to make such changes:
tak.na(x1) <- Who(x1==7)x1## [1] 1 4 3 Not applicable Not applicable
NA option in R
we introducedtak.naAs a tool for finding and creating missing values. This is one of several features built around NA. Most of the other NA features are the following optionsNA action.
Just as functions have defaults, R as software has similar low-level defaults. You can see these current settings using the commandoptions(). One of them is "na.action", which describes how to handle missing values. Possible settings for na.action in R include:
- is omittedandNot applicable: Returns an object with lowercase and lowercase numbers removed if lowercase and lowercase letters contain missing values; the difference between skipping and excluding NA can be seen in some prediction functions and residual functions
- already: return the object unchanged
- fail: returns an object only if it contains no missing values
To see the current action in the options, usegetOption("on.action")We can create a data frame with missing values and see how the above can be used to solve this.
(G <- as.a.data.frame(sticker(C(1:5, Do), Easy = 2)))## V1 V2## 1 1 4## 2 2 5## 3 3 N/Ais omitted(G)## V1 V2## 1 1 4## 2 2 5Not applicable(G)## V1 V2## 1 1 4## 2 2 5fail(G)## Error in na.fail.default(g): no value in objectalready(G)## V1 V2## 1 1 4## 2 2 5## 3 3 N/A
missing values in the analysis
In some R functions, one of the arguments that the user can provide is:NA action. For example, if you look at helplumenNaredba, seeNA actionis one of the specified parameters. It will be used by defaultNA actionspecified in option R. If you want to use anotherNA actionFor regression you can use the methodlumenOrder.
two common optionslumenis the default value,is omittedandNot applicableIt does not use missing values, but retains the positions of residuals and fitted values.
## Use known Ancombe data and set some to NAAnscombe <- w(Anscombe, { y1[1:3] <- Do})Anscombe # of views## x1 x2 x3 x4 y1 y2 y3 y4 ## 1 10 10 10 8 8,14 7,46 6,58 ## 2 8 8 8 8 8,14 6,77 5,76# 3 13 13 8 8,74 12,74 7,71 ## 4 9 9 9 8 8,8 1 8,77 7,11 8,84# # 5 11 11 11 8 8,33 9,26 7,81 8,47## 6 14 14 14 8 9,96 8,10 8,84 7,04## 7 6 6 6 8 7,24 6,13 6,0 8 5,25 ## 8 4 4 4 19 4,26 3 .10 5,39 12,50## 9 12 12 12 8 10,84 9,13 8,15 5,56## 10 7 7 7 8 4,82 7,26 6,42 7,91## 11 5 5 5 8 5, 68 4.74 5.73 6.89model. skipped <- lumen(y2 ~ y1, Dane = Anscombe, NA action = is omitted)model.exc <- lumen(y2 ~ y1, Dane = Anscombe, NA action = Not applicable)## Compare the impact on the residualsprecipitate(model. skipped)## 4 5 6 7 8 9 10 11 ## 0,727 1,575 -0,799 -0,743 -1,553 -0,425 2,190 -0,971precipitate(model.exc)## 1 2 3 4 5 6 7 8 9 10 ## ne ne ne 0.727 1.575 -0.799 -0.743 -1.553 -0.425 2.190 ## 11 ## -0.971## Compare performance to adjusted (predicted) valuesinstalled(model. skipped)## 4 5 6 7 8 9 10 11 ## 8,04 7,69 8,90 6,87 4,65 9,55 5,07 5,71installed(model.exc)## 1 2 3 4 5 6 7 8 9 10 11 ## no no no no 8.04 7.69 8.90 6.87 4.65 9.55 5.07 5.71
useNot applicableFill in the residuals and corresponding values asDOwhere values are missing. Other functions are not usedNA action, but has a different parameter (with some default values) to handle missing values. For example,the meaning isBy default, the command will return NA if there is a NA in the passed object.
the meaning is(x1)## [1] do
If you want to calculate the average of the non-missing values in the passed object, you can do so in thefrom rmparameter (set to FALSE by default).
the meaning is(x1, from rm = he says)## [1] 2,67
Two common commands used in data management and research aresummarizeandtable. Tensummarizecommand (when used with numeric vectors) returns the number of NAs in the vector buttableThe command ignores NA by default.
summarize(x1)## Min Q1 Median Average Q3 Max NA ## 1.00 2.00 3.00 2.67 3.50 4.00 2table(x1)## x1## 1 3 4 ## 1 1 1
See NA amongtableoutput where you can check "if" or "always"Ustadiscussion. The first will show NA in the output only when there is no data in the object. Others will contain NA in the output anyway.
table(x1, Usta = "mali")## x1## 1 3 4 ## 1 1 1 2table(1:3, Usta = "Constantly")## ## 1 2 3 ## 1 1 1 0
Imputing data with missing values in R is again different from other packages because NA cannot be compared to other values. By default sorting removes all NA values, so the length of the vector can be changed.
(x1s <- tip(x1))## [1] 1 3 4lenght(x1s)## [1] 3
The user can specify whether NA should be last or first in the sort order by specifying TRUE or FALSE for NAfinally Noradiscussion.
tip(x1, finally Nora = he says)## [1] 1 3 4 Not applicable Not applicable
Regardless of the purpose of your R code, it's a good idea to check for missing values in your data and use help files for any functions you use. You should understand and familiarize yourself with the default missing value handler or specify a missing value handler for the values you want to analyze.