To provide a solution to your question:
You should use %in%
. It gives you back a logical vector.
a %in% ""
# [1] FALSE TRUE FALSE
x[!a %in% ""]
# a
# 1: 1
# 2: NA
To find out why this is happening in data.table
:
(as opposted to data.frame
)
If you look at the data.table
source code on the file data.table.R
under the function "[.data.table"
, there's a set of if-statements
that check for i
argument. One of them is:
if (!missing(i)) {
# Part (1)
isub = substitute(i)
# Part (2)
if (is.call(isub) && isub[[1L]] == as.name("!")) {
notjoin = TRUE
if (!missingnomatch) stop("not-join '!' prefix is present on i but nomatch is provided. Please remove nomatch.");
nomatch = 0L
isub = isub[[2L]]
}
.....
# "isub" is being evaluated using "eval" to result in a logical vector
# Part 3
if (is.logical(i)) {
# see DT[NA] thread re recycling of NA logical
if (identical(i,NA)) i = NA_integer_
# avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
else i[is.na(i)] = FALSE
}
....
}
To explain the discrepancy, I've pasted the important piece of code here. And I've also marked them into 3 parts.
First, why dt[a != ""]
doesn't work as expected (by the OP)?
First, part 1
evaluates to an object of class call
. The second part of the if statement in part 2
returns FALSE. Following that, the call
is "evaluated" to give c(TRUE, FALSE, NA)
. Then part 3
is executed. So, NA
is replaced to FALSE
(the last line of the logical loop).
why does x[!(a== "")]
work as expected (by the OP)?
part 1
returns a call once again. But, part 2
evaluates to TRUE and therefore sets:
1) `notjoin = TRUE`
2) isub <- isub[[2L]] # which is equal to (a == "") without the ! (exclamation)
That is where the magic happened. The negation has been removed for now. And remember, this is still an object of class call. So this gets evaluated (using eval
) to logical again. So, (a=="")
evaluates to c(FALSE, TRUE, NA)
.
Now, this is checked for is.logical
in part 3
. So, here, NA
gets replaced to FALSE
. It therefore becomes, c(FALSE, TRUE, FALSE)
. At some point later, a which(c(F,T,F))
is executed, which results in 2 here. Because notjoin = TRUE
(from part 2
) seq_len(nrow(x))[-2]
= c(1,3) is returned. so, x[!(a=="")]
basically returns x[c(1,3)]
which is the desired result. Here's the relevant code snippet:
if (notjoin) {
if (bywithoutby || !is.integer(irows) || is.na(nomatch)) stop("Internal error: notjoin but bywithoutby or !integer or nomatch==NA")
irows = irows[irows!=0L]
# WHERE MAGIC HAPPENS (returns c(1,3))
i = irows = if (length(irows)) seq_len(nrow(x))[-irows] else NULL # NULL meaning all rows i.e. seq_len(nrow(x))
# Doing this once here, helps speed later when repeatedly subsetting each column. R's [irows] would do this for each
# column when irows contains negatives.
}
Given that, I think there are some inconsistencies with the syntax.. And if I manage to get time to formulate the problem, then I'll write a post soon.