r - subsetting a data.table using !=<some non-NA> excludes NA too

Question

Welcome To Ask or Share your Answers For Others

r - subsetting a data.table using !=<some non-NA> excludes NA too

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - subsetting a data.table using !=<some non-NA> excludes NA too

I have a data.table with a column that has NAs. I want to drop rows where that column takes a particular value (which happens to be ""). However, my first attempt lead me to lose rows with NAs as well:

> a = c(1,"",NA)
> x <- data.table(a);x
    a
1:  1
2:   
3: NA
> y <- x[a!=""];y
   a
1: 1

After looking at ?`!=`, I found a one liner that works, but it's a pain:

> z <- x[!sapply(a,function(x)identical(x,""))]; z
    a
1:  1
2: NA

I'm wondering if there's a better way to do this? Also, I see no good way of extending this to excluding multiple non-NA values. Here's a bad way:

>     drop_these <- function(these,where){
+         argh <- !sapply(where,
+             function(x)unlist(lapply(as.list(these),function(this)identical(x,this)))
+         )
+         if (is.matrix(argh)){argh <- apply(argh,2,all)}
+         return(argh)
+     }
>     x[drop_these("",a)]
    a
1:  1
2: NA
>     x[drop_these(c(1,""),a)]
    a
1: NA

I looked at ?J and tried things out with a data.frame, which seems to work differently, keeping NAs when subsetting:

> w <- data.frame(a,stringsAsFactors=F); w
     a
1    1
2     
3 <NA>
> d <- w[a!="",,drop=F]; d
      a
1     1
NA <NA>

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-16T23:07:28+0000

To provide a solution to your question:

You should use %in%. It gives you back a logical vector.

a %in% ""
# [1] FALSE  TRUE FALSE

x[!a %in% ""]
#     a
# 1:  1
# 2: NA

To find out why this is happening in `data.table`:

(as opposted to data.frame)

If you look at the data.table source code on the file data.table.R under the function "[.data.table", there's a set of if-statements that check for i argument. One of them is:

if (!missing(i)) {
    # Part (1)
    isub = substitute(i)

    # Part (2)
    if (is.call(isub) && isub[[1L]] == as.name("!")) {
        notjoin = TRUE
        if (!missingnomatch) stop("not-join '!' prefix is present on i but nomatch is provided. Please remove nomatch.");
        nomatch = 0L
        isub = isub[[2L]]
    }

    .....
    # "isub" is being evaluated using "eval" to result in a logical vector

    # Part 3
    if (is.logical(i)) {
        # see DT[NA] thread re recycling of NA logical
        if (identical(i,NA)) i = NA_integer_  
        # avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
        else i[is.na(i)] = FALSE  
    }
    ....
}

To explain the discrepancy, I've pasted the important piece of code here. And I've also marked them into 3 parts.

First, why `dt[a != ""]` doesn't work as expected (by the OP)?

First, part 1 evaluates to an object of class call. The second part of the if statement in part 2 returns FALSE. Following that, the call is "evaluated" to give c(TRUE, FALSE, NA) . Then part 3 is executed. So, NA is replaced to FALSE (the last line of the logical loop).

why does `x[!(a== "")]` work as expected (by the OP)?

part 1 returns a call once again. But, part 2 evaluates to TRUE and therefore sets:

1) `notjoin = TRUE`
2) isub <- isub[[2L]] # which is equal to (a == "") without the ! (exclamation)

That is where the magic happened. The negation has been removed for now. And remember, this is still an object of class call. So this gets evaluated (using eval) to logical again. So, (a=="") evaluates to c(FALSE, TRUE, NA).

Now, this is checked for is.logical in part 3. So, here, NA gets replaced to FALSE. It therefore becomes, c(FALSE, TRUE, FALSE). At some point later, a which(c(F,T,F)) is executed, which results in 2 here. Because notjoin = TRUE (from part 2) seq_len(nrow(x))[-2] = c(1,3) is returned. so, x[!(a=="")] basically returns x[c(1,3)] which is the desired result. Here's the relevant code snippet:

if (notjoin) {
    if (bywithoutby || !is.integer(irows) || is.na(nomatch)) stop("Internal error: notjoin but bywithoutby or !integer or nomatch==NA")
    irows = irows[irows!=0L]
    # WHERE MAGIC HAPPENS (returns c(1,3))
    i = irows = if (length(irows)) seq_len(nrow(x))[-irows] else NULL  # NULL meaning all rows i.e. seq_len(nrow(x))
    # Doing this once here, helps speed later when repeatedly subsetting each column. R's [irows] would do this for each
    # column when irows contains negatives.
}

Given that, I think there are some inconsistencies with the syntax.. And if I manage to get time to formulate the problem, then I'll write a post soon.

Categories

r - subsetting a data.table using !=<some non-NA> excludes NA too

r - subsetting a data.table using !=<some non-NA> excludes NA too

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

To provide a solution to your question:

To find out why this is happening in `data.table`:

First, why `dt[a != ""]` doesn't work as expected (by the OP)?

why does `x[!(a== "")]` work as expected (by the OP)?

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Categories

r - subsetting a data.table using !=<some non-NA> excludes NA too

r - subsetting a data.table using !=<some non-NA> excludes NA too

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

To provide a solution to your question:

To find out why this is happening in data.table:

First, why dt[a != ""] doesn't work as expected (by the OP)?

why does x[!(a== "")] work as expected (by the OP)?

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

To find out why this is happening in `data.table`:

First, why `dt[a != ""]` doesn't work as expected (by the OP)?

why does `x[!(a== "")]` work as expected (by the OP)?