Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
151 views
in Technique[技术] by (71.8m points)

r - data.table vs dplyr:一个人能做得好吗,另一个做不好或做得不好?(data.table vs dplyr: can one do something well the other can't or does poorly?)

Overview (概观)

I'm relatively familiar with data.table , not so much with dplyr .

(我对data.table比较熟悉,而不是dplyr 。)

I've read through some dplyr vignettes and examples that have popped up on SO, and so far my conclusions are that:

(我已经阅读了一些dplyr小插曲以及出现在SO上的例子,到目前为止,我的结论是:)

  1. data.table and dplyr are comparable in speed, except when there are many (ie >10-100K) groups, and in some other circumstances (see benchmarks below)

    (data.tabledplyr在速度上具有可比性,除非有许多(即> 10-100K)组,并且在某些其他情况下(参见下面的基准))

  2. dplyr has more accessible syntax

    (dplyr具有更易于访问的语法)

  3. dplyr abstracts (or will) potential DB interactions

    (dplyr抽象(或将)潜在的DB交互)

  4. There are some minor functionality differences (see "Examples/Usage" below)

    (有一些小的功能差异(参见下面的“示例/用法”))

In my mind 2. doesn't bear much weight because I am fairly familiar with it data.table , though I understand that for users new to both it will be a big factor.

(在我的脑海里2.没有太大的重量,因为我对data.table非常熟悉,虽然我明白对于那些对这两者data.table熟悉的用户来说这将是一个很重要的因素。)

I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with data.table .

(我想避免争论哪个更直观,因为这与我从熟悉data.table的人的角度提出的具体问题无关。)

I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here).

(我还想避免讨论“更直观”如何导致更快的分析(当然是真的,但同样,不是我最感兴趣的)。)

Question (题)

What I want to know is:

(我想知道的是:)

  1. Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (ie some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).

    (对于熟悉软件包的人来说,是否需要使用一个或另一个软件包来编写分析任务更加容易(例如,需要按键的一些组合与所需的深奥水平相结合,其中每个项目都是好事。)

  2. Are there analytical tasks that are performed substantially (ie more than 2x) more efficiently in one package vs. another.

    (是否存在在一个包装与另一个包装中更有效地执行分析任务(即,超过2倍)的分析任务。)

One recent SO question got me thinking about this a bit more, because up until that point I didn't think dplyr would offer much beyond what I can already do in data.table .

(最近的一个问题让我想到了这个问题 ,因为直到那时我才认为dplyr会提供超出我在data.table已经做的data.table 。)

Here is the dplyr solution (data at end of Q):

(这是dplyr解决方案(Q末尾的数据):)

dat %.%
  group_by(name, job) %.%
  filter(job != "Boss" | year == min(year)) %.%
  mutate(cumu_job2 = cumsum(job2))

Which was much better than my hack attempt at a data.table solution.

(这比我的hack尝试data.table解决方案要好得多。)

That said, good data.table solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution):

(也就是说,良好的data.table解决方案也相当不错(感谢Jean-Robert,Arun,并注意到这里我赞成对最严格的最佳解决方案的单一陈述):)

setDT(dat)[,
  .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], 
  by=list(id, job)
]

The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you're used to data.table (ie doesn't use some of the more esoteric tricks).

(后者的语法可能看起来非常深奥,但如果您习惯于data.table (即不使用某些更深奥的技巧),它实际上非常简单。)

Ideally what I'd like to see is some good examples were the dplyr or data.table way is substantially more concise or performs substantially better.

(理想情况下,我想看到的是一些很好的例子, dplyrdata.table方式基本上更简洁或表现更好。)

Examples (例子)

Usage 用法)
  • dplyr does not allow grouped operations that return arbitrary number of rows (from eddi's question , note: this looks like it will be implemented in dplyr 0.5 , also, @beginneR shows a potential work-around using do in the answer to @eddi's question).

    (dplyr不允许返回任意行数的分组操作(来自eddi的问题 ,请注意:这看起来它将在dplyr 0.5中实现,同样,@ beginneR显示了在@ eddi的问题的答案中使用do的潜在解决方法) 。)

  • data.table supports rolling joins (thanks @dholstius) as well as overlap joins

    (data.table支持滚动连接 (感谢@dholstius)以及重叠连接)

  • data.table internally optimises expressions of the form DT[col == value] or DT[col %in% values] for speed through automatic indexing which uses binary search while using the same base R syntax.

    (data.table内部优化形式的表达式DT[col == value]DT[col %in% values]速度通过自动索引 ,它使用二进制搜索 ,同时使用相同的基础R语法。)

    See here for some more details and a tiny benchmark.

    (请参阅此处了解更多详细信息和一个小基准。)

  • dplyr offers standard evaluation versions of functions (eg regroup , summarize_each_ ) that can simplify the programmatic use of dplyr (note programmatic use of data.table is definitely possible, just requires some careful thought, substitution/quoting, etc, at least to my knowledge)

    (dplyr提供的功能标准评估版本(如regroupsummarize_each_ ),可以简化程序中使用的dplyr (注意程序中使用的data.table是绝对有可能,只是需要一些认真思考,置换/报价,等等,至少据我所知))

Benchmarks 基准)
  • I ran my own benchmarks and found both packages to be comparable in "split apply combine" style analysis, except when there are very large numbers of groups (>100K) at which point data.table becomes substantially faster.

    (我运行自己的基准测试 ,发现两个软件包在“拆分应用组合”样式分析中具有可比性,除非有非常多的组(> 100K),此时data.table变得非常快。)

  • @Arun ran some benchmarks on joins , showing that data.table scales better than dplyr as the number of groups increase (updated with recent enhancements in both packages and recent version of R).

    (@Arun 在连接上运行了一些基准测试 ,结果表明, data.table组数量的增加, data.tabledplyr更好地dplyr (更新了包和最近版本的R中的最新增强功能)。)

    Also, a benchmark when trying to get unique values has data.table ~6x faster.

    (此外,尝试获取唯一值时的基准测试具有data.table ~6倍的速度。)

  • (Unverified) has data.table 75% faster on larger versions of a group/apply/sort while dplyr was 40% faster on the smaller ones ( <a href="https://stackoom.com/link/aHR

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed , Memory usage , Syntax and Features .

(我们需要至少涵盖这些方面,以提供全面的答案/比较(没有特别重要的顺序): SpeedMemory usageSyntaxFeatures 。)

My intent is to cover each one of these as clearly as possible from data.table perspective.

(我的目的是从data.table的角度尽可能清楚地涵盖其中的每一个。)

Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp.

(注意:除非另有明确说明,否则通过引用dplyr,我们引用dplyr的data.frame接口,其内部使用Rcpp在C ++中。)


The data.table syntax is consistent in its form - DT[i, j, by] .

(data.table语法的形式是一致的 - DT[i, j, by] 。)

To keep i , j and by together is by design.

(保持ijby在一起是设计的。)

By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage , and also provide some powerful features , all while maintaining the consistency in syntax.

(通过将相关操作保持在一起,它允许轻松优化操作以提高速度 ,更重要的是内存使用 ,并提供一些强大的功能 ,同时保持语法的一致性。)

1. Speed (1.速度)

Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas .

(相当多的基准测试(尽管主要是分组操作)已经添加到已经显示data.table的问题比dplyr 更快 ,因为要分组的组和/或行的数量增加,包括Matt在分组上的基准1000万到20亿行 (100GB内存)在100-1000万个组和不同的分组列上,这也比较了pandas 。)

See also updated benchmarks , which include Spark and pydatatable as well.

(另请参阅更新的基准测试 ,其中包括Sparkpydatatable 。)

On benchmarks, it would be great to cover these remaining aspects as well:

(在基准测试中,覆盖这些剩余方面也很棒:)

  • Grouping operations involving a subset of rows - ie, DT[x > val, sum(y), by = z] type operations.

    (涉及行子集的分组操作 - 即DT[x > val, sum(y), by = z]类型的操作。)

  • Benchmark other operations such as update and joins .

    (对其他操作进行基准测试,例如更新连接 。)

  • Also benchmark memory footprint for each operation in addition to runtime.

    (除运行时外,还为每个操作的基准内存占用量 。)

2. Memory usage (2.内存使用情况)

  1. Operations involving filter() or slice() in dplyr can be memory inefficient (on both data.frames and data.tables).

    (涉及dplyr中的filter()slice()的操作可能内存效率低(在data.frames和data.tables上)。)

    See this post .

    (看这篇文章 。)

    Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory .

    (请注意, Hadley的评论谈论速度 (dplyr对他来说很快),而主要关注的是记忆 。)

  2. data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable).

    (data.table接口目前允许通过引用修改/更新列(请注意,我们不需要将结果重新分配回变量)。)

     # sub-assign by reference, updates 'y' in-place DT[x >= 1L, y := NA] 

    But dplyr will never update by reference.

    (但dplyr 永远不会通过引用更新。)

    The dplyr equivalent would be (note that the result needs to be re-assigned):

    (dplyr等价物将是(注意结果需要重新分配):)

     # copies the entire 'y' column ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA)) 

    A concern for this is referential transparency .

    (对此的关注是参考透明度 。)

    Updating a data.table object by reference, especially within a function may not be always desirable.

    (通过引用更新data.table对象,尤其是在函数内更新可能并不总是令人满意的。)

    But this is an incredibly useful feature: see this and this posts for interesting cases.

    (但这是一个非常有用的功能:看到这个这个帖子的有趣案例。)

    And we want to keep it.

    (我们想保留它。)

    Therefore we are working towards exporting shallow() function in data.table that will provide the user with both possibilities .

    (因此,我们正在努力在data.table中导出shallow()函数,它将为用户提供两种可能性 。)

    For example, if it is desirable to not modify the input data.table within a function, one can then do:

    (例如,如果希望不修改函数中的输入data.table,则可以执行以下操作:)

     foo <- function(DT) { DT = shallow(DT) ## shallow copy DT DT[, newcol := 1L] ## does not affect the original DT DT[x > 2L, newcol := 2L] ## no need to copy (internally), as this column exists only in shallow copied DT DT[x > 2L, x := 3L] ## have to copy (like base R / dplyr does always); otherwise original DT will ## also get modified. } 

    By not using shallow() , the old functionality is retained:

    (通过不使用shallow() ,保留旧功能:)

     bar <- function(DT) { DT[, newcol := 1L] ## old behaviour, original DT gets updated by reference DT[x > 2L, x := 3L] ## old behaviour, update column x in original DT. } 

    By creating a shallow copy using shallow() , we understand that you don't want to modify the original object.

    (通过使用shallow()创建浅拷贝 ,我们知道您不想修改原始对象。)

    We take care of everything internally to ensure that while also ensuring to copy columns you modify only when it is absolutely necessary .

    (我们在内部处理所有事情,以确保在确保复制列时仅在绝对必要时修改。)

    When implemented, this should settle the referential transparency issue altogether while providing the user with both possibilties.

    (实施时,这应该完全解决参考透明度问题,同时为用户提供两种可能性。)

    Also, once shallow() is exported dplyr's data.table interface should avoid almost all copies.

    (此外,一旦shallow()被导出,dplyr的data.table接口应该避免几乎所有的副本。)

    So those who prefer dplyr's syntax can use it with data.tables.

    (所以那些喜欢dplyr语法的人可以将它与data.tables一起使用。)

    But it will still lack many features that data.table provides, including (sub)-assignment by reference.

    (但它仍然缺少data.table提供的许多功能,包括(sub)-assignment by reference。)

  3. Aggregate while joining:

    (加入时聚合:)

    Suppose you have two data.tables as follows:

    (假设您有两个data.tables如下:)

     DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y")) # xyz # 1: 1 a 1 # 2: 1 a 2 # 3: 1 b 3 # 4: 1 b 4 # 5: 2 a 5 # 6: 2 a 6 # 7: 2 b 7 # 8: 2 b 8 DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y")) # xy mul # 1: 1 a 4 # 2: 2 b 3 

    And you would like to get sum(z) * mul for each row in DT2 while joining by columns x,y .

    (并且您希望在按列x,y连接时获得DT2每行的sum(z) * mul 。)

    We can either:

    (我们可以:)

    • 1) aggregate DT1 to get sum(z) , 2) perform a join and 3) multiply (or)

      (1)聚合DT1得到sum(z) ,2)执行连接3)乘法(或))

       # data.table way DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][] # dplyr equivalent DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% right_join(DF2) %>% mutate(z = z * mul) 
    • 2) do it all in one go (using by = .EACHI feature):

      (2)一次完成(使用by = .EACHI功能):)

       DT1[DT2, list(z=sum(z) * mul), by = .EACHI] 

    What is the advantage?

    (有什么好处?)

    • We don't have to allocate memory for the intermediate result.

      (我们不必为中间结果分配内存。)

    • We don't have to group/hash twice (one for aggregation and other for joining).

      (我们没有两次分组/哈希(一个用于聚合,另一个用于加入)。)

    • And more importantly, the operation what we wanted to perform is clear by looking at j in (2).

      (更重要的是,通过查看(2)中的j ,我们想要执行的操作是清楚的。)

    Check this post for a detailed explanation of by = .EACHI .

    (查看<a href="https://stackoom.com/link/aHR0cHM6Ly9zdGFja292ZXJmbG93LmNvbS9hLzI3MDA0NTY2LzU1OTc4NA==" rel="nofollow noopener" target="_blank"


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...