Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
70 views
in Technique[技术] by (71.8m points)

python - How to add pointer from one to another numpy array with fast (numpy) functions?

I have two numpy arrays a and b.

I would like to know the position of the entry in the first column of a in the first column of b. If the entry in the first column of a does not exist in b, a NaN should be set for this row. The resulting column should be added at the right side of a.

Remark: the first columns of a and b are numerically ordered.

Example:

import numpy as np

a = np.array([
    [1, 201, ],
    [2, 202, ],
    [3, 203, ],
    [4, 204, ],
    [5, 205, ],
    [6, 206, ],
    [7, 207, ],
    [8, 208, ],
    [14,214,],
    [23,223,],
])

b = np.array([
    [3, 303, ], # 0
    [6, 306, ], # 1
    [7, 307, ], # 2
    [14,314,],  # 3
    [16,316,],  # 4
])

So, I am looking for a fast (if possible: numpy or related packages based) solution so that I get a new version of a that looks like:

array([[  1., 201.,  nan],
       [  2., 202.,  nan],
       [  3., 203.,   0.],
       [  4., 204.,  nan],
       [  5., 205.,  nan],
       [  6., 206.,   1.],
       [  7., 207.,   2.],
       [  8., 208.,  nan],
       [ 14., 214.,   3.],
       [ 23., 223.,  nan]])

sketch showing the relationship between a and b

question from:https://stackoverflow.com/questions/65904952/how-to-add-pointer-from-one-to-another-numpy-array-with-fast-numpy-functions

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Provided your arrays are sorted on [:, 0] as in your example, then you can use np.searchsorted() for a (very) fast solution (note: bold claim here; if there is a faster way I'd love to hear it):

ix = np.searchsorted(b[:, 0], a[:, 0]).clip(max=b.shape[0]-1)
out = np.hstack((a, np.where(b[ix, 0] == a[:, 0], ix, np.nan)[:, None]))

Out:

[[  1. 201.  nan]
 [  2. 202.  nan]
 [  3. 203.   0.]
 [  4. 204.  nan]
 [  5. 205.  nan]
 [  6. 206.   1.]
 [  7. 207.   2.]
 [  8. 208.  nan]
 [ 14. 214.   3.]
 [ 23. 223.  nan]]

Explanation:

The first line does most of the heavy-lifting work:

>>> np.searchsorted(b[:, 0], a[:, 0])
array([0, 0, 0, 1, 1, 1, 2, 3, 3, 5])

Except we clip it so that it doesn't exceed the number of rows in b. At that point, if b[ix, 0] matches a[:, 0], then ix is correct, otherwise select NaN:

ix = np.searchsorted(b[:, 0], a[:, 0]).clip(max=b.shape[0]-1)
b[ix, 0]
# array([ 3,  3,  3,  6,  6,  6,  7, 14, 14, 16])
a[:, 0]
# array([ 1,  2,  3,  4,  5,  6,  7,  8, 14, 23])

Addendum: timing info

Let:

  • v0(a, b) be the version proposed here (using searchsorted)
  • v1(a, b) be @anky's version based on isin and cumsum
  • v2(a, b) be @ShubhamSharma's version based on where(m.any(1),...)

Also, side note: observe that v1 is wrong when there are repeated values either in a or b (or both).

n = 100_000
m = 10_000

a = np.random.randint(0, 4, size=(n, 2)).cumsum(axis=0)
b = np.random.randint(0, 4, size=(m, 2)).cumsum(axis=0)

t0 = %timeit -o v0(a, b)
# 2.17 ms ± 855 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

t1 = %timeit -o v1(a, b)
# 4.34 ms ± 3.85 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

t2 = %timeit -o v2(a, b)
# 1.12 s ± 473 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)


>>> t1.best / t0.best
2.0

>>> t2.best / t0.best
519.1

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...