Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
71 views
in Technique[技术] by (71.8m points)

python - How can multiple matches be applied in dataframe?

I have a little performance issue with that problem:

There is a (3_000_000, 6) shape dataframe, call that A, and another one, that has (72_000, 6) shape, call that B. To keep it simple, suppose both of them have only string columns.

In the B dataframe, there are fields of rows where could be ?, so question mark in the value. For example CITY column: New ?ork instead of New York. The task is to find the right string from the A dataframe.

So another example:

B Dataframe

CITY ADDRESS ADDRESS_TYPE
New ?ork D?ck str?et
question from:https://stackoverflow.com/questions/65927165/how-can-multiple-matches-be-applied-in-dataframe

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

As indicated by @dwhswenson, the only strategy that comes to mind is to reduce the size of the dataframes you test, which is not so much a programming problem as a data management problem. This is going to depend on your dataset and what kind of work you want to do, but one naive strategy would be to store indices of rows in which column values start with 'a', 'b', etc. and then select a dataframe to match over based on the query string. So you'd do something like

import itertools
alphabet = 'abcdefghijklmnopqrstuvwxyz'
keys = list(itertools.chain.from_iterable(([c, l] for c in A.columns) for l in alphabet))
A_indexes = {}
for k in keys:
    begins_with = lambda x: (x[0] == k[1]) or (x[0] == k[1].upper())
    A_indexes[k[0], k[1]] = A[k[0]].loc[A[k[0]].apply(begins_with)].index

Then make a function that takes a column name and a string to match and returns a view of A that contains only rows whos entries for that column begin with the same letter as the string to match:

def get_view(column, string_to_match):
    return A.loc[A_indexes[[column, string_to_match[0].lower()]]]

You'd have to double the number of indices of course for the case in which the first letter of the string to match is a wildcard, but this is only an example of the kind of thing you could do to slice the dataset before doing a regex on every row.


if you want to work with the unique ID above, you could make a more complex index lookup dictionary for views into A:

import itertools
alphabet = 'abcdefghijklmnopqrstuvwxyz'
number_of_fields_in_ID = 6
separator = ' - '

keys = itertools.product(alphabet, repeat=number_of_fields_in_ID)

def check_ID(id, key):
    id = id.split(separator)
    matches_key = True
    for i,first_letter in enumerate(key):
        if id[i].lower() != first_letter:
            matches_key = False
            break
    return matches_key

A_indexes = {}
for k in keys:
    A_indexes[k] = A.loc[A['unique_ID'].apply(check_ID, args=(k,))].index

This will apply the kludgy function check_ID to your 3e6 element series A['unique_ID'] 26**number_of_fields_in_ID times, indexing the dataframe once per iteration (this is more than 4.5e5 iterations if the ID has 4 fields), so there may be significant up-front cost in compute time depending on your dataset. Whether or not this is worth it depends first on what you want to do afterwards (are you just doing a single set of queries to make a second dataset and you're done, or do you need to do a lot of arbitrary queries in the future?) and second on whether the first letters of each ID field are roughly evenly distributed over the alphabet (or decimal digits if you know a field is numeric, or both if it's alphanumeric). If you just have two or three cities, for instance, you wouldn't make indexes for the whole alphabet in that field. But again, doing the lookup this way is naive - if you go this route you'd come up with a lookup method that's tailored to your data.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...