pytorch - Is there a way to use GPU instead of CPU for BERT tokenization?

Question

Welcome To Ask or Share your Answers For Others

pytorch - Is there a way to use GPU instead of CPU for BERT tokenization?

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

pytorch - Is there a way to use GPU instead of CPU for BERT tokenization?

I'm using a BERT tokenizer over a large dataset of sentences (2.3M lines, 6.53bn words):

#creating a BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

#encoding the data using our tokenizer
encoded_dict = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].comment.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

As-is, it runs on CPU and only on 1 core. I tried to parallelize, but that will only speed up processing by 16x with my 16 cores CPU, which will still make it run for ages if I want to tokenize the full dataset.

Is there any way to make it run on GPU or to speed this up some other way?

EDIT: I have also tried using a fast tokenizer:

#creating a BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

Then passing the output to my batch_encode_plus:

#encoding the data using our tokenizer
encoded_dict = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].comment.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

But the batch_encode_plus returns the following error:

TypeError: batch_text_or_text_pairs has to be a list (got <class 'numpy.ndarray'>)

question from:https://stackoverflow.com/questions/65857708/is-there-a-way-to-use-gpu-instead-of-cpu-for-bert-tokenization

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

pytorch - Is there a way to use GPU instead of CPU for BERT tokenization?

pytorch - Is there a way to use GPU instead of CPU for BERT tokenization?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags