I'm using a BERT tokenizer over a large dataset of sentences (2.3M lines, 6.53bn words):
#creating a BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',
do_lower_case=True)
#encoding the data using our tokenizer
encoded_dict = tokenizer.batch_encode_plus(
df[df.data_type=='train'].comment.values,
add_special_tokens=True,
return_attention_mask=True,
pad_to_max_length=True,
max_length=256,
return_tensors='pt'
)
As-is, it runs on CPU and only on 1 core. I tried to parallelize, but that will only speed up processing by 16x with my 16 cores CPU, which will still make it run for ages if I want to tokenize the full dataset.
Is there any way to make it run on GPU or to speed this up some other way?
EDIT:
I have also tried using a fast tokenizer:
#creating a BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased',
do_lower_case=True)
Then passing the output to my batch_encode_plus:
#encoding the data using our tokenizer
encoded_dict = tokenizer.batch_encode_plus(
df[df.data_type=='train'].comment.values,
add_special_tokens=True,
return_attention_mask=True,
pad_to_max_length=True,
max_length=256,
return_tensors='pt'
)
But the batch_encode_plus returns the following error:
TypeError: batch_text_or_text_pairs has to be a list (got <class
'numpy.ndarray'>)
question from:
https://stackoverflow.com/questions/65857708/is-there-a-way-to-use-gpu-instead-of-cpu-for-bert-tokenization 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…