Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck.
from nltk.corpus import stopwords
cachedStopWords = stopwords.words("english")
def testFuncOld():
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])
def testFuncNew():
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in cachedStopWords])
if __name__ == "__main__":
for i in xrange(10000):
testFuncOld()
testFuncNew()
I ran this through the profiler: python -m cProfile -s cumulative test.py. The relevant lines are posted below.
nCalls Cumulative Time
10000 7.723 words.py:7(testFuncOld)
10000 0.140 words.py:11(testFuncNew)
So, caching the stopwords instance gives a ~70x speedup.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…