Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
540 views
in Technique[技术] by (71.8m points)

regex - Creating a custom categorized corpus in NLTK and Python

I'm experiencing a bit of a problem which has to do with regular expressions and CategorizedPlaintextCorpusReader in Python.

I want to create a custom categorized corpus and train a Naive-Bayes classifier on it. My issue is the following: I want to have two categories, "pos" and "neg". The positive files are all in one directory, main_dir/pos/*.txt, and the negative ones are in a separate directory, main_dir/neg/*.txt.

How can I use the CategorizedPlaintextCorpusReader to load and label all the positive files in the pos directory, and do the same for the negative ones?

NB: The setup is absolutely the same as the Movie_reviews corpus (~nltk_datacorporamovie_reviews).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Here is the answer to my question. Since I was thinking about using two cases I think it's good to cover both in case someone needs the answer in the future. If you have the same setup as the movie_review corpus - several folders labeled in the same way you would like your labels to be called and containing the training data you can use this.

reader = CategorizedPlaintextCorpusReader('~/MainFolder/', r'.*.txt', cat_pattern=r'(w+)/*')

The other approach that I was considering is putting everything in a single folder and naming the files 0_neg.txt, 0_pos.txt, 1_neg.txt etc. The code for your reader should look something like:

reader = CategorizedPlaintextCorpusReader('~/MainFolder/', r'.*.txt', cat_pattern=r'd+_(w+).txt')

I hope that this would help someone in the future.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

56.8k users

...