Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
354 views
in Technique[技术] by (71.8m points)

python - Iterate through multiple files and append text from HTML using Beautiful Soup

I have a directory of downloaded HTML files (46 of them) and I am attempting to iterate through each of them, read their contents, strip the HTML, and append only the text into a text file. However, I'm unsure where I'm messing up, though, as nothing gets written to my text file?

import os
import glob
from bs4 import BeautifulSoup
path = "/"
for infile in glob.glob(os.path.join(path, "*.html")):
        markup = (path)
        soup = BeautifulSoup(markup)
        with open("example.txt", "a") as myfile:
                myfile.write(soup)
                f.close()

-----update---- I've updated my code as below, however the text file still doesn't get created.

import os
import glob
from bs4 import BeautifulSoup
path = "/"
for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(markup)
    with open("example.txt", "a") as myfile:
        myfile.write(soup)
        myfile.close()

-----update 2-----

Ah, I caught that I had my directory incorrect, so now I have:

import os
import glob
from bs4 import BeautifulSoup

path = "c:\users\me\downloads"

for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(markup)

    with open("example.txt", "a") as myfile:
        myfile.write(soup)
        myfile.close()

When this is executed, I get this error:

Traceback (most recent call last):
  File "C:UsersMeDownloadssoup.py, line 11 in <module>
    myfile.write(soup)
TypeError: must be str, not BeautifulSoup

I fixed this last error by changing

myfile.write(soup)

to

myfile.write(soup.get_text())

-----update 3 ----

It's working properly now, here's the working code:

import os
import glob
from bs4 import BeautifulSoup

path = "c:\users\me\downloads"

for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(open(markup, "r").read())
    with open("example.txt", "a") as myfile:
        myfile.write(soup.get_text())
        myfile.close()
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

actually you are not reading html file, this should work,

soup=BeautifulSoup(open(webpage,'r').read(), 'lxml')

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...