Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
114 views
in Technique[技术] by (71.8m points)

python - How to extract element from a webpage with special class name?

I have a txt file filed with multiple urls, each url is an article with text and their corresponding SDG (example of one article 1)

The text parts of an article are in balises 'div.text.-normal.content' and then in 'p' And the SDGs are in 'div.tax-section.text.-normal.small' and then in 'span'

To extract them I use the following lines of code :

data = []

with open('urls_news.txt', 'r') as inf:
    for row in inf:
        url = row.strip()
        response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
            
        if response.ok:
            try:
                soup = BeautifulSoup(response.text,"html.parser")
                text = soup.select_one('div.text-normal').get_text(strip=True)
                topic = soup.select_one('div.tax-section').get_text(strip=True)

                data.append(
                    {
                    'text':text,
                    'topic': topic,
                    }
                )
                
                pd.DataFrame(data).to_excel('text_2.xlsx', index = False, header=True)

            except AttributeError:
                print (" ")

    time.sleep(3)

But I have no result, I've previously used this code to extract same type of information from an other website with clearer class name. I'va also tried to enter "div.text.-normal.content" and "div.tax-section.text.-normal.small" but same result.

I think that the classes i'm calling in this exemple are wrong. I would like to know what i've missed in theses classes names.

question from:https://stackoverflow.com/questions/66047922/how-to-extract-element-from-a-webpage-with-special-class-name

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

To select the text you can go with:

soup.select_one('div.text.-normal.content').get_text(strip=True)

Think there is something wrong with the names of the classes, just chain them with a . for every whitespace between them.

or:

soup.select_one('div.c-single-content').get_text(strip=True)

To get the topics as mentioned you can go with:

'^^'.join([topic.get_text(strip=True) for topic in soup.select_one('div.tax-section.text.-normal.small').select('a')])

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...