python - How to extract element from a webpage with special class name?

Question

Welcome To Ask or Share your Answers For Others

python - How to extract element from a webpage with special class name?

asked Oct 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How to extract element from a webpage with special class name?

I have a txt file filed with multiple urls, each url is an article with text and their corresponding SDG (example of one article 1)

The text parts of an article are in balises 'div.text.-normal.content' and then in 'p' And the SDGs are in 'div.tax-section.text.-normal.small' and then in 'span'

To extract them I use the following lines of code :

data = []

with open('urls_news.txt', 'r') as inf:
    for row in inf:
        url = row.strip()
        response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
            
        if response.ok:
            try:
                soup = BeautifulSoup(response.text,"html.parser")
                text = soup.select_one('div.text-normal').get_text(strip=True)
                topic = soup.select_one('div.tax-section').get_text(strip=True)

                data.append(
                    {
                    'text':text,
                    'topic': topic,
                    }
                )
                
                pd.DataFrame(data).to_excel('text_2.xlsx', index = False, header=True)

            except AttributeError:
                print (" ")

    time.sleep(3)

But I have no result, I've previously used this code to extract same type of information from an other website with clearer class name. I'va also tried to enter "div.text.-normal.content" and "div.tax-section.text.-normal.small" but same result.

I think that the classes i'm calling in this exemple are wrong. I would like to know what i've missed in theses classes names.

question from:https://stackoverflow.com/questions/66047922/how-to-extract-element-from-a-webpage-with-special-class-name

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T03:16:04+0000

To select the text you can go with:

soup.select_one('div.text.-normal.content').get_text(strip=True)

Think there is something wrong with the names of the classes, just chain them with a . for every whitespace between them.

or:

soup.select_one('div.c-single-content').get_text(strip=True)

To get the topics as mentioned you can go with:

'^^'.join([topic.get_text(strip=True) for topic in soup.select_one('div.tax-section.text.-normal.small').select('a')])

Categories

python - How to extract element from a webpage with special class name?

python - How to extract element from a webpage with special class name?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags