I have a txt file filed with multiple urls, each url is an article with text and their corresponding SDG (example of one article 1)
The text parts of an article are in balises 'div.text.-normal.content' and then in 'p'
And the SDGs are in 'div.tax-section.text.-normal.small' and then in 'span'
To extract them I use the following lines of code :
data = []
with open('urls_news.txt', 'r') as inf:
for row in inf:
url = row.strip()
response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
if response.ok:
try:
soup = BeautifulSoup(response.text,"html.parser")
text = soup.select_one('div.text-normal').get_text(strip=True)
topic = soup.select_one('div.tax-section').get_text(strip=True)
data.append(
{
'text':text,
'topic': topic,
}
)
pd.DataFrame(data).to_excel('text_2.xlsx', index = False, header=True)
except AttributeError:
print (" ")
time.sleep(3)
But I have no result, I've previously used this code to extract same type of information from an other website with clearer class name. I'va also tried to enter "div.text.-normal.content" and "div.tax-section.text.-normal.small" but same result.
I think that the classes i'm calling in this exemple are wrong. I would like to know what i've missed in theses classes names.
question from:
https://stackoverflow.com/questions/66047922/how-to-extract-element-from-a-webpage-with-special-class-name 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…