Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
134 views
in Technique[技术] by (71.8m points)

python - How to associate specific rows coming from webscraping in a CSV file?

I'm currently working on webscraping process by using python and more especially BeautifulSoup package in order to extract for each article of each page text and topics from a web page1)

For each article, I would like to regroup each texts extracted in one single string and associate to it a string of the topic.s. The goals is to iterate this process for all article and obtain a CSV file with a Text and Topic column (each line represent an article)

Texts = []
Topics = []


with open('urls.txt', 'r') as inf:
    with open('text_file.csv', 'w') as outf:
        outf.write('Text, labels
')
        for row in inf:
            url = row.strip()
            response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
            if response.ok:
                soup = BeautifulSoup(response.text,'lxml')
                txt = soup.findAll('div', {'class': 'para_content_text'})
                for div in txt:
                    p = div.findAll('p')
                    Texts.append(p)
                for result in Texts:
                    for item in result:
                        full_text = ' '.join([item.text for result in Texts for item in result])
                       

                        
            top = soup.find('div', {'class': 'article_tags_topics'})
            a = top.findAll('a')
            Topics.append(a)
            for res in Topics:
                for it in res :
                    full_topic = ' '.join([it.text for res in Topics])
            
            outf.write(full_text.replace(',','') + ',' + full_topic + '
')

But after running my code I obtained text cells repeated several times because each repetition is associated to a different topic. the topics are also repeated themselves (see attached screenshot to have a better idea)enter image description here

How can avoid these multiple line repeats ?

question from:https://stackoverflow.com/questions/65885926/how-to-associate-specific-rows-coming-from-webscraping-in-a-csv-file

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

A lot of loops and hard to read - Would recommend to create dicts from your data and push it to a list.

Advantage, you can process it in a "better way" - Writing to a csv I did with pandas, cause I like to analyse my data, before push it in files. You can also do it in your way.

Example

import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {"user-agent": "Mozilla/5.0"}
urlList = [
    'https://www.unep.org/news-and-stories/story/how-monitoring-sewage-could-prevent-return-coronavirus',
    'https://www.unep.org/news-and-stories/story/caribbean-wrestles-mischievous-invaders-monkeys'
          ]
data = []

for url in urlList:
    r = requests.get(url,headers=headers)
    soup = BeautifulSoup(r.text, "html.parser")
    
    text = soup.select_one('div.para_content_text').get_text(strip=True)
    topic = soup.select_one('div.article_tags_topics').get_text(strip=True)
    tags = soup.select_one('div.article_tags_tags').get_text(strip=True)
    data.append(
        {
        'text':text,
        'topic': topic,
        'tags':tags
        }
    )

pd.DataFrame(data).to_csv('text.csv', index = False, header=True)

Output

text,topic,tags
"In their efforts to stave off a second wave of COVID-19, scientists from around the world have turned to a new ally: sewage.In the United Kingdom, the Netherlands and Spain, researchers are poring over samples of wastewater for signs of the coronavirus, which is believed to be shed in human feces.Given that many people with the virus are asymptomatic and will not be tested for the disease, scientists say sewage could act like a COVID-19 early warning system.For more information, seeUNEP’s factsheet on COVID-19, wastewater and sewage",Water,PollutionCovid-19Health
"As iconic as the islands’ pristine beaches and tropical forests, the 60,000-plus green monkeys of St. Kitts and Nevis are a quintessential part of the Caribbean experience for many visitors.But while these photogenic mischief-makers might charm tourists, they pose serious threats to the twin-island Federation. Likely first brought to the islands from West Africa as exotic pets by European settlers in the 17th century, today the monkeys are putting pressure on native species, decimating crops and consistently evading efforts to scare them off.Tackling the Caribbean’s iconic invaders“Feral animals, particularly monkeys and wild pigs, cause considerable yield loss to food production each year,” says Melvin James, St. Kitts and Nevis’ Director of Agriculture. “In 2018, crude estimates indicated that a total of 90 metric tons of food—one month’s production—was rendered unmarketable due to feral animal invasion of farms on St. Kitts alone.”Located in the Eastern Caribbean, like many tropical islands, St. Kitts and Nevis are rich in biodiversity. But many species are fragile and susceptible to outside threats, including invasive animals.The United Nations Environment Programme and partners are working with the Government of St. Kitts and Nevis to research the impact of green monkeys on biodiversity, agriculture, tourism, and households. Backed bythe Global Environment Facility, the program, formally known as thePreventing COSTS of Invasive Alien Species in Barbados and the OECS Countriesproject, will also develop a sustainable plan to manage the green monkey population.Naitram Ramnanan, Regional Representative for project partnerthe Centre for Agriculture and Biosciences International(CABI), says green monkeys are becoming increasingly problematic in the region.“In Barbados, for example, they are also a significant agricultural pest,” Ramnanan said. “While they are present in the wild in other islands, they are not yet a serious pest but they are highly likely to become one.”The sustainable management plan that will be developed in St. Kitts and Nevis will also be replicated in Barbados and other islands.Dr. Kerry Dore is leading efforts to understand the broader impacts of green monkeys on St. Kitts and Nevis’s environment and economy. Photo by Rondell Williams.Eyes on the wildThe leader of the research team is Dr. Kerry M. Dore, a biological anthropologist with expertise in human-primate interactions.Dr. Dore’s team has already monitored crop losses across 65 randomly selected farms on St. Kitts and is currently monitoring losses on 26 farms and 22 backyard gardens, alongside conducting surveys to gauge the economic toll of green monkeys on agriculture.Having already discovered the primates have an appetite for a wide range of native fauna, including West Indian tree ferns, opuntia cacti, bromeliads, heliconias, and philodendron, the researchers are now planning to gauge the toll of the monkeys on the Federation’s bird population. By mimicking the nesting behavior of locally important bird species across a wide range of habitats with quail eggs planted in fake nests the team hopes to measure the scale and pattern of the monkeys’ predation.“Broadly speaking, we know that invasive species are the number one threat to biodiversity on islands,” Dr. Dore says. “Our goal for this portion of the project is to obtain the information the government needs to make informed management decisions that will benefit the environmental health of the Federation.”With data from the ongoing research to be used to assess the economic impact of green monkeys on St. Kitts and Nevis’ agricultural sector and biodiversity, work to evaluate the monkeys’ impact on tourism and households will begin in the fall of 2020.Camera traps are helping scientists capture the impact of monkeys on the island’s biodiversity. Photo by Rondell WilliamsA regional approach to monkey managementRegional collaboration with Barbados is ongoing and includes a citizen-science initiative to determine the monkeys’ range and population density on that island. The program’s ultimate goal is to create a scientifically-informed strategy to manage green monkeys and limit their impact on agriculture, biodiversity, tourism, and households around the Caribbean.UNEP biodiversity expert Christopher Cox said the spread of invasive species, along with ecosystem degradation due to human activity, is one of the greatest threats to flora and fauna in the region.“With the ever-increasing trade and movement of people through borders, the risk of introduction of harmful exotic species will remain high,” said Cox. “But by working together on a science-based, humane approach, we hope monkeys, people and native species can coexist—and the amazing biodiversity of this region can continue to thrive.”To learn more about thePreventing COSTS of Invasive Alien Species in Barbados and the OECS Countries projectand UNEP’s work in Biodiversity,contactChristopher Cox",Ecosystems and biodiversity,Global Environment FacilityBiodiversity

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...