Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
383 views
in Technique[技术] by (71.8m points)

python - 无法使脚本(使用多重处理创建的)运行更快(Can't make a script (created using multiprocessing) run faster)

I've created a script in python to fetch all the links connected to the name of different actors from imdb.com and then parse the first three of their movie links and finally scrape the name of director and writer of those movies.

(我已经在python中创建了一个脚本,以从imdb.com提取与不同演员的名字相关的所有链接,然后解析他们电影的前三个链接,最后抓取这些电影的导演和作家的名字。)

There are around 1000 names in there.

(大约有1000个名字。)

I've used the first ten names for this example.

(在此示例中,我使用了前十个名称。)

However, the problem is even when I used multiprocessing , as in concurrent.futures within the script, I find it not speeding up.

(然而,问题是,即使我用multiprocessing ,如concurrent.futures脚本中,我发现这不是加快。)

website link

(网站连结)

To recap - the script first uses the above link then parses the first three movie links from here:

(回顾一下-脚本首先使用上面的链接,然后从此处解析前三个电影链接:)

在此处输入图片说明

then parses the name of directors and writers from here:

(然后从此处解析导演和作家的姓名:)

在此处输入图片说明

This is what I've tried with:

(这是我尝试过的:)

import requests
import concurrent.futures
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = 'https://www.imdb.com/list/ls058011111/'
base = 'https://www.imdb.com/'

def get_actor_list():
    res = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(res.text,"lxml")
    for name_links in soup.select(".mode-detail")[:10]:
        name = name_links.select_one("h3 > a").get_text(strip=True)
        item_link = urljoin(base,name_links.select_one("h3 > a").get("href"))
        yield name,item_link

def get_movie_links(name,link):
    itemvault = []
    r = requests.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    item_links = [urljoin(base,item.get("href")) for item in soup.select(".filmo-category-section .filmo-row > b > a[href]")[:3]]

    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_url = {executor.submit(get_content,url): url for url in item_links}
        for future in concurrent.futures.as_completed(future_to_url):
            itemvault.append(future.result())

    print(name,itemvault)


def get_content(url):
    res = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(res.text,"lxml")
    try:
        director = soup.select_one("h4:contains('Director') ~ a").get_text(strip=True)
    except Exception as e: director = ""
    try:
        writer = soup.select_one("h4:contains('Writer') ~ a").get_text(strip=True)
    except Exception as e: writer = ""
    return director,writer

if __name__ == '__main__':
    for elem in get_actor_list():
        get_movie_links(elem[0],elem[1])

Type of results the script produces (as expected):

(脚本产生的结果类型(按预期):)

Robert De Niro [('', 'Anthony Thorne'), ('Jonathan Jakubowicz', 'Jonathan Jakubowicz'), ('Martin Scorsese', 'David Grann')]
Jack Nicholson [('Rob Reiner', 'Justin Zackham'), ('Casey Affleck', 'Casey Affleck'), ('James L. Brooks', 'James L. Brooks')]
Marlon Brando [('Peter Mitchell Rubin', 'Mario Puzo'), ('Bob Bendetson', 'Bob Bendetson'), ('Paul Hunter', 'Paul Hunter')]

How can I make the above script run faster?

(如何使上述脚本运行得更快?)

  ask by robots.txt translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...