Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
310 views
in Technique[技术] by (71.8m points)

javascript - Scrapy crawling not working on ASPX website

I'm scraping the Madrid Assembly's website, built in aspx, and I have no idea how to simulate clicks on the links where I need to get the corresponding politicians from. I tried this:

import scrapy

class AsambleaMadrid(scrapy.Spider):

name        =   "Asamblea_Madrid"
start_urls  =   ['http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx']

def parse(self, response):

    for id in response.css('div#moduloBusqueda div.sangria div.sangria ul li a::attr(id)'):
        target                  =   id.extract()
        url                     =   "http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx"

        formdata=   {'__EVENTTARGET': target,
                     '__VIEWSTATE': '/wEPDwUBMA9kFgJmD2QWAgIBD2QWBAIBD2QWAgIGD2QWAmYPZBYCAgMPZBYCAgMPFgIeE1ByZXZpb3VzQ29udHJvbE1vZGULKYgBTWljcm9zb2Z0LlNoYXJlUG9pbnQuV2ViQ29udHJvbHMuU1BDb250cm9sTW9kZSwgTWljcm9zb2Z0LlNoYXJlUG9pbnQsIFZlcnNpb249MTQuMC4wLjAsIEN1bHR1cmU9bmV1dHJhbCwgUHVibGljS2V5VG9rZW49NzFlOWJjZTExMWU5NDI5YwFkAgMPZBYMAgMPZBYGBSZnXzM2ZWEwMzEwXzg5M2RfNGExOV85ZWQxXzg4YTEzM2QwNjQyMw9kFgJmD2QWAgIBDxYCHgtfIUl0ZW1Db3VudAIEFghmD2QWAgIBDw8WBB4PQ29tbWFuZEFyZ3VtZW50BTRHcnVwbyBQYXJsYW1lbnRhcmlvIFBvcHVsYXIgZGUgbGEgQXNhbWJsZWEgZGUgTWFkcmlkHgRUZXh0BTRHcnVwbyBQYXJsYW1lbnRhcmlvIFBvcHVsYXIgZGUgbGEgQXNhbWJsZWEgZGUgTWFkcmlkZGQCAQ9kFgICAQ8PFgQfAgUeR3J1cG8gUGFybGFtZW50YXJpbyBTb2NpYWxpc3RhHwMFHkdydXBvIFBhcmxhbWVudGFyaW8gU29jaWFsaXN0YWRkAgIPZBYCAgEPDxYEHwIFL0dydXBvIFBhcmxhbWVudGFyaW8gUG9kZW1vcyBDb211bmlkYWQgZGUgTWFkcmlkHwMFL0dydXBvIFBhcmxhbWVudGFyaW8gUG9kZW1vcyBDb211bmlkYWQgZGUgTWFkcmlkZGQCAw9kFgICAQ8PFgQfAgUhR3J1cG8gUGFybGFtZW50YXJpbyBkZSBDaXVkYWRhbm9zHwMFIUdydXBvIFBhcmxhbWVudGFyaW8gZGUgQ2l1ZGFkYW5vc2RkBSZnX2MxNTFkMGIxXzY2YWZfNDhjY185MWM3X2JlOGUxMTZkN2Q1Mg9kFgRmDxYCHgdWaXNpYmxlaGQCAQ8WAh8EaGQFJmdfZTBmYWViMTVfOGI3Nl80MjgyX2ExYjFfNTI3ZDIwNjk1ODY2D2QWBGYPFgIfBGhkAgEPFgIfBGhkAhEPZBYCAgEPZBYEZg9kFgICAQ8WAh8EaBYCZg9kFgQCAg9kFgQCAQ8WAh8EaGQCAw8WCB4TQ2xpZW50T25DbGlja1NjcmlwdAW7AWphdmFTY3JpcHQ6Q29yZUludm9rZSgnVGFrZU9mZmxpbmVUb0NsaWVudFJlYWwnLDEsIDEsICdodHRwOlx1MDAyZlx1MDAyZnd3dy5hc2FtYmxlYW1hZHJpZC5lc1x1MDAyZkVTXHUwMDJmUXVlRXNMYUFzYW1ibGVhXHUwMDJmQ29tcG9zaWNpb25kZWxhQXNhbWJsZWFcdTAwMmZMb3NEaXB1dGFkb3MnLCAtMSwgLTEsICcnLCAnJykeGENsaWVudE9uQ2xpY2tOYXZpZ2F0ZVVybGQeKENsaWVudE9uQ2xpY2tTY3JpcHRDb250YWluaW5nUHJlZml4ZWRVcmxkHgxIaWRkZW5TY3JpcHQFIVRha2VPZmZsaW5lRGlzYWJsZWQoMSwgMSwgLTEsIC0xKWQCAw8PFgoeCUFjY2Vzc0tleQUBLx4PQXJyb3dJbWFnZVdpZHRoAgUeEEFycm93SW1hZ2VIZWlnaHQCAx4RQXJyb3dJbWFnZU9mZnNldFhmHhFBcnJvd0ltYWdlT2Zmc2V0WQLrA2RkAgEPZBYCAgUPZBYCAgEPEBYCHwRoZBQrAQBkAhcPZBYIZg8PFgQfAwUPRW5nbGlzaCBWZXJzaW9uHgtOYXZpZ2F0ZVVybAVfL0VOL1F1ZUVzTGFBc2FtYmxlYS9Db21wb3NpY2lvbmRlbGFBc2FtYmxlYS9Mb3NEaXB1dGFkb3MvUGFnZXMvUmVsYWNpb25BbGZhYmV0aWNhRGlwdXRhZG9zLmFzcHhkZAICDw8WBB8DBQZQcmVuc2EfDgUyL0VTL0JpZW52ZW5pZGFQcmVuc2EvUGFnaW5hcy9CaWVudmVuaWRhUHJlbnNhLmFzcHhkZAIEDw8WBB8DBRpJZGVudGlmaWNhY2nDs24gZGUgVXN1YXJpbx8OBTQvRVMvQXJlYVVzdWFyaW9zL1BhZ2luYXMvSWRlbnRpZmljYWNpb25Vc3Vhcmlvcy5hc3B4ZGQCBg8PFgQfAwUGQ29ycmVvHw4FKGh0dHA6Ly9vdXRsb29rLmNvbS9vd2EvYXNhbWJsZWFtYWRyaWQuZXNkZAIlD2QWAgIDD2QWAgIBDxYCHwALKwQBZAI1D2QWAgIHD2QWAgIBDw8WAh8EaGQWAgIDD2QWAmYPZBYCAgMPZBYCAgUPDxYEHgZIZWlnaHQbAAAAAAAAeUABAAAAHgRfIVNCAoABZBYCAgEPPCsACQEADxYEHg1QYXRoU2VwYXJhdG9yBAgeDU5ldmVyRXhwYW5kZWRnZGQCSQ9kFgICAg9kFgICAQ9kFgICAw8WAh8ACysEAWQYAgVBY3RsMDAkUGxhY2VIb2xkZXJMZWZ0TmF2QmFyJFVJVmVyc2lvbmVkQ29udGVudDMkVjRRdWlja0xhdW5jaE1lbnUPD2QFKUNvbXBvc2ljacOzbiBkZSBsYSBBc2FtYmxlYVxMb3MgRGlwdXRhZG9zZAVHY3RsMDAkUGxhY2VIb2xkZXJUb3BOYXZCYXIkUGxhY2VIb2xkZXJIb3Jpem9udGFsTmF2JFRvcE5hdmlnYXRpb25NZW51VjQPD2QFGkluaWNpb1xRdcOpIGVzIGxhIEFzYW1ibGVhZJ',
                     '__EVENTVALIDATION': '/wEWCALIhqvYAwKh2YVvAuDF1KUDAqCK1bUOAqCKybkPAqCKnbQCAqCKsZEJAvejv84Dtkx5dCFr3QGqQD2wsFQh8nP3iq8',
                     '__VIEWSTATEGENERATOR': 'BAB98CB3',
                     '__REQUESTDIGEST': '0x476239970DCFDABDBBDF638A1F9B026BD43022A10D1D757B05F1071FF3104459B4666F96A47B4845D625BCB2BE0D88C6E150945E8F5D82C189B56A0DA4BC859D'}

        yield scrapy.FormRequest(url=url, formdata= formdata, callback=self.takeEachParty)


def takeEachParty(self, response):

     print response.css('ul.listadoVert02 ul li::text').extract()

Going into the source code of the website, I can see how links look like, and how they send the JavaScript query. This is one of the links I need to access:

<a id="ctl00_m_g_36ea0310_893d_4a19_9ed1_88a133d06423_ctl00_Repeater1_ctl00_lnk_Grupo" href="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;ctl00$m$g_36ea0310_893d_4a19_9ed1_88a133d06423$ctl00$Repeater1$ctl00$lnk_Grupo&quot;, &quot;&quot;, true, &quot;&quot;, &quot;&quot;, false, true))">Grupo Parlamentario Popular de la Asamblea de Madrid</a>

I have been reading so many articles about, but probably the problem is my ignorance in respect.

Thanks in advance.

EDITED:

SOLUTION: I finally did it! Translating the very helpul code from Padraic Cunningham into Scrapy way. As I specified the issue for Scrapy, I want to post the result just in case someone has the same problem as I had.

So here it goes:

import scrapy
import js2xml

class AsambleaMadrid(scrapy.Spider):

     name        =   "AsambleaMadrid"
     start_urls  =   ['http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx']

    def parse(self, response):

         source  =   response
         hrefs   =   response.xpath("//*[@id='moduloBusqueda']//div[@class='sangria']/ul/li/a/@href").extract()
         form_data = self.validate(source)
         for ref in hrefs:
             # js2xml allows us to parse the JS function and params, and so to grab the __EVENTTARGET
             js_xml            = js2xml.parse(ref)
             _id               = js_xml.xpath(
                            "//identifier[@name='WebForm_PostBackOptions']/following-sibling::arguments/string[starts-with(.,'ctl')]")[0]
             form_data["__EVENTTARGET"] = _id.text

             url_diputado    =   'http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx'
             # The proper way to send a POST in scrapy is by using the FormRequest
             yield scrapy.FormRequest(url=url_diputado, formdata=form_data, callback=self.extract_parties, method='POST')

     def validate(self, source):
         # these fields are the minimum required as cannot be hardcoded
         data = {"__VIEWSTATEGENERATOR": source.xpath("//*[@id='__VIEWSTATEGENERATOR']/@value")[0].extract(),
             "__EVENTVALIDATION": source.xpath("//*[@id='__EVENTVALIDATION']/@value")[0].extract(),
             "__VIEWSTATE": source.xpath("//*[@id='__VIEWSTATE']/@value")[0].extract(),
             " __REQUESTDIGEST": source.xpath("//*[@id='__REQUESTDIGEST']/@value")[0].extract()}
         return data

     def extract_parties(self, response):
         source      =   response
         name        =   source.xpath("//ul[@class='listadoVert02']/ul/li/a/text()").extract()
         print name

I hope is clear. Thanks everybody, again!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

If you look at the data posted to the form in chrome or firebug you can see there are many fields passed in the post request, there are a few that are essential and must be parsed from the original page, parsing the ids from the div.sangria ul li a tags is not sufficient as the actual data posted is slightly different, what is posted is in the Javascript function, WebForm_DoPostBackWithOptions which is in the href not the id attribute:

href='javascript:WebForm_DoPostBackWithOptions(new 
 WebForm_PostBackOptions("ctl00$m$g_36ea0310_893d_4a19_9ed1_88a133d06423$ctl00$Repeater1$ctl03$lnk_Grupo", "", true, "", "", false, true))'>

Sometimes all the underscores are replaced with dollar signs so it is easy to do a str.replace to get them in the correct order but not really in this case, we could use a regex to parse but I like the js2xml lib which can parse a javascript function and its args into an xml tree.

The following code using requests shows you how can get the data from the initial request and get to all the pages you want:

import requests
from  lxml import html
import js2xml

post = "http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx"


def validate(xml):
    # these fields are the minimum required as cannot be hardcoded
    data = {"__VIEWSTATEGENERATOR": xml.xpath("//*[@id='__VIEWSTATEGENERATOR']/@value")[0],
            "__EVENTVALIDATION": xml.xpath("//*[@id='__EVENTVALIDATION']/@value")[0],
            "__VIEWSTATE": xml.xpath("//*[@id='__VIEWSTATE']/@value")[0],
            " __REQUESTDIGEST": xml.xpath("//*[@id='__REQUESTDIGEST']/@value")[0]}
    return data



with requests.Session() as s:
    # make initial requests to get the links/hrefs and the from fields
    r = s.get(
        "http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx")
    xml = html.fromstring(r.content)
    hrefs = xml.xpath("//*[@id='moduloBusqueda']//div[@class='sangria']/ul/li/a/@href")
    form_data = validate(xml)
    for h in hrefs:
        js_xml = js2xml.parse(h)
        _id = js_xml.xpath(
            "//identifier[@name='WebForm_PostBackOptions']/following-sibling::arguments/string[starts-with(.,'ctl')]")[
            0]
        form_data["__EVENTTARGET"] = _id.text
        r = s.post(post, data=form_data)
        xml = html.fromstring(r.content)
        print(xml.xpath("//ul[@class='listadoVert02']/ul/li/a/text()"))

If we run the code above we see the different text output from all teh anchor tags:

In [2]: with requests.Session() as s:
   ...:         r = s.get(
   ...:             "http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx")
   ...:         xml = html.fromstring(r.content)
   ...:         hrefs = xml.xpath("//*[@id='moduloBusqueda']//div[@class='sangria']/ul/li/a/@href")
   ...:         form_data = validate(xml)
   ...:         for h in hrefs:
   ...:                 js_xml = js2xml.parse(h)
   ...:                 _id = js_xml.xpath(
   ...:                     "//identifier[@name='WebForm_PostBackOptions']/following-sibling::arguments/string[starts-with(.,'ctl')]")[
   ...:                     0]
   ...:                 form_data["__EVENTTARGET"] = _id.text
   ...:                 r = s.post(post, data=form_data)
   ...:                 xml = html.fromstring(r.content)
   ...:                 print(xml.xpath("//ul[@class='listadoVert02']/ul/li/a/text()"))
   ...:         
[u'Aboxedn Aboxedn, Sonsoles Trinidad', u'Adrados Gautier, Mxaa Paloma', u'Aguado Del Olmo, Mxaa Josefa', u'xc1lvarez Padilla, Mxaa Nadia', u'Arribas Del Barrio, Josxe9 Mxaa', u'Ballarxedn Valcxe1rcel, xc1lvaro Cxe9sar', u'Berrio Fernxe1ndez-Caballero, Mxaa Inxe9s', u'Berzal Andrade, Josxe9 Manuel', u'Camxedns Martxednez, Ana', u'Carballedo Berlanga, Mxaa Eugenia', 'Cifuentes Cuencas, Cristina', u'Dxedaz Ayuso, Isabel Natividad', u'Escudero Dxedaz-Tejeiro, Marta', u'Fermosel Dxedaz, Jesxfas', u'Fernxe1ndez-Quejo Del Pozo, Josxe9 Luis', u'Garcxeda De Vinuesa Gardoqui, Ignacio', u'Garcxeda Martxedn, Marxeda Begoxf1a', u'Garrido Garcxeda, xc1ngel', u'Gxf3mez Ruiz, Jesxfas', u'Gxf3mez-Angulo Rodrxedguez, Juan Antonio', u'Gonzxe1lez Gonzxe1lez, Isabel Gema', u'Gonzxe1lez Jimxe9nez, Bartolomxe9', u'Gonzxe1lez Taboada, Jaime', u'Gonzxe1lez-Moxf1ux Vxe1zquez, Elena', u'Gonzalo Lxf3pez, Rosalxeda', 'Izquierdo Torres, Carlos', u'Lixe9bana Montijano, Pilar', u'Marixf1o Ortega, Ana Isabel', u'Moraga Valiente, xc1lvaro', u'Muxf1oz Abrines, Pedro', u'Nxfaxf1ez Guijarro, Josxe9 Enrique', u'Olmo Flxf3rez, Luis Del', u'Ongil Cores, Mxaa Gador', 'Ortiz Espejo, Daniel', u'Ossorio Crespo, Enrique Matxedas', 'Peral Guerra, Luis', u'Pxe9rez Baos, Ana Isabel', u'Pxe9rez Garcxeda, David', u'Plaxf1iol De Lacalle, Regina Mxaa', u'Redondo Alcaide, Mxaa Isabel', u'Rollxe1n Ojeda, Pedro', u'Sxe1nchez Fernxe1ndez, Alejandro', 'Sanjuanbenito Bonal, Diego', u'Serrano Guio, Josxe9 Tomxe1s', u'Serrano Sxe1nchez-Capuchino, Alfonso Carlos', 'Soler-Espiauba Gallo, Juan', 'Toledo Moreno, Lucila', 'Van-Halen Acedo, Juan']
[u'Andaluz Andaluz, Mxaa Isabel', u'Ardid Jimxe9nez, Mxaa Isabel', u'Carazo Gxf3mez, Mxf3nica', u'Casares Dxedaz, Mxaa Lucxeda Inmaculada', u'Cepeda Garcxeda De Lexf3n, Josxe9 Carmelo', 'Cruz Torrijos, Diego', u'Delgado Gxf3mez, Carla', u'Franco Pardo, Josxe9 Manuel', u'Freire Campo, Josxe9 Manuel', u'Gabilondo Pujol, xc1ngel', 'Gallizo Llamas, Mercedes', u"Garcxeda D'Atri, Ana", u'Garcxeda-Rojo Garrido, Pedro Pablo', u'Gxf3mez Montoya, Rafael', u'Gxf3mez-Chamorro Torres, Josxe9 xc1ngel', u'Gonzxe1lez Gonzxe1lez, Mxf3nica Silvana', u'Leal Fernxe1ndez, Mxaa Isaura', u'Llop Cuenca, Mxaa Pilar', 'Lobato Gandarias, Juan', u'Lxf3pez Ruiz, Mxaa Carmen', u'Manguan Valderrama, Eva Mxaa', u'Maroto Illera, Mxaa Reyes', u'Martxednez Ten, Carmen', u'Mena Romero, Mxaa Carmen', u'Moreno Navarro, Juan Josxe9', u'Moya Nieto, Encarnacixf3n', 'Navarro Lanchas, Josefa', 'Nolla Estrada, Modesto', 'Pardo Ortiz, Josefa Dolores', u'Quintana Viar, Josxe9', u'Rico Garcxeda-Hierro, Enrique', u'Rodrxedguez Garcxeda, Nicolxe1s', u'Sxe1nchez Acera, Pilar', u'Santxedn Fernxe1ndez, Pedro', 'Segovia Noriega, Juan', 'Vicente Viondi, Daniel', u'Vinagre Alcxe1zar, Agustxedn']
['Abasolo Pozas, Olga', 'Ardanuy Pizarro, Miguel', u'Beirak Ulanosky, Jazmxedn', u'Camargo Fernxe1ndez, Raxfal', 'Candela Pokorna, Marco', 'Delgado Orgaz, Emilio', u'Dxedaz Romxe1n, Laura', u'Espinar Merino, Ramxf3n', u'Espinosa De La Llave, Marxeda', u'Fernxe1ndez Rubixf1o, Eduardo', u'Garcxeda Gxf3mez, Mxf3nica', 'Gimeno Reinoso, Beatriz', u'Gutixe9rrez Benito, Eduardo', 'Huerta Bravo, Raquel', u'Lxf3pez Hernxe1ndez, Isidro', u'Lxf3pez Rodrigo, Josxe9 Manuel', u'Martxednez Abarca, Hugo', u'Morano Gonzxe1lez, Jacinto', u'Ongil Lxf3pez, Miguel', 'Padilla Estrada, Pablo', u'Ruiz-Huerta Garcxeda De Viedma, Lorena', 'Salazar-Alonso Revuelta, Cecilia', u'San Josxe9 Pxe9rez, Carmen', u'Sxe1nchez Pxe9rez, Alejandro', u'Serra Sxe1nchez, Isabel', u'Serra Sxe1nchez, Clara', 'Sevillano De Las Heras, Elena']
[u'Aguado Crespo, Ignacio Jesxfas', u'xc1lvarez Cabo, Daniel', u'Gonzxe1lez Pastor, Dolores', u'Iglesia Vicente, Mxaa Teresa De La', 'Lara Casanova, Francisco', u'Marbxe1n De Frutos, Marta', u'Marcos Arias, Tomxe1s', u'Megxedas Morales, Jesxfas Ricardo', u'Nxfaxf1ez Sxe1nchez, Roberto', 'Reyero Zubiri, Alberto', u'Rodrxedguez Durxe1n, Ana', u'Rubio Ruiz, Juan Ramxf3n', u'Ruiz Fernxe1ndez, Esther', u'Solxeds Pxe9rez, Susana', 'Trinidad Martos, Juan', 'Veloso Lozano, Enrique', u'Zafra Hernxe1ndez, Cxe9sar']

You can add the exact same logic to your spider, I just used requests to show you a working example. You should also be aware that not every asp.net site behaves the same, you may have to re-validate for every post as in this related answer.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...