Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
233 views
in Technique[技术] by (71.8m points)

how do I extract an element, sub-elements and the full path from xml in python?

I would like to extract an element, including sub-elements and the full path from xml.

If this is my xml doc:

<world>
    <countries>
        <country>
            <name>a</name>
            <description>a short description</description>
            <population>
                <now>250000</now>
                <2000>100000</2000>
            </population>
        </country>
        <country>
            <name>b</name>
            <description>b short description</description>
            <population>
                <now>350000</now>
                <2000>150000</2000>
            </population>
        </country>
    </countries>
</world>

I would like to end up with this (see below) based on an xpath expression of ('//country[name="a"]

<world>
    <countries>
        <country>
            <name>a</name>
            <description>a short description</description>
            <population>
                <now>250000</now>
                <2000>100000</2000>
            </population>
        </country>
    </countries>
</world>
question from:https://stackoverflow.com/questions/65948160/how-do-i-extract-an-element-sub-elements-and-the-full-path-from-xml-in-python

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

This type of thing can be taken care of using xpath with lxml.

One thing, though, one of the html tags (<2000>) is invalid since it doesn't begin with a letter. If you have no control over the source, you have to replace the offending tag before parsing and then replace it again after processing.

So, all together:

import lxml.html as lh
countries = """[your html above]"""
doc = lh.fromstring(countries.replace('2000','xxx'))

states = doc.xpath('//country')
for country in states:
    if country.xpath('./name/text()')[0]!='a':
        country.getparent().remove(country)
print(lh.tostring(doc).decode().replace('xxx','2000'))

Output:

<world>
    <countries>
        <country>
            <name>a</name>
            <description>a short description</description>
            <population>
                <now>250000</now>
                <2000>100000</2000>
            </population>
        </country>
        </countries>
</world>

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...