c# - How to get an HtmlElement value inside Frames/IFrames?

Question

Welcome To Ask or Share your Answers For Others

c# - How to get an HtmlElement value inside Frames/IFrames?

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

c# - How to get an HtmlElement value inside Frames/IFrames?

I'm using the Winforms WebBrowser control to collect the links of video clips from the site linked below.

LINK

But, when I scroll element by element, I cannot find the <video> tag.

void webBrowser_DocumentCompleted_2(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    try
    {
        HtmlElementCollection pTags = browser.Document.GetElementsByTagName("video");
        int i = 1;
        foreach (HtmlElement link in links)
        {

            if (link.Children[0].GetAttribute("className") == "vjs-poster")
            {
                try
                {

                    i++;
                }
                catch (Exception ex)
                {
                    MessageBox.Show(ex.Message);
                }
            }
        }
    }   // Added by edit
}

Soon after using the

HtmlElementCollection pTags = browser.Document.GetElementsByTagName("video");

I already return 0

Do I need to call any ajax?

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-16T22:26:35+0000

The Web page you linked contains IFrames.
An IFrame contains its own HtmlDocument. As of now, you're parsing just the main Document container.
Thus, you need to parse the HtmlElements TAGs of some other Frame.
The Web Page Frames list is referenced by the WebBrowser.Document.Window.Frames property, which returns an HtmlWindowCollection.
Each HtmlWindow in the collection contains it own HtmlDocument object.

Instead of parsing the Document object property returned by a WebBrowser, we, most of the time, need to parse each HtmlWindow.Document in the Frames collection; unless, of course we already know that the required Elements are part of the main Document or another known Frame.

An example (related to the current task):

Subscribe the DocumentCompleted event of the WebBrowser Control/Class.
Check the WebBrowser.ReadyState property to verify that a Document is loaded completely.

Note:
Remembering that a Web Page may be composed by multiple Documents contained in Frames/IFrames, we won't be surprised if the event is raised multiple times with a ReadyState = WebBrowserReadyState.Complete.
Each Frame's Document will raise the event when the WebBrowser is done loading it.

Parse the HtmlDocument of each Frame in the Document.Window.Frames collection, using the Frame.Document.Body.GetElementsByTagName() method.
Extract the HtmlElements Attibute using the HtmlElement.GetAttribute method.

Note:
Since the DocumentCompleted event is raised multiple times, we need to verify that an HtmlElement Attribute value is not stored multiple times, too.
Here, I'm using a support custom Class that holds all the collected values along with the HashCode of each reference Link (here, relying on the default implementation of GetHasCode()).
Each time a Document is parsed, we check whether a value has already been stored, comparing its Hash.

Stop the parsing when we verify that a duplicate Hash has been found: the Frame Document Elements have already been extracted.

Note:
While parsing the HtmlWindowCollection, it's inevitable to raise some specific Exceptions:

UnauthorizedAccessException: some Frames cannot be accessed.
InvalidOperationException: some Elements/Descendants cannot be accessed.

There's nothing we can do to avoid this: the Elements are not null, they simply throw these exceptions when we try to access any of their properties.
Here, I'm just catching and ignoring these specific Exceptions: we know we will eventually get them, we cannot avoid it, move on.

public class MovieLink
{
    public MovieLink() { }
    public int Hash { get; set; }
    public string VideoLink { get; set; }
    public string ImageLink { get; set; }
}

List<MovieLink> moviesLinks = new List<MovieLink>();

private void Browser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    var browser = sender as WebBrowser;
    if (browser.ReadyState != WebBrowserReadyState.Complete) return;

    var documentFrames = browser.Document.Window.Frames;
    foreach (HtmlWindow Frame in documentFrames) {
        try {
            var videoElement = Frame.Document.Body
                .GetElementsByTagName("VIDEO").OfType<HtmlElement>().FirstOrDefault();

            if (videoElement != null) {
                string videoLink = videoElement.Children[0].GetAttribute("src");
                int hash = videoLink.GetHashCode();
                if (moviesLinks.Any(m => m.Hash == hash)) {
                    // Done parsing this URL: remove handler or whatever 
                    // else is planned to move to the next site/page
                    return;
                }

                string sourceImage = videoElement.GetAttribute("poster");
                moviesLinks.Add(new MovieLink() {
                    Hash = hash, VideoLink = videoLink, ImageLink = sourceImage
                });
            }
        }
        catch (UnauthorizedAccessException) { } // Cannot be avoided: ignore
        catch (InvalidOperationException) { }   // Cannot be avoided: ignore
    }
}

Categories

c# - How to get an HtmlElement value inside Frames/IFrames?

c# - How to get an HtmlElement value inside Frames/IFrames?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags