Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
121 views
in Technique[技术] by (71.8m points)

html - VBA从包含空格的HTML检索数据(VBA Retrieve data from an HTML including space)

Here is the relevant HTML code.

(这是相关的HTML代码。)

<tr style="background-color: #f0f0f0">
<td> </td><td> a</td><td>a </td><td>  </td><td>&nbsp;</td>
</tr>

Here is the VBA code.

(这是VBA代码。)

sub gethtmlspace() 

Dim trObj As MSHTML.HTMLGenericElement
Dim tdObj As MSHTML.HTMLGenericElement
Dim aRes As Variant, bRes As Variant
Dim temp1 As Long, Temp2 As Long, temp3 As Long, Temp4 As Long
Dim oDom As Object: Set oDom = CreateObject("htmlFile")
Dim oRow As MSHTML.IHTMLElementCollection, oCell As MSHTML.IHTMLElementCollection

temp1 = 0
Temp2 = 0

    With CreateObject("MSXML2.ServerXMLHttp")
        .Open "GET", "https://docs.google.com/spreadsheets/d/1Yh6WlJTDxbOLPVaVgzn_mk2OAKYVUYgfnT5Wz-8odi4/gviz/tq?tqx=out:html&tq&gid=1", False
        .send
        oDom.body.innerHTML = .responseText
    End With

Set oRow = oDom.getElementsByTagName("TR")
    ReDim aRes(0 To oRow.Length - 1, 0 To oRow(0).getElementsByTagName("TD").Length - 1)
    For Each trObj In oRow
        Set oCell = trObj.getElementsByTagName("td")
        For Each tdObj In oCell
            aRes(temp1, Temp2) = tdObj.innerText
            Temp2 = Temp2 + 1
        Next tdObj
        Temp2 = 0
        temp1 = temp1 + 1
    Next trObj

end sub

I would like aRes array to contain the exact value in the HTMLcode, ie

(我希望aRes数组在HTMLcode中包含确切的值,即)

aRes(1,0) should be equal to a space " " My results get empty ie""

(aRes(1,0)应该等于一个空格“” 我的结果为空,即“”)

aRes(1,1) should be equal to a space and character a " a" My results get a only "a"

(aRes(1,1)应该等于一个空格和一个字符“ a”, 我的结果只能是一个“ a”)

aRes(1,2) should be "a " this one is correctly retrieved.

(aRes(1,2)应该为“ a”, 这是正确检索到的。)

aRes(1,3) should be equal to two spaces " " My results get empty ie""

(aRes(1,3)应该等于两个空格“” 我的结果为空,即“”)

aRes(1,4) should be equal to empty My results get a space ie" "

(aRes(1,4)应该等于空我的结果得到一个空格,即“”)

I know I can use regex to get the tasks done.

(我知道我可以使用正则表达式来完成任务。)

However, I would like to do it in a simple way using getelementsbytagname method.

(但是,我想使用getelementsbytagname方法以一种简单的方式做到这一点。)

I tried innerhtml, outertext, outerhtml, textcontent instead of innertext.

(我尝试使用innerhtml,outertext,outerhtml,textcontent而不是innertext。)

But no luck.

(但是没有运气。)

I also googled for the key words, like innertext with spacing, getelementsbytagename properties.

(我还用谷歌搜索关键字,例如带间隔的内部文本,getelementsbytagename属性。)

Also no luck.

(也没有运气。)

Could someone help please.

(有人可以帮忙吗。)

Thank you so much.

(非常感谢。)

  ask by lordy888 translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can't per se.

(您本身不能。)

The HTML parser decides what whitespace is useful and to retain and what to remove.

(HTML解析器决定什么空白是有用的,要保留的以及要删除的。)

I will add some references later (if I can find any) but just like in the browser engine, in the HTML parser there are rules which determine which whitespace characters are useful.

(稍后我将添加一些引用(如果可以找到任何引用),但是就像在浏览器引擎中一样,HTML解析器中有一些规则来确定哪些空格字符有用。)

Bear in mind that:

(请记住:)

"Whitespace" is a mass noun

(“空白”是一个专有名词)

covering a variety of characters which may be handled differently.

(涵盖了可能会以不同方式处理的各种字符。)

Compare what happens to your responseText after it has gone through the HTML parser:

(比较您的responseText通过HTML解析器后发生了什么:)

image

See how whitespace determined not useful is removed.

(查看如何确定空格不可用被删除。)

You cannot use a method of HTMLfile to get the result you want, as by the time the HTML has been parsed it is too late;

(您无法使用HTMLfile的方法来获取所需的结果,因为在解析HTML时为时已晚。)

and there is no setting with late bound HTMLFile , or early bound MSHTML.HTMLDocument , that changes this.

(并且没有后期绑定HTMLFile或早期绑定MSHTML.HTMLDocument来更改此设置。)

You would have to look to other string manipulations first.

(您将必须首先考虑其他字符串操作。)

You might, for example, do a replace$ on the .responseText of Chr$(32) with the html entity &nbsp;

(例如,您可以使用HTML实体&nbsp;Chr$(32).responseText上执行一次replace $ &nbsp;)

.

(。)

Or, use regex, as you mention, to do a more efficient set of replacements.

(或者,正如您提到的,使用正则表达式进行更有效的替换。)

You can generate the above image outputs with:

(您可以使用以下方法生成上述图像输出:)

Option Explicit

Public Sub ExamineHtmlWhenParsed()
    Dim oDom As Object: Set oDom = CreateObject("htmlFile")

    With CreateObject("MSXML2.ServerXMLHTTP")
        .Open "GET", "https://docs.google.com/spreadsheets/d/1Yh6WlJTDxbOLPVaVgzn_mk2OAKYVUYgfnT5Wz-8odi4/gviz/tq?tqx=out:html&tq&gid=1", False
        .send
        oDom.body.innerHTML = .responseText
        WriteTxtFile .responseText, "C:UsersUserDesktopinput.txt"
        WriteTxtFile oDom.body.innerHTML, "C:UsersUserDesktopparsed.txt"
    End With

End Sub

 Public Sub WriteTxtFile(ByVal aString As String, ByVal filePath As String)
    Dim fso As Object, Fileout As Object
    Set fso = CreateObject("Scripting.FileSystemObject")
    Set Fileout = fso.CreateTextFile(filePath, True, True)
    Fileout.Write aString
    Fileout.Close
End Sub

This gives a worked example of browser white space processing.

(给出了浏览器空白处理的有效示例。)

This discusses it in the content of css.

(在css的内容中进行了讨论。)

The VBA HTML parsers will be older than the current HTML5 living standard but the current standard is here .

(VBA HTML解析器将比当前的HTML5使用标准更早,但是当前的标准在这里 。)

You can review the answers given to this question and the associated comments eg:

(您可以查看对此问题的答案以及相关的评论,例如:)

@JasonWoof: HTML5 spec says that browsers should only collapse 5 (ascii) whitespace characters (space, tab, cr, lf, ff).

(@JasonWoof: HTML5规范指出,浏览器只能折叠5个(ASCII)空白字符(空格,制表符,cr,lf,ff)。)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...