XML DOM and HTML

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

XML DOM and HTML

Johannes Nohl
Dear list,

I player around with the units dom and xmlread. I liked them very
much. Now I thought I could parse websites with it. But they are
slightly different as far as I know. In xml everthing is within a node
while in HTML there are more then one value in a node. E.g.:

possible XML:

<div>
 asdf1
 <span>qwer1</span>
 <span>qwer2</span>
</div>

HTML:
<div>
 asdf1
 <span>qwer1</span>
 asdf2
 <span>qwer2</span>
 asdf3
</div>

Using XML-Dom I can access Value "asdf1" only. I think second example
is not valid XML, or?

Has anybody used XML to parse HTML-files? Is there a unit?

Thanks four your help!
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: XML DOM and HTML

Michael Van Canneyt


On Sat, 7 Jun 2008, Johannes Nohl wrote:

> Dear list,
>
> I player around with the units dom and xmlread. I liked them very
> much. Now I thought I could parse websites with it. But they are
> slightly different as far as I know. In xml everthing is within a node
> while in HTML there are more then one value in a node. E.g.:

There are multiple problems with HTML parsing: HTML is not a well-formed
XML document, because
- the tags are case insensitive (in XML they are case sensitive)
- Not all tags must be closed.
If the HTML is XHTML, then the DOM unit can be used to parse it.

>
> possible XML:
>
> <div>
>  asdf1
>  <span>qwer1</span>
>  <span>qwer2</span>
> </div>
>
> HTML:
> <div>
>  asdf1
>  <span>qwer1</span>
>  asdf2
>  <span>qwer2</span>
>  asdf3
> </div>
>
> Using XML-Dom I can access Value "asdf1" only. I think second example
> is not valid XML, or?

No, it should be valid. if it wasn't valid XML, then the XMLRead unit would
give an error.

Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: XML DOM and HTML

Johannes Nohl
Dear list, dear Michael!

> There are multiple problems with HTML parsing: HTML is not a well-formed
> XML document, because
> - the tags are case insensitive (in XML they are case sensitive)
> - Not all tags must be closed.
> If the HTML is XHTML, then the DOM unit can be used to parse it.

But how do I retrieve more than the first part of the node's value?

If I read in:
 <div>
  asdf1
  <span>qwer1</span>
  asdf2
  <img src="" />
  asdf3
 </div>

FindNode('dvi').NodeValue returns "asdf1". But not asdf2 and asdf3.
Isn't the example above valid XHTML?

Am I wrong?
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: XML DOM and HTML

Michael Van Canneyt


On Sun, 8 Jun 2008, Johannes Nohl wrote:

> Dear list, dear Michael!
>
> > There are multiple problems with HTML parsing: HTML is not a well-formed
> > XML document, because
> > - the tags are case insensitive (in XML they are case sensitive)
> > - Not all tags must be closed.
> > If the HTML is XHTML, then the DOM unit can be used to parse it.
>
> But how do I retrieve more than the first part of the node's value?
>
> If I read in:
>  <div>
>   asdf1
>   <span>qwer1</span>
>   asdf2
>   <img src="" />
>   asdf3
>  </div>
>
> FindNode('dvi').NodeValue returns "asdf1". But not asdf2 and asdf3.
> Isn't the example above valid XHTML?

In the above, the node value is badly defined for the div node.
The return value is IMHO correct. You will have to 'glue' the various text parts together.


Michael.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: XML DOM and HTML

Lee Jenkins
In reply to this post by Johannes Nohl
Johannes Nohl wrote:

> Dear list, dear Michael!
>
>> There are multiple problems with HTML parsing: HTML is not a well-formed
>> XML document, because
>> - the tags are case insensitive (in XML they are case sensitive)
>> - Not all tags must be closed.
>> If the HTML is XHTML, then the DOM unit can be used to parse it.
>
> But how do I retrieve more than the first part of the node's value?
>
> If I read in:
>  <div>
>   asdf1
>   <span>qwer1</span>
>   asdf2
>   <img src="" />
>   asdf3
>  </div>
>
> FindNode('dvi').NodeValue returns "asdf1". But not asdf2 and asdf3.
> Isn't the example above valid XHTML?
>

If were going to parse web pages I would probably opt to use RegEx.  There is
regex included with fpc I believe, but I tend to use this one since its
compatible with fpc and delphi:

http://regexpstudio.com/TRegExpr/TRegExpr.html

--

Warm Regards,

Lee

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: XML DOM and HTML

Sebastian Günther
In reply to this post by Johannes Nohl
Johannes Nohl schrieb:

> Dear list,
>
> I player around with the units dom and xmlread. I liked them very
> much. Now I thought I could parse websites with it. But they are
> slightly different as far as I know. In xml everthing is within a node
> while in HTML there are more then one value in a node. E.g.:
>
> possible XML:
>
> <div>
>  asdf1
>  <span>qwer1</span>
>  <span>qwer2</span>
> </div>
>
> HTML:
> <div>
>  asdf1
>  <span>qwer1</span>
>  asdf2
>  <span>qwer2</span>
>  asdf3
> </div>
>
> Using XML-Dom I can access Value "asdf1" only. I think second example
> is not valid XML, or?
>
> Has anybody used XML to parse HTML-files? Is there a unit?


Yes.
HTML is based on SGML, and XML is a subset of SGML. So you cannot simply
parse any HTML file using a XML parser.
You can try to use the HTML parser (but which relies on more or less
correct HTML code) in packages/fpc-xml/sax_html.pp instead of the XML
parser, which should be able to parse most of all websites.


Regards,
Sebastian
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal