Fast HTML Parser

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Fast HTML Parser

Marcos Douglas B. Santos
Hi,

Someone knows a fast html parser to use in Pascal code?

I need something like this:

HTML:
<select name="sel_x">
<option>1</option>
<option>2</option>
</select>

I need a function/object to give me only the values:
1
2

Something like:
S := GetHTMLValues('sel_x');

Regards,
Marcos Douglas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Fast HTML Parser

Rainer Stratmann
It's not that difficult to write yourself.


 On Wednesday 06 August 2014 19:50:44 you wrote:

> Hi,
>
> Someone knows a fast html parser to use in Pascal code?
>
> I need something like this:
>
> HTML:
> <select name="sel_x">
> <option>1</option>
> <option>2</option>
> </select>
>
> I need a function/object to give me only the values:
> 1
> 2
>
> Something like:
> S := GetHTMLValues('sel_x');
>
> Regards,
> Marcos Douglas
> _______________________________________________
> fpc-pascal maillist  -  [hidden email]
> http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal 
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Fast HTML Parser

Marcos Douglas B. Santos
On Wed, Aug 6, 2014 at 2:54 PM, Rainer Stratmann
<[hidden email]> wrote:

>  On Wednesday 06 August 2014 19:50:44 you wrote:
>> Hi,
>>
>> Someone knows a fast html parser to use in Pascal code?
>>
>> I need something like this:
>>
>> HTML:
>> <select name="sel_x">
>> <option>1</option>
>> <option>2</option>
>> </select>
>>
>> I need a function/object to give me only the values:
>> 1
>> 2
>>
>> Something like:
>> S := GetHTMLValues('sel_x');
>
> It's not that difficult to write yourself.

You're right. But I'm searching the faster HTML parser to use in huge
HTML files... thousands of files.

Best regards,
Marcos Douglas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Fast HTML Parser

Mark Morgan Lloyd-5
Marcos Douglas wrote:

> On Wed, Aug 6, 2014 at 2:54 PM, Rainer Stratmann
> <[hidden email]> wrote:
>>  On Wednesday 06 August 2014 19:50:44 you wrote:
>>> Hi,
>>>
>>> Someone knows a fast html parser to use in Pascal code?
>>>
>>> I need something like this:
>>>
>>> HTML:
>>> <select name="sel_x">
>>> <option>1</option>
>>> <option>2</option>
>>> </select>
>>>
>>> I need a function/object to give me only the values:
>>> 1
>>> 2
>>>
>>> Something like:
>>> S := GetHTMLValues('sel_x');
>> It's not that difficult to write yourself.
>
> You're right. But I'm searching the faster HTML parser to use in huge
> HTML files... thousands of files.

I disagree: it's damn difficult if one isn't working with tightly
constrained input, and the original question says HTML without
specifying it's a subset.

There's a couple of places where I parse HTML files that I've created
myself, i.e. I know exactly what's in them, using- basically- a simple
recursive-descent parser with some rather flexible ideas about comments
(i.e. in the above example, name="sel_x" could be lost as a comment).
However if I'm doing a brute-force job over a large number of files I
usually use Lynx as a preprocessor, which allows me to use standard
text-processing utilities to pull named rows out of tabulated reports.

--
Mark Morgan Lloyd
markMLl .AT. telemetry.co .DOT. uk

[Opinions above are the author's, not those of his employers or colleagues]
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Fast HTML Parser

Marcos Douglas B. Santos
On Wed, Aug 6, 2014 at 5:46 PM, Mark Morgan Lloyd
<[hidden email]> wrote:

> Marcos Douglas wrote:
>>
>> On Wed, Aug 6, 2014 at 2:54 PM, Rainer Stratmann
>> <[hidden email]> wrote:
>>>
>>>  On Wednesday 06 August 2014 19:50:44 you wrote:
>>>>
>>>> Hi,
>>>>
>>>> Someone knows a fast html parser to use in Pascal code?
>>>>
>>>> I need something like this:
>>>>
>>>> HTML:
>>>> <select name="sel_x">
>>>> <option>1</option>
>>>> <option>2</option>
>>>> </select>
>>>>
>>>> I need a function/object to give me only the values:
>>>> 1
>>>> 2
>>>>
>>>> Something like:
>>>> S := GetHTMLValues('sel_x');
>>>
>>> It's not that difficult to write yourself.
>>
>>
>> You're right. But I'm searching the faster HTML parser to use in huge
>> HTML files... thousands of files.
>
>
> I disagree: it's damn difficult if one isn't working with tightly
> constrained input, and the original question says HTML without specifying
> it's a subset.
>
> There's a couple of places where I parse HTML files that I've created
> myself, i.e. I know exactly what's in them, using- basically- a simple
> recursive-descent parser with some rather flexible ideas about comments
> (i.e. in the above example, name="sel_x" could be lost as a comment).
> However if I'm doing a brute-force job over a large number of files I
> usually use Lynx as a preprocessor, which allows me to use standard
> text-processing utilities to pull named rows out of tabulated reports.

I know the tokens to search, but the HTML could be very different each other.
I can't use a external tool. Need to be a application (that already exists).

Thanks,
Marcos Douglas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Fast HTML Parser

Graeme Geldenhuys-6
On 2014-08-06 21:54, Marcos Douglas wrote:
> I know the tokens to search, but the HTML could be very different each other.
> I can't use a external tool. Need to be a application (that already exists).

Take a look at POWtils (aka PWU or PSP or Pascal Server Pages) created
by somebody known as Z505. There has been various locations for the
source code, but I think the latest is at:

  https://code.google.com/p/powtils/

It has (or at least had) a very simple to use HTML parser that was very
fast. If you don't come write with the above URL, I have some release
archives I know contains the code. Just let me know and I can make it
available.


Regards,
  - Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Fast HTML Parser

Graeme Geldenhuys-6
In reply to this post by Marcos Douglas B. Santos
On 2014-08-06 21:54, Marcos Douglas wrote:
> I know the tokens to search, but the HTML could be very different each other.
> I can't use a external tool. Need to be a application (that already exists).

It seems a copy of the Fast HTML Parser unit I spoke of has made its way
into the FPC source code tree.

See <fpc_src>/packages/chm/src/fasthtmlparser.pas

Attached is the original one I got from powtils release. It includes the
parser, a utility unit and a demo program showing the parser in action
with some stats output.


Regards,
  - Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

fasthtmlparser.tar.gz (12K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Fast HTML Parser

Andrew Haines
In reply to this post by Marcos Douglas B. Santos
On 08/06/14 13:50, Marcos Douglas wrote:

> Hi,
>
> Someone knows a fast html parser to use in Pascal code?
>
> I need something like this:
>
> HTML:
> <select name="sel_x">
> <option>1</option>
> <option>2</option>
> </select>
>
> I need a function/object to give me only the values:
> 1
> 2
>
> Something like:
> S := GetHTMLValues('sel_x');
>
> R

There is the unit fasthtmlparser included with fpc in the packages/chm
folder.

It is pretty basic and just has callbacks for tags and text. I don't
think it's smart enough to tell you of the

name="sel_x" part of your tag. Maybe it can be improved.

Regards,

Andrew Haines

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Fast HTML Parser

Marcos Douglas B. Santos
On Wed, Aug 6, 2014 at 9:58 PM, Andrew Haines <[hidden email]> wrote:

> On 08/06/14 13:50, Marcos Douglas wrote:
>> Hi,
>>
>> Someone knows a fast html parser to use in Pascal code?
>>
>> I need something like this:
>>
>> HTML:
>> <select name="sel_x">
>> <option>1</option>
>> <option>2</option>
>> </select>
>>
>> I need a function/object to give me only the values:
>> 1
>> 2
>>
>> Something like:
>> S := GetHTMLValues('sel_x');
>>
>> R
>
> There is the unit fasthtmlparser included with fpc in the packages/chm
> folder.
>
> It is pretty basic and just has callbacks for tags and text. I don't
> think it's smart enough to tell you of the
>
> name="sel_x" part of your tag.

You're right, but I change my code to use fasthtmlparser and worked
(at least for now). Thank you.

> Maybe it can be improved.
I agree. If I change something, I'll send a patch.


Marcos Douglas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Fast HTML Parser

Michael Schnell
In reply to this post by Rainer Stratmann
On 08/06/2014 07:54 PM, Rainer Stratmann wrote:
> It's not that difficult to write yourself.
>
In fact, my son once did write (using Delphi) a parser that creates a
list of hierarchically linked objects from HTML code and also can write
a HTML file from this structure.

So you can read a file, use straight forward programming to modify the
content, and write it back.

As the HTML format is not very strict and is a moving target, the parser
unit is far from perfect, but it is in daily use and does a rather nice
job.

OTOH, I would not say it's fast, anyway :-( .

-Michael
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Fast HTML Parser

Marcos Douglas B. Santos
In reply to this post by Graeme Geldenhuys-6
On Wed, Aug 6, 2014 at 6:51 PM, Graeme Geldenhuys
<[hidden email]> wrote:

> On 2014-08-06 21:54, Marcos Douglas wrote:
>> I know the tokens to search, but the HTML could be very different each other.
>> I can't use a external tool. Need to be a application (that already exists).
>
> Take a look at POWtils (aka PWU or PSP or Pascal Server Pages) created
> by somebody known as Z505. There has been various locations for the
> source code, but I think the latest is at:
>
>   https://code.google.com/p/powtils/
>
> It has (or at least had) a very simple to use HTML parser that was very
> fast. If you don't come write with the above URL, I have some release
> archives I know contains the code. Just let me know and I can make it
> available.

But the fasthtmlparser, your tip before, is a powtils' source, don't?
I have the code -- for many years -- but I did not know about
fasthtmlparser. It's very simple. I did not found everything I want
but it is a good start.

Best regards,
Marcos Douglas

PS: Like you I use FPC in real applications in production. So I have a
deadline - always short - to fulfill. So finding good code to help in
our projects is very good because it makes us save time. Thanks.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Fast HTML Parser

Marco van de Voort
In our previous episode, Marcos Douglas said:
> > It has (or at least had) a very simple to use HTML parser that was very
> > fast. If you don't come write with the above URL, I have some release
> > archives I know contains the code. Just let me know and I can make it
> > available.
>
> But the fasthtmlparser, your tip before, is a powtils' source, don't?
> I have the code -- for many years -- but I did not know about
> fasthtmlparser. It's very simple. I did not found everything I want
> but it is a good start.

Yes it is. The CHM parser is also based on it, but there z505 is not listed
as author but as contributor:

 AUTHOR       : James Azarja
                http://www.jazarsoft.com/

 CONTRIBUTORS : L505
                http://z505.com


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Fast HTML Parser

Luiz Americo Pereira Camara


2014-08-07 10:20 GMT-03:00 Marco van de Voort <[hidden email]>:
In our previous episode, Marcos Douglas said:
> > It has (or at least had) a very simple to use HTML parser that was very
> > fast. If you don't come write with the above URL, I have some release
> > archives I know contains the code. Just let me know and I can make it
> > available.
>
> But the fasthtmlparser, your tip before, is a powtils' source, don't?
> I have the code -- for many years -- but I did not know about
> fasthtmlparser. It's very simple. I did not found everything I want
> but it is a good start.

Yes it is. The CHM parser is also based on it, but there z505 is not listed
as author but as contributor:

 AUTHOR       : James Azarja
                http://www.jazarsoft.com/

 CONTRIBUTORS : L505
                http://z505.com


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Fast HTML Parser

Marco van de Voort
In our previous episode, luiz americo pereira camara said:
> You can try http://www.benibela.de/sources_en.html#internettools

That seems more something like sax_html fromt the fcl-xml package.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Fast HTML Parser

Luiz Americo Pereira Camara



2014-08-08 8:28 GMT-03:00 Marco van de Voort <[hidden email]>:
In our previous episode, luiz americo pereira camara said:
> You can try http://www.benibela.de/sources_en.html#internettools

That seems more something like sax_html fromt the fcl-xml package.

It's not a simple parser. It has the ability to extract part of html through templates. See http://videlibri.sourceforge.net/cgi-bin/xidelcgi

Luiz
 

_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Fast HTML Parser

Marco van de Voort
In our previous episode, luiz americo pereira camara said:
> >
>
> It's not a simple parser. It has the ability to extract part of html
> through templates. See http://videlibri.sourceforge.net/cgi-bin/xidelcgi

There is xpath support in fcl-xml?
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Fast HTML Parser

Marcos Douglas B. Santos
In reply to this post by Luiz Americo Pereira Camara
On Thu, Aug 7, 2014 at 8:53 PM, luiz americo pereira camara
<[hidden email]> wrote:
>
> You can try http://www.benibela.de/sources_en.html#internettools

I will see, thanks.

Regards,
Marcos Douglas
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: Fast HTML Parser

Daniel Gaspary
In reply to this post by Marco van de Voort
On Fri, Aug 8, 2014 at 9:40 AM, Marco van de Voort <[hidden email]> wrote:
> There is xpath support in fcl-xml?

Yes. But HTML files used to be very irregular XML.  Some files can
raise an error when trying to open.

Things like "<p>" without closing element were easy to find.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal