stripping HTML

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

stripping HTML

Roland Schäfer
Hello everyone,
is there any existing FPC code (even external libraries with bindings)
to strip HTML tags from files, including adequate removal of scripts,
comments and other multi-line non-text - and which handles faulty HTML
input in a tolerant fashion? I also need to keep track of how many
characters per line were removed. Thanks in advance.
Regards - Roland
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: stripping HTML

leledumbo
Administrator
http://www.festra.com/eng/snip12.htm
Simple googling gives a lot of results, try: html strip (pascal OR delphi)
Reply | Threaded
Open this post in threaded view
|

Re: stripping HTML

Roland Schäfer
On 4/17/2011 11:00 AM, leledumbo wrote:
> http://www.festra.com/eng/snip12.htm
> Simple googling gives a lot of results, try: html strip (pascal OR delphi)

Thank you for your reply.

I feel I have to justify myself: I always do extensive web and list
archive searches before posting to a list (hence the infrequency of my
posts). I had actually found that snippet over a week ago but
immediately discarded it since it is obviously a toy solution. I have a
much better solution already using the PCRE library on a text stream,
sometimes re-reading portions of the stream by way of backtracking. The
problems with any approach like that (esp. 6-liners like the one linked
in your post, but also more elaborate buts still makeshift regular
expression magic) are:

1. They don't handle faulty HTML well enough.

2. They don't handle any multi-line constructs like comments or scripts.
Depending on how naively you read the input (e.g., using
TStringList.ReadFromFile), they even choke on simple tags with all sorts
of line breaks in between, which are frequently found (and which are, to
my knowledge, not even ill-formed). What do you do with this (for a start)?

'<div class="al#13#10#13ert">'

3. They are potentially not the most efficient solution, which is an
important factor if the stripping alone takes days.

As a clarification: I am mining several ~500GB results of Heritrix
crawls containig all versions of XML, HTML, inline CSS, inline Scripts,
etc. They need to be accurately stripped from HTML/XML (accurately means
without losing too much real text). The text/markup ratio has to be
calculated and stored on a per-line basis since I'm applying a machine
learning algorithm afterwards which uses those ratios as one factor to
separate coherent text from boilerplate (menus, navigation, copyright etc.).

I had anticipated a reply along the lines of "read the documents into a
DOM object and extract the text from that". That is also problematic
since it is not fast enough given the size of the input (That is an
assumption; I haven't benchmarked the FPC DOM implementation yet.), and
I don't see how I can calculate the text/markup ratio per line in a
simple fashion when using a DOM implementation.

I am *not* trying to clean or format simple or limited HTML on a string
basis. For stuff like that, I wouldn't have asked. I actually wouldn't
use Pascal for such tasks but rather sed or a Perl script at max.

I would still highly appreciate further input.
Regards
Roland
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: stripping HTML

Ralf Junker
HTML is not meant to be handled on a line-by-line basis as other
text-based formats. According to the specs, HTML is not line-based.
Browsers should display the following two HTML snippets identically:

  <p#13#10>one#13#10two</p>

and

  <p>one#13#10two</p#13#10>

With HTML tags removed both result to:

  one two

As such, a line-based text/markup ratio does not make much sense IMHO,
especially since browsers do strip line breaks in most text elements
except within <pre> ... </pre>.

That said, I believe that DIHtmlParser should care for most of your needs:

  http://yunqa.de/delphi/doku.php/products/htmlparser/index

DIHtmlParser meets most of your requirements:

  * Not DOM based, very fast.

  * Hand-crafted, linear-scan Unicode HTML parser.

  * Handles SCRIPTs and STYLEs well.

  * Simple "Extract Text" demo included, may be modified as needed.

Drawbacks:

  * Like HTML, DIHtmlParser is not line-based. An option is available
    to strip or preserve line breaks and white space.

  * Pre-compiled units available for Delphi only. The source code is
    required to compile with FreePascal.

Ralf

On 17.04.2011 14:08, Roland Schäfer wrote:

> I feel I have to justify myself: I always do extensive web and list
> archive searches before posting to a list (hence the infrequency of my
> posts). I had actually found that snippet over a week ago but
> immediately discarded it since it is obviously a toy solution. I have a
> much better solution already using the PCRE library on a text stream,
> sometimes re-reading portions of the stream by way of backtracking. The
> problems with any approach like that (esp. 6-liners like the one linked
> in your post, but also more elaborate buts still makeshift regular
> expression magic) are:
>
> 1. They don't handle faulty HTML well enough.
>
> 2. They don't handle any multi-line constructs like comments or scripts.
> Depending on how naively you read the input (e.g., using
> TStringList.ReadFromFile), they even choke on simple tags with all sorts
> of line breaks in between, which are frequently found (and which are, to
> my knowledge, not even ill-formed). What do you do with this (for a start)?
>
> '<div class="al#13#10#13ert">'
>
> 3. They are potentially not the most efficient solution, which is an
> important factor if the stripping alone takes days.
>
> As a clarification: I am mining several ~500GB results of Heritrix
> crawls containig all versions of XML, HTML, inline CSS, inline Scripts,
> etc. They need to be accurately stripped from HTML/XML (accurately means
> without losing too much real text). The text/markup ratio has to be
> calculated and stored on a per-line basis since I'm applying a machine
> learning algorithm afterwards which uses those ratios as one factor to
> separate coherent text from boilerplate (menus, navigation, copyright etc.).
>
> I had anticipated a reply along the lines of "read the documents into a
> DOM object and extract the text from that". That is also problematic
> since it is not fast enough given the size of the input (That is an
> assumption; I haven't benchmarked the FPC DOM implementation yet.), and
> I don't see how I can calculate the text/markup ratio per line in a
> simple fashion when using a DOM implementation.
>
> I am *not* trying to clean or format simple or limited HTML on a string
> basis. For stuff like that, I wouldn't have asked. I actually wouldn't
> use Pascal for such tasks but rather sed or a Perl script at max.
>
> I would still highly appreciate further input.
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: stripping HTML

Roland Schäfer
Thanks a lot for your reply.

On 4/17/2011 3:46 PM, Ralf Junker wrote:
> HTML is not meant to be handled on a line-by-line basis as other
> text-based formats. According to the specs, HTML is not line-based.
> Browsers should display the following two HTML snippets identically:
 [...]
> As such, a line-based text/markup ratio does not make much sense IMHO,
> especially since browsers do strip line breaks in most text elements
> except within <pre> ... </pre>.

This is sort of off-topic, so I'll make it short: Yes, that is a problem
we are aware of. However, experiments with even simple threshholds
("remove lines with less than 50% text") were sort of successful. Simple
machine learning makes it much better. To avoid true paragraph detection
(which would be desirable but costly given the TB-sized input) we are
also experimenting with several line-based and non-line-based windows on
the input and cumulative html/text ratios for those windows. Also, this
is only stage one of the cleanup, and we run some more linguistically
informed and costly steps on the already much smaller amounts of data.

Maybe I'll give paragraph detection based on <p>, <div> etc. another
try, but we actually decided against that a while ago because we lost
huge amounts of valuable input due to non-use or very creative use of
such elements in actual web pages.

> That said, I believe that DIHtmlParser should care for most of your needs:

Yes, that looks perfect. I wouldn't even have a problem with the license
or with paying for it, and I even still have D7. However, my program has
to run on our Debian 64-bit servers.

Regards
Roland
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: stripping HTML

Felipe Monteiro de Carvalho
2011/4/17 Roland Schäfer <[hidden email]>:
> Yes, that looks perfect. I wouldn't even have a problem with the license
> or with paying for it, and I even still have D7. However, my program has
> to run on our Debian 64-bit servers.

You could contact the authors and say that you would like to buy a
license if it works in FPC linux-x86-64

--
Felipe Monteiro de Carvalho
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: stripping HTML

Ralf Junker
On 17.04.2011 16:30, Felipe Monteiro de Carvalho wrote:

>> Yes, that looks perfect. I wouldn't even have a problem with the license
>> or with paying for it, and I even still have D7. However, my program has
>> to run on our Debian 64-bit servers.
>
> You could contact the authors and say that you would like to buy a
> license if it works in FPC linux-x86-64

I am the author of DIHtmlParser.

I do not know if DIHtmlParser compiles and works in FPC linux-x86-64
because I do not have that environment available for testing.
Unfortunately, low demand for that platform does not justify setting it
up and supporting it on a regular basis.

I can say, however, that the latest version source code is Pascal only
and compiles on FPC Win32 without platform warnings. But I do suspect
that it will need a few IFDEFs to make it Linux compatible. If so, I
would of course be glad to add any to the code so they will be available
in future versions.

However, having read Stefan's more detailed, off-topic requirements
description, I'd rather suggest that they come up with their own HTML
parser and text filter. It sounds too specific to me to be handled by
any standard component already available.

Ralf
_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Reply | Threaded
Open this post in threaded view
|

Re: stripping HTML

Roland Schäfer
On 4/17/2011 5:13 PM, Ralf Junker wrote:

> I am the author of DIHtmlParser.
>
> I do not know if DIHtmlParser compiles and works in FPC linux-x86-64
> because I do not have that environment available for testing.
> Unfortunately, low demand for that platform does not justify setting it
> up and supporting it on a regular basis.
>
> I can say, however, that the latest version source code is Pascal only
> and compiles on FPC Win32 without platform warnings. But I do suspect
> that it will need a few IFDEFs to make it Linux compatible. If so, I
> would of course be glad to add any to the code so they will be available
> in future versions.
I would volunteer to try and compile it under GNU/Linux 32 and 64 if
there is some interest in such work. I could not guarantee success nor
continued support, though. I'm not sure whether that's acceptable for a
commercial product. Feel free to contact me off-list, though.

> However, having read Stefan's more detailed, off-topic requirements
> description, I'd rather suggest that they come up with their own HTML
> parser and text filter. It sounds too specific to me to be handled by
> any standard component already available.

I was hoping to same time on the simple stripping, but I guess I will
continue my work on a custom parser.

Thanks for the input.

Roland

P.S. How do you know my boss' name is Stefan?


_______________________________________________
fpc-pascal maillist  -  [hidden email]
http://lists.freepascal.org/mailman/listinfo/fpc-pascal

signature.asc (268 bytes) Download Attachment