eXept Software AG Logo

Smalltalk/X Webserver

Documentation of class 'HTML::TextExtractor':

Home

Documentation
www.exept.de
Everywhere
for:
[back]

Class: TextExtractor (in HTML)


Inheritance:

   Object
   |
   +--HTML::Visitor
      |
      +--HTML::TextExtractor
         |
         +--HTML::RichTextExtractor

Package:
stx:goodies/webServer/htmlTree
Category:
Net-Documents-HTML-Utilities
Version:
rev: 1.5 date: 2018/04/26 10:33:35
user: cg
file: HTML__TextExtractor.st directory: goodies/webServer/htmlTree
module: stx stc-classLibrary: htmlTree
Author:
Claus Gittinger

Description:


a tool to extract the raw text of some html 
(either a constructed tree, or from a parser)
can be used to extract strings for searching, or for
conversion to raw-ascii, for example.

CAVEAT:
    I am not sure if this implementation is generic enough for
    other uses in its current state 
    (maybe we have to look for specialities like PRE.../PRE or     
    text within form-elements to make this really correct).


Class protocol:

extraction
o  extractTextFromDocument: domTree

o  extractTextFromHtmlString: htmlString

instance creation
o  new
return an initialized instance


Instance protocol:

accessing
o  text

initialization
o  initialize
allow for a subclass to have this already initialized

visiting
o  appendString: aString

o  visitElement: anElement
Default method for all html elements.

o  visitString: aString


Examples:


     |b document x|

     b := HTML::TreeBuilder new beginWith:(document := Document new).
     b 
        head;
        headEnd;
        body;
          table;
            tr;
              td; text:'aaa'; tdEnd;
              td; text:'bbb'; tdEnd;
            trEnd;
          tableEnd;
        bodyEnd.

     document htmlString inspect.
     x := HTML::TextExtractor new.
     x visit:document.
     x text inspect.
     |document|

     document := HTML::HTMLParser parseText:'<h1>Hello <b>World</b></h1>'.
     (HTML::TextExtractor extractTextFromDocument:document) inspect.
     (HTML::TextExtractor extractTextFromHtmlString:'<h1>Hello <b>World</b></h1>') inspect.


ST/X 7.1.0.0; WebServer 1.663 at exept.de:8081; Mon, 04 Aug 2025 14:16:14 GMT