Smalltalk/X Webserver

Documentation of class 'HTML::TextExtractor':

Class: TextExtractor (in HTML)

Inheritance
Description
Class protocol
- extraction
- instance creation
Instance protocol
Examples

Inheritance:

   Object
   |
   +--HTML::Visitor
      |
      +--HTML::TextExtractor
         |
         +--HTML::RichTextExtractor

Package:: stx:goodies/webServer/htmlTree

Category:: Net-Documents-HTML-Utilities

Version:: rev: 1.5 date: 2018/04/26 10:33:35; user: cg; file: HTML__TextExtractor.st directory: goodies/webServer/htmlTree; module: stx stc-classLibrary: htmlTree

Author:: Claus Gittinger

Description:

a tool to extract the raw text of some html 
(either a constructed tree, or from a parser)
can be used to extract strings for searching, or for
conversion to raw-ascii, for example.

CAVEAT:
    I am not sure if this implementation is generic enough for
    other uses in its current state 
    (maybe we have to look for specialities like PRE.../PRE or     
    text within form-elements to make this really correct).

Class protocol:

extraction

extractTextFromDocument: domTree
extractTextFromHtmlString: htmlString

instance creation

new: return an initialized instance

Instance protocol:

accessing

text

initialization

initialize: allow for a subclass to have this already initialized

visiting

appendString: aString
visitElement: anElement: Default method for all html elements.
visitString: aString

Examples:

     |b document x|

     b := HTML::TreeBuilder new beginWith:(document := Document new).
     b 
        head;
        headEnd;
        body;
          table;
            tr;
              td; text:'aaa'; tdEnd;
              td; text:'bbb'; tdEnd;
            trEnd;
          tableEnd;
        bodyEnd.

     document htmlString inspect.
     x := HTML::TextExtractor new.
     x visit:document.
     x text inspect.

     |document|

     document := HTML::HTMLParser parseText:'<h1>Hello <b>World</b></h1>'.
     (HTML::TextExtractor extractTextFromDocument:document) inspect.

     (HTML::TextExtractor extractTextFromHtmlString:'<h1>Hello <b>World</b></h1>') inspect.

ST/X 7.1.0.0; WebServer 1.663 at exept.de:8081; Sun, 19 Apr 2026 12:38:18 GMT