|
Class: TextExtractor (in HTML)
Object
|
+--HTML::Visitor
|
+--HTML::TextExtractor
|
+--HTML::RichTextExtractor
- Package:
- stx:goodies/webServer/htmlTree
- Category:
- Net-Documents-HTML-Utilities
- Version:
- rev:
1.5
date: 2018/04/26 10:33:35
- user: cg
- file: HTML__TextExtractor.st directory: goodies/webServer/htmlTree
- module: stx stc-classLibrary: htmlTree
- Author:
- Claus Gittinger
a tool to extract the raw text of some html
(either a constructed tree, or from a parser)
can be used to extract strings for searching, or for
conversion to raw-ascii, for example.
CAVEAT:
I am not sure if this implementation is generic enough for
other uses in its current state
(maybe we have to look for specialities like PRE.../PRE or
text within form-elements to make this really correct).
extraction
-
extractTextFromDocument: domTree
-
-
extractTextFromHtmlString: htmlString
-
instance creation
-
new
-
return an initialized instance
accessing
-
text
-
initialization
-
initialize
-
allow for a subclass to have this already initialized
visiting
-
appendString: aString
-
-
visitElement: anElement
-
Default method for all html elements.
-
visitString: aString
-
|b document x|
b := HTML::TreeBuilder new beginWith:(document := Document new).
b
head;
headEnd;
body;
table;
tr;
td; text:'aaa'; tdEnd;
td; text:'bbb'; tdEnd;
trEnd;
tableEnd;
bodyEnd.
document htmlString inspect.
x := HTML::TextExtractor new.
x visit:document.
x text inspect.
|
|document|
document := HTML::HTMLParser parseText:'<h1>Hello <b>World</b></h1>'.
(HTML::TextExtractor extractTextFromDocument:document) inspect.
|
(HTML::TextExtractor extractTextFromHtmlString:'<h1>Hello <b>World</b></h1>') inspect.
|
|