eXept Software AG Logo

Smalltalk/X Webserver

Documentation of class 'HTML::HTMLParser':

Home

Documentation
www.exept.de
Everywhere
for:
[back]

Class: HTMLParser (in HTML)


Inheritance:

   Object
   |
   +--HTML::HTMLParser

Package:
stx:goodies/webServer/htmlTree
Category:
Net-Documents-HTML-Utilities
Version:
rev: 1.53 date: 2018/04/26 10:33:50
user: cg
file: HTML__HTMLParser.st directory: goodies/webServer/htmlTree
module: stx stc-classLibrary: htmlTree
Author:
Claus Gittinger

Description:


Instances of this class are used to read HTML documents
and build a tree of HTML::Element objects.

Notice:
    this is a newer and better version of the (old) parser found in libhtml.
    Due to the space limitations at the time, the old parser was written,
    the old parser used a much simpler html model (simple linked list),
    which is harder to process later.
    Please (try to) use this one for new projects.
    
IMPORTANT: 
    textScannedSoFar is in the characterEncoding of the input data. 
    Conversion takes place when a textBlock is finished!


Related information:

    Element

Class protocol:

initialization
o  ampersandEscapes

o  elementTypes
ElementTypes := nil.
HTMLParser initializeElementTypes

o  initialize
self initializeElementTypes. -- now done lazily in #elementTypes
usage example(s):
self initializeAmpersandEscapes.     -- now done lazily in #ampersandEscapes   
usage example(s):
self initializeMathAmpersandEscapes. -- now done lazily in #mathAmpersandEscapes 
usage example(s):
     AmpersandEscapes := nil.
     HTMLParser initialize

     MathAmpersandEscapes := nil.
     HTMLParser initialize

o  initializeAmpersandEscapes
NOTE: we have some inconsistencies here.
We map ampersand escape chars to ISO-8859-1 codes,
and try to interpret them as some other encoding,
if characterDecoder is set
usage example(s):
     AmpersandEscapes := nil.
     HTMLParser initializeAmpersandEscapes

o  initializeElementTypes
ElementTypes := nil.
HTMLParser initializeElementTypes

o  initializeMathAmpersandEscapes
these are obsolete now, as HTML4 added the missing stuff in the meantime.
usage example(s):
     MathAmpersandEscapes := nil.
     HTMLParser initializeMathAmpersandEscapes

o  mathAmpersandEscapes

parsing
o  parseText: aStringOrStream
parse aStringOrStream; answer the parsed document
usage example(s):
     self parseText:'hello world - this is easy'  
     self parseText:'hello < world > - this is easy'  
     self parseText:'hello world this is easy'  
     self parseText:'hello
world

this is easy' self parseText:'hello

  • world
  • foo

this is easy' self parseText:'

this is easy' self parseText:('../../doc/online/english/TOP.html' asFilename contentsOfEntireFile asString) self parseText:('../../doc/online/english/TOP.html' asFilename readStream) self parseText:('../../doc/online/english/TOP.html' asFilename contentsOfEntireFile asString) self parseText:'Bönnigheim - Startseite Bönnigheim' characterEncoding:#utf8

o  parseText: aStringOrStream characterEncoding: anEncodingString
parse aStringOrStream, answer the parsed document.
The encoding of the character set is specified by anEncodingString
(e.g. #utf8 or 'iso8859-1').

Answer the parsed document
usage example(s):
     self
        parseText:('/tmp/DER-Tour-01.html' 
                        asFilename contentsOfEntireFile asString) characterEncoding:#utf8


Instance protocol:

accessing
o  characterEncoding: aString
set the character set / ecoding for the following text

o  docType

o  validate: aBoolean
turn off validation by passing false

o  validating

error reporting
o  infoMessage: msg
destination is an optional someone who asked me to parse.

private
o  addElement: anElement

o  addProcessingInstruction: aProcessingInstruction

o  addText: aString
self error:'Text after end of html ignored' mayProceed:true.

o  classForType: aTypeSymbol
internal interface - return a markup elements class, given a typeSymbol
(such as #b, #pre or #'/pre')

o  elementFor: aString
given a marks string (such as 'b', 'pre' or '/pre'),
return a new markup instance
usage example(s):
HTMLParser basicNew elementFor:'IMG border=0 SRC="internal-gopher-unknown"'

o  endElement: markupText
self assert:(currentElement mustBeClosed not).

o  inPre
return true, if currently in a pre element.
(Do not strip separators of a text block if inside a pre)

public-scanning
o  parseText: aStringOrStream
parse some string, return a tree of markups
usage example(s):
     (HTMLParser new) parseText:'hello world - this is easy'  
     (HTMLParser new) parseText:'hello < world > - this is easy'  
     (HTMLParser new) parseText:'hello world this is easy'  
     (HTMLParser new) parseText:'hello
world

this is easy' (HTMLParser new) parseText:'hello

  • world
  • foo

this is easy' (HTMLParser new) parseText:'

this is easy' (HTMLParser new) parseText:('../../doc/online/english/TOP.html' asFilename contentsOfEntireFile asString) (HTMLParser new) parseText:('../../doc/online/english/TOP.html' asFilename readStream) (HTMLParser new) parseText:('../../doc/online/english/programming/viewintro.html' asFilename contentsOfEntireFile asString)

o  parseText: aStringOrStream characterEncoding: anEncodingString

o  parseText: aStringOrStream withBindings: metaBindings
parse some string, return a tree of HTMLMarkups.
Ampersand variables (i.e. &url) are expanded as given in the
metabindings dictionary.
(this seems to be non-standard HTML, but is used in hotjava).
The destination is only required for scripts, which may want to access
document very early.
usage example(s):
     (HTMLParser new) parseText:'hello world - this is easy'  
     (HTMLParser new) parseText:'hello < world > - this is easy'  
     (HTMLParser new) parseText:'hello world this is easy'  
     (HTMLParser new) parseText:'hello
world

this is easy' (HTMLParser new) parseText:'hello

  • world
  • foo

this is easy' (HTMLParser new) parseText:'

this is easy' (HTMLParser new) parseText:('../../doc/online/english/TOP.html' asFilename contentsOfEntireFile asString) (HTMLParser new) parseText:('../../doc/online/english/programming/viewintro.html' asFilename contentsOfEntireFile asString)

o  parseText: aStringOrStream withBindings: metaBindings for: aDestination
parse some string, return a tree of HTMLMarkups.
Ampersand variables (i.e. &url) are expanded as given in the
metabindings dictionary.
(this seems to be non-standard HTML, but is used in hotjava).
The destination is only required for scripts, which may want to access
document very early.
usage example(s):
     (HTMLParser new) parseText:'hello world - this is easy'  
     (HTMLParser new) parseText:'hello < world > - this is easy'  
     (HTMLParser new) parseText:'hello world this is easy'  
     (HTMLParser new) parseText:'hello
world

this is easy' (HTMLParser new) parseText:'hello

  • world
  • foo

this is easy' (HTMLParser new) parseText:'

this is easy' (HTMLParser new) parseText:('../../doc/online/english/TOP.html' asFilename contentsOfEntireFile asString) (HTMLParser new) parseText:('../../doc/online/english/programming/viewintro.html' asFilename contentsOfEntireFile asString)

scanning
o  ampersandEscape
parse an ampersand escape; the '&' has already been read.

o  ampersandEscape: aString
return a new string, containing the ampersand escape character.
Expects aString to NOT contain the initial ampersand.
usage example(s):
     (HTMLParser new) ampersandEscape:'lt'
     (HTMLParser new) ampersandEscape:'ouml'
     (HTMLParser new) ampersandEscape:'#32'
     (HTMLParser new) ampersandEscape:'apos'

     (HTMLParser new) parseText:'hello α β γ normal'
     (HTMLParser new) parseText:'helloworld

this is easy'

o  ampersandEscapeString
parse an ampersand escape; the '&' has already been read.
Return the escape string.

o  collectParametersFrom: text
sigh; '-' is allowed ...

o  extractMetaInformationFrom: metaElement
<mime-type> ; charset=

o  finishTextBlock
finish a scanned textBlock; add it to the markup list

o  parseMarkup
'<' has been detected; parse and return a markup element

o  startNewTextBlock

scripts

o  parseJavaScriptFrom: scriptStream
HTML

o  parseSmalltalkScriptFrom: scriptStream

o  script: element
a <script> TAG was encountered.
check for the language (which defaults to javaScript) and dispatch
to a script language handler.

o  script_javascript: element
a <script language=javaScript> TAG was encountered.
parse the script, and construct the scriptObject

o  script_smalltalkscript: element
a <script language=smalltalkScript> TAG was encountered.
parse the script, and construct the scriptObject (which has the methods in
its anonymous class)


Examples:


ElementTypes := nil. HTMLParser initializeElementTypes
  |p in document|

  p := HTML::HTMLParser new.
  in := '<head>
<? bla bla bla ?>
<!-- bla bla bla -->
<!-- 
bla bla bla -->
<!-- 
bla bla bla 
-->
</head>
' readStream.
  document := p parseText:in.
  in close.
  document inspect
  |p in document|

  p := HTML::HTMLParser new.
  in := '../../doc/online/english/TOP.html' asFilename readStream.
  document := p parseText:in.
  in close.
  document inspect
  |p in document|

  p := HTML::HTMLParser new. 
  in := '../../../exept/expecco/projects/not_delivered/buggyWebShopDemo/selenium_tests/buggyWebshop_bestellung'
               asFilename readStream.
  document := p parseText:in.
  in close.
  document inspect.
  |p in document|

  p := HTML::HTMLParser new. 
  in := '../../../exept/expecco/projects/not_delivered/buggyWebShopDemo/selenium_tests/buggyWebshop_checkImages'
               asFilename readStream.
  document := p parseText:in.
  in close.
  document inspect.
  |p in document|

  p := HTML::HTMLParser new. 
  in := '../../../exept/expecco/projects/not_delivered/buggyWebShopDemo/selenium_tests/buggyWebshop_checkLinks'
               asFilename readStream.
  document := p parseText:in.
  in close.
  document inspect.
  |p in document|

  p := HTML::HTMLParser new.
  in := '
<?xml version=''1.0'' encoding=''UTF-8''?>
<!DOCTYPE html PUBLIC ''-//W3C//DTD XHTML 1.0 Strict//EN'' ''http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd''>
<html xmlns=''http://www.w3.org/1999/xhtml'' xml:lang=''en'' lang=''en''>
<head profile=''http://selenium-ide.openqa.org/profiles/test-case''>
<meta http-equiv=''Content-Type'' content=''text/html; charset=UTF-8'' />
<link rel=''selenium.base'' href='''' />
<title>New Test</title>
</head>
<body>
<table cellpadding=''1'' cellspacing=''1'' border=''1''>
<thead>
<tr><td rowspan=''1'' colspan=''3''>New Test</td></tr>
</thead><tbody>

</tbody></table>
</body>
</html>
' readStream.
  document := p parseText:in.
  in close.
  document inspect


ST/X 7.1.0.0; WebServer 1.663 at exept.de:8081; Mon, 04 Aug 2025 16:50:38 GMT