|
Class: HTMLParser (in HTML)
Object
|
+--HTML::HTMLParser
- Package:
- stx:goodies/webServer/htmlTree
- Category:
- Net-Documents-HTML-Utilities
- Version:
- rev:
1.53
date: 2018/04/26 10:33:50
- user: cg
- file: HTML__HTMLParser.st directory: goodies/webServer/htmlTree
- module: stx stc-classLibrary: htmlTree
- Author:
- Claus Gittinger
Instances of this class are used to read HTML documents
and build a tree of HTML::Element objects.
Notice:
this is a newer and better version of the (old) parser found in libhtml.
Due to the space limitations at the time, the old parser was written,
the old parser used a much simpler html model (simple linked list),
which is harder to process later.
Please (try to) use this one for new projects.
IMPORTANT:
textScannedSoFar is in the characterEncoding of the input data.
Conversion takes place when a textBlock is finished!
Element
initialization
-
ampersandEscapes
-
-
elementTypes
-
ElementTypes := nil.
HTMLParser initializeElementTypes
-
initialize
-
self initializeElementTypes. -- now done lazily in #elementTypes
usage example(s):
self initializeAmpersandEscapes. -- now done lazily in #ampersandEscapes
|
usage example(s):
self initializeMathAmpersandEscapes. -- now done lazily in #mathAmpersandEscapes
|
usage example(s):
AmpersandEscapes := nil.
HTMLParser initialize
MathAmpersandEscapes := nil.
HTMLParser initialize
|
-
initializeAmpersandEscapes
-
NOTE: we have some inconsistencies here.
We map ampersand escape chars to ISO-8859-1 codes,
and try to interpret them as some other encoding,
if characterDecoder is set
usage example(s):
AmpersandEscapes := nil.
HTMLParser initializeAmpersandEscapes
|
-
initializeElementTypes
-
ElementTypes := nil.
HTMLParser initializeElementTypes
-
initializeMathAmpersandEscapes
-
these are obsolete now, as HTML4 added the missing stuff in the meantime.
usage example(s):
MathAmpersandEscapes := nil.
HTMLParser initializeMathAmpersandEscapes
|
-
mathAmpersandEscapes
-
parsing
-
parseText: aStringOrStream
-
parse aStringOrStream; answer the parsed document
usage example(s):
self parseText:'hello world - this is easy'
self parseText:'hello < world > - this is easy'
self parseText:'hello world this is easy'
self parseText:'hello world this is easy'
self parseText:'hello this is easy'
self parseText:' this is easy'
self
parseText:('../../doc/online/english/TOP.html'
asFilename contentsOfEntireFile asString)
self
parseText:('../../doc/online/english/TOP.html'
asFilename readStream)
self
parseText:('../../doc/online/english/TOP.html'
asFilename contentsOfEntireFile asString)
self parseText:'Bönnigheim - Startseite Bönnigheim' characterEncoding:#utf8
|
-
parseText: aStringOrStream characterEncoding: anEncodingString
-
parse aStringOrStream, answer the parsed document.
The encoding of the character set is specified by anEncodingString
(e.g. #utf8 or 'iso8859-1').
Answer the parsed document
usage example(s):
self
parseText:('/tmp/DER-Tour-01.html'
asFilename contentsOfEntireFile asString) characterEncoding:#utf8
|
accessing
-
characterEncoding: aString
-
set the character set / ecoding for the following text
-
docType
-
-
validate: aBoolean
-
turn off validation by passing false
-
validating
-
error reporting
-
infoMessage: msg
-
destination is an optional someone who asked me to parse.
private
-
addElement: anElement
-
-
addProcessingInstruction: aProcessingInstruction
-
-
addText: aString
-
self error:'Text after end of html ignored' mayProceed:true.
-
classForType: aTypeSymbol
-
internal interface - return a markup elements class, given a typeSymbol
(such as #b, #pre or #'/pre')
-
elementFor: aString
-
given a marks string (such as 'b', 'pre' or '/pre'),
return a new markup instance
usage example(s):
HTMLParser basicNew elementFor:'IMG border=0 SRC="internal-gopher-unknown"'
|
-
endElement: markupText
-
self assert:(currentElement mustBeClosed not).
-
inPre
-
return true, if currently in a pre element.
(Do not strip separators of a text block if inside a pre)
public-scanning
-
parseText: aStringOrStream
-
parse some string, return a tree of markups
usage example(s):
(HTMLParser new) parseText:'hello world - this is easy'
(HTMLParser new) parseText:'hello < world > - this is easy'
(HTMLParser new) parseText:'hello world this is easy'
(HTMLParser new) parseText:'hello world this is easy'
(HTMLParser new) parseText:'hello this is easy'
(HTMLParser new) parseText:' this is easy'
(HTMLParser new)
parseText:('../../doc/online/english/TOP.html'
asFilename contentsOfEntireFile asString)
(HTMLParser new)
parseText:('../../doc/online/english/TOP.html'
asFilename readStream)
(HTMLParser new)
parseText:('../../doc/online/english/programming/viewintro.html'
asFilename contentsOfEntireFile asString)
|
-
parseText: aStringOrStream characterEncoding: anEncodingString
-
-
parseText: aStringOrStream withBindings: metaBindings
-
parse some string, return a tree of HTMLMarkups.
Ampersand variables (i.e. &url) are expanded as given in the
metabindings dictionary.
(this seems to be non-standard HTML, but is used in hotjava).
The destination is only required for scripts, which may want to access
document very early.
usage example(s):
(HTMLParser new) parseText:'hello world - this is easy'
(HTMLParser new) parseText:'hello < world > - this is easy'
(HTMLParser new) parseText:'hello world this is easy'
(HTMLParser new) parseText:'hello world this is easy'
(HTMLParser new) parseText:'hello this is easy'
(HTMLParser new) parseText:' this is easy'
(HTMLParser new)
parseText:('../../doc/online/english/TOP.html'
asFilename contentsOfEntireFile asString)
(HTMLParser new)
parseText:('../../doc/online/english/programming/viewintro.html'
asFilename contentsOfEntireFile asString)
|
-
parseText: aStringOrStream withBindings: metaBindings for: aDestination
-
parse some string, return a tree of HTMLMarkups.
Ampersand variables (i.e. &url) are expanded as given in the
metabindings dictionary.
(this seems to be non-standard HTML, but is used in hotjava).
The destination is only required for scripts, which may want to access
document very early.
usage example(s):
(HTMLParser new) parseText:'hello world - this is easy'
(HTMLParser new) parseText:'hello < world > - this is easy'
(HTMLParser new) parseText:'hello world this is easy'
(HTMLParser new) parseText:'hello world this is easy'
(HTMLParser new) parseText:'hello this is easy'
(HTMLParser new) parseText:' this is easy'
(HTMLParser new)
parseText:('../../doc/online/english/TOP.html'
asFilename contentsOfEntireFile asString)
(HTMLParser new)
parseText:('../../doc/online/english/programming/viewintro.html'
asFilename contentsOfEntireFile asString)
|
scanning
-
ampersandEscape
-
parse an ampersand escape; the '&' has already been read.
-
ampersandEscape: aString
-
return a new string, containing the ampersand escape character.
Expects aString to NOT contain the initial ampersand.
usage example(s):
(HTMLParser new) ampersandEscape:'lt'
(HTMLParser new) ampersandEscape:'ouml'
(HTMLParser new) ampersandEscape:'#32'
(HTMLParser new) ampersandEscape:'apos'
(HTMLParser new) parseText:'hello α β γ normal'
(HTMLParser new) parseText:'hello
-
ampersandEscapeString
-
parse an ampersand escape; the '&' has already been read.
Return the escape string.
-
collectParametersFrom: text
-
sigh; '-' is allowed ...
-
extractMetaInformationFrom: metaElement
-
<mime-type> ; charset=
-
finishTextBlock
-
finish a scanned textBlock; add it to the markup list
-
parseMarkup
-
'<' has been detected; parse and return a markup element
-
startNewTextBlock
-
scripts
-
parseJavaScriptFrom: scriptStream
-
HTML
-
parseSmalltalkScriptFrom: scriptStream
-
-
script: element
-
a <script> TAG was encountered.
check for the language (which defaults to javaScript) and dispatch
to a script language handler.
-
script_javascript: element
-
a <script language=javaScript> TAG was encountered.
parse the script, and construct the scriptObject
-
script_smalltalkscript: element
-
a <script language=smalltalkScript> TAG was encountered.
parse the script, and construct the scriptObject (which has the methods in
its anonymous class)
ElementTypes := nil.
HTMLParser initializeElementTypes
|p in document|
p := HTML::HTMLParser new.
in := '<head>
<? bla bla bla ?>
<!-- bla bla bla -->
<!--
bla bla bla -->
<!--
bla bla bla
-->
</head>
' readStream.
document := p parseText:in.
in close.
document inspect
|
|p in document|
p := HTML::HTMLParser new.
in := '../../doc/online/english/TOP.html' asFilename readStream.
document := p parseText:in.
in close.
document inspect
|
|p in document|
p := HTML::HTMLParser new.
in := '../../../exept/expecco/projects/not_delivered/buggyWebShopDemo/selenium_tests/buggyWebshop_bestellung'
asFilename readStream.
document := p parseText:in.
in close.
document inspect.
|
|p in document|
p := HTML::HTMLParser new.
in := '../../../exept/expecco/projects/not_delivered/buggyWebShopDemo/selenium_tests/buggyWebshop_checkImages'
asFilename readStream.
document := p parseText:in.
in close.
document inspect.
|
|p in document|
p := HTML::HTMLParser new.
in := '../../../exept/expecco/projects/not_delivered/buggyWebShopDemo/selenium_tests/buggyWebshop_checkLinks'
asFilename readStream.
document := p parseText:in.
in close.
document inspect.
|
|p in document|
p := HTML::HTMLParser new.
in := '
<?xml version=''1.0'' encoding=''UTF-8''?>
<!DOCTYPE html PUBLIC ''-//W3C//DTD XHTML 1.0 Strict//EN'' ''http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd''>
<html xmlns=''http://www.w3.org/1999/xhtml'' xml:lang=''en'' lang=''en''>
<head profile=''http://selenium-ide.openqa.org/profiles/test-case''>
<meta http-equiv=''Content-Type'' content=''text/html; charset=UTF-8'' />
<link rel=''selenium.base'' href='''' />
<title>New Test</title>
</head>
<body>
<table cellpadding=''1'' cellspacing=''1'' border=''1''>
<thead>
<tr><td rowspan=''1'' colspan=''3''>New Test</td></tr>
</thead><tbody>
</tbody></table>
</body>
</html>
' readStream.
document := p parseText:in.
in close.
document inspect
|
|
|
ST/X 7.1.0.0; WebServer 1.663 at exept.de:8081; Mon, 04 Aug 2025 11:41:55 GMT
|
|