htmlParser (YaCy API: javadoc documentation)

java.lang.Object
- net.yacy.document.AbstractParser
- - net.yacy.document.parser.htmlParser

All Implemented Interfaces:: Parser

public class htmlParser
extends AbstractParser
implements Parser

Nested Class Summary
- Nested classes/interfaces inherited from interface net.yacy.document.Parser
  Parser.Failure

Field Summary

Fields
Modifier and Type Field and Description

static java.lang.String[] htmlExtensions

static java.util.Set<java.lang.String> htmlExtensionsSet

private static int maxLinks
- Fields inherited from class net.yacy.document.AbstractParser
  log, SUPPORTED_EXTENSIONS, SUPPORTED_MIME_TYPES

Fields
Modifier and Type	Field and Description
`static java.lang.String[]`	`htmlExtensions`
`static java.util.Set<java.lang.String>`	`htmlExtensionsSet`
`private static int`	`maxLinks`

Constructor Summary

Constructors
Constructor and Description

htmlParser()

Constructors
Constructor and Description
`htmlParser()`

Method Summary

Methods
Modifier and Type	Method and Description
`static void`	`main(java.lang.String[] args)`
`Document[]`	`parse(DigestURL location, java.lang.String mimeType, java.lang.String documentCharset, VocabularyScraper vocscraper, int timezoneOffset, java.io.InputStream sourceStream)` parse an input stream
`private Document`	`parseAlternativeSnapshot(DigestURL location, java.lang.String mimeType, java.lang.String documentCharset, VocabularyScraper vocscraper, int timezoneOffset)` Implementation of ajax crawling scheme to crawl the content of html snapshot page instead of the (empty) original ajax url see https://developers.google.com/webmasters/ajax-crawling/docs/specification Ajax crawling sheme is denoted by url with anchor param starting with "!" (1) or by a header tag It is expected that the check for ajax crawling scheme happend already so we can directly try to get the snapshot page
`static ContentScraper`	`parseToScraper(DigestURL location, java.lang.String documentCharset, VocabularyScraper vocabularyScraper, java.nio.charset.Charset[] detectedcharsetcontainer, int timezoneOffset, java.io.InputStream sourceStream, int maxLinks)`
`static ContentScraper`	`parseToScraper(DigestURL location, java.lang.String documentCharset, VocabularyScraper vocabularyScraper, int timezoneOffset, java.lang.String input, int maxLinks)`
`static java.lang.String`	`patchCharsetEncoding(java.lang.String encoding)` some html authors use wrong encoding names, either because they don't know exactly what they are doing or they produce a type.
`private static Document`	`transformScraper(DigestURL location, java.lang.String mimeType, java.lang.String charSet, ContentScraper scraper)` the transformScraper method transforms a scraper object into a document object

Methods inherited from class net.yacy.document.AbstractParser
equals, getName, hashCode, singleList, supportedExtensions, supportedMimeTypes

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface net.yacy.document.Parser
equals, getName, hashCode, supportedExtensions, supportedMimeTypes

Field Detail

maxLinks
```
private static final int maxLinks
```
See Also:
Constant Field Values

htmlExtensions

public static final java.lang.String[] htmlExtensions

htmlExtensionsSet

public static final java.util.Set<java.lang.String> htmlExtensionsSet

Constructor Detail
- htmlParser
```
public htmlParser()
```

Method Detail

parse
```
public Document[] parse(DigestURL location,
               java.lang.String mimeType,
               java.lang.String documentCharset,
               VocabularyScraper vocscraper,
               int timezoneOffset,
               java.io.InputStream sourceStream)
                 throws Parser.Failure,
                        java.lang.InterruptedException
```
Description copied from interface: Parser

parse an input stream

Specified by:

parse in interface Parser

Parameters:
location - the url of the source
mimeType - the mime type of the source, if known
documentCharset - the charset of the source, if known
vocscraper - an entity scraper to detect facets from text annotation context
sourceStream - a input stream

Returns:
a list of documents that result from parsing the source

Throws:

Parser.Failure

java.lang.InterruptedException

transformScraper

private static Document transformScraper(DigestURL location,
                        java.lang.String mimeType,
                        java.lang.String charSet,
                        ContentScraper scraper)

the transformScraper method transforms a scraper object into a document object

Parameters:: location -; mimeType -; charSet -; scraper -
Returns:

parseToScraper

public static ContentScraper parseToScraper(DigestURL location,
                            java.lang.String documentCharset,
                            VocabularyScraper vocabularyScraper,
                            int timezoneOffset,
                            java.lang.String input,
                            int maxLinks)
                                     throws java.io.IOException

Throws:: java.io.IOException

parseToScraper

public static ContentScraper parseToScraper(DigestURL location,
                            java.lang.String documentCharset,
                            VocabularyScraper vocabularyScraper,
                            java.nio.charset.Charset[] detectedcharsetcontainer,
                            int timezoneOffset,
                            java.io.InputStream sourceStream,
                            int maxLinks)
                                     throws Parser.Failure,
                                            java.io.IOException

Throws:: Parser.Failure; java.io.IOException

patchCharsetEncoding
```
public static java.lang.String patchCharsetEncoding(java.lang.String encoding)
```
some html authors use wrong encoding names, either because they don't know exactly what they are doing or they produce a type. Many times, the upper/downcase scheme of the name is fuzzy This method patches wrong encoding names. The correct names are taken from http://www.iana.org/assignments/character-sets

Parameters:
encoding -

Returns:
patched encoding name

parseAlternativeSnapshot
```
private Document parseAlternativeSnapshot(DigestURL location,
                                java.lang.String mimeType,
                                java.lang.String documentCharset,
                                VocabularyScraper vocscraper,
                                int timezoneOffset)
```
Implementation of ajax crawling scheme to crawl the content of html snapshot page instead of the (empty) original ajax url see https://developers.google.com/webmasters/ajax-crawling/docs/specification Ajax crawling sheme is denoted by url with anchor param starting with "!" (1) or by a header tag It is expected that the check for ajax crawling scheme happend already so we can directly try to get the snapshot page

Parameters:
location - original url (ajax url)
mimeType -
documentCharset -
vocscraper -
timezoneOffset -

Returns:
document as result of parsed snapshot or null if not exist or on any other issue with snapshot

main

public static void main(java.lang.String[] args)

Class htmlParser

Nested Class Summary

Nested classes/interfaces inherited from interface net.yacy.document.Parser

Field Summary

Fields inherited from class net.yacy.document.AbstractParser

Constructor Summary

Method Summary

Methods inherited from class net.yacy.document.AbstractParser

Methods inherited from class java.lang.Object

Methods inherited from interface net.yacy.document.Parser

Field Detail

maxLinks

htmlExtensions

htmlExtensionsSet

Constructor Detail

htmlParser

Method Detail

parse

transformScraper

parseToScraper

parseToScraper

patchCharsetEncoding

parseAlternativeSnapshot

main