public class htmlParser extends AbstractParser implements Parser
Parser.Failure
Modifier and Type | Field and Description |
---|---|
static java.lang.String[] |
htmlExtensions |
static java.util.Set<java.lang.String> |
htmlExtensionsSet |
private static int |
maxLinks |
log, SUPPORTED_EXTENSIONS, SUPPORTED_MIME_TYPES
Constructor and Description |
---|
htmlParser() |
Modifier and Type | Method and Description |
---|---|
static void |
main(java.lang.String[] args) |
Document[] |
parse(DigestURL location,
java.lang.String mimeType,
java.lang.String documentCharset,
VocabularyScraper vocscraper,
int timezoneOffset,
java.io.InputStream sourceStream)
parse an input stream
|
private Document |
parseAlternativeSnapshot(DigestURL location,
java.lang.String mimeType,
java.lang.String documentCharset,
VocabularyScraper vocscraper,
int timezoneOffset)
Implementation of ajax crawling scheme to crawl the content of html snapshot page
instead of the (empty) original ajax url
see https://developers.google.com/webmasters/ajax-crawling/docs/specification
Ajax crawling sheme is denoted by url with anchor param starting with "!" (1)
or by a header tag
It is expected that the check for ajax crawling scheme happend already so we can directly
try to get the snapshot page
|
static ContentScraper |
parseToScraper(DigestURL location,
java.lang.String documentCharset,
VocabularyScraper vocabularyScraper,
java.nio.charset.Charset[] detectedcharsetcontainer,
int timezoneOffset,
java.io.InputStream sourceStream,
int maxLinks) |
static ContentScraper |
parseToScraper(DigestURL location,
java.lang.String documentCharset,
VocabularyScraper vocabularyScraper,
int timezoneOffset,
java.lang.String input,
int maxLinks) |
static java.lang.String |
patchCharsetEncoding(java.lang.String encoding)
some html authors use wrong encoding names, either because they don't know exactly what they
are doing or they produce a type.
|
private static Document |
transformScraper(DigestURL location,
java.lang.String mimeType,
java.lang.String charSet,
ContentScraper scraper)
the transformScraper method transforms a scraper object into a document object
|
equals, getName, hashCode, singleList, supportedExtensions, supportedMimeTypes
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait
equals, getName, hashCode, supportedExtensions, supportedMimeTypes
private static final int maxLinks
public static final java.lang.String[] htmlExtensions
public static final java.util.Set<java.lang.String> htmlExtensionsSet
public Document[] parse(DigestURL location, java.lang.String mimeType, java.lang.String documentCharset, VocabularyScraper vocscraper, int timezoneOffset, java.io.InputStream sourceStream) throws Parser.Failure, java.lang.InterruptedException
Parser
parse
in interface Parser
location
- the url of the sourcemimeType
- the mime type of the source, if knowndocumentCharset
- the charset of the source, if knownvocscraper
- an entity scraper to detect facets from text annotation contextsourceStream
- a input streamParser.Failure
java.lang.InterruptedException
private static Document transformScraper(DigestURL location, java.lang.String mimeType, java.lang.String charSet, ContentScraper scraper)
location
- mimeType
- charSet
- scraper
- public static ContentScraper parseToScraper(DigestURL location, java.lang.String documentCharset, VocabularyScraper vocabularyScraper, int timezoneOffset, java.lang.String input, int maxLinks) throws java.io.IOException
java.io.IOException
public static ContentScraper parseToScraper(DigestURL location, java.lang.String documentCharset, VocabularyScraper vocabularyScraper, java.nio.charset.Charset[] detectedcharsetcontainer, int timezoneOffset, java.io.InputStream sourceStream, int maxLinks) throws Parser.Failure, java.io.IOException
Parser.Failure
java.io.IOException
public static java.lang.String patchCharsetEncoding(java.lang.String encoding)
encoding
- private Document parseAlternativeSnapshot(DigestURL location, java.lang.String mimeType, java.lang.String documentCharset, VocabularyScraper vocscraper, int timezoneOffset)
location
- original url (ajax url)mimeType
- documentCharset
- vocscraper
- timezoneOffset
- public static void main(java.lang.String[] args)