YaCy Crawler API

A web crawl is stared using either the web page at /CrawlStartSite_p.html or /CrawlStartExpert_p.html, however, both web pages call the servlet at /Crawler_p.html. The /Crawler_p.html can be called directly to show monitoring information but it is also the API access point to start a crawl using a direct call to the url as shown below:

http://localhost:8090/Crawler_p.html?crawlingDomMaxPages=10000&range=wide&intention=&sitemapURL=&crawlingQ=on&crawlingMode=url&crawlingURL=http://vip.asus.com/forum/default.aspx%3FSLanguage%3Den-us&crawlingFile=&mustnotmatch=&crawlingFile%24file=&crawlingstart=Neuen%20Crawl%20starten&mustmatch=.*&createBookmark=on&bookmarkFolder=/crawlStart&xsstopw=on&indexMedia=on&crawlingIfOlderUnit=hour&cachePolicy=iffresh&indexText=on&crawlingIfOlderCheck=on&bookmarkTitle=&crawlingDomFilterDepth=1&crawlingDomFilterCheck=on&crawlingIfOlderNumber=1&crawlingDepth=4

The parameters used here are explained below in detail. Each YaCy crawl job has its own profile to store information to ensure proper handling of crawled URLs. It is created at crawl start, will be set as terminated if a crawl is considered to be finished, and may also be edited or deleted while the crawl is running.

To start a new crawl and create its profile following parameters are needed

Start Point

Crawler Filter

Document filter

Clean-up

Crawling

Document cache

Index Attributes