API Click on this API button to see a documentation of the POST request parameter for crawl starts.

Expert Crawl Start

Start Crawling Job:  You can define URLs as start points for Web page crawling and start crawling here. "Crawling" means that YaCy will download the given website, extract all links in it and then download the content behind these links. This is repeated as long as specified under "Crawling Depth". A crawl can also be started using wget and the post arguments for this web page.

Crawl Job

A Crawl Job consist of one or more start point, crawl limitations and document freshness rules.

Start Point
One Start URL or a list of URLs:
(must start with http:// https:// ftp:// smb:// file://)
infoDefine the start-url(s) here. You can submit more than one URL, each line one URL please. Each of these URLs are the root for a crawl start, existing start URLs are always re-loaded. Other already visited URLs are sorted out as "double", if they are not allowed using the re-crawl option.  
empty
From Link-List of URL

From Sitemap
From File (enter a path
within your local file system)
Crawler Filter

These are limitations on the crawl stacker. The filters will be applied before a web page is loaded.

Crawling Depth
info This defines how often the Crawler will follow links (of links..) embedded in websites. 0 means that only the page you enter under "Starting Point" will be added to the index. 2-4 is good for normal indexing. Values over 8 are not useful, since a depth-8 crawl will index approximately 25.600.000.000 pages, maybe this is the whole WWW.     also all linked non-parsable documents
Unlimited crawl depth for URLs matching with
Maximum Pages per Domain
info You can limit the maximum number of pages that are fetched and indexed from a single domain with this option. You can combine this limitation with the 'Auto-Dom-Filter', so that the limit is applied to all the domains within the given depth. Domains outside the given depth are then sorted-out anyway. :    :
info A questionmark is usually a hint for a dynamic page. URLs pointing to dynamic content should usually not be crawled. However, there are sometimes web pages with static content that is accessed with URLs containing question marks. If you are unsure, do not check this to avoid crawl loops. Following frames is NOT done by Gxxg1e, but we do by default to have a richer content. 'nofollow' in robots metadata can be overridden; this does not affect obeying of the robots.txt which is never ignored. Accept URLs with query-part ('?'):
Obey html-robots-noindex:
Obey html-robots-nofollow:
Media Type detection
Media Type checking info Not loading URLs with unsupported file extension is faster but less accurate. Indeed, for some web resources the actual Media Type is not consistent with the URL file extension. Here are some examples:
Load Filter on URLs
info The filter is a regular expression. Example: to allow only urls that contain the word 'science', set the must-match filter to '.*science.*'. You can also use an automatic domain-restriction to fully crawl a single domain. Attention: you can test the functionality of your regular expressions using the Regular Expression Tester within YaCy.
must-match
Restrict to start domain(s)
Restrict to sub-path(s)
Use filter (must not be empty)
must-not-match
Load Filter on URL origin of links
info The filter is a regular expression. Example: to allow loading only links from pages on example.org domain, set the must-match filter to '.*example.org.*'. Attention: you can test the functionality of your regular expressions using the Regular Expression Tester within YaCy.
must-match (must not be empty)
must-not-match
Load Filter on IPs
must-match (must not be empty)
must-not-match
info Crawls can be restricted to specific countries. This uses the country code that can be computed from the IP of the server that hosts the page. The filter is not a regular expressions but a list of country codes, separated by comma. no country code restriction
Use filter  
Document Filter

These are limitations on index feeder. The filters will be applied after a web page was loaded.

Filter on URLs
info The filter is a regular expression that must not match with the URLs to allow that the content of the url is indexed. Attention: you can test the functionality of your regular expressions using the Regular Expression Tester within YaCy.
must-match (must not be empty)
must-not-match
No Indexing when Canonical present and Canonical != URL
Filter on Content of Document
(all visible text, including camel-case-tokenized url and title)
must-match (must not be empty)
must-not-match
Filter on Document Media Type (aka MIME type)
Media Type filter info The filter is a regular expression that must match with the document Media Type (also known as MIME Type) to allow the URL to be indexed. Standard Media Types are described at the IANA registry. Attention: you can test the functionality of your regular expressions using the Regular Expression Tester within YaCy.
must-match
must-not-match
Solr query filter on any active indexed field(s)
Solr query filter info Each parsed document is checked against the given Solr query before being added to the index. The query must be written in respect to the standard Solr query syntax.
must-match
must-not-match
Content Filter

These are limitations on parts of a document. The filter will be applied after a web page was loaded. You can choose to:

Evaluate by default
Use all words in document by default until a CSS class as listed below appears; then ignore all
Ignore by default
Ignore all words in document by default until a CSS class as listed below appears, then evaluate all
Filter div or nav class names
comma-separated list of <div> or <nav> element class names which should be filtered out/in according to switch above.
Clean-Up before Crawl Start
Clean up search events cache info Check this option to be sure to get fresh search results including newly crawled documents. Beware that it will also interrupt any refreshing/resorting of search results currently requested from browser-side.
No Deletion
info After a crawl was done in the past, document may become stale and eventually they are also deleted on the target host. To remove old files from the search index it is not sufficient to just consider them for re-load but it may be necessary to delete them because they simply do not exist any more. Use this in combination with re-crawl while this time should be longer. Do not delete any document before the crawl is started.
Delete sub-path
For each host in the start url list, delete all documents (in the given subpath) from that host.
Delete only old
Treat documents that are loaded ago as stale and delete them before the crawl is started.
Double-Check Rules
No Doubles
info A web crawl performs a double-check on all links found in the internet against the internal database. If the same url is found again, then the url is treated as double when you check the 'no doubles' option. A url may be loaded again when it has reached a specific age, to use that check the 're-load' option. Never load any page that is already known. Only the start-url may be loaded again.
Re-load
Treat documents that are loaded ago as stale and load them again. If they are younger, they are ignored.
Document Cache
info This option is used by default for proxy prefetch, but is not needed for explicit crawling.
info The caching policy states when to use the cache during crawling: no cache: never use the cache, all content from fresh internet source; if fresh: use the cache if the cache exists and is fresh using the proxy-fresh rules; if exist: use the cache if the cache exist. Do no check freshness. Otherwise use online source; cache only: never go online, use all content from cache. If no cache exist, treat content as unavailable no cache    if fresh    if exist    cache only
Robot Behaviour
info Because YaCy can be used as replacement for commercial search appliances (like the Google Search Appliance aka GSA) the user must be able to crawl all web pages that are granted to such commercial platforms. Not having this option would be a strong handicap for professional usage of this software. Therefore you are able to select alternative user agents here which have different crawl timings and also identify itself with another user agent and obey the corresponding robots rule.
Snapshot Creation
info Snapshots are xml metadata and pictures of web pages that can be created during crawling time. The xml data is stored in the same way as a Solr search result with one hit and the pictures will be stored as pdf into subdirectories of HTCACHE/snapshots/. From the pdfs the jpg thumbnails are computed. Snapshot generation can be controlled using a depth parameter; that means a snapshot is only be generated if the crawl depth of a document is smaller or equal to the given number here. If the number is set to -1, no snapshots are generated.
replace old snapshots with new one    add new versions for each crawl
Index Attributes
Indexing
info This enables indexing of the webpages the crawler will download. This should be switched on by default, unless you want to crawl only to fill the Document Cache without indexing. :     :
info If checked, the crawler will contact other peers and use them as remote indexers for your crawl. If you need your crawling results locally, you should switch this off. Only senior and principal peers can initiate or receive remote crawls. A YaCyNews message will be created to inform all peers about a global crawl, so they can omit starting a crawl with the same start point.

Remote crawl results won't be added to the local index as the remote crawler is disabled on this peer.

You can activate it in the Remote Crawl Configuration page.

:

This message will appear in the 'Other Peer Crawl Start' table of other peers.
info A crawl result can be tagged with names which are candidates for a collection request. These tags can be selected with the GSA interface using the 'site' operator. To use this option, the 'collection_sxt'-field must be switched on in the Solr Schema
info The time zone is required when the parser detects a date in the crawled web page. Content can be searched with the on: - modifier which requires also a time zone when a query is made. To normalize all given dates, the date is stored in UTC time zone. To get the right offset from dates without time zones to UTC, this offset must be given here. The offset is given in minutes; Time zone offsets for locations east of UTC must be negative; offsets for zones west of UTC must be positve.