Ignore HTML Strings

Thunderstone Search Appliance Manual

Ignore HTML Strings

Syntax: one or more pairs of strings

All data between specified begin and end string pairs will be stripped from the HTML before the text is extracted (i.e. links are unaffected). These are simple strings, not patterns nor REX expressions, and the case is ignored. This is useful for excluding boilerplate or otherwise unwanted portions of HTML documents. String pairs should not nest nor overlap in documents; use Ignore Selectors (here) for nesting/balanced elements. Documents with no begin string will be unaffected. Documents with no end string after the last begin string will still discard HTML from the last begin string to end of document. Prior to version 25.0.0 this setting was named Ignore Tags.