Keep HTML Strings

Thunderstone Search Appliance Manual

Keep HTML Strings

Syntax: one or more pairs of strings

All data not between specified begin and end string pairs will be stripped from the HTML before the text is extracted (i.e. links are unaffected). These are simple strings, not patterns nor REX expressions, and the case is ignored. This is useful for extracting prime interest areas of HTML pages without the surrounding boilerplate. String pairs should not nest nor overlap in documents; use Keep Selectors (here) for nesting/balanced elements. Documents with no begin string will be unaffected. Documents with no end string after the last begin string will still keep HTML from the last begin string to end of document. Prior to version 25.0.0 this setting was named Keep Tags.