What is a TextParse Query?
A TextParse query is an Autonomy query that allows you match a single document against many stored queries in a specially configured Autonomy index. Normally a single query is matched against many stored documents. However you can invert that functionality with TextParse queries. That makes it possible to implement document alerting in which user-defined alerts can be generated whenever a matching document is ingested into the system. The most famous example is Google Alerts, however Autonomy has had the same capability for a lot longer and TextParse queries make that functionality easily accessible as described in the sections below.
TextParse Queries in Action
At a high level, alerting is implemented by creating special documents which represent the query criteria for an alert and then matching those alerts against incoming documents. For example if a news story alerting system allowed users to match against the BYLINE, ABSTRACT, and text of a news story the user might say they want to be alerted on all articles with “Ben Stein” in the BYLINE, “Gold or Metals” in the ABSTRACT, and “Hedging” in the text. That query criteria would be used to create a document which would be stored in a specially configured content index along with many other user-defined alerts for later matching against incoming news documents. When a new news document arrives, that document is converted into a special query called a TextParse query and run against the special content index in order to find all alert query documents which match the news story. All matching alerts could then be emailed out to the users who created them in order to inform them of the arrival of the new story.
Architecture
For those of you unfamiliar with the IDOL component architecture – the IDOL server acts as a unifying facade for a number of subcomponents which support the various search, categorization, community, viewing, and other capabilities of IDOL. When configured correctly, those subcomponents can be run on their own independently. In a typical IDOL server setup, the IDOL server receives ACI commands which it then forwards to its subcomponents to do the actual work. That is why you will sometimes hear IDOL referred to as the IDOL proxy. Two of the IDOL subcomponents, content and agentstore, are used for high-speed indexing and retrieval. The content component supports the index and retrieval functionality of IDOL, while agentstore is used by IDOL as an internal database of agent, categorization, and community information. Agentstore is just a specialized instance of the content component which IDOL uses to support the agent, categorization, and community functions. You can think of agentstore as IDOL’s internal database.
In order to support TextParse query functionality you will typically run a separate content component independent of IDOL so it can be configured specifically for high-speed TextParse queries and can be scaled-up through multiple installations of an identically configured component (more on that in a later post).
Configuration
The simplest way to start with configuration is to use a copy of the agentstore component from an existing out-of-the-box installation of IDOL. Remember that agentstore is just a specially configured instance of the content component. That special configuration contains some of the settings that are required for high-speed TextParse querying, namely the following:
· AgentBooleanCacheField – this will cache the field used for Boolean matching in your agent documents stored in your agentstore. It is not strictly required, but can increase performance considerably at the expense of additional memory usage.
· FieldTextCacheField - this will cache the field used for fieldtext matching in your agent documents stored in your agentstore. It is not strictly required, but can increase performance considerably at the expense of additional memory usage.
· TextParseIndexType field processing – this setting allows you to define the fields that will comprise your query text. This is essential because the first step in matching a TextParse query is to match the query’s “text” parameter against the agent documents TextParseIndexType fields.
If you look in the default IDOL configuration file, these configuration settings are not defined. However, they are defined in the default agentstore configuration file (agentstore.cfg) which means we don’t have to start from scratch if we use agentstore as the basis for our setup. The pertinent settings from the default agentstore configuration are shown in the listing below:
|
. . .
AgentBooleanCacheField=*/BOOLEANRESTRICTION
FieldTextCacheField=*/FIELDTEXTRESTRICTION
. . .
[FieldProcessing]
. . .
9=SetTextParseIndexFields
. . .
[SetTextParseIndexFields]
property=TextParseIndexFields
propertyfieldcsvs=*/DRETITLE,*/DRECONTENT,*/TRAINING
. . .
[TextParseIndexFields]
TextParseIndexType=TRUE
. . .
|
The definitions for the AgentBooleanCacheField and FieldTextCacheField are fine as they are, we just need to remember to use those fields when defining our agent documents. That leaves the TextParseIndexField definitions. We can customize those as needed by modifying the “propertyfieldcsvs” setting shown in the listing above. For example, you might have an ABSTRACT field that is part of the text of your documents, in which case you would append */ABSTRACT to the end of the “propertyfieldcsvs” setting.
Seeding
The agentstore engine must be populated with all the correct document tag information before it can correctly process textparse queries. Therefore you will need to index at least one “dummy” document into your agentstore. The dummy document must contain all fields upon which you wish to query. It can be indexed into any database. Assuming you wanted to query on the BYLINE and ABSTRACT you would index the dummy document shown in the listing below. This tells the engine the BYLINE and ABSTRACT are available tags.
|
#DREREFERENCE Dummy
#DRETITLE
Ignore this title
#DREFIELD BYLINE=""
#DREFIELD ABSTRACT=""
#DRECONTENT
#DREENDDOC |
Creating Agent Documents
Once you’ve configured your custom agentstore content engine and have indexed at least one dummy document you are ready to create your agent documents. Remember that each agent document represents a user query that we want to match against an incoming document. We can store hundreds of thousands of separate agent documents and have them matched against an incoming document. The agent document can have any fields that you desire, but it should contain the following fields at a minimum:
· DREREFERENCE – this is the unique ID of the agent document.
· DRETITLE – this should be the human readable ID for the agent document.
· BOOLEANRESTRICTION DREFIELD – this is a Boolean text restriction which can contain terms or field restricted terms in order to match against the text of the textparse document.
· FIELDTEXTRESTRICTION DREFIELD – this it the fieldtext restriction which can use fieldtext operators to further refine the match against the textparse document.
· DRECONTENT – this is the text that is used to match against the textparse document. Only after a match is made on the text content are the BOOLEANRESTRICTION or FIELDTEXTRESTRICTION fields considered (see more in the “Gotchas” section below).
It’s also handy to include additional information about the alert, such as the email distribution list to which matching documents should be sent when they are matched or any other descriptive information about the alert.
A sample agent document containing values for these fields is shown in the listing below.
|
#DREREFERENCE 2ad6b67c1b7b4b86876b4651f3b70c48
#DRETITLE MY TEST AGENT
#DREDBNAME Agent
#DREFIELD DISTRIBUTIONLIST="test@test.com"
#DREFIELD THRESHOLD="80"
#DREFIELD DELIVERYMETHOD="REALTIME"
#DREFIELD FIELDTEXTRESTRICTION="MATCH{XYZ}:FLD1 AND MATCH{ABC}:FLD2"
#DREFIELD BOOLEANRESTRICTION="( term1 AND term2 ) OR term3:*/BYLINE”
#DRECONTENT
XXDEFAULTVALUEXX
#DREENDDOC
|
The agent documents are stored in the index using the DREADD command in the same way you would store any other document. You can also store many multiple agent documents from a single file as you would any other IDX documents. In the example above we use the preconfigured agentstore “Agent” database, however that is not a strict requirement.
Querying
Once you have a configured agentstore component with at least one dummy document and at least one agent document indexed you are ready to use textparse queries to match documents against the stored agents to see which match the documents. You use a modified version of a standard Autonomy ACI query in order to match the agents. The modified query must contain the following query parameters at a minimum:
· TextParse - “TextParse=true” to enable the textparse functionality.
· AgentBooleanField – this is used to identify the agent Boolean field used by the agentstore. If you use the agentstore configuration as described above, then this value would be “AgentBooleanField=BOOLEANRESTRICTION”.
· FieldTextField - this is used to identify the agent fieldtext field used by the agentstore. If you use the agentstore configuration as described above, then this value would be “FieldTextField=FIELDTEXTRESTRICTION”.
· Text - must also contain the full content of the IDX document in the “text” parameter of the ACI query and that text should be sent using the HTTP POST action rather than the GET action.
I have created the simple HTML form shown below in order to simplify the ACI query creation and execution for testing. This form presents a textbox into which you can enter an IDX document for submission to your agentstore to see if it returns alert document results. Simply change the host and port names to suit your particular installation.
|
<HTML>
<HEAD><TITLE>TextParse Query Form</TITLE></HEAD>
<BODY>
<H1>TextParse Query Form</H1>
<FORM METHOD="post" target="other" ACTION="http://localhost:9050/action=QUERY&textparse=true&databasematch=agent
&agentbooleanfield=BOOLEANRESTRICTION&fieldtextfield=FIELDTEXTRESTRICTION
&maxresults=100&predict=false&print=all&totalresults=true" >
<textarea name="text" cols="40" rows="10">Paste IDX document here</textarea>
<BR/>
<INPUT TYPE="submit" NAME="Submit" VALUE="Execute Query">
</FORM>
</BODY>
</HTML>
|
The HTML file produces a web page similar to the one shown in the screenshot below:

The results of a TextParse query are no different than those of a regular ACI query. In other words the result contains a list of the matching documents. In this case the returned documents represent agent queries which match the document. You can control the returned fields using “print” and “printfields” as you would with any other query.
Optimization
The configuration section above mentioned that the AgentBooleanCacheField and FieldTextCacheField are essential to performance of TextParse query matching. Those cache settings allow the in-memory caching of the Boolean and field text queries which will be used for document matching. Those settings are essential for any real-world alerting system.
The complexity of agent document queries impact performance most significantly. Unfortunately, it’s often impossible to control the complexity of alerting expressions when users are the ones creating the expressions. Luckily users typically create very simple matching expressions often containing one or two keywords and they rarely use Boolean or field restricted expressions. Our best advice in this area is to obtain a representative set of agent documents and test extensively to determine how many agent documents can be indexed before matching time exceeds the required maximum.
The size of the matching documents is the next highest impact on performance. Massively scalable alert matching is only possible on smaller documents, typically those with 5K text or under. However it is possible to perform alerting on much larger documents (in the megabytes) as long as document throughput (number of documents matched per minute) is not a significant issue. Again, the best way to optimize is to obtain a representative set of documents and run them as textparse queries against your agent documents to see the average query response time under load.
With reasonably small or sectioned documents (<5K in size) and reasonably complex agent document definitions it is possible to match hundreds of thousands of agents with performance over 100 documents per second (with adequate hardware and memory).
Gotchas
There are a few “gotchas” that you will need to avoid when using text parse queries. The first is related to the large size of the textparse query itself. Since a textparse query involves sending an entire IDX document as part of the query, the query can become very large. Therefore you need to make sure that you have MaxInputString set to a large value or disabled (by setting it to -1). That setting is only available in the configuration file. You also need to make sure that you have MaxQueryTerms set to a large enough value to accommodate the large document. MaxQueryTerms is set to 250000 in the default agentstore configuration. That should be a reasonable size for smaller documents, however, you can specify a larger value in the ACI query to override the configured default if necessary.
The second gotcha has to do with the way agent documents match the textparse document. They will always attempt to match the textparse types fields to the text of the document first. If there is no match, then they will proceed no further. That means that if the agent document defines no text or the textparse document text has no text, then no match will be made even though the agent Boolean or field text portions of the agent document match the textparse document. The way around this is to always append an artificial term, say “XXDEFAULTVALUEXX” to the end of the DRECONTENT field in your textparse document. Then in your agent documents you must make sure that you add “XXDFEAULTVALUEXX” to the DRECONTENT if would otherwise be empty. This will prevent empty text in the agent document from disabling the agent completely.
Conclusion
TextParse queries are the key to massively scalable document alerting with Autonomy however they can be tricky to implement if you don’t configure all the pieces correctly. Armed with the information in this document, you should be able to configure your agent alerting system in a snap and be up in running in minutes. In a subsequent post I will show how to scale this solution out to handle millions of user alerts for matching against tens and hundreds of documents per seconds.