Skip to main content
Chuck Tichenor

Chuck Tichenor

Search
Capax Global Blogs
Home
Christopher Keslin
Mark E. Smith
Jerry Hawk
Chuck Tichenor
Pat Richard
Michael Primeaux
  
Capax Global Blogs > Chuck Tichenor  

February 14
Autonomy TextParse Queries

What is a TextParse Query?

A TextParse query is an Autonomy query that allows you match a single document against many stored queries in a specially configured Autonomy index. Normally a single query is matched against many stored documents. However you can invert that functionality with TextParse queries. That makes it possible to implement document alerting in which user-defined alerts can be generated whenever a matching document is ingested into the system. The most famous example is Google Alerts, however Autonomy has had the same capability for a lot longer and TextParse queries make that functionality easily accessible as described in the sections below.

TextParse Queries in Action

At a high level, alerting is implemented by creating special documents which represent the query criteria for an alert and then matching those alerts against incoming documents. For example if a news story alerting system allowed users to match against the BYLINE, ABSTRACT, and text of a news story the user might say they want to be alerted on all articles with “Ben Stein” in the BYLINE, “Gold or Metals” in the ABSTRACT, and “Hedging” in the text. That query criteria would be used to create a document which would be stored in a specially configured content index along with many other user-defined alerts for later matching against incoming news documents. When a new news document arrives, that document is converted into a special query called a TextParse query and run against the special content index in order to find all alert query documents which match the news story. All matching alerts could then be emailed out to the users who created them in order to inform them of the arrival of the new story.

Architecture

For those of you unfamiliar with the IDOL component architecture – the IDOL server acts as a unifying facade for a number of subcomponents which support the various search, categorization, community, viewing, and other capabilities of IDOL. When configured correctly, those subcomponents can be run on their own independently. In a typical IDOL server setup, the IDOL server receives ACI commands which it then forwards to its subcomponents to do the actual work. That is why you will sometimes hear IDOL referred to as the IDOL proxy. Two of the IDOL subcomponents, content and agentstore, are used for high-speed indexing and retrieval. The content component supports the index and retrieval functionality of IDOL, while agentstore is used by IDOL as an internal database of agent, categorization, and community information. Agentstore is just a specialized instance of the content component which IDOL uses to support the agent, categorization, and community functions. You can think of agentstore as IDOL’s internal database.

In order to support TextParse query functionality you will typically run a separate content component independent of IDOL so it can be configured specifically for high-speed TextParse queries and can be scaled-up through multiple installations of an identically configured component (more on that in a later post).

Configuration

The simplest way to start with configuration is to use a copy of the agentstore component from an existing out-of-the-box installation of IDOL. Remember that agentstore is just a specially configured instance of the content component. That special configuration contains some of the settings that are required for high-speed TextParse querying, namely the following:

·         AgentBooleanCacheField – this will cache the field used for Boolean matching in your agent documents stored in your agentstore. It is not strictly required, but can increase performance considerably at the expense of additional memory usage.

·         FieldTextCacheField - this will cache the field used for fieldtext matching in your agent documents stored in your agentstore. It is not strictly required, but can increase performance considerably at the expense of additional memory usage.

·         TextParseIndexType field processing – this setting allows you to define the fields that will comprise your query text. This is essential because the first step in matching a TextParse query is to match the query’s “text” parameter against the agent documents TextParseIndexType fields.

If you look in the default IDOL configuration file, these configuration settings are not defined. However, they are defined in the default agentstore configuration file (agentstore.cfg) which means we don’t have to start from scratch if we use agentstore as the basis for our setup. The pertinent settings from the default agentstore configuration are shown in the listing below:

. . .

AgentBooleanCacheField=*/BOOLEANRESTRICTION

FieldTextCacheField=*/FIELDTEXTRESTRICTION

. . .

[FieldProcessing]

. . .

9=SetTextParseIndexFields

. . .

[SetTextParseIndexFields]

property=TextParseIndexFields

propertyfieldcsvs=*/DRETITLE,*/DRECONTENT,*/TRAINING

. . .

[TextParseIndexFields]

TextParseIndexType=TRUE

. . .

 

 

The definitions for the AgentBooleanCacheField and FieldTextCacheField are fine as they are, we just need to remember to use those fields when defining our agent documents. That leaves the TextParseIndexField definitions. We can customize those as needed by modifying the “propertyfieldcsvs” setting shown in the listing above. For example, you might have an ABSTRACT field that is part of the text of your documents, in which case you would append */ABSTRACT to the end of the “propertyfieldcsvs” setting.

Seeding

The agentstore engine must be populated with all the correct document tag information before it can correctly process textparse queries. Therefore you will need to index at least one “dummy” document into your agentstore. The dummy document must contain all fields upon which you wish to query. It can be indexed into any database. Assuming you wanted to query on the BYLINE and ABSTRACT you would index the dummy document shown in the listing below. This tells the engine the BYLINE and ABSTRACT are available tags.

 

#DREREFERENCE Dummy

#DRETITLE

Ignore this title

#DREFIELD BYLINE=""

#DREFIELD ABSTRACT=""

#DRECONTENT

#DREENDDOC

 

Creating Agent Documents

Once you’ve configured your custom agentstore content engine and have indexed at least one dummy document you are ready to create your agent documents. Remember that each agent document represents a user query that we want to match against an incoming document. We can store hundreds of thousands of separate agent documents and have them matched against an incoming document. The agent document can have any fields that you desire, but it should contain the following fields at a minimum:

·         DREREFERENCE – this is the unique ID of the agent document.

·         DRETITLE – this should be the human readable ID for the agent document.

·         BOOLEANRESTRICTION DREFIELD – this is a Boolean text restriction which can contain terms or field restricted terms in order to match against the text of the textparse document.

·         FIELDTEXTRESTRICTION DREFIELD – this it the fieldtext restriction which can use fieldtext operators to further refine the match against the textparse document.

·         DRECONTENT – this is the text that is used to match against the textparse document. Only after a match is made on the text content are the BOOLEANRESTRICTION or FIELDTEXTRESTRICTION fields considered (see more in the “Gotchas” section below).

It’s also handy to include additional information about the alert, such as the email distribution list to which matching documents should be sent when they are matched or any other descriptive information about the alert.

A sample agent document containing values for these fields is shown in the listing below.

 

#DREREFERENCE 2ad6b67c1b7b4b86876b4651f3b70c48

#DRETITLE MY TEST AGENT

#DREDBNAME Agent

#DREFIELD DISTRIBUTIONLIST="test@test.com"

#DREFIELD THRESHOLD="80"

#DREFIELD DELIVERYMETHOD="REALTIME"

#DREFIELD FIELDTEXTRESTRICTION="MATCH{XYZ}:FLD1 AND MATCH{ABC}:FLD2"

#DREFIELD BOOLEANRESTRICTION="( term1 AND term2 )  OR term3:*/BYLINE”

#DRECONTENT

XXDEFAULTVALUEXX

#DREENDDOC

 

 

The agent documents are stored in the index using the DREADD command in the same way you would store any other document. You can also store many multiple agent documents from a single file as you would any other IDX documents. In the example above we use the preconfigured agentstore “Agent” database, however that is not a strict requirement.

Querying

Once you have a configured agentstore component with at least one dummy document and at least one agent document indexed you are ready to use textparse queries to match documents against the stored agents to see which match the documents. You use a modified version of a standard Autonomy ACI query in order to match the agents. The modified query must contain the following query parameters at a minimum:

·         TextParse - “TextParse=true” to enable the textparse functionality.

·         AgentBooleanField – this is used to identify the agent Boolean field used by the agentstore. If you use the agentstore configuration as described above, then this value would be “AgentBooleanField=BOOLEANRESTRICTION”.

·         FieldTextField - this is used to identify the agent fieldtext field used by the agentstore. If you use the agentstore configuration as described above, then this value would be “FieldTextField=FIELDTEXTRESTRICTION”.

·         Text - must also contain the full content of the IDX document in the “text” parameter of the ACI query and that text should be sent using the HTTP POST action rather than the GET action.

I have created the simple HTML form shown below in order to simplify the ACI query creation and execution for testing. This form presents a textbox into which you can enter an IDX document for submission to your agentstore to see if it returns alert document results. Simply change the host and port names to suit your particular installation.

 

<HTML>

<HEAD><TITLE>TextParse Query Form</TITLE></HEAD>

<BODY>

<H1>TextParse Query Form</H1>

<FORM METHOD="post" target="other" ACTION="http://localhost:9050/action=QUERY&textparse=true&databasematch=agent

&agentbooleanfield=BOOLEANRESTRICTION&fieldtextfield=FIELDTEXTRESTRICTION

&maxresults=100&predict=false&print=all&totalresults=true" >

<textarea name="text" cols="40" rows="10">Paste IDX document here</textarea>

<BR/>

<INPUT TYPE="submit" NAME="Submit" VALUE="Execute Query">

</FORM>

</BODY>

</HTML>

 

 

The HTML file produces a web page similar to the one shown in the screenshot below:

Test Parse Query Form

The results of a TextParse query are no different than those of a regular ACI query. In other words the result contains a list of the matching documents. In this case the returned documents represent agent queries which match the document. You can control the returned fields using “print” and “printfields” as you would with any other query.

Optimization

The configuration section above mentioned that the AgentBooleanCacheField and FieldTextCacheField are essential to performance of TextParse query matching. Those cache settings allow the in-memory caching of the Boolean and field text queries which will be used for document matching. Those settings are essential for any real-world alerting system.

The complexity of agent document queries impact performance most significantly. Unfortunately, it’s often impossible to control the complexity of alerting expressions when users are the ones creating the expressions. Luckily users typically create very simple matching expressions often containing one or two keywords and they rarely use Boolean or field restricted expressions. Our best advice in this area is to obtain a representative set of agent documents and test extensively to determine how many agent documents can be indexed before matching time exceeds the required maximum.

The size of the matching documents is the next highest impact on performance. Massively scalable alert matching is only possible on smaller documents, typically those with 5K text or under. However it is possible to perform alerting on much larger documents (in the megabytes) as long as document throughput (number of documents matched per minute) is not a significant issue. Again, the best way to optimize is to obtain a representative set of documents and run them as textparse queries against your agent documents to see the average query response time under load.

With reasonably small or sectioned documents (<5K in size) and reasonably complex agent document definitions it is possible to match hundreds of thousands of agents with performance over 100 documents per second (with adequate hardware and memory).

Gotchas

There are a few “gotchas” that you will need to avoid when using text parse queries. The first is related to the large size of the textparse query itself. Since a textparse query involves sending an entire IDX document as part of the query, the query can become very large. Therefore you need to make sure that you have MaxInputString set to a large value or disabled (by setting it to -1). That setting is only available in the configuration file. You also need to make sure that you have MaxQueryTerms set to a large enough value to accommodate the large document. MaxQueryTerms is set to 250000 in the default agentstore configuration. That should be a reasonable size for smaller documents, however, you can specify a larger value in the ACI query to override the configured default if necessary.

The second gotcha has to do with the way agent documents match the textparse document. They will always attempt to match the textparse types fields to the text of the document first. If there is no match, then they will proceed no further. That means that if the agent document defines no text or the textparse document text has no text, then no match will be made even though the agent Boolean or field text portions of the agent document match the textparse document. The way around this is to always append an artificial term, say “XXDEFAULTVALUEXX” to the end of the DRECONTENT field in your textparse document. Then in your agent documents you must make sure that you add “XXDFEAULTVALUEXX” to the DRECONTENT if would otherwise be empty. This will prevent empty text in the agent document from disabling the agent completely.

Conclusion

TextParse queries are the key to massively scalable document alerting with Autonomy however they can be tricky to implement if you don’t configure all the pieces correctly. Armed with the information in this document, you should be able to configure your agent alerting system in a snap and be up in running in minutes. In a subsequent post I will show how to scale this solution out to handle millions of user alerts for matching against tens and hundreds of documents per seconds.

 

January 30
Tracking Your Content in Autonomy

Most folks are used to using relational database systems which provide ATOMic content update transactions as a matter of course.  Therefore new users of full-text indexing systems, such as Autonomy, are often surprised to find that full-text index content update operations are not transactional.

 

With a transactional system, you can be sure that your content update has been safely committed once the transaction has completed. Unfortunately, due to the nature of the technology, full-text indexing systems, such as Autonomy, cannot provide that assurance. This is important because there is no guarantee that the content has been indexed, especially in a larger Autonomy environment where the content may be mirrored or partitioned across multiple systems.

 

So how does one achieve transactional nirvana in a non-transactional system such as Autonomy? One answer is to implement a document tracking system (DTS). Implementing DTS with Autonomy is fairly straightforward once you know the pertinent configuration settings and data formats which I will cover in this article. You should be familiar with Autonomy scalability architecture, and the Distributed Index Handler (DIH) in particular, before reading further. You can find out more about DIH here.

 

What is DTS?

 

DTS assures that content was indexed by actively checking the indices to assure that the content is actually stored. It's equivalent to performing a verify operation on a file copy operation by reading the file from the copy destination. If the DTS finds that the document has been indexed it can record that fact in a database for use by other applications. Alternatively, if the DTS finds that a document has not been indexed, it can log the issue, alert administrative staff, or attempt to re-index the content automatically (or all of the above).

 

DTS can be implemented a number of ways all depending on how much control your system has over the indexing process. The least intrusive method is to take advantage of the DIH archive folder feature. The DIH can be configured to automatically store every indexing command to the filesystem as an archive. In that way the archive folder will contain a full record of all indexing performed through the DIH. As such it is an ideal monitoring point for a DTS system.

 

How is DTS Implemented?

 

Using the DIH archive folder, a DTS system would monitor for all new index command files in the archive folder. It would allow those files to "age" for a time period before processing them. The aging is required in order to assure that the actual indexing operation has completed. Remember that indexing is an asynchronous operation, therefore it could be several minutes (or even hours for large indices) after and index job has been submitted to the DIH before the operation has completed.

 

Once the DTS is in possession of an "aged" index command it can then verify that command against the target indices. For each command the DIH creates two files in the archive folders. One with an ".icmd" extension and one with a ".data" extension. The icmd file contains the ACI URL command that was issued to the DIH. The data file contains the actual document data (if any) that was associated with the command. If the command file contains the string "DREADD" then it is a content submission for one or more documents. If the command file contains the string "DREDELETEREF" or "DREDELETEDOC" then it is a content deletion for one or more documents.

 

When the DTS finds a DREADD command it must verify that all documents in the data file have actually been indexed. Therefore it will need to extract the DREREFERENCE for each document in the data file. For IDX files, this is a matter of parsing out every #DREREFERENCE line. For XML files one would use a SAX or XmlReader parser to extract the XML element designated as the reference. Once all the DREREFERENCEs have been extracted, then an ACI query command can be created in order to query the content in the target index systems. You can use action=getcontent, action=list, or action=query for your query. Make sure to use print=None to minimize the content which is returned. 

 

Verifying a DREADD in a distributed environment where content mirroring is employed could mean that the DTS has to check multiple Autonomy IDOL systems to verify the content. The DTS can be configured with a list of those systems or it could automatically read those systems from the DIH or IDOL configuration file (under the engines or distributed engines settings). For each one of the IDOL systems, the DTS would issue an ACI query constructed in the steps above in order to verify that the system actually contains those documents. The results of the ACI query would include the DREREFERENCEs of all indexed documents. The DTS would perform a difference between the expected DREREFERENCEs and those actually returned from the ACI query. If any references were missing, then that document would be flagged as a potentially missing document.

 

Processing of a DREDELETEREF or DREDELETEDOC command is analogous to that of DREADD. In this case there is no document content in the data file so you must extract the DREREFERENCEs from the icmd file. Once you have the references DTS would then create the verification ACI query and send to all IDOL systems as instructed above. In this case the DTS would expect to receive zero results because all those documents should have been deleted. If any references are returned, then they would be flagged as potentially undeleted documents.

 

Don't Forget to Consider the Index Queue

 

In the sections above I referred to "potentially" missing documents or "potentially" undeleted documents. The DTS can't be sure that a document is actually missing or undeleted until it checks all other commands in the pending index queue. Remember, a document can be indexed, deleted, and re-indexed. So it is possible that a document appears to be missing from the perspective of an earlier index command, but it was actually deleted by a later index command. Therefore a full implementation of DTS should always consider all commands in queue before declaring a document missing or undeleted. Note that this is not necessary in archive systems that never delete content, or delete it in managed batch blocks.

 

As mentioned above, the DTS can take multiple actions once it has found an inconsistency. It can log the issue, alert operations staff, and/or automatically resubmit the index command. In the later case, the DTS would use the DIH icmd file and data file in order to construct an DRE command in order to re-index or delete the content. The DTS should submit the command directly to the inconsistent IDOL system rather than re-submitting the command through the DIH. That will help avoid the possibility of infinite loops. Once the DTS has resubmitted the index command it should also verify that the resubmit was correctly indexed using the process described above. If the resubmit fails, then the content should again be resubmitted until a maximum configured retry count is exceeded.

 

Conclusion

 

Using the system described above you can safely submit content to your Autonomy systems and verify with absolute certainty that it was indexed. DTS achieves this independently without having to add any special tracking or transaction processing to your application systems.

 

If you like what you've read above, but don't have the time or resources to develop your own DTS, then consider using the CapaxGlobal DTS. Our cross-platform DTS provides all the capabilities described above, and then some, in an affordable and easily administrated service which runs directly on your Autonomy DIH systems.