Oracle® Ultra Search User's Guide 10g Release 1 (10.1) Part Number B10731-02 |
|
|
View PDF |
This chapter contains the following topics:
Ultra Search is built on the Oracle Database and Oracle Text technology that provides uniform search-and-locate capabilities over multiple repositories: Oracle databases, other ODBC compliant databases, IMAP mail servers, HTML documents served up by a Web server, files on disk, and more.
Ultra Search uses a 'crawler' to collect documents. You can schedule the crawler to suit the Web sites that you want to search. The documents stay in their own repositories, and the crawled information is used to build an index that stays within your firewall in a designated Oracle database. Ultra Search also provides APIs for building content management solutions.
In addition, Ultra Search offers the following:
A complete text query language for text search inside the database
Full integration with the Oracle Database and the SQL query language
Advanced features like concept searching and theme analysis
Attribute mapping to facilitate attribute search across disparate repositories
Indexing of all popular file formats (150+)
Full globalization, including support for Chinese, Japanese and Korean (CJK), and Unicode
Ultra Search is made up of the following components:
The Ultra Search crawler is a Java process activated by your Oracle server according to a set schedule. When activated, the crawler spawns a configurable number of processor threads that fetch documents from various data sources and index them using Oracle Text. This index is used for querying. Data sources can be Web sites, database tables, files, mailing lists, Oracle Application Server Portal page groups, or user-defined data sources.
The crawler maps links and analyzes relationships. The crawler schedule is integrated with and driven from the DBMS_JOB
queue mechanism. Whenever the crawler encounters embedded, non-HTML documents during the crawling, it uses Oracle Text filters to automatically detect the document type and filter and index the document.
The Ultra Search backend consists of an Ultra Search repository and Oracle Text. Oracle Text provides the text indexing and search capabilities required to index and query data retrieved from your data sources. The backend is not visible to users; it indexes information from the crawler and serves up the query results.
See Also: "Installing the Ultra Search Backend" |
The administration tool is a J2EE-compliant Web application. You can use it to manage Ultra Search instances, and you can access it from any browser in your intranet. The administration tool is independent from the Ultra Search query application. Therefore, the administration tool and query application can be hosted on different computers to enhance security and scalability.
Ultra Search provides the following APIs:
The query API works with indexed data. The Java API does not impose any HTML rendering elements. The application can completely customize the HTML interface.
The crawler agent API crawls and indexes proprietary document repositories.
The email Java API accesses archived emails and is used by the query application to display emails. It can also be used when building your own custom query application.
The URL rewriter API is used by the crawler to filter and rewrite extracted URL links before they are inserted into the URL queue.
Ultra Search includes highly functional query applications to query and display search results. The query applications are based on JSP and work with any JSP1.1 compliant engine.
This section explains some features in Ultra Search. It includes the following topics:
You can create a read-only snapshot of a master Ultra Search instance. This is useful for query processing or for a backup. You can also make a snapshot instance updatable. This is useful when the master instance is corrupted and you want to use a snapshot as a new master instance.
See Also: "Instances Page" |
Document attributes, or metadata, describe the properties of a document. Each data source has its own set of document attributes. The value is retrieved during the crawling process and then mapped to one of the search attributes and stored and indexed in the database. This lets you query documents based on their attributes. Document attributes in different data sources can be mapped to the same search attribute. Therefore, you can query documents from multiple data sources based on the same search attribute.
Ultra Search has the following default search attributes: Title, Author, Description, Subject, Mimetype, Language, Host, and LastModifedDate. They can be incorporated in search applications for a more detailed search and richer presentation. The list of values (LOV) for a search attribute can help you specify a search query. If attribute LOV is available, then the crawler registers the LOV definition, which includes attribute value, attribute value display name, and its translation.
See Also: "Synchronizing Data Sources" |
Ultra Search provides a command-line tool to load metadata into an Ultra Search database. If you have a large amount of data, this is probably faster than using the HTML-based administration tool. The loader tool supports the following types of metadata:
You can define, edit, or delete your own data sources and types in addition to the ones provided. You might implement your own crawler agent to crawl and index a proprietary document repository, such as Lotus Notes or Documentum, which contain their own databases and interfaces. The proprietary repository is called a user-defined data source. The module that enables the crawler to access the data source is called a crawler agent.
Robots exclusion lets you control which parts of your sites can be visited by robots. If robots exclusion is enabled (default), then the Web crawler traverses the pages based on the access policy specified in the Web server robots
.txt
file. For example, when a robot visits http
://www
.foobar
.com
/, it checks for http
://www
.foobar
.com
/robots
.txt
. If it finds it, the crawler analyzes its contents to see if it is allowed to retrieve the document. If you own the Web sites, then you can disable robots exclusions. However, when crawling other Web sites, you should always comply with robots
.txt
by enabling robots exclusion.
See Also: "Web Sources" |
For initial planning purposes, you might want the crawler to collect URLs without indexing. After crawling is done, you can examine document URLs and status, remove unwanted documents, and start indexing. You can update the crawling mode to the following:
Automatically accept all URLs for indexing
Examine URLs before indexing
Index only
See Also: "Schedules Page" |
The URL rewriter is a user-supplied Java module for implementing the Ultra Search UrlRewriter interface. It is used by the crawler to filter or rewrite extracted URL links before they are put into the URL queue. URL filtering removes unwanted links, and ULR rewriting transforms the URL link. This transformation is necessary when access URLs are used.
Ultra Search offers a flexible query API to incorporate search functionality to your sites. The query API includes the following functionality:
Three attribute types: string, number, and date
Multivalued attributes
Display name support for attributes, attribute list of values (LOV), and data groups
Document relevancy boosting
Arbitrary grouping of attribute query operator using operators (AND
, OR
), with control over attribute operator evaluation order
Selection of metadata returned in query result
Ultra Search supports secure searches, which return only documents satisfying the search criteria that the search user is allowed to view. For secure searches, each indexed document should be protected by an access control list (ACL). During searches, the ACL is evaluated. If the user performing the search has permission to read the protected document, then the document is returned by the query API. Otherwise, it is not returned.
There are two ways to secure a data source:
Specify a single ACL for protecting all documents of a data source.
The administrator specifies the permissions of the single ACL in the Ultra Search administration tool. The resulting ACL is used to protect all documents belonging to that data source.
Crawl ACLs from the data source.
The data source is expected to provide the ACL together with the document. This lets each document be protected by its own unique ACL.
Ultra Search performs ACL duplicate detection. This means that if a crawled document's ACL already exists in the Ultra Search system, then the existing ACL is used to protect the document, instead of creating a new ACL within Ultra Search. This policy reduces storage space and increases performance.
Ultra Search supports only a single LDAP domain. The LDAP users and groups specified in the ACL must belong to the same LDAP domain.
Caution: If ACLs are crawled from data sources, then it is the responsibility of the administrator to ensure that the data sources being crawled belong to the same LDAP domain. Otherwise, it is possible that search users can inadvertently be granted permissions to access documents that they should not be able to access. |
Searches run against a secure-search enabled Ultra Search instance are slower than those run against a non secure-search enabled instance. This is because each candidate hit could require an ACL evaluation. ACLs are evaluated natively by the Oracle server for optimum performance. Nevertheless, this is a finite time. Therefore, the time taken to return hits in a secure search varies depending on the number ACL evaluations that must be made.
Ultra Search stores ACLs in the Oracle XML DB repository. Ultra Search also uses Oracle XML DB functionality to evaluate ACLs. (This dependency only exists for those users who are making use of secure searching.)
The ACLs are managed by Ultra Search. ACLs are uniquely referenced by documents from a single Ultra Search instance. ACLs are not shared by multiple Ultra Search instances. For acceptable performance, the ACL cache size must be large enough to contain all ACLs evaluated at run time.
ACLs in the XML DB repository are protected by other ACLs (known as "protector ACLs"). Ultra Search ensures that the protector ACLs grant appropriate privileges in order for Ultra Search to invoke the XML DB ACL evaluation mechanism. The evaluation performance is primarily affected by the total number of ACLs used by all XML DB client applications that also utilize its ACL evaluation mechanism. This set of applications includes Ultra Search.
An Ultra Search data source can be protected by a single administrator specified ACL. This ACL specifies which users and groups are allowed to view the documents belonging to that data source.
Ultra Search uses the Oracle Server's ACL evaluation engine to evaluate permissions when queries are performed by search users. This ACL evaluation engine is a feature of Oracle XML DB. If an Ultra Search query attempts to retrieve a document that is protected by an administrator specified ACL, the ACL is evaluated and subsequently cached.
The duration an ACL is cached is controlled by an XML DB configuration parameter. (For more information, consult the Oracle XML DB Developer's Guide.) The /xdbconfig/sysconfig/acl-max-age
parameter must be modified. The value is a number in seconds that determines how long ACLs are cached.
Since ACLs are cached, it is important to remember that changes to an administrator specified ACL may not propagate immediately. This only applies to database sessions that existed before the change was made.
Ultra Search includes fully functional sample query applications to query and display search results. The sample query applications include a sample search portlet. The sample Ultra Search portlet demonstrates how to write a search portlet for use in Oracle Application Server Portal. This same portlet is installed as a feature of the Oracle Application Server Portal product.
See Also: "Ultra Search Query API" |
You can override the search results and influence the order that documents are ranked in the query result list with document relevancy boosting. This can promote important documents to higher scores and make them easier to find.
Relevancy boosting assigns a score to a document for specific queries entered by the search user.
Note: The document still has a score computed by Oracle Text if you enter a query that is not one of the boosted queries. |
Relevancy boosting has the following limitations:
Comparison of the user's query against the boosted queries uses exact string match. This means that the comparison is case-sensitive and space-aware. Therefore, a document with a boosted score for "Ultra Search" is not boosted when you enter "ultrasearch".
Relevancy boosting requires that the query application pass in the search term in the Query API getResult
method call. The sample applications are designed to pass the basic search terms as the boost term. Advanced search criteria based on search attributes are ignored.
See Also: "Queries Page" |
Ultra Search translates each user query into a database query. This process is called query syntax expansion. The expansion logic determines relevancy, recall of the search results. The Ultra Search default expansion boosts the relevancy of those documents that matches the user's query as a part of their title. The query syntax expansion can be customized with the query API.
See Also: "Customizing the Query Syntax Expansion" |
When gathering information from a database-based Web application, Ultra Search lets you specify a URL to display the data retrieved on a browser, rendered by a screen of a Web application corresponding to the data in the database tables. The URL points to a screen in the Web application corresponding to the data in the database. This is available for table data sources, file data sources, and user-defined data sources.
See Also: "Using Crawler Agents" |
Traditionally, Ultra Search used centralized search to gather data on a regular basis and update one index that cataloged all searchable data. This provided fast searching, but it required that the data source to be crawlable before it could be searched. Ultra Search now also provides federated search, which allows multiple indexes to perform a single search. Each index can be maintained separately. By querying the data source at search-time, search results are always the latest results. User credentials can be passed to the data source and authenticated by the data source itself. Queries can be processed efficiently using the data's native format. To use federated search, you must deploy an Ultra Search search adapter, or searchlet, and create an Oracle source. A searchlet is a Java module deployed in the middle tier (inside OC4J) that searches the data in an enterprise information system on behalf of a user. When a user's query is delegated to the searchlet, the searchlet runs the query on behalf of the user. Every searchlet is a JCA 1.0 compliant resource adapter.
See Also: "Federated Sources" |
The Ultra Search administration tool supports three modes of logging on, depending on the type of user. You can log on as:
A single sign-on (SSO) user managed in the Oracle Internet Directory and authenticated with the SSO server
A local database schema user in the Ultra Search database (non-SS0 mode)
A Portal user
An Enterprise Manager user
Note: Single Sign-On (SSO) is available only with the Oracle Identity Management infrastructure. |
See Also: "Logging On to Ultra Search" |
Oracle Internet Directory is Oracle's native LDAP v3-compliant directory service, built as an application on top of the Oracle Database. Oracle Internet Directory hosts the Oracle common identity. All Oracle Web-based products integrate with the SSO server for single sign-on support.
An Ultra Search administration group contains a set of users. Each user can belong to one or multiple groups. All groups are created using groupOfUniqueNames
and orclGroup
object classes.
The only way to grant a user administration privileges is to assign them to an administration group. Ultra Search authorizes the user administration privileges based on the administration groups to which the user belongs. The following groups are created for each Ultra Search instance:
Super-users: Users in this group can create or drop Ultra Search instances and can administer Ultra Search instances within the installation. Super-users must obey the rules for document relevancy boosting and ACL defined for each of the documents associated with the Ultra Search instance. For example, if a document ACL does not grant access to the super-user or group, then the super-user cannot search and browse the document.
Instance administrators: Users in this group can administer the Ultra Search instance. Only the instance database schema user and members in the super-users group can drop the instance.
The authorization of the administration user is performed in the following steps:
After the administration user is successfully authenticated by the SSO server or the Ultra Search database, the Ultra Search GUI brings up the first screen for the user to choose an Ultra Search instance.
The Ultra Search GUI looks up the Oracle Internet Directory server or Ultra Search repository to find all Ultra Search instances with the installation that the administration user has privileges to administer.
The administration user chooses the Ultra Search instance from the list.
Although Ultra Search in the Oracle Application Server is the same product as Ultra Search in Oracle Collaboration Suite and Ultra Search in the Oracle Database, there are a couple differences:
The Oracle Database is not integrated with Oracle Application Server Portal. With Oracle Application Server and Oracle Collaboration Suite, Portal users add powerful multi-repository search to their Portal pages. Oracle Application Server and Oracle Collaboration Suite also have the capability to crawl and make searchable Portal's own repository.
Oracle Application Server includes a Single Sign-On (SSO) server. SSO users can log on once for all components of the Oracle Application Server product, and the Ultra Search administrative interface allows user management operations on either database users or SSO users. Authenticated SSO users never see the Ultra Search logon screen. Instead, they can immediately choose an instance. If the SSO user does not have permissions to manage Ultra Search (set in the Users Page), then the SSO user receives an error. SSO is available only with the Oracle Identity Management infrastructure.
See Also: http://portalstudio.oracle.com |
Ultra Search provides a search portlet that can be embedded in Oracle Application Server Portal pages. It is implemented as a JavaServer Page application.
The Ultra Search search portlet supports most of the functionality provided by the Query API Complete Sample application.
See Also:
|
Ultra Search is a client program to the Oracle server at run time. It can be deployed in two configurations: in the backend or in the middle tier.
The Ultra Search query interface and the administration tool can be accessed from any HTML browser client. The administration tool relies on certain Java classes in the middle tier. This logical middle tier can be the same physical computer as the one that runs the database server, or a different one, running Oracle Application Server. The Ultra Search database backend consists of the Ultra Search data dictionary that stores metadata on all the different repositories, as well as the schedules and Java classes needed to drive the crawler. The crawler itself can run either on the database server computer or remotely on another computer.
See Also: Chapter 3, "Installing and Configuring Ultra Search" for more information about the components |
Figure 1-1 illustrates the Ultra Search system configuration.