5
Dec

A number of terms get used extensively in the search industry. While these terms do not all specifically pertain to federated search or to deep web searching they’re worth knowing as they come up frequently in such conversations.

I welcome suggestions for terms to add to this glossary.

  • aggregation. The process of combining, or merging, search results from multiple sources. Aggregation typically involves removal of duplicates.
  • bibliographic database. A searchable database of information that describes documents, typically books and articles. A bibliographic database may include title, author, abstract, journal edition, publication date, and other descriptive fields. See definitions of bibliographic database on the web.
  • Boolean expression. A query containing one or more Boolean operators and optionally characters such as parentheses to force an order of evaluation.
  • Boolean operator. One of AND, OR, NOT. Boolean operators are used to limit or expand search criteria. Boolean operators can be combined in complex ways, e.g. “government AND NOT (American OR British)“.
  • clustering. A technology that organizes search results into groups of related documents. The intent is that documents within a group, or cluster, are more related to one another than to documents not in the cluster. A number of clustering systems use graphics to visually display clusters.

  • crawling. The process by which Google and other surface web search engines collect documents for indexing and searching. Crawling involves identifying and following links from documents to discover new documents. Surface web crawlers use a large list of known web-sites and discover pages within those sites plus other sites and their pages. See Web crawler - Wikipedia.
  • data mining. A broad term for the process of collecting large quantities of data and analyzing it from different perspectives in search of actionable information. See Data Mining: What is Data Mining?
  • data source. Generally refers to the location where searchable documents are stored, for example, a publisher’s database.
  • database. A collection of documents and/or document information organized in such a way to facilitate search and retrieval of its contents. See What is database? A Word Definition From the Webopedia Computer Dictionary.
  • dedup, dedupe, or deduplication. The process of removing duplicate items from search results. Deduping across documents from multiple sources is difficult.
  • deep web. The set of web-sites and their documents that cannot be accessed via crawling. Deep web content typically lives inside of databases. The content is accessed through search forms. See Deep Web - Wikipedia.
  • enterprise search. The process of making content securely stored within an Intranet searchable. Enterprise search solutions typically search email, documents of multiple types (e.g. spreadsheet, PDF files, word processor files), plus documents stored in multiple databases.
  • federated search. The technology for searching multiple databases simultaneously. See Federated search - Wikipedia.
  • full text. Refers to the entire text of a document, as opposed to its metadata.
  • harvesting. The technology and process of retrieving all documents from a given repository for indexing and searching.
  • heterogeneous data sources. Refers to the fact that content within different data sources differs in structure and in method of access. This causes search engines to require different access methods for different data sources.
  • hit list. See result list.
  • indexable web. See surface web.
  • indexing. Analogous to creating an index of the text of a book. Makes it possible to quickly locate information related to search terms.
  • information retrieval. The science of searching documents and their metadata. See Information retrieval - Wikipedia.
  • invisible web. Refers to the deep web, i.e. to content that is not visible to web crawling technologies.
  • keyword. Refers to a search term intending to locate specific documents, or small sets of documents rather than large sets of loosely related documents.
  • metadata, or meta data. Information that describes a document. Metdata typically includes author, title, abstract, journal edition, publication date, and other descriptive fields. See Metadata - Wikipedia.
  • metasearch. See federated search.
  • metasearch engine. See federated search. See also Metasearch engine - Wikipedia.
  • ontology. A model of concepts and their relationships within a domain. The intent in information retrieval is to make useful inferences from available information. See Ontology (computer science) - Wikipedia.
  • OPAC. Online Public Access Catalog. A library’s online catalog. See OPAC - Wikipedia.
  • phrase. One or more search terms intended to be searched literally and typically enclosed in double quotes. A search for the phrase “federated search engine” will only match documents with the three quoted words, one right after the other.
  • portal. A single point of access to information from disparate sources, typically organized around a them. Portals are typically web-sites.
  • proximity searching. The process of locating documents in which search terms are within a certain number of characters or words from one another. The intent is to assign higher relevance rank to documents in which the search terms appear closer to one another. See Proximity search (text) - Wikipedia.
  • query. The set of words, Booleans, and punctuation entered by a user to perform a search. Refers to what the user entered into the search form.
  • query syntax. Refers to the use of special characters used in a search to convey a particular meaning to a search engine. Query syntax includes how to use wildcards, how to create phrases, and how to perform Boolean searches. Different search engines have different query syntaxes.
  • query term. One of potentially numerous words in a query.
  • query translation. The process of converting query terms and query syntax from one search engine to another. In federated search, query terms entered by a user must be converted to the proper query syntax for each underlying source searched.
  • real time. Refers to searches against live and possibly dynamic data as opposed to data that is periodically updated via crawling or harvesting technologies.
  • relevance. Refers to the likelihood that a document returned from a search is useful to the person performing a search. Search results are frequently refered to as highly relevant or not highly relevant.
  • relevance ranking. The process by which a score is assigned to a document based on its relevance. Documents federated from among different sources are assigned relevance rank scores which are used to determine the display order for search results ranked by relevance.
  • result list. Typically a one-page display of summary information about documents returned from a user search. Results lists are sorted by some criteria, commonly date or relevance, and they usually contain links to full text or metadata of document.
  • scalability. The ability of a search application to grow to handle a larger user, computation, or network load.
  • screen scraping. The process of extracting document information by reading and parsing the HTML returned by search engines which is really intended to be read by humans and not machines. Preferable to screen scraping is the processing of structured data in XML or other format.
  • search engine. An application, frequently web-based, that takes a user query and returns documents relevant to the query. Search engines can access the deep web, the surface web, or a combination of both. See Search engine - Wikipedia.
  • search expression. See query.
  • search form. The set of elements with which a user interacts in performing a search. A search form typically includes one or more text boxes corresponding to search fields (e.g. title, author, date of publication), some mechanism for specifying how multiple search terms are handled in terms of Boolean logic, and a submit button.
  • search precision. Refers to the percent of documents retrieved from a search that are relevant. High precision and high recall are ideal but it is difficult to achieve both simultaneously. The precision vs. recall tradeoff in Wikipedia.
  • search recall. Refers to the percent of relevant documents not retrieved from a search. High precision and high recall are ideal but it is difficult to achieve both simultaneously. See The precision vs. recall tradeoff in Wikipedia.
  • search term. See query term.
  • semantic web. An extension of the web in which software agents can interact with one another to exchange information. See Semantic Web - Wikipedia.
  • spidering. See crawling.
  • stemming. The process of reducing a word to its stem, or root. Search engines perform stemming to avoid omitting relevant results. Users normally expect a search for “federation” to return documents with the word “federate.” See Stemming - Wikipedia.
  • stopword, or stop word. A word that is removed from a user’s query expression prior to or after performing a search. Searching for the word “and”, for example, would return many non-relevant results and would likely overburden most search engines. See Stop words - Wikipedia.
  • surface web. The part of the web that can be access by crawling technologies, as opposed to content in the deep web. See Surface Web - Wikipedia.
  • visible web. See surface web.
  • wildcard, or wildcard character. A character that can substitute for one or more characters in a search expression. The asterisk (*) is frequently used to represent zero or more characters.
  • XML. Stands for Extensible Markup Language. In information retrieval, XML is used to structure document information in such a way that search engines can easily query and retrieve results from databases. Only some content owners provide XML search capability.
  • Z39.50. A standard for searching and retrieving documents commonly used in library environments. See Z39.50 - Wikipedia.

If you enjoyed this post, make sure you subscribe to the RSS feed!

This entry was posted on Wednesday, December 5th, 2007 at 11:35 pm and is filed under basics. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or TrackBack URI from your own site.

Leave a reply

Name (*)
Mail (*)
URI
Comment