Jump to content

Search engine indexing: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
No edit summary
m grammar
 
(488 intermediate revisions by more than 100 users not shown)
Line 1: Line 1:
{{Short description|Method for data management}}
{{Copyedit|date=January 1879}}


'''[[Search engine]] [[index (information technology)|indexing]]''' entails how [[data (computing)|data]] is pooped, parsed, and stored to facilitate fast and accurate [[information retrieval|retrieval]]. Index design incorporates [[interdisciplinary]] concepts from [[Linguistics]], [[Cognitive psychology]], [[Mathematics]], [[Informatics]], [[Physics]], and [[Computer science]]. An alternate name for the process is [[Web indexing]], within the context of search engines designed to find [[web page]]s on the [[Internet]].
'''Search engine indexing''' is the collecting, [[parsing]], and storing of data to facilitate fast and accurate [[information retrieval]]. Index design incorporates interdisciplinary concepts from [[linguistics]], [[cognitive psychology]], mathematics, [[informatics]], and [[computer science]]. An alternate name for the process, in the context of [[search engine]]s designed to find [[web page]]s on the Internet, is ''[[web indexing]]''.


Popular engines focus on the full-text indexing of online, [[natural language]] documents<ref>Clarke, C., Cormack, G.: Dynamic Inverted Indexes for a Distributed Full-Text Retrieval System. TechRep MT-95-01, University of Waterloo, February 1995.</ref>, yet there are other searchable [[Multimedia|media]] types such as video, audio<ref>Stephen V. Rice, Stephen M. Bailey. [http://www.comparisonics.com/SearchingForSounds.html Searching for Sounds]. Comparisonics Corporation. May 2004. Verified Dec 2006</ref>, and graphics<ref>Charles E. Jacobs, Adam Finkelstein, David H. Salesin. [http://grail.cs.washington.edu/projects/query/mrquery.pdf Fast Multiresolution Image Querying]. Department of Computer Science and Engineering, University of Washington. 1995. Verified Dec 2006</ref><ref>Lee, James. [http://www.technologyreview.com/read_article.aspx?id=17772&ch=infotech Software Learns to Tag Photos]. MIT Technology Review. November 09, 2006. Pg 1-2. Verified Dec 2006. Commercial external link</ref>. [[Metasearch engine|Meta search engines]] reuse the indices of other services and do not store a local index, where as [[cache]]-based search engines permanently store the index along with the [[text corpus|corpus]]. Unlike full text indices, partial text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined interval due to the required time and processing costs, where as [[Intelligent agent|agent]]-based search engines index in [[real time]].
Popular search engines focus on the [[Full-text search|full-text]] indexing of online, [[Natural language processing|natural language]] documents.<ref>Clarke, C., Cormack, G.: Dynamic Inverted Indexes for a Distributed Full-Text Retrieval System. TechRep MT-95-01, University of Waterloo, February 1995.</ref> [[Media type]]s such as pictures, video,<ref>{{cite journal |last=Sikos |first=L. F. |date=August 2016 |title=RDF-powered semantic video annotation tools with concept mapping to Linked Data for next-generation video indexing |journal=Multimedia Tools and Applications |doi=10.1007/s11042-016-3705-7 |s2cid=254832794 |url=https://ap01.alma.exlibrisgroup.com/view/delivery/61USOUTHAUS_INST/12165436490001831 }}{{Dead link|date=August 2023 |bot=InternetArchiveBot |fix-attempted=yes }}</ref> audio,<ref>http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf {{Bare URL PDF|date=March 2022}}</ref> and graphics<ref>Charles E. Jacobs, Adam Finkelstein, David H. Salesin. [http://grail.cs.washington.edu/projects/query/mrquery.pdf Fast Multiresolution Image Querying]. Department of Computer Science and Engineering, University of Washington. 1995. Verified Dec 2006</ref> are also searchable.

[[Metasearch engine|Meta search engines]] reuse the indices of other services and do not store a local index whereas cache-based search engines permanently store the index along with the [[text corpus|corpus]]. Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs, while [[Intelligent agent|agent]]-based search engines index in [[Real time business intelligence|real time]].


==Indexing==
==Indexing==
The goal of being an idiot is to [[optimization (computer science)|optimize]] the speed and performance of finding [[relevance|relevant]] documents for a search query. Without an index, the search engine would [[lexical scanner|scan]] every document in the corpus, which would take a considerable amount of poop and [[computing]] power. Where as an index of 1000 documents can be queried within milliseconds, a raw scan of 1000 documents could take hours (as an example). No search engine user would be comfortable waiting several hours to get search results. The trade off for the time saved during [[information retrieval|retrieval]] is that additional [[computer storage|storage]] is required to store the index and that it takes a considerable amount of time to update.
The purpose of storing an index is to optimize speed and performance in finding [[relevance (information retrieval)|relevant]] documents for a search query. Without an index, the search engine would [[Lexical analysis|scan]] every document in the [[Text corpus|corpus]], which would require considerable time and computing power. For example, while an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in 10,000 large documents could take hours. The additional [[Computer Storage|computer storage]] required to store the index, as well as the considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval.


===Index Design Factors===
===Index design factors===
Major factors in designing a search engine's architecture include:
Major factors in designing a search engine's architecture include:


* '''Merge''' factors - how data enters the index, or how words or subject features are added to the index during [[text corpus|corpus]] traversal, and whether multiple indexers can work asynchronously. The indexer must first check whether it is updating old content or adding new content. Traversal typically correlates to the [[Web crawling|data collection]] policy. Search engine index merging is similar in concept to the [[Merge (SQL)|SQL Merge]] command and other [[merge algorithm]]s.<ref>Brown, E.W.: Execution Performance Issues in Full-Text Information Retrieval. Computer Science Department, University of Massachusetts at Amherst, Technical Report 95-81, October 1995.</ref>
; Merge factors: How data enters the index, or how words or subject features are added to the index during text corpus traversal, and whether multiple indexers can work asynchronously. The indexer must first check whether it is updating old content or adding new content. Traversal typically correlates to the [[Web crawling|data collection]] policy. Search engine index merging is similar in concept to the [[Merge (SQL)|SQL Merge]] command and other merge algorithms.<ref>Brown, E.W.: Execution Performance Issues in Full-Text Information Retrieval. Computer Science Department, University of Massachusetts Amherst, Technical Report 95-81, October 1995.</ref>
* Storage techniques - how to store the index [[data]] - whether information should be [[data compression|compressed]] or filtered
; Storage techniques: How to store the index [[data]], that is, whether information should be data compressed or filtered.
* Index size - how much [[computer storage]] is required to support the index
; Index size: How much [[Computer data storage|computer storage]] is required to support the index.
* Lookup speed - how quickly a word can be found in the inverted index. How quickly an entry in a [[data structure]] can be found, versus how quickly it can be updated or removed, is a central focus of [[computer science]]
; Lookup speed: How quickly a word can be found in the [[inverted index]]. The speed of finding an entry in a data structure, compared with how quickly it can be updated or removed, is a central focus of computer science.
* Maintenance - maintaining the index over time<ref>Cutting, D., Pedersen, J.: Optimizations for dynamic inverted index maintenance. Proceedings of SIGIR, 405-411, 1990.</ref>
; Maintenance: How the index is maintained over time.<ref>Cutting, D., Pedersen, J.: Optimizations for dynamic inverted index maintenance. Proceedings of SIGIR, 405-411, 1990.</ref>
* [[Fault tolerance]] - how important it is for the service to be reliable, how to deal with index corruption, whether bad data can be treated in isolation, dealing with bad [[hardware]], [[partition (database)|partitioning schemes]] such as [[hash function|hash-based]] or composite partitioning<ref>[http://dev.mysql.com/doc/refman/5.1/en/partitioning-linear-hash.html Linear Hash Partitioning]. MySQL 5.1 Reference Manual. Verified Dec 2006</ref>, [[data replication]]
;Fault tolerance: How important it is for the service to be reliable. Issues include dealing with index corruption, determining whether bad data can be treated in isolation, dealing with bad hardware, [[partition (database)|partitioning]], and schemes such as [[hash function|hash-based]] or composite partitioning,<ref>[http://dev.mysql.com/doc/refman/5.1/en/partitioning-linear-hash.html Linear Hash Partitioning]. MySQL 5.1 Reference Manual. Verified Dec 2006</ref> as well as [[Replication (computer science)|replication]].

===Index Data Structures===


===Index data structures===
Search engine architectures vary in how indexing is performed and in index storage to meet the various design factors. Types of indices include:
Search engine architectures vary in the way indexing is performed and in methods of index storage to meet the various design factors.


* [[Suffix tree]]s - figuratively structured like a tree, supports linear time lookup. Built by storing the suffices of words. Used for searching for patterns in [[DNA]] sequences and clustering. A major drawback is that the storage of a word in the tree may require more storage than storing the word itself.<ref name="Gus97">{{cite book
;[[Suffix tree]]: Figuratively structured like a tree, supports linear time lookup. Built by storing the suffixes of words. The suffix tree is a type of [[trie]]. Tries support [[extendible hashing]], which is important for search engine indexing.<ref>[https://xlinux.nist.gov/dads/HTML/trie.html trie], [https://www.nist.gov/dads Dictionary of Algorithms and Data Structures], [http://www.nist.gov U.S. National Institute of Standards and Technology].</ref> Used for searching for patterns in [[DNA]] sequences and clustering. A major drawback is that storing a word in the tree may require space beyond that required to store the word itself.<ref name="Gus97">{{cite book
| last = Gusfield
| last = Gusfield
| first = Dan
| first = Dan
| origyear = 1997
| orig-year = 1997
| year = 1999
| year = 1999
| title = Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology
| title = Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology
| publisher = Cambridge University Press
| publisher = Cambridge University Press
| location = USA
| location = US
| id = ISBN 0-521-58519-8}}.
| isbn = 0-521-58519-8}}.
</ref> An alternate representation is a [[suffix array]], which is considered to require less [[virtual memory|memory]] and supports [[data compression|compression]] like [[BWT]].
</ref> An alternate representation is a [[suffix array]], which is considered to require less virtual memory and supports data compression such as the [[Burrows–Wheeler transform|BWT]] algorithm.
* [[Trie]]s - an [[Ordered tree data structure|ordered tree]] [[data structure]] that is used to store an [[associative array]] where the keys are [[String (computer science)|strings]]. Regarded as faster than a [[hash table]], but are less [[computer storage|space]] efficient. The suffix tree is a type of trie. Tries support [[extendible hashing]], which is important for search engine indexing.<ref>[http://www.nist.gov/dads/HTML/trie.html trie], [http://www.nist.gov/dads Dictionary of Algorithms and Data Structures], [http://www.nist.gov U.S. National Institute of Standards and Technology].</ref>
* [[Inverted index|Inverted indices]] - stores a list of occurrences of each atomic search criterion<ref>Black, Paul E., [http://www.nist.gov/dads/HTML/invertedIndex.html inverted index], [http://www.nist.gov/dads Dictionary of Algorithms and Data Structures], [http://www.nist.gov U.S. National Institute of Standards and Technology] Oct 2006. Verified Dec 2006.</ref>, typically in the form of a [[hash table]] or [[binary tree]]<ref>C. C. Foster, Information retrieval: information storage and retrieval using AVL trees, Proceedings of the 1965 20th national conference, p.192-205, August 24-26, 1965, Cleveland, Ohio, United States </ref><ref>Landauer, W. I.: The balanced tree and its utilization in information retrieval. IEEE Trans. on Electronic Computers, Vol. EC-12, No. 6, December 1963. </ref>.


;[[Inverted index]]: Stores a list of occurrences of each atomic search criterion,<ref>Black, Paul E., [https://xlinux.nist.gov/dads/HTML/invertedIndex.html inverted index], [https://www.nist.gov/dads Dictionary of Algorithms and Data Structures], [http://www.nist.gov U.S. National Institute of Standards and Technology] Oct 2006. Verified Dec 2006.</ref> typically in the form of a [[hash table]] or [[binary tree]].<ref>C. C. Foster, Information retrieval: information storage and retrieval using AVL trees, Proceedings of the 1965 20th national conference, p.192-205, August 24–26, 1965, Cleveland, Ohio, United States</ref><ref>Landauer, W. I.: The balanced tree and its utilization in information retrieval. IEEE Trans. on Electronic Computers, Vol. EC-12, No. 6, December 1963.</ref>
* [[Citation index|Citation indices]] - stores the existence of citations or hyperlinks between documents to support [[citation analysis]], a subject of [[Bibliometrics]].
* [[Ngram|Ngram indices]] - for storing sequences of n length of data to support other types of retrieval or [[text mining]].<ref>[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13 Google Ngram Datasets] for sale at [http://www.ldc.upenn.edu/ LDC] Catalog</ref>
* [[Term-document matrix|Term document matrices]] - used in [[latent semantic analysis]], stores the occurrences of words in documents in a two dimensional [[sparse matrix]].


;[[Citation index]]: Stores citations or hyperlinks between documents to support citation analysis, a subject of [[bibliometrics]].
===Challenges in Parallelism===
;[[N-gram|''n''-gram index]]: Stores sequences of length of data to support other types of retrieval or [[text mining]].<ref>[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13 Google Ngram Datasets] {{Webarchive|url=https://web.archive.org/web/20130929081544/http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13 |date=2013-09-29 }} for sale at [http://www.ldc.upenn.edu/ LDC] Catalog</ref>
A major challenge in the design of search engines is the management of [[parallel computing|parallel processes]]. There are many opportunities for race conditions and coherence faults. For example, a new document is added to the corpus and the index must be updated, but the index simultaneously needs to continue responding to search queries. This is a collision between two competiting tasks. Consider that authors are producers of information, and a [[web crawler|crawler]] is the consumer of this information, grabbing the text and storing it in a [[cache]] (or [[corpus]]). The forward index is the consumer of the information produced by the corpus, and the inverted index is the consumer of information produced by the forward index. This is commonly referred to as a '''producer-consumer model'''. The indexer is the producer of searchable information and users are the consumers that need to search. The challenge is magnified when working with distributed storage and distributed processing. In an effort to scale with larger amounts of indexed information, the search engine's architecture may involve [[distributed computing]], where the search engine consists of several machines operating in unison. This increases the possibilities for incoherency and makes it more difficult to maintain a fully-synchronized, distributed, parallel architecture.<ref>Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Google, Inc. OSDI. 2004.</ref>
;[[Document-term matrix]]: Used in latent semantic analysis, stores the occurrences of words in documents in a two-dimensional [[sparse matrix]].

===Challenges in parallelism===
A major challenge in the design of search engines is the management of serial computing processes. There are many opportunities for [[race conditions]] and coherent faults. For example, a new document is added to the corpus and the index must be updated, but the index simultaneously needs to continue responding to search queries. This is a collision between two competing tasks. Consider that authors are producers of information, and a [[web crawler]] is the consumer of this information, grabbing the text and storing it in a cache (or [[Text corpus|corpus]]). The forward index is the consumer of the information produced by the corpus, and the inverted index is the consumer of information produced by the forward index. This is commonly referred to as a '''producer-consumer model'''. The indexer is the producer of searchable information and users are the consumers that need to search. The challenge is magnified when working with distributed storage and distributed processing. In an effort to scale with larger amounts of indexed information, the search engine's architecture may involve [[distributed computing]], where the search engine consists of several machines operating in unison. This increases the possibilities for incoherency and makes it more difficult to maintain a fully synchronized, distributed, parallel architecture.<ref>Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Google, Inc. OSDI. 2004.</ref>


===Inverted indices===
===Inverted indices===
{{Main|Inverted index}}
Many search engines incorporate an [[inverted index]] when evaluating a [[search query]] to quickly locate the documents which contain the words in a query and rank these documents by relevance. The inverted index stores a list of the documents for each word. The search engine can retrieve the matching documents quickly using direct [[random access|access]] to find the documents for a word. The following is a simplified illustration of the inverted index:


Many search engines incorporate an [[inverted index]] when evaluating a [[Web search query|search query]] to quickly locate documents containing the words in a query and then rank these documents by relevance. Because the inverted index stores a list of the documents containing each word, the search engine can use direct [[random access|access]] to find the documents associated with each word in the query in order to retrieve the matching documents quickly. The following is a simplified illustration of an inverted index:
{| align="center" border="1"
|+ Inverted Index
|-
| Word || Documents
|-
| the || Document 1, Document 3, Document 4, Document 5
|-
| cow || Document 2, Document 3, Document 4
|-
| says || Document 5
|-
| moo || Document 7
|}


{| align="center" class="wikitable"
The above figure is a simplified form of a [[Boolean]] index. Such an index would only serve to determine whether a document matches a query, but would not contribute to ranking matched documents. In some designs the index includes additional information such as the frequency of each word in each document or the positions of the word in each document.<ref>Grossman, Frieder, Goharian. [http://www.eng.auburn.edu/~gilbert/Comp7120/Concept-50/IR-Basics-of-Inverted-Index.pdf IR Basics of Inverted Index]. 2002. Verified Dec 2006.</ref> With position, the search algorithm can identify word proximity to support searching for phrases. Frequency can be used to help in ranking the relevance of documents to the query. Such topics are the central research focus of [[information retrieval]].
|+Inverted index
|-
! Word !! Documents
|-
| the || Document 1, Document 3, Document 4, Document 5, Document 7
|-
| cow || Document 2, Document 3, Document 4
|-
| says || Document 5
|-
| moo || Document 7
|}


This index can only determine whether a word exists within a particular document, since it stores no information regarding the frequency and position of the word; it is therefore considered to be a [[Boolean data type|Boolean]] index. Such an index determines which documents match a query but does not rank matched documents. In some designs the index includes additional information such as the frequency of each word in each document or the positions of a word in each document.<ref>Grossman, Frieder, Goharian. [http://www.cs.clemson.edu/~juan/CPSC862/Concept-50/IR-Basics-of-Inverted-Index.pdf IR Basics of Inverted Index]. 2002. Verified Aug 2011.</ref> Position information enables the search algorithm to identify word proximity to support searching for phrases; frequency can be used to help in ranking the relevance of documents to the query. Such topics are the central research focus of [[information retrieval]].
The inverted index is a [[sparse matrix]] given that words are not present in each document. It is stored differently than a two dimensional [[array]] to reduce [[computer storage|memory]] requirements. The index is similar to the [[document-term matrix|term document matrices]] employed by [[latent semantic analysis]]. The inverted index can be considered a form of a [[hash table]]. In some cases the index is a form of a [[binary tree]], which requires additional storage but may reduce the lookup time. In larger indices the architecture is typically [[distributed hash table|distributed]].<ref>Tang, Hunqiang. Dwarkadas, Sandhya. "Hybrid Global Local Indexing for Efficient
Peer to Peer Information Retrieval". University of Rochester. Pg 1. http://www.cs.rochester.edu/u/sarrmor/publications/eSearch-NSDI.ps</ref>


The inverted index is a [[sparse matrix]], since not all words are present in each document. To reduce [[Computer Storage|computer storage]] memory requirements, it is stored differently from a two dimensional [[Array data structure|array]]. The index is similar to the [[document-term matrix|term document matrices]] employed by [[latent semantic analysis]]. The inverted index can be considered a form of a hash table. In some cases the index is a form of a [[binary tree]], which requires additional storage but may reduce the lookup time. In larger indices the architecture is typically a [[distributed hash table]].<ref>Tang, Hunqiang. [[Sandhya Dwarkadas|Dwarkadas, Sandhya]]. "Hybrid Global Local Indexing for Efficient
Inverted indices can be programmed in several computer programming languages.<ref>[http://www.kimbly.com/code/invidx/haskell/InvIdx.lhs] - inverted index written in [[Haskell]]</ref><ref>[http://www.kimbly.com/code/invidx/lisp/invidx.cl] - inverted index written in [[Lisp]]</ref>
Peer to Peer Information Retrieval". University of Rochester. Pg 1. http://www.cs.rochester.edu/u/sandhya/papers/nsdi04.ps</ref>


==== Implementation of Phrase Search Using an Inverted Index ====
===Index Merging===
For phrase searching, a specialized form of an inverted index called a positional index is used. A positional index not only stores the ID of the document containing the token but also the exact position(s) of the token within the document in the [[postings list]]. The occurrences of the phrase specified in the query are retrieved by navigating these postings list and identifying the indexes at which the desired terms occur in the expected order (the same as the order in the phrase). So if we are searching for occurrence of the phrase "First Witch", we would:
The inverted index is filled via a [[merge]] or rebuild. A rebuild is similar to a merge but first deletes the contents of the inverted index. The architecture may be designed to support incremental indexing<ref>Tomasic, A., et al: Incremental Updates of Inverted Lists for Text Document Retrieval. Short Version of Stanford University Computer Science Technical Note STAN-CS-TN-93-1, December, 1993.</ref>, where a merge involves identifying the document or documents to add into or update in the index and [[parsing]] each document into words. For technical accuracy, a merge involves the unison of newly indexed documents, typically residing in [[virtual memory]], with the index [[cache]] residing on one or more computer hard drives.


# Retrieve the postings list for "first" and "witch"
After parsing, the indexer adds the containing document to the document list for the appropriate words. The process of finding each word in the inverted index in order to denote that it occurred within a document may be too time consuming when designing a larger search engine, and so this process is commonly split up into the development of a [[forward index]] and the process of [[sorting]] the contents of the forward index for entry into the inverted index. The inverted index is named inverted because it is an inversion of the forward index.
# Identify the first time that "witch" occurs after "first"
# Check that this occurrence is immediately after the occurrence of "first".
# If not, continue to the next occurrence of "first".


The postings lists can be navigated using a binary search in order to minimize the time complexity of this procedure.<ref>{{Cite book |last=Büttcher |first=Stefan |title=Information retrieval: implementing and evaluating search engines |last2=Clarke |first2=Charles L. A. |last3=Cormack |first3=Gordon V. |date=2016 |publisher=The MIT Press |isbn=978-0-262-52887-0 |edition=First MIT Press paperback |location=Cambridge, Massachusetts London, England}}</ref>
===The Forward Index===
The [[forward index]] stores a list of words for each document. The following is a simplified form of the forward index:


===Index merging===
{| align="center" border="1"
The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge but first deletes the contents of the inverted index. The architecture may be designed to support incremental indexing,<ref>Tomasic, A., et al.: Incremental Updates of Inverted Lists for Text Document Retrieval. Short Version of Stanford University Computer Science Technical Note STAN-CS-TN-93-1, December, 1993.</ref> where a merge identifies the document or documents to be added or updated and then parses each document into words. For technical accuracy, a merge conflates newly indexed documents, typically residing in virtual memory, with the index cache residing on one or more computer hard drives.
|+ Forward Index
|-
| Document || Words
|-
| Document 1 || the,cow,says,moo
|-
| Document 2 || the,cat,and,the,hat
|-
| Document 3 || the,dish,ran,away,with,the,spoon
|}


After parsing, the indexer adds the referenced document to the document list for the appropriate words. In a larger search engine, the process of finding each word in the inverted index (in order to report that it occurred within a document) may be too time consuming, and so this process is commonly split up into two parts, the development of a forward index and a process which sorts the contents of the forward index into the inverted index. The inverted index is so named because it is an inversion of the forward index.
The rationale behind developing a forward index is that as documents are [[parsing|parsed]], it is better to immediately store the words per document. The delineation enables [[Asynchronous system|asynchronous]] processing, which partially [[circumvention|circumvents]] the inverted index update [[bottleneck]].<ref>Sergey Brin and Lawrence Page. [http://infolab.stanford.edu/~backrub/google.html The Anatomy of a Large-Scale Hypertextual Web Search Engine]. [[Stanford University]]. 1998. Verified Dec 2006.</ref> The forward index is [[Sorting algorithm|sorting]] to transform it to an inverted index. The forward index is essentially a list of pairs consisting of a document and a word, [[collation|collated]] by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index.

===The forward index===
The forward index stores a list of words for each document. The following is a simplified form of the forward index:

{| align="center" class="wikitable"
|+ Forward Index
|-
! Document !! Words
|-
| Document 1 || the,cow,says,moo
|-
| Document 2 || the,cat,and,the,hat
|-
| Document 3 || the,dish,ran,away,with,the,spoon
|}

The rationale behind developing a forward index is that as documents are parsed, it is better to intermediately store the words per document. The delineation enables asynchronous system processing, which partially circumvents the inverted index update [[wikt:bottleneck|bottleneck]].<ref>Sergey Brin and Lawrence Page. [http://infolab.stanford.edu/~backrub/google.html The Anatomy of a Large-Scale Hypertextual Web Search Engine]. [[Stanford University]]. 1998. Verified Dec 2006.</ref> The forward index is [[Sorting algorithm|sorted]] to transform it to an inverted index. The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index.


===Compression===
===Compression===
Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge. Many search engines utilize a form of [[compression]] to reduce the size of the indices on [[computer storage|disk]].<ref>H.S. Heaps. Storage analysis of a compression coding for a document database. 1NFOR, I0(i):47-61, February 1972.</ref> Consider the following scenario for a full text, Internet, search engine.
Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge. Many search engines utilize a form of [[data compression|compression]] to reduce the size of the indices on [[computer storage|disk]].<ref>H.S. Heaps. Storage analysis of a compression coding for a document database. 1NFOR, I0(i):47-61, February 1972.</ref> Consider the following scenario for a full text, Internet search engine.


* It takes 8 bits (or 1 [[byte]]) to store a single character. Some [[character encoding|encodings]] use 2 bytes per character<ref>[https://www.unicode.org/faq/basic_q.html#15 The Unicode Standard - Frequently Asked Questions]. Verified Dec 2006.</ref><ref>[https://web.archive.org/web/20010209140313/http://www.uplink.freeuk.com/data.html Storage estimates]. Verified Dec 2006.</ref>
* An estimated 2,000,000,000 different web pages exist as of the year 2000<ref>Murray, Brian H. [http://www.cyveillance.com/web/downloads/Sizing_the_Internet.pdf Sizing the Internet]. Cyveillance, Inc. Pg 2. July 2000. Verified Dec 2006.</ref>
* The average number of characters in any given word on a page may be estimated at 5 ([[Wikipedia:Size comparisons]])
* A fictitious estimate of 250 words per webpage on average, based on the assumption of being similar to the pages of a novel.<ref>Blair Bancroft. [http://www.blairbancroft.com/word_count.htm Word Count:A Highly Personal-and Probably Controversial-Essay on Counting Words]. Personal Website. Verified Dec 2006.</ref>
* It takes 8 bits (or 1 [[byte]]) to store a single character. Some [[character encoding|encodings]] use 2 bytes per character<ref>[http://www.unicode.org/faq/basic_q.html#15 The Unicode Standard - Frequently Asked Questions]. Verified Dec 2006.</ref><ref>[http://www.uplink.freeuk.com/data.html Storage estimates]. Verified Dec 2006.</ref>
* The average number of characters in any given word on a page can be estimated at 5 ([[Wikipedia:Size comparisons]])
* The average [[personal computer]] comes with about 20 [[gigabyte]]s of usable space<ref>[http://www.pctechguide.com/31HardDisk.htm PC Technology Guide: Guides/Storage/Hard Disks]. PC Technology Guide. 2003. Verified Dec 2006.</ref>


Given these estimates, generating a uncompressed index (assuming a non-[[conflation|conflated]], simple, index) for 2 billion web pages would need to store 5 billion word entries. At 1 byte per character, or 5 bytes per word, this would require 25 gigabytes of storage space alone, more than the average size a personal computer's free disk space. This space is further increased in the case of a distributed storage architecture that is fault-tolerant. Using compression, the index size can be reduced to a portion of its size, depending on which compression techniques are chosen. The trade off is the time and processing power required to perform compression.
Given this scenario, an uncompressed index (assuming a non-[[conflation|conflated]], simple, index) for 2 billion web pages would need to store 500 billion word entries. At 1 byte per character, or 5 bytes per word, this would require 2500 gigabytes of storage space alone.{{cn|date=December 2023}} This space requirement may be even larger for a fault-tolerant distributed storage architecture. Depending on the compression technique chosen, the index can be reduced to a fraction of this size. The tradeoff is the time and processing power required to perform compression and decompression.{{cn|date=December 2023}}


Notably, large scale search engine designs incorporate the cost of storage, and the costs of electricity to power the storage. Compression, in this regard, is a measure of cost as well.
Notably, large scale search engine designs incorporate the cost of storage as well as the costs of electricity to power the storage. Thus compression is a measure of cost.{{cn|date=December 2023}}


==Document Parsing==
==Document parsing==
Document parsing involves breaking apart the components (words) of a document or other form of media for insertion into the forward and inverted indices. For example, if the full contents of a document consisted of the sentence "Hello World", there would typically be two words found, the token "Hello" and the token "World". In the context of search engine indexing and [[natural language processing]], parsing is more commonly referred to as [[tokenization]], and sometimes [[word boundary disambiguation]], [[tagging]], [[Text segmentation]], [[Content analysis]], text analysis, [[Text mining]], [[Concordance]] generation, [[Speech segmentation]], [[Lexer|lexing]], or [[lexical analysis]]. The terms 'indexing', 'parsing', and 'tokenization' are used interchangeably in corporate slang.
Document parsing breaks apart the components (words) of a document or other form of media for insertion into the forward and inverted indices. The words found are called ''tokens'', and so, in the context of search engine indexing and [[natural language processing]], parsing is more commonly referred to as [[Tokenization (lexical analysis)|tokenization]]. It is also sometimes called [[word boundary disambiguation]], [[Part-of-speech tagging|tagging]], [[text segmentation]], [[content analysis]], text analysis, [[text mining]], [[Concordance (publishing)|concordance]] generation, [[speech segmentation]], [[Lexical analysis|lexing]], or [[lexical analysis]]. The terms 'indexing', 'parsing', and 'tokenization' are used interchangeably in corporate slang.


Natural language processing, as of 2006, is the subject of continuous research and technological improvement. There are a host of challenges in tokenization, in extracting the necessary information from documents for indexing to support quality searching. Tokenization for indexing involves multiple technologies, the implementation of which are commonly kept as corporate secrets.
Natural language processing is the subject of continuous research and technological improvement. Tokenization presents many challenges in extracting the necessary information from documents for indexing to support quality searching. Tokenization for indexing involves multiple technologies, the implementation of which are commonly kept as corporate secrets.{{citation needed|date=August 2015}}


=== Challenges in Natural Language Processing ===
===Challenges in natural language processing===
Word Boundary Ambiguity - native [[English language|English]] speakers can at first consider tokenization to be a straightfoward task, but this is not the case with designing a [[multilingual]] indexer. In digital form, the text of other languages such as [[Chinese language|Chinese]], [[Japanese language|Japanese]] or [[Arabic language|Arabic]] represent a greater challenge as words are not clearly delineated by [[whitespace]]. The goal during tokenization is to identify words for which users will search. Language specific logic is employed to properly identify the boundaries of words, which is often the rationale for designing a parser for each language supported (or for groups of languages with similar boundary markers and syntax).
; Word boundary ambiguity: Native [[English language|English]] speakers may at first consider tokenization to be a straightforward task, but this is not the case with designing a [[multilingual]] indexer. In digital form, the texts of other languages such as [[Chinese language|Chinese]] or [[Japanese language|Japanese]] represent a greater challenge, as words are not clearly delineated by [[Whitespace (computer science)|whitespace]]. The goal during tokenization is to identify words for which users will search. Language-specific logic is employed to properly identify the boundaries of words, which is often the rationale for designing a parser for each language supported (or for groups of languages with similar boundary markers and syntax).


Language Ambiguity - to assist with properly ranking matching documents, many search engines collect additional information about words, such as its [[language]] or [[lexical category]] ([[part of speech]]). These techniques are language-dependent as the syntax varies among languages. Documents do not always clearly identify the language of the document or represent it accurately. In tokenizing the document, some search engines attempt to automatically identify the language of the document.
; Language ambiguity: To assist with properly ranking matching documents, many search engines collect additional information about each word, such as its [[language]] or [[lexical category]] ([[part of speech]]). These techniques are language-dependent, as the syntax varies among languages. Documents do not always clearly identify the language of the document or represent it accurately. In tokenizing the document, some search engines attempt to automatically identify the language of the document.


Diverse File Formats - in order to correctly identify what bytes of a document represent characters, the file format must be correctly handled. Search engines which support multiple file formats must be able to correctly open and access the document and be able to tokenize the characters of the document.
; Diverse file formats: In order to correctly identify which bytes of a document represent characters, the file format must be correctly handled. Search engines that support multiple file formats must be able to correctly open and access the document and be able to tokenize the characters of the document.


Faulty Storage - the quality of the natural language data is not always assumed to be perfect. An unspecified amount of documents, particular on the Internet, do not always closely obey proper file protocol. [[Binary]] characters may be mistakenly encoded into various parts of a document. Without recognition of these characters and appropriate handling, the index quality or indexer [[Optimization|performance]] could degrade.
; Faulty storage: The quality of the natural language data may not always be perfect. An unspecified number of documents, particularly on the Internet, do not closely obey proper file protocol. [[Binary data|Binary]] characters may be mistakenly encoded into various parts of a document. Without recognition of these characters and appropriate handling, the index quality or indexer performance could degrade.


=== Tokenization ===
===Tokenization===
Unlike [[literacy|literate]] human adults, computers are not inherently aware of the structure of a natural language document and do not instantly recognize words and sentences. To a computer, a document is only a big sequence of bytes. Computers do not know that a space character between two sequences of characters means that there are two separate words in the document. Instead, a computer program is developed by humans which trains the computer, or instructs the computer, how to identify what constitutes an individual or distinct word, referred to as a token. This program is commonly referred to as a [[tokenizer]] or [[parser]] or [[lexer]]. Many search engines, as well as other natural language processing software, incorporate [[List of compiler-compilers|specialized programs]] for parsing, such as [[YACC]] OR [[Lex programming tool|Lex]].
Unlike [[literacy|literate]] humans, computers do not understand the structure of a natural language document and cannot automatically recognize words and sentences. To a computer, a document is only a sequence of bytes. Computers do not 'know' that a space character separates words in a document. Instead, humans must program the computer to identify what constitutes an individual or distinct word referred to as a token. Such a program is commonly called a [[tokenizer]] or [[parser]] or [[Lexical analysis|lexer]]. Many search engines, as well as other natural language processing software, incorporate [[Comparison of parser generators|specialized programs]] for parsing, such as [[YACC]] or [[Lex programming tool|Lex]].


During tokenization, the parser identifies sequences of characters, which typically represent words. Commonly recognized tokens include punctuation, sequences of numerical characters, alphabetical characters, [[Alphanumeric|alphanumerical]] characters, binary characters (backspace, null, print, and other antiquated print commands), whitespace (space, tab, carriage return, line feed), and [[Entity extraction|entities]] such as [[email]] addresses, phone numbers, and [[URL]]s. When identifying each token, several characteristics may be stored such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number.
During tokenization, the parser identifies sequences of characters that represent words and other elements, such as punctuation, which are represented by numeric codes, some of which are non-printing control characters. The parser can also identify [[Entity extraction|entities]] such as [[email]] addresses, phone numbers, and [[Uniform Resource Locator|URL]]s. When identifying each token, several characteristics may be stored, such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number.


=== Language Recognition ===
===Language recognition===
If the search engine supports multiple languages, a common initial step during tokenization is to identify each document's language, given that many of the later steps are language dependent (such as [[stemming]] and [[part of speech]] tagging). [[Language recognition]] is the process by which a computer program attempts to automatically identify, or categorize, the [[language]] of a document. Other names for language recognition include language classification, language analysis, language identification, and language tagging. Automated language recognition is the subject of ongoing research in [[natural language processing]]. Finding which words the language belongs to may involve the use of a [[language recognition chart]].
If the search engine supports multiple languages, a common initial step during tokenization is to identify each document's language; many of the subsequent steps are language dependent (such as [[stemming]] and [[part of speech]] tagging). [[Language identification|Language recognition]] is the process by which a computer program attempts to automatically identify, or categorize, the [[language]] of a document. Other names for language recognition include language classification, language analysis, language identification, and language tagging. Automated language recognition is the subject of ongoing research in [[natural language processing]]. Finding which language the words belongs to may involve the use of a [[Wikipedia:Language recognition chart|language recognition chart]].


=== Format Analysis ===
===Format analysis===
Depending on whether the search engine supports multiple [[File format|document formats]], documents must be prepared for tokenization. The challenge is that many document formats contain, in addition to textual content, formatting information. For example, [[HTML]] documents contain HTML tags, which specify formatting information, like whether to start a new line, or display a word in '''bold''', or change the [[font]] size or [[Font family|family]]. If the search engine were to ignore the difference between content and '''markup''', the segments would also be included in the index, leading to poor search results. Format analysis involves the identification and handling of formatting content embedded within documents which control how the document is rendered on a computer screen or interpreted by a software program. Format analysis is also referred to as structure analysis, format parsing, tag stripping, format stripping, text normalization, text cleaning, or text preparation. The challenge of format analysis is further complicated by the intricacies of various file formats. Certain file formats are proprietary and very little information is disclosed, while others are well documented. Common, well-documented file formats that many search engines support include:
If the search engine supports multiple [[File format|document formats]], documents must be prepared for tokenization. The challenge is that many document formats contain formatting information in addition to textual content. For example, [[HTML]] documents contain HTML tags, which specify formatting information such as new line starts, bold emphasis, and [[font]] size or [[Font family|style]]. If the search engine were to ignore the difference between content and 'markup', extraneous information would be included in the index, leading to poor search results. Format analysis is the identification and handling of the formatting content embedded within documents which controls the way the document is rendered on a computer screen or interpreted by a software program. Format analysis is also referred to as structure analysis, format parsing, tag stripping, format stripping, text normalization, text cleaning and text preparation. The challenge of format analysis is further complicated by the intricacies of various file formats. Certain file formats are proprietary with very little information disclosed, while others are well documented. Common, well-documented file formats that many search engines support include:


* [[Microsoft Word]]
* [[Microsoft Excel]]
* [[Microsoft Powerpoint]]
* IBM [[Lotus Notes]]
* [[HTML]]
* [[HTML]]
* [[ASCII]] Text files (a text document without any formatting)
* [[ASCII]] text files (a text document without specific computer readable formatting)
* [[Adobe|Adobe's]] Portable Document Format ([[PDF]])
* [[Adobe Systems|Adobe]]'s Portable Document Format ([[PDF]])
* [[PostScript]] (PS)
* [[PostScript]] (PS)
* [[LaTex]]
* [[LaTeX]]
* The [[UseNet]] archive (NNTP) and other deprecated bulletin board formats
* [[UseNet]] netnews server formats
* [[XML]] and derivatives like [[RSS]]
* [[XML]] and derivatives like [[RSS]]
* [[SGML]] (this is more of a general protocol)
* [[SGML]]
* [[Multimedia]] [[meta data]] formats like [[ID3]]
* [[Multimedia]] [[meta data]] formats like [[ID3]]
* [[Microsoft Word]]
Techniques for dealing with various formats include:
* [[Microsoft Excel]]
* [[Microsoft PowerPoint]]
* IBM [[Lotus Notes]]
Options for dealing with various formats include using a publicly available commercial parsing tool that is offered by the organization which developed, maintains, or owns the format, and writing a custom [[parser]].


Some search engines support inspection of files that are stored in a [[Compressor (software)|compressed]] or encrypted file format. When working with a compressed format, the indexer first decompresses the document; this step may result in one or more files, each of which must be indexed separately. Commonly supported [[list of archive formats|compressed file format]]s include:
* Using a publicly available commercial parsing tool that is offered by the organization which developed, maintains, or owns the format
* Writing a custom [[parser]]


* [[ZIP (file format)|ZIP]] - Zip archive file
Some search engines support inspection of files that are stored in a [[compression|compressed]], or encrypted, file format. If working with a compressed format, then the indexer first decompresses the document, which may result in one or more files, each of which must be indexed separately. Commonly supported [[list of archive formats|compressed file format]]s include:
* [[RAR (file format)|RAR]] - Roshal ARchive file

* [[ZIP]] - Zip File
* [[RAR]] - Archive File
* [[Cabinet (file format)|CAB]] - [[Microsoft Windows]] Cabinet File
* [[Cabinet (file format)|CAB]] - [[Microsoft Windows]] Cabinet File
* [[Gzip]] - Gzip file
* [[Gzip]] - File compressed with gzip
* [[Bzip2|BZIP]] - Bzip file
* [[Bzip2|BZIP]] - File compressed using bzip2
* [[Tar (file format)|TAR]], [[gzip|GZ]], and TAR.GZ - [[Unix]] Gzip'ped Archives
* [[tar (computing)|Tape ARchive (TAR)]], [[Unix]] archive file, not (itself) compressed
* TAR.Z, TAR.GZ or TAR.BZ2 - [[Unix]] archive files compressed with Compress, GZIP or BZIP2


Format analysis can involve quality improvement methods to avoid including 'bad information' in the index. Content can manipulate the formatting information to include additional content. Examples of abusing document formatting for [[spamdexing]]:
Format analysis can involve quality improvement methods to avoid including 'bad information' in the index. Content can manipulate the formatting information to include additional content. Examples of abusing document formatting for [[spamdexing]]:


* Including hundreds or thousands of words in a section which is hidden from view on the computer screen, but visible to the indexer, by use of formatting (e.g. hidden [[Span and div|"div" tag]] in [[HTML]], which may incorporate the use of [[CSS]] or [[Javascript]] to do so).
* Including hundreds or thousands of words in a section that is hidden from view on the computer screen, but visible to the indexer, by use of formatting (e.g. hidden [[Span and div|"div" tag]] in [[HTML]], which may incorporate the use of [[CSS]] or [[JavaScript]] to do so).
* Setting the foreground font color of words to the same as the background color, making words hidden on the computer screen to a person viewing the document, but not hidden to the indexer.
* Setting the foreground font color of words to the same as the background color, making words hidden on the computer screen to a person viewing the document, but not hidden to the indexer.


=== Section Recognition ===
===Section recognition===
Some search engines incorporate section recognition, the identification of major parts of a document, prior to tokenization. Not all the documents in a corpus read like a well-written book, divided into organized chapters and pages. Many documents on the [[Internet|web]] contain erroneous content and side-sections which do not contain primary material, that which the document is about, such as newsletters and corporate reports. For example, this article may display a side menu with words inside links to other web pages. Some file formats, like HTML or PDF, allow for content to be displayed in columns. Even though the content is displayed, or rendered, in different areas of the view, the raw markup content may store this information sequentially. Words that appear in the raw source content sequentially are indexed sequentially, even though these sentences and paragraphs are rendered in different parts of the computer screen. If search engines index this content as if it were normal content, a dilemma ensues where the quality of the index is degraded and search quality is degraded due to the mixed content and improper word proximity. Two primary problems are noted:
Some search engines incorporate section recognition, the identification of major parts of a document, prior to tokenization. Not all the documents in a corpus read like a well-written book, divided into organized chapters and pages. Many documents on the [[Internet|web]], such as newsletters and corporate reports, contain erroneous content and side-sections that do not contain primary material (that which the document is about). For example, articles on the Wikipedia website display a side menu with links to other web pages. Some file formats, like HTML or PDF, allow for content to be displayed in columns. Even though the content is displayed, or rendered, in different areas of the view, the raw markup content may store this information sequentially. Words that appear sequentially in the raw source content are indexed sequentially, even though these sentences and paragraphs are rendered in different parts of the computer screen. If search engines index this content as if it were normal content, the quality of the index and search quality may be degraded due to the mixed content and improper word proximity. Two primary problems are noted:


* Content in different sections is treated as related in the index, when in reality it is not
* Content in different sections is treated as related in the index when in reality it is not
* Organizational 'side bar' content is included in the index, but the side bar content does not contribute to the meaning of the document, and the index is filled with a poor representation of its documents, assuming the goal is to go after the meaning of each document, a sub-goal of providing quality search results.
* Organizational ''side bar'' content is included in the index, but the side bar content does not contribute to the meaning of the document, and the index is filled with a poor representation of its documents.


Section analysis may require the search engine to implement the rendering logic of each document, essentially an abstract representation of the actual document, and then index the representation instead. For example, some content on the Internet is rendered via Javascript. Viewers of web pages in web browsers see this content. If the search engine does not render the page and evaluate the javascript within the page, it would not 'see' this content in the same way, and index the document incorrectly. Given that some search engines do not bother with rendering issues, many web page designers avoid displaying content via javascript or use the [[Noscript tag]] to ensure that the web page is indexed properly. At the same time, this fact is also [[spamdexing|exploited]] to cause the search engine indexer to 'see' different content than the viewer.
Section analysis may require the search engine to implement the rendering logic of each document, essentially an abstract representation of the actual document, and then index the representation instead. For example, some content on the Internet is rendered via JavaScript. If the search engine does not render the page and evaluate the JavaScript within the page, it would not 'see' this content in the same way and would index the document incorrectly. Given that some search engines do not bother with rendering issues, many web page designers avoid displaying content via JavaScript or use the [https://worldwidenews.ru/2020/05/27/noscript-tag/ Noscript] tag to ensure that the web page is indexed properly. At the same time, this fact can also be [[spamdexing|exploited]] to cause the search engine indexer to 'see' different content than the viewer.


=== Meta Tag Indexing ===
===HTML priority system===
{{Original research section|date=November 2013}}
Specific documents offer embedded meta information such as the author, keywords, description, and language. For HTML pages, the [[meta tag]] contains keywords which are also included in the index. During earlier growth periods in the Internet and search engine technology (more so, the hardware on which it runs) would only index the keywords in the meta tags for the forward index (and still applying techniques such as [[stemming]] and stop words). The full document would not be parsed. At this time, full-text indexing was not as well established, nor was the [[hardware]] to support such technology. The design of the HTML markup language initially included support for meta tags for this very purpose of being properly and easily indexed, without requiring tokenization.<ref>Berners-Lee, T., "Hypertext Markup Language - 2.0", RFC 1866, Network Working Group, November 1995.</ref>
Indexing often has to recognize the [[HTML]] tags to organize priority. Indexing low priority to high margin to labels like ''strong'' and ''link'' to optimize the order of priority if those labels are at the beginning of the text could not prove to be relevant. Some indexers like [[Google]] and [[Bing (search engine)|Bing]] ensure that the [[search engine]] does not take the large texts as relevant source due to strong type system compatibility.<ref>Google Webmaster Tools, "Hypertext Markup Language 5", Conference for SEO January 2012.</ref>


===Meta tag indexing===
As the Internet grew (the number of users capable of browsing the web and the number of websites increased and the technology for making websites and hosting websites improved), many brick-and-mortar corporations went 'online' in the mid 1990s and established corporate websites. The keywords used to describe webpages (many of which were corporate-oriented webpages similar to product brochures) changed from descriptive keywords to marketing-oriented keywords designed to drive sales by placing the webpage high in the search results for specific search queries. The fact that these keywords were subjectively-specified was leading to [[spamdexing]], which drove many search engines to adopt full-text indexing technologies in the 1990s. Search engine designers and companies could only place so many 'marketing keywords' into the content of a webpage before draining it of all interesting and useful information. Given that conflict of interest with the business goal of designing user-oriented websites which were 'sticky', the customer lifetime value equation was changed to incorporate more useful content into the website in hopes of retaining the visitor. In this sense, full-text indexing was more objective and increased the quality of search engine results, as it was one more step away from subjective control of search engine result placement, which in turn furthered research of full-text indexing technologies.
Meta tag indexing plays an important role in organizing and categorizing web content. Specific documents often contain embedded meta information such as author, keywords, description, and language. For HTML pages, the [[meta tag]] contains keywords which are also included in the index. Earlier Internet [[search engine technology]] would only index the keywords in the meta tags for the forward index; the full document would not be parsed. At that time full-text indexing was not as well established, nor was [[computer hardware]] able to support such technology. The design of the HTML markup language initially included support for meta tags for the very purpose of being properly and easily indexed, without requiring tokenization.<ref>Berners-Lee, T., "Hypertext Markup Language - 2.0", RFC 1866, Network Working Group, November 1995.</ref>


As the Internet grew through the 1990s, many [[brick and mortar business|brick-and-mortar corporations]] went 'online' and established corporate websites. The keywords used to describe webpages (many of which were corporate-oriented webpages similar to product brochures) changed from descriptive to marketing-oriented keywords designed to drive sales by placing the webpage high in the search results for specific search queries. The fact that these keywords were subjectively specified was leading to [[spamdexing]], which drove many search engines to adopt full-text indexing technologies in the 1990s. Search engine designers and companies could only place so many 'marketing keywords' into the content of a webpage before draining it of all interesting and useful information. Given that conflict of interest with the business goal of designing user-oriented websites which were 'sticky', the [[customer lifetime value]] equation was changed to incorporate more useful content into the website in hopes of retaining the visitor. In this sense, full-text indexing was more objective and increased the quality of search engine results, as it was one more step away from subjective control of search engine result placement, which in turn furthered research of full-text indexing technologies.
In the context of [[Desktop search]], many solutions incorporate meta tags to provide a way for authors to further customize how the search engine will index content from various files that is not evident from the file content. Desktop search is under the control of the user and the changes in that context only serve to help, unlike Internet search engines which must focus more on the full text index.


In [[desktop search]], many solutions incorporate meta tags to provide a way for authors to further customize how the search engine will index content from various files that is not evident from the file content. Desktop search is more under the control of the user, while Internet search engines must focus more on the full text index.
== See also ==

* [[Information retrieval]]
==See also==
* [[Document retrieval|Document Retrieval]]
* [[Information extraction]]
* [[Search engine]]
* [[Vertical search]]
* [[Desktop search]]
**[[Windows indexing service]]<ref>Krishna Nareddy. [http://msdn2.microsoft.com/en-us/library/ms951558.aspx Indexing with Microsoft Index Server]. MSDN Library. Microsoft Corporation. January 30, 1998. Verified Dec 2006. Note that this is a commercial, external link.</ref>
* [[List of search engines]]
* [[Text retrieval|Text Retrieval]]
* [[Latent semantic analysis|Latent Semantic Indexing]]
* [[KWIC|Keyword In Context Indexing]]
* [[Concordance]]
* [[Controlled vocabulary]]
* [[Controlled vocabulary]]
* [[Natural language processing]]
* [[Database index]]
* [[Text mining]]
* [[Full-text search]]
* [[Content analysis]]
* [[Information extraction]]
* [[Key Word in Context]]
* [[Selection-based search]]
* [[Site map]]
* [[Text retrieval]]
* [[Information literacy]]

==References==
{{Reflist}}


==Further reading==
==Further reading==
*R. Bayer and E. McCreight. Organization and maintenance of large ordered indices. Acta Informatica, 173-189, 1972.
*R. Bayer and E. McCreight. Organization and maintenance of large ordered indices. Acta Informatica, 173-189, 1972.
*[[Donald E. Knuth]]. The art of computer programming, volume 1 (3rd ed.): fundamental algorithms, Addison Wesley Longman Publishing Co. Redwood City, CA, 1997.
*[[Donald E. Knuth]]. [[The Art of Computer Programming]], volume 1 (3rd ed.): fundamental algorithms, Addison Wesley Longman Publishing Co. Redwood City, CA, 1997.
*[[Donald E. Knuth]]. The art of computer programming, volume 3: (2nd ed.) sorting and searching, Addison Wesley Longman Publishing Co. Redwood City, CA, 1998.
*[[Donald E. Knuth]]. The art of computer programming, volume 3: (2nd ed.) sorting and searching, Addison Wesley Longman Publishing Co. Redwood City, CA, 1998.
*[[Gerald Salton]]. Automatic text processing, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1988.
*[[Gerald Salton]]. Automatic text processing, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1988.
Line 203: Line 209:
*G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.
*G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.
*Adelson-Velskii, G.M., Landis, E. M.: An information organization algorithm. DANSSSR, 146, 263-266 (1962).
*Adelson-Velskii, G.M., Landis, E. M.: An information organization algorithm. DANSSSR, 146, 263-266 (1962).
*Edward H. Sussenguth, Jr., Use of tree structures for processing files, Communications of the ACM, v.6 n.5, p.272-279, May 1963
*[[Edward H. Sussenguth Jr.]], Use of tree structures for processing files, Communications of the ACM, v.6 n.5, p.&nbsp;272-279, May 1963
*Harman, D.K., et al: Inverted files. In Information Retrieval: Data Structures and Algorithms, Prentice-Hall, pp 28-43, 1992.
*Harman, D.K., et al.: Inverted files. In Information Retrieval: Data Structures and Algorithms, Prentice-Hall, pp 28–43, 1992.
*Lim, L., et al: Characterizing Web Document Change, LNCS 2118, 133–146, 2001.
*Lim, L., et al.: Characterizing Web Document Change, LNCS 2118, 133–146, 2001.
*Lim, L., et al: Dynamic Maintenance of Web Indexes Using Landmarks. Proc. of the 12th W3 Conference, 2003.
*Lim, L., et al.: Dynamic Maintenance of Web Indexes Using Landmarks. Proc. of the 12th W3 Conference, 2003.
*Moffat, A., Zobel, J.: Self-Indexing Inverted Files for Fast Text Retrival. ACM TIS, 349–379, October 1996, Volume 14, Number 4.
*Moffat, A., Zobel, J.: Self-Indexing Inverted Files for Fast Text Retrieval. ACM TIS, 349–379, October 1996, Volume 14, Number 4.
*Mehlhorn, K.: Data Structures and Efficient Algorithms, Springer Verlag, EATCS Monographs, 1984.
*[[Kurt Mehlhorn|Mehlhorn, K.]]: Data Structures and Efficient Algorithms, Springer Verlag, EATCS Monographs, 1984.
*Mehlhorn, K., Overmars, M.H.: Optimal Dynamization of Decomposable Searching Problems. IPL 12, 93–98, 1981.
*[[Kurt Mehlhorn|Mehlhorn, K.]], [[Mark Overmars|Overmars, M.H.]]: Optimal Dynamization of Decomposable Searching Problems. IPL 12, 93–98, 1981.
*Mehlhorn, K.: Lower Bounds on the Efficiency of Transforming Static Data Structures into Dynamic Data Structures. Math. Systems Theory 15, 1–16, 1981.
*[[Kurt Mehlhorn|Mehlhorn, K.]]: Lower Bounds on the Efficiency of Transforming Static Data Structures into Dynamic Data Structures. Math. Systems Theory 15, 1–16, 1981.
*Koster, M.: ALIWEB: Archie-Like indexing in the Web. Computer Networks and ISDN Systems, Vol. 27, No. 2 (1994) 175-182 (also see Proc. First Int'l World Wide Web Conf., Elsevier Science, Amsterdam, 1994, pp. 175-182)
*Koster, M.: ALIWEB: Archie-Like indexing in the Web. Computer Networks and ISDN Systems, Vol. 27, No. 2 (1994) 175-182 (also see Proc. First Int'l World Wide Web Conf., Elsevier Science, Amsterdam, 1994, pp.&nbsp;175–182)
*Serge Abiteboul and Victor Vianu. [http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=1996-20&format=text&compression=&name=1996-20.text Queries and Computation on the Web]. Proceedings of the International Conference on Database Theory. Delphi, Greece 1997.
*[[Serge Abiteboul]] and [[Victor Vianu]]. [http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=1996-20&format=text&compression=&name=1996-20.text Queries and Computation on the Web]. Proceedings of the International Conference on Database Theory. Delphi, Greece 1997.
*Ian H Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. New York: Van Nostrand Reinhold, 1994.
*Ian H Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. New York: Van Nostrand Reinhold, 1994.
*A. Emtage and P. Deutsch, "Archie--An Electronic Directory Service for the Internet." Proc. Usenix Winter 1992 Tech. Conf., Usenix Assoc., Berkeley, Calif., 1992, pp. 93-110.
*A. Emtage and P. Deutsch, "Archie--An Electronic Directory Service for the Internet." Proc. Usenix Winter 1992 Tech. Conf., Usenix Assoc., Berkeley, Calif., 1992, pp.&nbsp;93–110.
*M. Gray, [http://www.mit.edu/people/mkgray/net/ World Wide Web Wanderer].
*M. Gray, [https://www.mit.edu/people/mkgray/net/ World Wide Web Wanderer].
*D. Cutting and J. Pedersen. "Optimizations for Dynamic Inverted Index Maintenance." Proceedings of the 13th International Conference on Research and Development in Information Retrieval, pp. 405-411, September 1990.
*D. Cutting and J. Pedersen. "Optimizations for Dynamic Inverted Index Maintenance." Proceedings of the 13th International Conference on Research and Development in Information Retrieval, pp.&nbsp;405–411, September 1990.
*Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack. [http://www.ir.uwaterloo.ca/book/ Information Retrieval: Implementing and Evaluating Search Engines] {{Webarchive|url=https://web.archive.org/web/20201005195805/http://www.ir.uwaterloo.ca/book/ |date=2020-10-05 }}. MIT Press, Cambridge, Mass., 2010.

== References ==
<references/>


{{Internet search}}
[[Category:Information retrieval]]
[[Category:Searching]]
[[Category:Indexing]]


{{DEFAULTSORT:Search Engine Indexing}}
[[fr:Indexation]]
[[Category:Index (publishing)]]
[[it:Indicizzazione (motore di ricerca)]]
[[Category:Internet search algorithms]]

Latest revision as of 14:43, 22 September 2024

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.

Popular search engines focus on the full-text indexing of online, natural language documents.[1] Media types such as pictures, video,[2] audio,[3] and graphics[4] are also searchable.

Meta search engines reuse the indices of other services and do not store a local index whereas cache-based search engines permanently store the index along with the corpus. Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs, while agent-based search engines index in real time.

Indexing

[edit]

The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power. For example, while an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in 10,000 large documents could take hours. The additional computer storage required to store the index, as well as the considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval.

Index design factors

[edit]

Major factors in designing a search engine's architecture include:

Merge factors
How data enters the index, or how words or subject features are added to the index during text corpus traversal, and whether multiple indexers can work asynchronously. The indexer must first check whether it is updating old content or adding new content. Traversal typically correlates to the data collection policy. Search engine index merging is similar in concept to the SQL Merge command and other merge algorithms.[5]
Storage techniques
How to store the index data, that is, whether information should be data compressed or filtered.
Index size
How much computer storage is required to support the index.
Lookup speed
How quickly a word can be found in the inverted index. The speed of finding an entry in a data structure, compared with how quickly it can be updated or removed, is a central focus of computer science.
Maintenance
How the index is maintained over time.[6]
Fault tolerance
How important it is for the service to be reliable. Issues include dealing with index corruption, determining whether bad data can be treated in isolation, dealing with bad hardware, partitioning, and schemes such as hash-based or composite partitioning,[7] as well as replication.

Index data structures

[edit]

Search engine architectures vary in the way indexing is performed and in methods of index storage to meet the various design factors.

Suffix tree
Figuratively structured like a tree, supports linear time lookup. Built by storing the suffixes of words. The suffix tree is a type of trie. Tries support extendible hashing, which is important for search engine indexing.[8] Used for searching for patterns in DNA sequences and clustering. A major drawback is that storing a word in the tree may require space beyond that required to store the word itself.[9] An alternate representation is a suffix array, which is considered to require less virtual memory and supports data compression such as the BWT algorithm.
Inverted index
Stores a list of occurrences of each atomic search criterion,[10] typically in the form of a hash table or binary tree.[11][12]
Citation index
Stores citations or hyperlinks between documents to support citation analysis, a subject of bibliometrics.
n-gram index
Stores sequences of length of data to support other types of retrieval or text mining.[13]
Document-term matrix
Used in latent semantic analysis, stores the occurrences of words in documents in a two-dimensional sparse matrix.

Challenges in parallelism

[edit]

A major challenge in the design of search engines is the management of serial computing processes. There are many opportunities for race conditions and coherent faults. For example, a new document is added to the corpus and the index must be updated, but the index simultaneously needs to continue responding to search queries. This is a collision between two competing tasks. Consider that authors are producers of information, and a web crawler is the consumer of this information, grabbing the text and storing it in a cache (or corpus). The forward index is the consumer of the information produced by the corpus, and the inverted index is the consumer of information produced by the forward index. This is commonly referred to as a producer-consumer model. The indexer is the producer of searchable information and users are the consumers that need to search. The challenge is magnified when working with distributed storage and distributed processing. In an effort to scale with larger amounts of indexed information, the search engine's architecture may involve distributed computing, where the search engine consists of several machines operating in unison. This increases the possibilities for incoherency and makes it more difficult to maintain a fully synchronized, distributed, parallel architecture.[14]

Inverted indices

[edit]

Many search engines incorporate an inverted index when evaluating a search query to quickly locate documents containing the words in a query and then rank these documents by relevance. Because the inverted index stores a list of the documents containing each word, the search engine can use direct access to find the documents associated with each word in the query in order to retrieve the matching documents quickly. The following is a simplified illustration of an inverted index:

Inverted index
Word Documents
the Document 1, Document 3, Document 4, Document 5, Document 7
cow Document 2, Document 3, Document 4
says Document 5
moo Document 7

This index can only determine whether a word exists within a particular document, since it stores no information regarding the frequency and position of the word; it is therefore considered to be a Boolean index. Such an index determines which documents match a query but does not rank matched documents. In some designs the index includes additional information such as the frequency of each word in each document or the positions of a word in each document.[15] Position information enables the search algorithm to identify word proximity to support searching for phrases; frequency can be used to help in ranking the relevance of documents to the query. Such topics are the central research focus of information retrieval.

The inverted index is a sparse matrix, since not all words are present in each document. To reduce computer storage memory requirements, it is stored differently from a two dimensional array. The index is similar to the term document matrices employed by latent semantic analysis. The inverted index can be considered a form of a hash table. In some cases the index is a form of a binary tree, which requires additional storage but may reduce the lookup time. In larger indices the architecture is typically a distributed hash table.[16]

Implementation of Phrase Search Using an Inverted Index

[edit]

For phrase searching, a specialized form of an inverted index called a positional index is used. A positional index not only stores the ID of the document containing the token but also the exact position(s) of the token within the document in the postings list. The occurrences of the phrase specified in the query are retrieved by navigating these postings list and identifying the indexes at which the desired terms occur in the expected order (the same as the order in the phrase). So if we are searching for occurrence of the phrase "First Witch", we would:

  1. Retrieve the postings list for "first" and "witch"
  2. Identify the first time that "witch" occurs after "first"
  3. Check that this occurrence is immediately after the occurrence of "first".
  4. If not, continue to the next occurrence of "first".

The postings lists can be navigated using a binary search in order to minimize the time complexity of this procedure.[17]

Index merging

[edit]

The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge but first deletes the contents of the inverted index. The architecture may be designed to support incremental indexing,[18] where a merge identifies the document or documents to be added or updated and then parses each document into words. For technical accuracy, a merge conflates newly indexed documents, typically residing in virtual memory, with the index cache residing on one or more computer hard drives.

After parsing, the indexer adds the referenced document to the document list for the appropriate words. In a larger search engine, the process of finding each word in the inverted index (in order to report that it occurred within a document) may be too time consuming, and so this process is commonly split up into two parts, the development of a forward index and a process which sorts the contents of the forward index into the inverted index. The inverted index is so named because it is an inversion of the forward index.

The forward index

[edit]

The forward index stores a list of words for each document. The following is a simplified form of the forward index:

Forward Index
Document Words
Document 1 the,cow,says,moo
Document 2 the,cat,and,the,hat
Document 3 the,dish,ran,away,with,the,spoon

The rationale behind developing a forward index is that as documents are parsed, it is better to intermediately store the words per document. The delineation enables asynchronous system processing, which partially circumvents the inverted index update bottleneck.[19] The forward index is sorted to transform it to an inverted index. The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index.

Compression

[edit]

Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge. Many search engines utilize a form of compression to reduce the size of the indices on disk.[20] Consider the following scenario for a full text, Internet search engine.

Given this scenario, an uncompressed index (assuming a non-conflated, simple, index) for 2 billion web pages would need to store 500 billion word entries. At 1 byte per character, or 5 bytes per word, this would require 2500 gigabytes of storage space alone.[citation needed] This space requirement may be even larger for a fault-tolerant distributed storage architecture. Depending on the compression technique chosen, the index can be reduced to a fraction of this size. The tradeoff is the time and processing power required to perform compression and decompression.[citation needed]

Notably, large scale search engine designs incorporate the cost of storage as well as the costs of electricity to power the storage. Thus compression is a measure of cost.[citation needed]

Document parsing

[edit]

Document parsing breaks apart the components (words) of a document or other form of media for insertion into the forward and inverted indices. The words found are called tokens, and so, in the context of search engine indexing and natural language processing, parsing is more commonly referred to as tokenization. It is also sometimes called word boundary disambiguation, tagging, text segmentation, content analysis, text analysis, text mining, concordance generation, speech segmentation, lexing, or lexical analysis. The terms 'indexing', 'parsing', and 'tokenization' are used interchangeably in corporate slang.

Natural language processing is the subject of continuous research and technological improvement. Tokenization presents many challenges in extracting the necessary information from documents for indexing to support quality searching. Tokenization for indexing involves multiple technologies, the implementation of which are commonly kept as corporate secrets.[citation needed]

Challenges in natural language processing

[edit]
Word boundary ambiguity
Native English speakers may at first consider tokenization to be a straightforward task, but this is not the case with designing a multilingual indexer. In digital form, the texts of other languages such as Chinese or Japanese represent a greater challenge, as words are not clearly delineated by whitespace. The goal during tokenization is to identify words for which users will search. Language-specific logic is employed to properly identify the boundaries of words, which is often the rationale for designing a parser for each language supported (or for groups of languages with similar boundary markers and syntax).
Language ambiguity
To assist with properly ranking matching documents, many search engines collect additional information about each word, such as its language or lexical category (part of speech). These techniques are language-dependent, as the syntax varies among languages. Documents do not always clearly identify the language of the document or represent it accurately. In tokenizing the document, some search engines attempt to automatically identify the language of the document.
Diverse file formats
In order to correctly identify which bytes of a document represent characters, the file format must be correctly handled. Search engines that support multiple file formats must be able to correctly open and access the document and be able to tokenize the characters of the document.
Faulty storage
The quality of the natural language data may not always be perfect. An unspecified number of documents, particularly on the Internet, do not closely obey proper file protocol. Binary characters may be mistakenly encoded into various parts of a document. Without recognition of these characters and appropriate handling, the index quality or indexer performance could degrade.

Tokenization

[edit]

Unlike literate humans, computers do not understand the structure of a natural language document and cannot automatically recognize words and sentences. To a computer, a document is only a sequence of bytes. Computers do not 'know' that a space character separates words in a document. Instead, humans must program the computer to identify what constitutes an individual or distinct word referred to as a token. Such a program is commonly called a tokenizer or parser or lexer. Many search engines, as well as other natural language processing software, incorporate specialized programs for parsing, such as YACC or Lex.

During tokenization, the parser identifies sequences of characters that represent words and other elements, such as punctuation, which are represented by numeric codes, some of which are non-printing control characters. The parser can also identify entities such as email addresses, phone numbers, and URLs. When identifying each token, several characteristics may be stored, such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number.

Language recognition

[edit]

If the search engine supports multiple languages, a common initial step during tokenization is to identify each document's language; many of the subsequent steps are language dependent (such as stemming and part of speech tagging). Language recognition is the process by which a computer program attempts to automatically identify, or categorize, the language of a document. Other names for language recognition include language classification, language analysis, language identification, and language tagging. Automated language recognition is the subject of ongoing research in natural language processing. Finding which language the words belongs to may involve the use of a language recognition chart.

Format analysis

[edit]

If the search engine supports multiple document formats, documents must be prepared for tokenization. The challenge is that many document formats contain formatting information in addition to textual content. For example, HTML documents contain HTML tags, which specify formatting information such as new line starts, bold emphasis, and font size or style. If the search engine were to ignore the difference between content and 'markup', extraneous information would be included in the index, leading to poor search results. Format analysis is the identification and handling of the formatting content embedded within documents which controls the way the document is rendered on a computer screen or interpreted by a software program. Format analysis is also referred to as structure analysis, format parsing, tag stripping, format stripping, text normalization, text cleaning and text preparation. The challenge of format analysis is further complicated by the intricacies of various file formats. Certain file formats are proprietary with very little information disclosed, while others are well documented. Common, well-documented file formats that many search engines support include:

Options for dealing with various formats include using a publicly available commercial parsing tool that is offered by the organization which developed, maintains, or owns the format, and writing a custom parser.

Some search engines support inspection of files that are stored in a compressed or encrypted file format. When working with a compressed format, the indexer first decompresses the document; this step may result in one or more files, each of which must be indexed separately. Commonly supported compressed file formats include:

  • ZIP - Zip archive file
  • RAR - Roshal ARchive file
  • CAB - Microsoft Windows Cabinet File
  • Gzip - File compressed with gzip
  • BZIP - File compressed using bzip2
  • Tape ARchive (TAR), Unix archive file, not (itself) compressed
  • TAR.Z, TAR.GZ or TAR.BZ2 - Unix archive files compressed with Compress, GZIP or BZIP2

Format analysis can involve quality improvement methods to avoid including 'bad information' in the index. Content can manipulate the formatting information to include additional content. Examples of abusing document formatting for spamdexing:

  • Including hundreds or thousands of words in a section that is hidden from view on the computer screen, but visible to the indexer, by use of formatting (e.g. hidden "div" tag in HTML, which may incorporate the use of CSS or JavaScript to do so).
  • Setting the foreground font color of words to the same as the background color, making words hidden on the computer screen to a person viewing the document, but not hidden to the indexer.

Section recognition

[edit]

Some search engines incorporate section recognition, the identification of major parts of a document, prior to tokenization. Not all the documents in a corpus read like a well-written book, divided into organized chapters and pages. Many documents on the web, such as newsletters and corporate reports, contain erroneous content and side-sections that do not contain primary material (that which the document is about). For example, articles on the Wikipedia website display a side menu with links to other web pages. Some file formats, like HTML or PDF, allow for content to be displayed in columns. Even though the content is displayed, or rendered, in different areas of the view, the raw markup content may store this information sequentially. Words that appear sequentially in the raw source content are indexed sequentially, even though these sentences and paragraphs are rendered in different parts of the computer screen. If search engines index this content as if it were normal content, the quality of the index and search quality may be degraded due to the mixed content and improper word proximity. Two primary problems are noted:

  • Content in different sections is treated as related in the index when in reality it is not
  • Organizational side bar content is included in the index, but the side bar content does not contribute to the meaning of the document, and the index is filled with a poor representation of its documents.

Section analysis may require the search engine to implement the rendering logic of each document, essentially an abstract representation of the actual document, and then index the representation instead. For example, some content on the Internet is rendered via JavaScript. If the search engine does not render the page and evaluate the JavaScript within the page, it would not 'see' this content in the same way and would index the document incorrectly. Given that some search engines do not bother with rendering issues, many web page designers avoid displaying content via JavaScript or use the Noscript tag to ensure that the web page is indexed properly. At the same time, this fact can also be exploited to cause the search engine indexer to 'see' different content than the viewer.

HTML priority system

[edit]

Indexing often has to recognize the HTML tags to organize priority. Indexing low priority to high margin to labels like strong and link to optimize the order of priority if those labels are at the beginning of the text could not prove to be relevant. Some indexers like Google and Bing ensure that the search engine does not take the large texts as relevant source due to strong type system compatibility.[23]

Meta tag indexing

[edit]

Meta tag indexing plays an important role in organizing and categorizing web content. Specific documents often contain embedded meta information such as author, keywords, description, and language. For HTML pages, the meta tag contains keywords which are also included in the index. Earlier Internet search engine technology would only index the keywords in the meta tags for the forward index; the full document would not be parsed. At that time full-text indexing was not as well established, nor was computer hardware able to support such technology. The design of the HTML markup language initially included support for meta tags for the very purpose of being properly and easily indexed, without requiring tokenization.[24]

As the Internet grew through the 1990s, many brick-and-mortar corporations went 'online' and established corporate websites. The keywords used to describe webpages (many of which were corporate-oriented webpages similar to product brochures) changed from descriptive to marketing-oriented keywords designed to drive sales by placing the webpage high in the search results for specific search queries. The fact that these keywords were subjectively specified was leading to spamdexing, which drove many search engines to adopt full-text indexing technologies in the 1990s. Search engine designers and companies could only place so many 'marketing keywords' into the content of a webpage before draining it of all interesting and useful information. Given that conflict of interest with the business goal of designing user-oriented websites which were 'sticky', the customer lifetime value equation was changed to incorporate more useful content into the website in hopes of retaining the visitor. In this sense, full-text indexing was more objective and increased the quality of search engine results, as it was one more step away from subjective control of search engine result placement, which in turn furthered research of full-text indexing technologies.

In desktop search, many solutions incorporate meta tags to provide a way for authors to further customize how the search engine will index content from various files that is not evident from the file content. Desktop search is more under the control of the user, while Internet search engines must focus more on the full text index.

See also

[edit]

References

[edit]
  1. ^ Clarke, C., Cormack, G.: Dynamic Inverted Indexes for a Distributed Full-Text Retrieval System. TechRep MT-95-01, University of Waterloo, February 1995.
  2. ^ Sikos, L. F. (August 2016). "RDF-powered semantic video annotation tools with concept mapping to Linked Data for next-generation video indexing". Multimedia Tools and Applications. doi:10.1007/s11042-016-3705-7. S2CID 254832794.[permanent dead link]
  3. ^ http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf [bare URL PDF]
  4. ^ Charles E. Jacobs, Adam Finkelstein, David H. Salesin. Fast Multiresolution Image Querying. Department of Computer Science and Engineering, University of Washington. 1995. Verified Dec 2006
  5. ^ Brown, E.W.: Execution Performance Issues in Full-Text Information Retrieval. Computer Science Department, University of Massachusetts Amherst, Technical Report 95-81, October 1995.
  6. ^ Cutting, D., Pedersen, J.: Optimizations for dynamic inverted index maintenance. Proceedings of SIGIR, 405-411, 1990.
  7. ^ Linear Hash Partitioning. MySQL 5.1 Reference Manual. Verified Dec 2006
  8. ^ trie, Dictionary of Algorithms and Data Structures, U.S. National Institute of Standards and Technology.
  9. ^ Gusfield, Dan (1999) [1997]. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. US: Cambridge University Press. ISBN 0-521-58519-8..
  10. ^ Black, Paul E., inverted index, Dictionary of Algorithms and Data Structures, U.S. National Institute of Standards and Technology Oct 2006. Verified Dec 2006.
  11. ^ C. C. Foster, Information retrieval: information storage and retrieval using AVL trees, Proceedings of the 1965 20th national conference, p.192-205, August 24–26, 1965, Cleveland, Ohio, United States
  12. ^ Landauer, W. I.: The balanced tree and its utilization in information retrieval. IEEE Trans. on Electronic Computers, Vol. EC-12, No. 6, December 1963.
  13. ^ Google Ngram Datasets Archived 2013-09-29 at the Wayback Machine for sale at LDC Catalog
  14. ^ Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Google, Inc. OSDI. 2004.
  15. ^ Grossman, Frieder, Goharian. IR Basics of Inverted Index. 2002. Verified Aug 2011.
  16. ^ Tang, Hunqiang. Dwarkadas, Sandhya. "Hybrid Global Local Indexing for Efficient Peer to Peer Information Retrieval". University of Rochester. Pg 1. http://www.cs.rochester.edu/u/sandhya/papers/nsdi04.ps
  17. ^ Büttcher, Stefan; Clarke, Charles L. A.; Cormack, Gordon V. (2016). Information retrieval: implementing and evaluating search engines (First MIT Press paperback ed.). Cambridge, Massachusetts London, England: The MIT Press. ISBN 978-0-262-52887-0.
  18. ^ Tomasic, A., et al.: Incremental Updates of Inverted Lists for Text Document Retrieval. Short Version of Stanford University Computer Science Technical Note STAN-CS-TN-93-1, December, 1993.
  19. ^ Sergey Brin and Lawrence Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Stanford University. 1998. Verified Dec 2006.
  20. ^ H.S. Heaps. Storage analysis of a compression coding for a document database. 1NFOR, I0(i):47-61, February 1972.
  21. ^ The Unicode Standard - Frequently Asked Questions. Verified Dec 2006.
  22. ^ Storage estimates. Verified Dec 2006.
  23. ^ Google Webmaster Tools, "Hypertext Markup Language 5", Conference for SEO January 2012.
  24. ^ Berners-Lee, T., "Hypertext Markup Language - 2.0", RFC 1866, Network Working Group, November 1995.

Further reading

[edit]
  • R. Bayer and E. McCreight. Organization and maintenance of large ordered indices. Acta Informatica, 173-189, 1972.
  • Donald E. Knuth. The Art of Computer Programming, volume 1 (3rd ed.): fundamental algorithms, Addison Wesley Longman Publishing Co. Redwood City, CA, 1997.
  • Donald E. Knuth. The art of computer programming, volume 3: (2nd ed.) sorting and searching, Addison Wesley Longman Publishing Co. Redwood City, CA, 1998.
  • Gerald Salton. Automatic text processing, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1988.
  • Gerard Salton. Michael J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, Inc., New York, NY, 1986.
  • Gerard Salton. Lesk, M.E.: Computer evaluation of indexing and text processing. Journal of the ACM. January 1968.
  • Gerard Salton. The SMART Retrieval System - Experiments in Automatic Document Processing. Prentice Hall Inc., Englewood Cliffs, 1971.
  • Gerard Salton. The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, Reading, Mass., 1989.
  • Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Chapter 8. ACM Press 1999.
  • G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.
  • Adelson-Velskii, G.M., Landis, E. M.: An information organization algorithm. DANSSSR, 146, 263-266 (1962).
  • Edward H. Sussenguth Jr., Use of tree structures for processing files, Communications of the ACM, v.6 n.5, p. 272-279, May 1963
  • Harman, D.K., et al.: Inverted files. In Information Retrieval: Data Structures and Algorithms, Prentice-Hall, pp 28–43, 1992.
  • Lim, L., et al.: Characterizing Web Document Change, LNCS 2118, 133–146, 2001.
  • Lim, L., et al.: Dynamic Maintenance of Web Indexes Using Landmarks. Proc. of the 12th W3 Conference, 2003.
  • Moffat, A., Zobel, J.: Self-Indexing Inverted Files for Fast Text Retrieval. ACM TIS, 349–379, October 1996, Volume 14, Number 4.
  • Mehlhorn, K.: Data Structures and Efficient Algorithms, Springer Verlag, EATCS Monographs, 1984.
  • Mehlhorn, K., Overmars, M.H.: Optimal Dynamization of Decomposable Searching Problems. IPL 12, 93–98, 1981.
  • Mehlhorn, K.: Lower Bounds on the Efficiency of Transforming Static Data Structures into Dynamic Data Structures. Math. Systems Theory 15, 1–16, 1981.
  • Koster, M.: ALIWEB: Archie-Like indexing in the Web. Computer Networks and ISDN Systems, Vol. 27, No. 2 (1994) 175-182 (also see Proc. First Int'l World Wide Web Conf., Elsevier Science, Amsterdam, 1994, pp. 175–182)
  • Serge Abiteboul and Victor Vianu. Queries and Computation on the Web. Proceedings of the International Conference on Database Theory. Delphi, Greece 1997.
  • Ian H Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. New York: Van Nostrand Reinhold, 1994.
  • A. Emtage and P. Deutsch, "Archie--An Electronic Directory Service for the Internet." Proc. Usenix Winter 1992 Tech. Conf., Usenix Assoc., Berkeley, Calif., 1992, pp. 93–110.
  • M. Gray, World Wide Web Wanderer.
  • D. Cutting and J. Pedersen. "Optimizations for Dynamic Inverted Index Maintenance." Proceedings of the 13th International Conference on Research and Development in Information Retrieval, pp. 405–411, September 1990.
  • Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack. Information Retrieval: Implementing and Evaluating Search Engines Archived 2020-10-05 at the Wayback Machine. MIT Press, Cambridge, Mass., 2010.