THE VECTOR PROCESSING MODEL
Meaning of vector:
Vector is a physical quantity which has both magnitude and direction.
Meaning of model:
A model is a schematic description of a system, theory or phenomenon that accounts for its known or inferred properties.
Various mathematical have been proposed to represent information retrieval systems and procedures, one of which is “The vector processing model”, which represents documents and queries by term sets and compares global similarities between queries and documents.
The vector processing model assumes that an available term set, called term vectors, is used for both the stored records and information requests.
Consider a collection of documents in which each document is characterized by one or more index terms. Thus, the documents are the objects in the collection each of which is represented by a number of index terms. The similarity between two objects is normally computed as a function of the number of properties that are assigned to both objects. Substantially similar methods can be used for determining collection structure and for retrieving information by comparing the query vectors with the vectors representing the stored items and retrieving items that are found to be similar to the queries.
Consider two documents –DOCi and DOCj. Let TERMik reqresent the weight of the (property) term k assigned to document i. One may assume the value of TERMik as zero or one (in the case of binary system), or the weight may vary from zero to a maximum value (say four or six, or so). Now the two document vectors may be represented as
DOCi = (TERMil, TERMi2, TERMi3……TERMit)
DOCj = (TERMj1, TERMj2, TERMj3…..TERMjt)
Where t terms (i.e. properties) have been assigned to characterize each document (i.e. object).
The following vector functions are to be considered to compute the similarity between the two given vectors:
t
(1) ∑ TERM ik
k=1
This denotes the sum of weights of all the properties included in a given vector;
t
(2) ∑ TERM ik. TERMjk
k=1
Which denotes the component by component vector product, consisting of the sum the products of the corresponding term weights for two vectors;
t
(3) ∑ min (TERM ik. TERMjk )
k=1
Which denotes the sum of the minimum component weights of the components of the two vectors; and
t
(4) √∑ min(TERM ik )
k=1
Which denotes the length of the property vector (here, for the document DOCj), when the property vectors are considered as ordinary vectors.
These functions can be illustrated with the following example. Suppose the two document vectors are represented as
DOCi = (3, 2, 1, 0, 0, 0, 1, 1)
DOCj = (1, 1, 1, 0, 0, 1, 0, 0)
Where each document is assigned eight index terms. The four vector functions will then be:
t
(1) ∑ TERM ik = (3+2+1+0+0+0+1+1)=8
k=1
t
(2) ∑ TERM ik. TERMjk = (3.1)+(2.1)+(1.1)+(0.0)+(0.0)+(0.1)+(1.0)+(1.0)
k=1 = (3+2+1+0+0+0+0+0) = 6
t
(3) ∑ min (TERM ik. TERMjk ) = min(3,1) + min(2,1) + min(1,1) + min(0,0)
k=1 min(0,0) +min(0,1) +min(1,0) +min(1,0)
= 1+1+1+0+0+0+0+0 = 3
t
(4) √∑ min(TERM ik ) = √(3.3)+(2.3)+(1.1)+(0.0)+(0.0)+(0.0)+(1.1)+(1.1)
k=1
Several coefficients for similarity measures can be used; Salton and McGill 7 show five such coefficients, which are shown below.
1. The dice coefficient
t
2﴾∑ (TERMik.TERMjk)﴿
k=1 2(6)
SIM (DOCi, DOCj) = 1
t t 8+4
∑ TERM ik + ∑TERM jk
k=1 k=1
2. The Jaccard coefficient
t
∑(TERMik.TERMjk)
k=1 6
SIM (DOCi, DOCj) = 1
t t t 8+4-6
∑ TERM ik + ∑TERM jk+∑(TERM ik . TERMjk)
k=1 k=1 k=1
3. The cosine coefficient, which is a measure of the angle between two t-dimensional object vectors in a space of t dimensions:
t
∑ (TERMik.TERMJK)
k=1 6
SIM (DOCi,DOCj) = 0.75
t t 8
√ ∑ (TERM ik) . ∑ (TERM jk)
k=1 k=1
4. The overlap coefficient.
t
∑(TERMik.TERMJK)
k=1 6
SIM (DOCi, DOCj) = 1.5
t t 4
min﴾∑ (TERM ik). ∑ (TERM jk)﴿
k=1 k=1
5. The asymmetric coefficient:
t
∑(TERMik.TERMjk)
k=1 3
SIM (DOCi, DOCj) = 0.375
t 8
∑ (TERM ik)
k=1
Advantages of vector processing model:
1. It improves quality (term weighting)
2. Allows approximate matching (partial matching)
3. Gives ranking by similarity (cosine formula)
4. Simple, fast.
Disadvantages of vector processing model:
1. It assumes that index terms are independent
2. No logical expressions.
