Basic concepts.
Agent:
An agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through effectors.
The field of information Retrival deals with the representation, storage, organization of, access of Information items.
Percepts: what the agents perceives from its environment
User’s query
Documents
Users’ feedback
Users’ actions
Actions: what the agent can do in its environment
Follow links
Retrieve documents
Query search engine
Expand query
Goals: what should the agent try to achieve
Find the exact information, orgainse the Information
Environment: what the agent acts and perceives within
Different tasks, interfaces, formats are hard to combine
Domain knowledge does not scale well
Search a large collection documents to find the ones that satisfied an information need
Indexing(Scape & Store)
Document representation
Comparison with query
Evaluation/feedback
Mannual indexing: libraries
Automatic indexing: indexing program assigns keywords, pharases, other features
Exact Match: Boolean (Yes or No)
Best Match:
Vector Space (Cosine Similarity)
Citation analysis models
Probabilistic models
Any text object can be represented by a term vector.
Example:
D1: 0.3, 0.1, 0.4
D2: 0.8, 0.5, 0.6
Query: 0.0, 0.2, 0.0
Cosine Similarity: Similarity is determined by distance in a vector space.
$Cosine Similarity = \frac{A \cdot B}{\parallel A \parallel \parallel B \parallel}$
Term weights reflect the estimated importance of each term. The more often a word occurs in a document, the better that term is in describing what the document is about. On the other hand, terms that appear in many documents in the collection are not very useful for distinguishing documents.
Term weight $W_{ij} = TF \times IDF $
$TF$: Term Frequency
$TF = \frac{termCount}{documentLength}$
$IDF$: Inverse Document Frequency
$IDF = \log (\frac{N}{\mid {d \in D : t \in d} \mid})$
Where
$N:$ total number of documents in the corpus
$\mid {d \in D : t \in d} \mid:$ the number of documents where term $t$ appears. If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the donominator to 1 + $\mid {d \in D : t \in d} \mid:$
Example:
D1: cat eat mouse, mouse eat chocolate
D2: cat eat mouse
D3: mouse eat chocolate mouse
Query: cat
Term Frequency of D1:
$TF: \frac{1}{6}, \frac{2}{6}, \frac{2}{6}, \frac{1}{6}$
$IDF$ of D1:
$IDF: \log(\frac{4}{4}), \log(\frac{4}{3}), \log(\frac{4}{3}), \log(\frac{4}{2})$
$Weights = IDF \times TF$
$Recall = \frac{\mid {\text{relevant documents}} \cap {\text{retrieved documents}} \mid}{\mid {\text{relevant documents}}\mid}$
$Precision = \frac{\mid {\text{relevant documents}} \cap {\text{retrieved documents}} \mid}{\mid {\text{retrieved documents}}\mid}$
Traditionally, one uses a “test collection” composed of three entities:
Based on this test collection one can compare the resuls of a query using two different IR systems.
Systems effectiveness: evaluation measures
Set-based measures: documents in the ranking are treated as unique and the ordering of result is ignored.
Precision
Recall
Precision and Recall hold an approximate inverse relationship. But it is not always the case.
Compared with other measures, Precision is simple to compute because one only considers the set of retrieved documents. However to compute recall requires comparing the set of retrieved documents with entire collection. Which is impossible in many cases (web search).
In web search, the focus is typicall on obtaining high precision by finding as many relevant documents in the top $n$ results. However, there are certain domain, the focus is on find all relevant document through an exhaustive-search, alternative recall-oriented measures can be employed: $e$ measure and $f$ measure.
Rank-based measures: Based on evaluating ranked retrieval results where not only the number of relevant documents, but also returning relevant documents higher in the ranked lists. A common way to evaluate ranked documents is to compute precision at various levels of recall.