Main Entity-oriented Search

# Entity-oriented Search

0 / 0
How much do you like this book?
What’s the quality of the file?
Categories:
Year:
2018
Publisher:
Springer
Language:
english
Pages:
354
ISBN 13:
9783319939353
File:
PDF, 8.91 MB

You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
1

Year:
2017
Language:
english
File:
PDF, 19.77 MB
5.0 / 5.0
2

### 礼仪中的美术：巫鸿中国古代美术史文编

Year:
2005
Language:
chinese
File:
PDF, 42.39 MB
4.0 / 0
Krisztian Balog

Entity-oriented Search

Krisztian Balog
University of Stavanger
Stavanger, Norway
ISSN 1387-5264
The Information Retrieval Series
ISBN 978-3-319-93933-9
ISBN 978-3-319-93935-3 (eBook)
https://doi.org/10.1007/978-3-319-93935-3
Library of Congress Control Number: 2018946540
© The Editor(s) (if applicable) and the Author(s) 2018

Preface

I have not yet reached my goal. . . But I forget what is behind,
and I struggle for what is ahead. I run toward the goal, so I can
win the prize of being called to heaven. This is the prize God
offers because of what Christ Jesus has done.
(Philippians 3:12–14, CEV)

The idea of writing this book stemmed from a series of tutorials that I gave with
colleagues on “entity linking and retrieval for semantic search.” There was no single
text on this topic that would cover all the material that I wished to introduce to
someone who is new to this field. With this book, I set out to fill that gap. I hope that
by making the book open access, many will be able to use it and benefit from it.
For me, writing this book, in many ways, was like running a marathon. No one
forced me to do it, yet I thought that—for some reason—it’d be a good idea to
challenge myself to do it. Then, along the way, there comes inevitably a point where
one asks: Why am I doing this to myself? But then, in the end, crossing the finish
line certainly feels like an accomplishment. In time, this experience might even be
remembered as if it was a walk in the park.1 In any case, it was a good run.
I wish to express my gratitude to a number of people who played a role in
making this book happen. First of all, I would like to thank Ralf Gerstner, executive
editor for Computer Science at Springer, for seeing me through to the successful
completion of this book and for always being a gentleman when it came to my
deadline extension requests. I also want to thank the Information Retrieval Series
editors Maarten de Rijke and ChengXiang Zhai for the comments on my book
proposal.
A very special thanks to Jam; ie Callan and to anonymous Reviewer #2 for reviewing the book and for making numerous valuable suggestions for improvements.
The following colleagues provided feedback on drafts of specific chapters at
various stages of completion, and I would like to thank them for their insightful
comments: Marek Ciglan, Arjen de Vries, Kalervo Järvelin, Miguel Martinez, Edgar

1 Note

to self: No, it wasn’t.

Meij, Kjetil Nørvåg, Doug Oard, Heri Ramampiaro, Ralf Schenkel, Alberto Tonon,
and Chenyan Xiong.
I want to thank Edgar Meij and Daan Odijk for the collaboration on the entity
linking and retrieval tutorials, which planted the idea of this book. Working with
you was always easy, enjoyable, and fun. My gratitude goes to all my co-authors for
the joint work that contributed to the material that is presented in this book.
I am especially grateful to the Department of Electrical Engineering and
Computer Science at the University of Stavanger for providing a pleasant work
environment, where I could devote a substantial amount of time to writing this book.
I would like to thank my PhD students for giving me their honest opinion and
offering constructive criticism on drafts of the book. They are, in gender-first-thenalphabetical order: Faegheh Hasibi, Jan Benetka, Heng Ding, Darío Garigliotti,
Trond Linjordet, and Shuo Zhang. Special thanks, in addition, to Faegheh for the
thorough checking of technical details and for suggestions on the organization of
the material; to Darío for tidying up my references; to Jan for prettifying the figures
and illustrations; to Trond for injecting entropy and for the careful proofreading
and numerous suggestions for language improvements; to Shuo and Heng for the
oriental perspective and for telling me that I use too many words.
Last but not least, I want to thank my friends and family for their outstanding
support throughout the years. You know who you are.
Stavanger, Norway
April 2018

Krisztian Balog

Website

http://eos-book.org
This book is accompanied by the above website. The website provides a variety of
supplementary material, corrections of mistakes, and related resources.

Contents

1

Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
1.1 What Is an Entity? .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
1.1.1 Named Entities vs. Concepts .. . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
1.1.2 Properties of Entities . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
1.1.3 Representing Properties of Entities . . . .. . . . . . . . . . . . . . . . . . . .
1.2 A Brief Historical Outlook.. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
1.2.1 Information Retrieval .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
1.2.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
1.2.3 Natural Language Processing . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
1.2.4 Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
1.3 Entity-Oriented Search .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
1.3.1 A Bird’s-Eye View. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
1.3.2 Tasks and Challenges .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
1.3.3 Entity-Oriented vs. Semantic Search . .. . . . . . . . . . . . . . . . . . . .
1.3.4 Application Areas . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
1.4 About the Book .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
1.4.1 Focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
1.4.2 Audience and Prerequisites . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
1.4.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
1.4.4 Terminology and Notation . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

1
2
3
4
5
6
7
8
9
10
11
11
14
15
16
17
17
17
18
19
20

2

Meet the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
2.1 The Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
2.1.1 Datasets and Resources. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
2.2 Wikipedia .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
2.2.1 The Anatomy of a Wikipedia Article . .. . . . . . . . . . . . . . . . . . . .
2.2.2 Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
2.2.3 Special-Purpose Pages. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
2.2.4 Categories, Lists, and Navigation Templates .. . . . . . . . . . . . .
2.2.5 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

25
26
27
28
29
32
33
33
35

2.3

Knowledge Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
2.3.1 A Knowledge Base Primer .. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
2.3.2 DBpedia .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
2.3.3 YAGO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
2.3.4 Freebase .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
2.3.5 Wikidata.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
2.3.6 The Web of Data . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
2.3.7 Standards and Resources . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
2.4 Summary.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
Part I
3

4

36
37
40
45
46
47
48
51
51
52

Entity Ranking

Term-Based Models for Entity Ranking . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
3.1 The Ad Hoc Entity Retrieval Task . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
3.2 Constructing Term-Based Entity Representations . . . . . . . . . . . . . . . . . .
3.2.1 Representations from Unstructured Document
Corpora.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
3.2.2 Representations from Semi-structured Documents . . . . . . .
3.2.3 Representations from Structured Knowledge Bases . . . . . .
3.3 Ranking Term-Based Entity Representations .. .. . . . . . . . . . . . . . . . . . . .
3.3.1 Unstructured Retrieval Models . . . . . . . .. . . . . . . . . . . . . . . . . . . .
3.3.2 Fielded Retrieval Models .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
3.3.3 Learning-to-Rank .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
3.4 Ranking Entities Without Direct Representations . . . . . . . . . . . . . . . . . .
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
3.5.1 Evaluation Measures . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
3.5.2 Test Collections .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
3.6 Summary.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
3.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

57
58
59
61
67
69
74
75
79
82
85
86
86
88
94
94
95

Semantically Enriched Models for Entity Ranking .. . . . . . . . . . . . . . . . . . . .
4.1 Semantics Means Structure . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
4.2 Preserving Structure .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
4.2.1 Multi-Valued Predicates . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
4.2.2 References to Entities . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
4.3 Entity Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
4.3.1 Type Taxonomies and Challenges . . . . .. . . . . . . . . . . . . . . . . . . .
4.3.2 Type-Aware Entity Ranking .. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
4.3.3 Estimating Type-Based Similarity . . . . .. . . . . . . . . . . . . . . . . . . .
4.4 Entity Relationships .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
4.4.1 Ad Hoc Entity Retrieval .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
4.4.2 List Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
4.4.3 Related Entity Finding . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

101
103
104
105
107
111
111
113
113
116
116
118
120

4.5

Similar Entity Search.. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
4.5.1 Pairwise Entity Similarity .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
4.5.2 Collective Entity Similarity . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
4.6 Query-Independent Ranking .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
4.6.1 Popularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
4.6.2 Centrality.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
4.6.3 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
4.7 Summary.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
4.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

Part II
5

124
126
130
133
134
135
138
139
139
140

Bridging Text and Structure

Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.1 From Named Entity Recognition Toward Entity Linking . . . . . . . . . .
5.1.1 Named Entity Recognition . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.1.2 Named Entity Disambiguation .. . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.1.3 Entity Coreference Resolution . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.2 The Entity Linking Task . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.3 The Anatomy of an Entity Linking System . . . . .. . . . . . . . . . . . . . . . . . . .
5.4 Mention Detection.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.4.1 Surface Form Dictionary Construction . . . . . . . . . . . . . . . . . . . .
5.4.2 Filtering Mentions .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.4.3 Overlapping Mentions .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.5 Candidate Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.6 Disambiguation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.6.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.6.2 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.6.3 Pruning .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.7 Entity Linking Systems . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.8.1 Evaluation Measures . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.8.2 Test Collections .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.8.3 Component-Based Evaluation . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.9 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.9.1 A Cross-Lingual Dictionary for English Wikipedia
Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.9.2 Freebase Annotations of the ClueWeb Corpora .. . . . . . . . . .
5.10 Summary.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.11 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

147
148
149
150
151
152
152
154
155
156
157
157
159
159
164
172
172
174
174
175
179
180
180
180
181
182
183

6

Populating Knowledge Bases . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
6.1 Harvesting Knowledge from Text . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
6.1.1 Class-Instance Acquisition .. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
6.1.2 Class-Attribute Acquisition . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
6.1.3 Relation Extraction . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
6.2 Entity-Centric Document Filtering . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
6.2.1 Overview .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
6.2.2 Mention Detection .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
6.2.3 Document Scoring .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
6.2.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
6.2.5 Evaluation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
6.3 Slot Filling .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
6.3.1 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
6.3.2 Evaluation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
6.4 Summary.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
6.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

Part III

189
191
192
195
195
197
198
199
200
203
207
212
213
215
215
216
216

Semantic Search

7

Understanding Information Needs . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.1 Semantic Query Analysis . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.1.1 Query Classification . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.1.2 Query Annotation.. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.1.3 Query Interpretation . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.2 Identifying Target Entity Types .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.2.2 Unsupervised Approaches.. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.2.3 Supervised Approach .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.2.4 Evaluation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.3 Entity Linking in Queries .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.3.1 Entity Annotation Tasks . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.3.2 Pipeline Architecture for Interpretation Finding . . . . . . . . . .
7.3.3 Candidate Entity Ranking .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.3.4 Producing Interpretations . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.4 Query Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.4.1 Concepts and Definitions .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.4.2 Template Discovery Methods . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.5 Summary.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

225
226
226
228
231
232
233
234
236
236
239
240
242
243
246
252
253
255
260
261
261

8

Leveraging Entities in Document Retrieval . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
8.1 Mapping Queries to Entities . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
8.2 Leveraging Entities for Query Expansion .. . . . . .. . . . . . . . . . . . . . . . . . . .
8.2.1 Document-Based Query Expansion . . .. . . . . . . . . . . . . . . . . . . .
8.2.2 Entity-Centric Query Expansion .. . . . . .. . . . . . . . . . . . . . . . . . . .

269
270
272
273
274

8.2.3 Unsupervised Term Selection .. . . . . . . . .. . . . . . . . . . . . . . . . . . . .
8.2.4 Supervised Term Selection .. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
8.3 Projection-Based Methods .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
8.3.1 Explicit Semantic Analysis . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
8.3.2 Latent Entity Space Model .. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
8.3.3 EsdRank .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
8.4 Entity-Based Representations.. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
8.4.1 Entity-Based Document Language Models . . . . . . . . . . . . . . .
8.4.2 Bag-of-Entities Representation . . . . . . . .. . . . . . . . . . . . . . . . . . . .
8.5 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
8.6 Resources and Test Collections .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
8.7 Summary.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
8.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

275
276
279
280
282
283
285
285
287
292
292
293
293
294

Utilizing Entities for an Enhanced Search Experience . . . . . . . . . . . . . . . . .
9.1 Query Assistance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
9.1.1 Query Auto-completion . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
9.1.2 Query Recommendations . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
9.1.3 Query Building Interfaces .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
9.2 Entity Cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
9.2.1 The Anatomy of an Entity Card. . . . . . . .. . . . . . . . . . . . . . . . . . . .
9.2.2 Factual Entity Summaries .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
9.3 Entity Recommendations . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
9.3.1 Recommendations Given an Entity .. . .. . . . . . . . . . . . . . . . . . . .
9.3.2 Personalized Recommendations . . . . . . .. . . . . . . . . . . . . . . . . . . .
9.3.3 Contextual Recommendations . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
9.3.4 Explaining Recommendations . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
9.4 Summary.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
9.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

299
299
300
302
310
312
313
314
319
320
322
325
327
331
332
332

10 Conclusions and Future Directions .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
10.1 Summary of Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
10.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
10.1.2 Retrieval Methods . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
10.1.3 Understanding and Interacting with Users . . . . . . . . . . . . . . . .
10.2 A Peek into the Future . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
10.3 Future Research Directions . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
10.3.1 Understanding and Interacting with Users . . . . . . . . . . . . . . . .
10.3.2 Complex Information Needs and Task Completion .. . . . . .
10.3.3 Data and Knowledge . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
10.4 Concluding Remarks .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

337
338
338
338
339
340
343
344
345
346
346
347

9

Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 349

Acronyms

EF
EL
ELQ
ER
IEF
INEX
IR
KB
KG
KR
LM
LTR
NLP
SDM
SERP
SPO
TREC

Entity frequency
Entity retrieval
Inverse entity frequency
Initiative for the Evaluation of XML Retrieval
Information retrieval
Knowledge base
Knowledge graph
Knowledge repository
Language models
Learning-to-rank
Natural language processing
Sequential dependence model
Search engine result page
Subject-predicate-object (triple)
Text Retrieval Conference

Notation

Throughout this book, unless stated otherwise, the notation used is as follows:
Symbol
c(x)
c(x;y)
c(x,y;z)
d
D
Dq (k)
e
E
Eq (k)
K
Le
lx
q
t
Te
T
V
|X|
Z
1(x)

Meaning
Total count of x
Count of x in the context of y
Number of times x and y co-occur in the context of z
Document (d ∈ D)
Document collection
Top-k ranked documents for query q
Entity (e ∈ E)
Entity catalog (set of all entities)
Top-k ranked entities for query q
Knowledge base (set of SPO triples)
Set of links of an entity e

Representation length of x (lx = t ∈V c(t;x))
Query
Term (string token, t ∈ V)
Types of entity e (Te ⊂ T )
Type taxonomy
Vocabulary of terms
Cardinality of set X
Normalization factor
Binary indicator function (returns 1 if x is true, otherwise 0)

Chapter 1

Introduction

Search engines have become part of our daily lives. We use Google (Bing, Yandex,
Baidu, etc.) as the main gateway to find information on the Web. With a certain type
of content in mind, we may search directly on a particular site or service, e.g., on
Facebook or LinkedIn for people, organizations, and events; on Amazon or eBay
for products; or on YouTube or Spotify for music. Even on our smartphones, we are
increasingly reliant on search functionality to find contacts, email, notes, calendar
entries, apps, etc. We have grown accustomed to expect a search box somewhere
near the top of the screen, and we have also increased our expectations of the quality
and speed of the responses to our searches.
On the highest level of abstraction, the field of information retrieval (IR)
is concerned with developing technology for matching information needs with
information objects. What we put in the search box, i.e., the query, is an expression
of our information need. It may range from a few simple keywords (e.g., “Bond
girls”) to a proper natural language question (e.g., “What are good digital cameras
under \$300?”). The search engine then responds with a ranked list of items, i.e.,
information objects. Traditionally, these items were documents. In fact, IR has been
seen as synonymous with document retrieval by many. The past decade, however,
has seen an enormous development in search technology. As regular users, we have
witnessed first-hand the transitioning of search engines into “answering engines.”
Today’s contemporary web search engines return rich search result pages, which
include direct displays of entities, facts, and other structured results instead of
merely a list of documents (“ten blue links”), as illustrated in Fig. 1.1. A primary
enabling component behind these advanced search services is the availability
of large-scale structured knowledge repositories (called knowledge bases), which
organize information around specific things or objects (which we will be referring
to as entities). The objective of this book is to give a detailed account of the
developments of a decade of IR research that have enabled us to search for “things,
not strings.”

2

1 Introduction

Fig. 1.1 An example of a rich search result page from the Google search engine. The panel on the
right-hand side of the page is an example of an entity card

1.1 What Is an Entity?
Informally, an entity is a “thing” or “object” that can be referred to. Common
types of entities include, e.g., people, organizations, products, locations, and events.
Producing a precise definition, as we shall see, turns out to be quite challenging. A
commonly accepted definition of an entity is as follows:
An entity is an object or concept in the real world that can be distinctly identified.

However, this definition is not without complications. Let us take the entity
“Superman” as an example. Does it refer to the fictional comic book superhero,
to the comic book itself, or to the actor who is playing the character in the
movie adaptation? Entity identity is a hard question to tackle. Part of the issue
is related to defining “the” (real) world. Any attempt to resolve this is likely to
lead to a long philosophical debate about “existence.” Therefore, we will resort to
a more pragmatic and data-oriented approach. For that, we go all the way back

1.1 What Is an Entity?

3

to database management systems of the 1970s, where the importance of entities,
as meaningful units for organizing information, has been recognized. The entityrelationship (ER) model proposed by Chen [11] in 1976 is a high-level conceptual
data model that “incorporates some of the important semantic information about
the real world” [11]. The ER model revolves around real-world entities and the
associations among them. Both entities and relationships are described by means of
their properties (attribute-value pairs). Further, an entity is an instance of a given
entity type (i.e., a semantic class). We capture these key facets of entities in the
following definition:
Definition 1.1 An entity is a uniquely identifiable object or thing, characterized by its name(s), type(s), attributes, and relationships to other entities.
We circumvent the “existential” questions by restricting our universe to some
particular registry of entities, which we will refer to as the entity catalog. Thus,
we consider that an entity “exists” if an only if it is an entry in the given entity
catalog.
Definition 1.2 An entity catalog is collection of entries, where each entry is
identified by a unique ID and contains the name(s) of the corresponding entity.
The entity catalog defines the universe of entities by providing entities with unique
identifiers. While this alone can turn out to be surprisingly useful, we typically have
more knowledge about entities (regarding their types, attributes, and relationships).
We will shortly come back to the question of how to represent this knowledge, in
Sect. 1.1.3.

1.1.1 Named Entities vs. Concepts
Entities are most commonly thought of as real-world objects represented by a proper
noun. There are, in fact, two main classes of entities that may be distinguished:
• Named entities are real-world objects that can be denoted by a proper noun.
Examples include specific persons, locations, organizations, products, events,
etc.
• Concepts are abstract objects, including, but not limited to, mathematical and
philosophical concepts (e.g., “distance,” “axiom,” “quantity”), physical concepts
and natural phenomena (e.g., “gravity,” “force,” “wind”), psychological concepts
(e.g., “emotion,” “thought,” “identity”), and social concepts (e.g., “authority,”
“human rights,” “peace”).

4

1 Introduction

These two classes generally correspond to the dichotomy between concrete and
abstract objects in philosophy. It is worth noting that the distinction between
concrete/abstract objects has a curious status in contemporary philosophy, with
many plausible ways of drawing the line between the two [34].
As far as our work is concerned, this distinction is mostly of a philosophical
nature. From a technical perspective, the exact same methods may be used for
named entities and/or concepts. Thus, unless stated otherwise, whenever we write
entity in this book, we mean both of them. Nevertheless, the focus of practical
application scenarios is, more commonly than not, restricted to named entities.

1.1.2 Properties of Entities
We shall collectively refer to all information associated with an entity (e.g., the
unique identifier, names, types, attributes, and relationships) as entity properties.
Let us now explore each of these properties in a bit more detail.
Unique identifier: Entities need to be uniquely identifiable. There must be a oneto-one correspondence between each entity identifier (ID) and the (real-world
or fictional) object it represents (i.e., within a given entity catalog; the same
entity may exist under different identifiers in other catalogs). Examples of entity
identifiers from past IR benchmarking campaigns include email addresses for
unique resource identifiers (URIs, within Linked Data repositories).
Name(s): Entities are known and referred to by their name—usually, a proper
noun. Unlike IDs, names do not uniquely identify entities; multiple entities may
share the same name (e.g., “Michael Jordan”). Also, the same entity may be
known by more than a single name (e.g., “Barack Obama,” “President Obama,”
“Barack Hussein Obama II”). These alternative names are called surface forms
or aliases. Humans can easily resolve the ambiguity of entity references from
the context of the mention most of the time. For machines, automatically
disambiguating entity references presents many challenges.
Type(s): Entities may be categorized into multiple entity types (or types for
short). Types can also be thought of as containers (semantic categories) that
group together entities with similar properties. An analogy can be made to objectoriented programming, whereby an entity of a type is like an instance of a class.
The set of possible entity types are often organized in a hierarchical structure,
i.e., a type taxonomy. For example, the entity Albert Einstein is an instance of the
type “scientist,” which is a subtype of “person.”
Attributes: The characteristics or features of an entity are described by a set of
attributes. Different types of entities are typically characterized by different sets
of attributes. For example, the attributes of a person include the date and place of
birth, weight, height, parents, spouses, etc. The a Attributes of a populated place
include latitude, longitude, population, postal code(s), country, continent, etc.

1.1 What Is an Entity?

5

Notice that some of the items in these lists are entities themselves, e.g., locations
or persons. We do not treat those as attributes but consider them separately, as
relationships. Attributes always have literal values; optionally, they may also
be accompanied by data type information (such as number, date, geographic
coordinate, etc.).
Relationships: In the words of Booch [9]: “an object by itself is intensely
uninteresting.” Relationships describe how two entities are associated to each
other. From a linguistic perspective, entities may be thought of as proper
nouns and relationships between them as verbs. For example, “Homer wrote
the Odyssey” or “The General Theory of Relativity was discovered by Albert
Einstein.” Relationships may also be seen as “typed links” between entities.

1.1.3 Representing Properties of Entities
Information about entities can be represented and stored in semi-structured or in
structured form.
Definition 1.3 A knowledge repository (KR) is a catalog of entities that
contains entity type information, and (optionally) descriptions or properties
of entities, in a semi-structured or structured format.
Wikipedia is a classic example of a knowledge repository. Each article in Wikipedia
is an entry that describes a particular entity. Articles are also assigned to categories
(which can be seen as entity types) and contain hyperlinks to other articles (thereby
indicating the presence of a relationship between two entities, albeit not the type of
the relationship). Wikipedia articles also contain information about attributes and
relationships of entities, but not in a structured form.
To organize and store information about entities in a structured form, one
needs a knowledge representation model. The Resource Description Framework
(RDF), which we will discuss in detail in Sect. 2.3.1.2, is the prevalent standard for
describing entities (and, more generally, resources). An entity can be represented as
a set of RDF statements. These statements may be seen as facts or assertions about
that entity. A knowledge base is a structured knowledge repository for storing and

Definition 1.4 A knowledge base (KB) is a structured knowledge repository
that contains a set of facts (assertions) about entities.

According to our definition, all knowledge bases are also knowledge repositories,
but the reverse is not true.

6

1 Introduction

Fig. 1.2 Illustration of the relationship between entity catalog, knowledge repository, and knowledge base, each complementing and extending the previous concept. The entity properties marked
with * are mandatory
< dbr : Kimi_Raikkonen >
< foaf : name >
< dbo : birthPlace >
< dbo : nationality >
< dct : description >
< dbo : birthDate >
< rdf : type >
< dct : subject >
< dct : subject >
< rdfs : comment >

" Kimi Räikkönen "
< dbr : Espoo >
< dbr : Finland >
" Finnish race driver "
"1979 -10 -17"
< dbo : RacingDriver >
< dbc : Finnish_racin g_ dr iv er s >
< dbc : Ferrari_Form ul a _O ne _d ri v er s >
" Kimi - Matias Räikkönen [...] nicknamed " The Ice Man ",
is a Finnish racing driver currently driving for
Ferrari in Formula One . [...]"

Listing 1.1 Excerpt from the DBpedia knowledge base entry of KIMI RÄIKKÖNEN

Conceptually, entities in a knowledge base may be seen as nodes of a graph, with
the relationships between them as (labeled) edges. Thus, especially when this graph
nature is emphasized, a knowledge base may also be referred to as a knowledge
graph (KG). Figure 1.2 shows the relationship between these concepts.
To give an idea of what a knowledge base entry of an entity looks like, we refer
to Listing 1.1. This particular example is from DBpedia knowledge base, showing
an excerpt from the entry of the entity KIMI RÄIKKÖNEN who is displayed on the
entity card in Fig. 1.1. We are going to cover knowledge bases and the RDF model
in greater detail in Chap. 2.

1.2 A Brief Historical Outlook
Before delving into the topic of entity-oriented search, it is important to put things
in historical context. Therefore, in this section, we present a broad perspective on
developments within multiple fields of computer science, in particular information
retrieval (IR), databases (DB), natural language processing (NLP), and the Semantic

1.2 A Brief Historical Outlook

7

Web (SW). Even though they have developed largely independently of each other,
concentrated on separate problems, and operated on different types of data, they
seem to converge on a common theme: entities as units for capturing, storing,
organizing, and accessing information.

1.2.1 Information Retrieval
According to an early definition by Salton [35] from 1968, “Information retrieval
is a field concerned with the structure, analysis, organization, storage, searching,
and retrieval of information.” From its inception, IR has always kept a strong focus
on evaluating the effectiveness of systems: “determining the relevance of items,
retrieved by a search engine, relative to a user’s information need” [36]. The launch
of the Text REtrieval Conference (TREC) series in 1992, co-sponsored by the US
National Institute of Standards and Technology (NIST) and the US Department of
Defense, has had a profound impact on the field, by standardizing retrieval evaluation through the creation of large test collections. TREC was followed by Asian and
European sister events, the NII Test Collection for IR Systems (NTCIR) in 1999, and
the Conference and Labs of the Evaluation Forum (CLEF, formerly Cross-Language
Evaluation Forum) in 2000. These benchmarking campaigns follow an annual cycle.
Each edition features a number of specific tasks, which are thematically organized
into different “tracks.” By looking at the development of these tracks, one can get a
good overview of how the focus of research in IR has shifted over the years.
Up to the mid-1990s, the field has primarily focused on documents as the unit
of retrieval. Driven by the motto “users want answers, not documents,” a new front
of IR research has emerged with the arrival of the TREC Question Answering track
in 1999. Question answering systems respond with a short, focused answer to a
question formulated in natural language, e.g., “Who invented the paper clip?” or
“How many calories are there in a Big Mac?” The expert finding task at TREC
Enterprise track (2005–2008) concentrated on answering a more specific type of
question: “Who are the experts on topic X?” Here, the input is a keyword query,
specifying the area of expertise (e.g., “XML schema”), and the system answers this
by returning a ranked list of people. The INEX Entity Ranking (2007–2009) and the
to arbitrary entity types, laying the groundwork for the area of entity retrieval.
With the transitioning from documents to entities as the units of retrieval also
came an increased reliance on structured data sources, known as knowledge bases.
The TREC Knowledge Base Acceleration track (2012–2014) aimed at developing
technology that can aid humans in maintaining and expanding information stored
especially major web search engines, like Google) has also played an prominent
role in shaping the field. Search has become a commodity, and users have grown
accustomed to expressing their information needs using short keyword queries, and

8

1 Introduction

Table 1.1 Comparison of database systems and information retrieval, based on [40]
Data type
Foundation
Queries
Evaluation criteria
User

Database systems
Numbers, short strings
Algebraic/logic based
Boolean retrieval
Structured query languages
Efficiency
Programmer

Information retrieval
Text
Probabilistic/statistics based
Ranked retrieval
Free text queries
Effectiveness (user satisfaction)
Nontechnical person

getting—most of the time—relevant results almost instantly. At the same time, the
massive volumes of usage data collected from users allows for improved methods,
by harnessing the “wisdom of the crowds.” As Liu [24] explains, “given the amount
of potential training data available, it has become possible to leverage machine
learning technologies to build effective ranking models.” Such models exploit a
large number of features by means of discriminative learning, known as “learningto-rank” [24].

1.2.2 Databases
“A database management system is a software system that enables the creation,
maintenance, and use of large amounts of data” [1]. This definition suggests that
database systems and information retrieval have a lot in common. This is indeed the
case, yet DB and IR emphasize very different aspects of information management.
Databases contain highly structured data, which is queried by expert users (i.e.,
programmers) using formal query languages, like SQL. The focus is on precise
query processing and efficiency. IR systems, on the other hand, “understand queries
as approximate, best-effort formulations of the user’s information needs” [40].
Search is an interactive process, which often involves multiple query reformulations
upon the inspection of results. Table 1.1 summarizes the traditional differences
between DB and IR systems. Given the complementary foci and techniques in DB
and IR, the two fields can benefit from each other’s developments. For instance, IR
can profit from efficient indexing structures, whereas DB can make use of natural
language search interfaces and probabilistic ranking mechanisms from IR. While the
traditional boundaries between these two fields still exist, they are getting blurred.
Entity retrieval is a cross-over application area between IR and DB that requires
flexible ranking on text, categorical, and numerical attributes. Additionally, the
search also needs to be able to cope with “no answers” and “too many answers.”
Searching online product catalogs is a good illustrative example, where users issue
keyword queries but also use various filters (e.g., via faceting) to narrow down the
scope of results. Many of these queries could be answered more or less exactly, but
many others will require probabilistic scoring and ranking.

1.2 A Brief Historical Outlook

9

As we have already discussed in Sect. 1.1, it has been realized very early on in
the database field that entities offer a disciplined way of handling data. The entityrelationship approach of Chen [11] was originally proposed as a semantic data
model, to provide a better representation of real-world entities. Entity-relationship
diagrams, which are built up of entities, relationships, and attributes, are now
normally used as a conceptual modeling technique [7]. The field of databases
recognized the need for an entity-centric view of web content about the same
time as IR did [13, 40]. The recent focus in databases—within our interest area—
has primarily been on developing indexing schemes that facilitate efficient query
processing [10, 12], and on interpreting queries with the help of structured data, i.e.,
translating keyword queries to structured queries [18, 31, 38, 41].
Additionally, the field of databases also deals with a range of data integration
and data quality problems, such as record linkage (a.k.a. entity resolution) [14, 16]
or schema mapping [33]. We consider these being outside the scope of this book.

1.2.3 Natural Language Processing
Most research in natural language processing (or computational linguistics) aims
to capture the meaning of text. One might divide NLP problems into (1) low-level
parsing and segmentation tasks, (2) linguistic annotations, and (3) end-user applications. Common text parsing and segmentation tasks include sentence breaking, word
segmentation, stemming, and lemmatization. Linguistic annotation tasks include
part-of-speech tagging, word sense disambiguation, named entity recognition and
disambiguation, coreference resolution, temporal tagging, semantic role labeling,
and dependency parsing. These annotations are meant to yield deeper representations that are closer to meaning and may be exploited in real-world applications.
End-user applications include, among others, information extraction, machine
translation, text summarization, sentiment analysis, and question-answering. For
us, the most relevant of these is information extraction (IE), which “refers to
the automatic extraction of structured information such as entities, relationships
between entities, and attributes describing entities from unstructured sources” [37].
There are two main modes in which an IE system may be deployed: one is to
annotate text with the identified mentions of structured information, another is to
populate a knowledge base with the extracted information. Information extraction is
narrower in scope than full text understanding—which is still beyond our capabilities today. Nevertheless, identifying entities and relationships makes it possible to
capture, to a large extent, what a given piece of text is about. Furthermore, entities
can serve as a pivot for connecting unstructured text and structured knowledge
bases. While rooted in NLP, the problem area of extracting structured information
from unstructured sources now engages the IR, DB, machine learning, and Web
communities as well. Over time, the scope of IE systems was expanded to include
the extraction of not only atomic elements (entities and relations) but of higher-order
structures as well, such as tables and lists [15, 25, 29].

10

1 Introduction

Up until the late 1980s, most NLP systems employed rule-based approaches,
which relied heavily on linguistic theory. Then came the “statistical revolution,”
introducing machine learning algorithms for language processing that could learn
from manually annotated corpora [22]. The current state of the art “draws far more
heavily on statistics and machine learning than it does on linguistic theory” [22].
Today, a broad range of robust, efficient, and scalable techniques for shallow NLP
processing (as opposed to deep linguistic analysis) are available [30].

1.2.4 Semantic Web
The Semantic Web is a relatively young field, especially compared to the other
three (IR, DB, NLP). The term was coined by Tim Berners-Lee, referring to an
envisioned extension of the original Web. While the original Web is a medium of
documents for people (i.e., the Web of Documents), the Semantic Web is meant to
be a Web of “actionable information,” i.e., an environment that enables intelligent
agents to carry out sophisticated tasks for users. The Semantic Web is “a Web
of relations between resources denoting real world objects, i.e., objects such as
people, places and events” [19]. The challenge of the Semantic Web, as explained
in the 2001 Scientific American by Berners-Lee et al. [6], is “to provide a language
that expresses both data and rules for reasoning about the data.” Thus, from the
late 1990s and throughout the 2000s, a great deal of effort was expended toward
establishing standards for knowledge representation. Several important technologies
were introduced:
• The Universal Resource Identifier (URI), to be able to uniquely identify “things”
(i.e., entities, which are called resources);
• The eXtensible Markup Language (XML), to add structure to web pages;
• The Resource Description Framework (RDF), to encode meaning in a form of
(sets of) triples;
• Various serializations for storing and transmitting RDF data, e.g., Notation-3,
Turtle, N-Triples, RDFa, and RDF/JSON;
• The SPARQL query language, to retrieve and manipulate RDF data;
• A large palette of techniques to describe and define vocabularies, including the
RDF Schema (RDFS), the Simple Knowledge Organization System (SKOS), and
the Web Ontology Language (OWL).
These technologies together form a layered architecture, referred to as the Semantic
Web Stack.
In terms of large-scale, agent-based mediation with heterogeneous data, the
Semantic Web is a dream that has not (yet) come true. The Semantic Web movement, nevertheless, has resulted in structured data on a previously unprecedented
scale. As a terminological distinction, Semantic Web is often used to refer to the
various standards and technologies, while the data that is being published using
Semantic Web standards is called Linked Data or the Web of Data. Linked data may

1.3 Entity-Oriented Search

11

be exposed as semantic mark-up embedded within HTML pages or as entire datasets
(i.e., knowledge bases) published as RDF (e.g., DBpedia or Wikidata). A key idea
is that resources that refer to the same real-world entity may be interlinked across
different sources.
Ontologies, for automated inference or for integrating heterogeneous data, have
seen little adoption in the search industry. Recent efforts are geared toward speaking
the same language using a shared vocabulary. Schema.org is a collaborative activity
by major search providers (including Google, Microsoft, Yahoo, and Yandex) in
order to define a standard for semantic markup. At the time of writing, over 10
million sites use Schema.org to mark up their web pages and email messages.
Regarding information access, it was realized that formal, structured query
languages, like SPARQL, are unsuitable for ordinary users, who would prefer simple
keyword search. Thus, the Semantic Web community has adopted IR-style ranking
models for retrieving specific entities [8, 17, 27].

1.3 Entity-Oriented Search
We use the term entity-oriented search to refer to a broad range of information
to documents.

Definition 1.5 Entity-oriented search is the search paradigm of organizing
and accessing information centered around entities, and their attributes and
relationships.

The significance of this information access paradigm is twofold:
• From a user perspective, entities are natural units for organizing information. We
care about and mostly think in terms of real-world things and their connections.
Allowing users to interact with specific entities offers a richer and more effective
user experience than what is provided by conventional document-based retrieval
systems.
• From a machine perspective, entities allow for a better understanding of search
queries, of document content, and even of users (e.g., their context and preferences). Entities enable search engines to be more intelligent.

1.3.1 A Bird’s-Eye View
Figure 1.3 shows a high-level overview of an entity-oriented search system. At first
glance, one might say that this looks a lot like any conventional (i.e., document-

12

1 Introduction

Fig. 1.3 Architecture of an entity-oriented search system

oriented) retrieval system. While that observation is indeed valid from this distance,
there is a single, yet important difference on the data end. The document collection
is complemented with a knowledge repository. The knowledge repository contains,
at the bare minimum, an entity catalog: a dictionary of entity names and unique
identifiers. Typically, the knowledge repository also contains the descriptions and
properties of entities in semi-structured (e.g., Wikipedia) or structured format
(e.g., Wikidata, DBpedia). Commonly, the knowledge repository also contains
ontological resources (e.g., a type taxonomy).
Next, we briefly look at the three main components depicted on Fig. 1.3, moving
from left to right.

1.3.1.1 Users and Information Needs
Users may articulate their information needs in many different ways. These are
and natural language queries are distinguished [4]. We complement this list with
Keyword queries Thanks to major web search engines, keyword queries have
become the “dominating lingua franca of information access” [2]. Keyword
queries are also known as free text queries: “a query in which the terms of
the query are typed freeform into the search interface, without any connecting
search operators (such as Boolean operators)” [26]. Keyword queries are easy to
formulate, but—by their very nature—are imprecise.
Structured queries Structured data sources (databases and knowledge bases) are
traditionally queried using formal query languages (such as SQL or SPARQL).
These queries are very precise. However, formulating them requires a “knowledge of the underlying schema as well as that of the query language” [3].
Structured queries are primarily intended for expert users and well-defined,
precise information needs.

1.3 Entity-Oriented Search

13

Keyword++ queries We use the term keyword++ query (coined in [3]) to refer
to keyword queries that are complemented with additional structural elements.
For example, when users supply target categories or various filters via faceted
search interfaces, those extra pieces of input constitute the ++ part. With welldesigned user interfaces, supplying these does not induce a cognitive load on the
user. Keyword++ queries may be seen as “fielded” keyword queries.
Natural language queries Information needs can be formulated using natural
language, the same way as one human would express it to another in an everyday
conversation. Often, natural language queries take a question form. Also, such
queries are increasingly more spoken aloud with voice search, instead of being
typed [28].
Zero-query The traditional way of information access is reactive: the search
system responds to a user-initiated query. Proactive systems, on the other hand,
“anticipate and address the user’s information need, without requiring the user
to issue (type or speak) a query” [5]. The zero-query search paradigm can be
expressed with the slogan “the query is the user.” In practice, the context of the
user is used to infer information needs.
Sawant and Chakrabarti [39] refer to queries typically sent to search engines
as “telegraphic queries.” These are not well-formed grammatical sentences or
questions. Keywords could also be described as “shallow” natural language queries.
For example, most users would simply issue “birth date neil armstrong.” With voice
search being increasingly more prevalent, especially on mobile devices, alternatively, the user could ask the question: “When was Neil Armstrong born?” Bast
et al. [4] point out that “keyword search and natural language search are less clearly
delineated than it may seem.” The distinction often depends on the processing
technique used rather than the query text itself. In this book, we will concentrate on
keyword (and keyword++) queries. We note that the same techniques may be applied
for natural language queries as well (but will likely yield suboptimal results).
1.3.1.2 Search Engine
At this high-level perspective, the search engine consists of two main parts: the
user interface and the retrieval system. The former takes care of the interaction
with the user, from the formulation of the information need to the presentation
of search results. The “single search box” paradigm became extremely popular
thanks to major web search engines. Recently, natural language interfaces have also
been receiving increased attention. These allow users to pose a (possibly complex)
question in natural language (instead of merely a list of keywords). The retrieval
system interprets the search request and compiles a response. Modern web search
engine result pages are composed of a ranked list of documents (web pages),
entity cards, direct answers, and other knowledge panels, along with further entity
recommendations and suggestions for query reformulations. In vertical search, the
result list comprises a ranked list of entities, possibly grouped by entity type. Our
main focus in this book will be on how to generate entity-oriented responses.

14

1 Introduction

1.3.1.3 Data
We distinguish between three main types of data.
Unstructured data can be found in vast quantities in a variety of forms: web
pages, spreadsheets, emails, blogs, tweets, medical records, etc. Without making
any assumptions about the format, all these may be treated as textual documents,
i.e., a sequence of words.
Semi-structured data is characterized by the lack of rigid, formal structure. Typically, it contains tags or other types of markup to separate textual content from
semantic elements. Semi-structured data is “self-describing,” i.e., “the schema is
contained within the data and is evolving together with the content” [3].
Structured data adheres to a predefined (fixed) schema and is typically organized in a tabular format—think of relational databases. The schema serves as
a blueprint of how the data is organized, describes how real-world entities are
modeled, and imposes constraints to ensure the consistency of the data.
In Fig. 1.3, the document collection is an unstructured or semi-structured data
source. The knowledge repository may be either in semi-structured (e.g., RDF) or
in structured format (e.g., a relational database). One of the challenges in entityoriented search is that information about a given entity has to be collected and
aggregated across noisy, heterogeneous, and potentially conflicting data sources,
both unstructured and structured.

Next, we identify a number of specific tasks, and related challenges, that we will
be concerned with in this book. These can be organized around three main thematic
areas. In fact, these themes largely correspond to the three parts of the book.

1.3.2.1 Entities as the Unit of Retrieval
According to various studies, 40–70% of queries in web search mention or target
specific entities [20, 23, 32]. These queries commonly seek a particular entity,
albeit often an ambiguous one (e.g., “harry potter”) or a list of entities (e.g.,
“doctors in barcelona”). Such queries are better answered by returning a ranked
list of entities, as opposed to a list of documents. We refer to this as the task
of entity retrieval. There are three main challenges involved here: (1) how to
represent information needs, (2) how to represent entities (using both unstructured
and structured datasets), and (3) how to match those representations. One of the
most exciting opportunities in entity retrieval is how to leverage the additional
structure associated with entities in the knowledge repository—attributes, types, and
relationships—to improve retrieval effectiveness.

1.3 Entity-Oriented Search

15

1.3.2.2 Entities for Knowledge Representation
Entities help to bridge the gap between the worlds of unstructured and structured
data: they can be used to semantically enrich unstructured text, while textual sources
may be utilized to populate structured knowledge bases.
Recognizing mentions of entities in text and associating these mentions with the
corresponding entries in a knowledge base is known as the task of entity linking.
Entities allow for a better understanding of the meaning of text, both for humans and
for machines. While humans can relatively easily resolve the ambiguity of entities,
based on the context in which they are mentioned, for machines this presents many
difficulties and challenges.
The knowledge base entry of an entity summarizes what we know about that
entity. As the world is constantly changing, so are new facts surfacing. Keeping up
with these changes requires a continuous effort from editors and content managers.
This is a demanding task at scale. By analyzing the contents of documents in
which entities are mentioned, this process—of finding new facts or facts that need
updating—may be supported, or even fully automated. We refer to this as the
problem of knowledge base population.

1.3.2.3 Entities for an Enhanced User Experience
Besides being meaningful retrieval and information organization units, entities can
improve the user experience throughout the entire search process. This starts with
query assistance services that can aid users in articulating their information needs.
Next, entities may be utilized for improved content understanding, by connecting
entities and facts to queries and documents. For example, they make it possible to
automatically direct requests to specific services or verticals (sites dedicated to a
specific segment of online content). When presenting retrieval results, knowledge
results (i.e., the “ten blue links”) with various information boxes and knowledge
panels (like it is shown in Fig. 1.1). Finally, entities may be harnessed for providing
contextual recommendations. See, e.g., the “People also search for” section on
Fig. 1.1.

1.3.3 Entity-Oriented vs. Semantic Search
Entity-oriented and semantic search are often mentioned in the same context, and
even treated as casual synonyms by many. The question inevitably arises: What is
the difference between the two (if any)?
There is no agreed definition of semantic search, in fact, the term itself is highly
contested. One of the first published references to the term appeared in a 2003 paper
by Guha et al. [19]: “Semantic Search attempts to augment and improve traditional

16

1 Introduction

search results (based on Information Retrieval technology) by using data from the
Semantic Web.” Since the Semantic Web is primarily organized around real-world
objects and their relationships, according to this definition, entity-oriented search
could indeed be seen as synonymous with semantic search. According to a more
recent definition attributed to John [21], “Semantic Search is defined as search for
information based on the intent of the searcher and contextual meaning of the search
terms, instead of depending on the dictionary meaning of the individual words in the
search query.”
We prefer to take a broader view on semantic search, which is as follows.

Definition 1.6 Semantic search encompasses a variety of methods and
approaches aimed at aiding users in their information access and consumption
activities, by understanding their context and intent.

This definition emphasizes the overall high-level objective, an improved user
experience, without restricting the techniques to explicit semantics. This definition
includes, among others, implicit semantics, such as term dependencies, topic
models, or latent space models. Furthermore, we do not limit semantic search to the
also fall under the umbrella of semantic search. Simply put, semantic search is
broader than entity-oriented search. Entities, nonetheless, play a leading role in it.
Throughout this book, our notion of semantics will be the following: references
to meaningful, i.e., machine understandable (ontological or linguistic) structures.

1.3.4 Application Areas
Where can entity-oriented search technology be applied? Obviously, web search is
the most prominent application area, but it is certainly not the only one. Entities play
a major role in a wide range of information access scenarios, including enterprise
search, domain-specific and vertical search (e.g., e-commerce, automotive industry,
medical search, legal information, scholarly literature, job search, and travel), social
networking, and intelligence services. Unlike web search, most of these focus on a
single or at most a handful of entity types in a given domain. Furthermore, entities
have an important function in question answering systems and in personal digital
assistants.

17

The book aims to cover all facets of entity-oriented search—where “search” can
be interpreted in the broadest sense of information access—from a unified point of
view, and provide a coherent and comprehensive overview of the state of the art.
This work is the first synthesis of research in this broad and rapidly developing
area. Selected topics are discussed in depth, with the intention of establishing
foundational techniques and methods for future research and development. A range
of other topics are treated at a survey level, with numerous pointers to relevant
literature for those interested. We also identify open issues and challenges along
the way, and conclude with a roadmap for future research.

1.4.1 Focus
The book is firmly rooted in information retrieval, and it thus bears the characteristics of the field. Developments are motivated and driven by specific use-cases, with
theory, evaluation, and application all being interconnected. A strong focus on data
is maintained throughout the book—after all, it is the data that dictates to a large
extent what can be done.
We deliberately refrain from reporting evaluation results from specific studies;
the absolute values of those evaluation scores may be largely influenced by, among
others, the various data (pre-)processing techniques, choice of tools, and parameter
settings. A direct comparison of results from different studies (performed by
different groups/individuals) may thus be misleading. Nevertheless, we indicate
overall performance ranges on standard benchmark suites. A great deal of attention
is given to evaluation methodology and to available resources, such as datasets,
software tools, and frameworks.
To remain focused, we shall follow a language agnostic approach and use
English as our working language (as, indeed, most test collections are in English).
Languages with markedly different syntax, morphology, or compositional semantics
may need additional processing techniques. The discussion of those is outside the
scope of this book.

1.4.2 Audience and Prerequisites
The primary target audience of this book are researchers and graduate students. It is
our hope that readers with a theoretical inclination will find it as useful as will those
with a practical orientation.
An understanding of basic probability and statistics concepts is required for most
models and algorithms that are discussed in the book. A general background in

18

1 Introduction

information retrieval (i.e., familiarity with the main components of a search engine
and traditional document retrieval models, such as BM25 and language models, and
with basics of retrieval evaluation) is sufficient to follow the material. Further, a
basic understanding of machine learning concepts and algorithms for supervised
learning is assumed. It was our intention to make the book as self-contained as
possible. Therefore, standard retrieval models, learning-to-rank methods, and IR
evaluation measures will be briefly explained when we come across them for the
first time, in Chap. 3.

1.4.3 Organization
The book is divided into three main parts, sandwiched by introductory and
concluding chapters.
• The first two chapters, Introduction and Meet the Data, introduce the basic
concepts, provide an overview of entity-oriented search tasks, and present the
various types and sources of data that will be used throughout the book.
• Part I deals with the core task of entity ranking: given a textual query, possibly
enriched with additional elements or structural hints, return a ranked list of
entities. This core task is examined in a number of different flavors, using both
structured and unstructured data collections, and various query formulations. In
all these cases, the output is a ranked list of entities. The main questions guiding
this part are:
– How to represent entities and information needs, and how to match those
representations?
– How to exploit unique properties of entities, namely, types and relationships,
to improve retrieval performance?
Specifically, Chap. 3 introduces models purely for the text-based ranking of
entities. Chapter 4 presents advanced models capable of leveraging structured
information associated with entities, such as entity types and relationships. As
sequentially.
• Part II is devoted to the role of entities in bridging unstructured and structured
data. The following two questions are addressed:
– How to recognize and disambiguate entity mentions in text and link them to
structured knowledge repositories?
– How to leverage massive volumes of unstructured (and semi-structured) data
to populate knowledge bases with new information about entities?
Chapters 5 and 6 may be read largely independent of each other and of other
chapters of the book.

19

• Part III explores how entities can enable search engines to understand the
concepts, meaning, and intent behind the query that the user enters into the
search box, and provide rich and focused responses (as opposed to merely a
list of documents)—a process known as semantic search. As we have discussed
earlier, semantic search is not a single method or approach, but rather a collection
of techniques. We present those techniques by dividing them into three broad
categories: understanding information needs (Chap. 7), leveraging entities in
document retrieval (Chap. 8), and utilizing entities for an enhanced search
experience (Chap. 9). Chapters 7–9 are relatively autonomous and can be read
independently of each other, but they build on concepts and tools from Parts I
and II.
• The final chapter, Conclusions and Future Directions, concludes the book by
discussing limitations of current approaches and suggests directions for future
research.

1.4.4 Terminology and Notation
This section provides a detailed description of the terminological and notational
conventions that will be used throughout the book.
Terminology Great care has been taken to use the following “reserved keywords”
only in their explicitly defined senses.
• Entity description: Textual (term-based) entity representation created with the
purpose of retrieval.
• Entity mention: Text span that is referring to a specific entity.
• Knowledge repository: A semi-structured or structured data collection that
contains a catalog of entities with unique identifiers, along with other information
entities). Examples include Wikipedia, DBpedia, Freebase, etc.
• Knowledge base: A structured knowledge repository that contains facts (assertions) about entities (including specific attributes and relationships). In this book,
these facts are represented as a set of subject-predicate-object (SPO) triples,
according to the RDF data model. For example, DBpedia is a knowledge base,
but Wikipedia is not.
• Knowledge graph: When viewed as a graph, we refer to a knowledge base as a
knowledge graph. This name is reserved for the contexts where the graph nature
of the data is utilized.
• Term: Atomic unit of text tokenization and indexing (i.e., a “word”).
Typography We adhere to certain typographical conventions.
• Whenever referring to a particular entity, the name of that entity is typeset in
small capitals, e.g., JOHN SMITH.

20

1 Introduction

• We typeset queries in italics, e.g., “example search query.” We include these
queries in verbatim, as they appear in the given dataset, i.e., without correcting
grammar or capitalization.
• When quoting data from a knowledge repository, it is typeset in typewriter
font.

Selected definitions, key concepts, and ideas are highlighted in gray boxes
throughout the book.

Mathematical Notation We adopt the following notational conventions.
• Sequences of elements of the same type (such as vectors, lists, etc.) are denoted
as x1, . . . ,xn .
• Tuples, i.e., ordered collections of elements of different types, are denoted as
(x1, . . . ,xn ).
• Set-like variables are denoted by capital calligraphic letters, e.g., D for documents, E for entities, T for the taxonomy of types, V for the vocabulary of terms,
etc. Graphs represent an exception with vertices and edges denoted as V and E,
respectively (as the calligraphic versions of those letters are already taken).
• Matrices are denoted by bold capital roman letters (e.g., A) and vectors are
denoted by bold small roman letters (e.g., w).
• We occasionally use the semicolon to group the input variables of a function,
to show which are specific to the given target (before semicolon) and which are
more contextual (after semicolon). For example, c(t,e;d) denotes the number of
times the term t and entity e co-occur in a particular document d. The semicolon
is not more than a reading aid, and there is no mathematical difference between
the comma and the semicolon.
• Some functions, like weight (w()), score (score()), or similarity (sim()), are
formulated differently in the various works that this book draws upon. However,
these functions are named similarly (though their arguments may vary) because
they play similar roles in their respective contexts.
• Performance measures are typeset in roman font, e.g., F1 or NDCG.
• The symbol × denotes multiplication, while · is reserved for the dot product.

References
1. Abiteboul, S., Hull, R., Vianu, V. (eds.): Foundations of Databases: The Logical Level. 1st edn.
2. Agarwal, G., Kabra, G., Chang, K.C.C.: Towards rich query interpretation: walking back and
forth for mining query templates. In: Proceedings of the 19th international conference on
World wide web, WWW ’10, pp. 1–10. ACM (2010). doi: 10.1145/1772690.1772692

References

21

3. Balog, K.: Semistructured data search. In: Ferro, N. (ed.) Bridging Between Information
Retrieval and Databases, Lecture Notes in Computer Science, vol. 8173, pp. 74–96. Springer
(2014). doi: 10.1007/978-3-642-54798-0_4
4. Bast, H., Buchhold, B., Haussmann, E.: Semantic search on text and knowledge bases. Found.
Trends Inf. Retr. 10(2-3), 119–271 (2016). doi: 10.1561/1500000032
5. Benetka, J.R., Balog, K., Nørvåg, K.: Anticipating information needs based on check-in
activity. In: Proceedings of the 10th ACM International Conference on Web Search and Data
Mining, WSDM ’17, pp. 41–50. ACM (2017). doi: 10.1145/3018661.3018679
6. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American 284(5), 34–43
(2001)
7. Beynon-Davies, P.: Database Systems. 3rd edn. Palgrave, Basingstoke, UK (2004)
8. Blanco, R., Mika, P., Vigna, S.: Effective and efficient entity search in RDF data. In:
Proceedings of the 10th International Conference on The Semantic Web, ISWC ’11, pp. 83–97.
Springer (2011). doi: 10.1007/978-3-642-25073-6_6
9. Booch, G.: Object Oriented Design with Applications. Benjamin-Cummings Publishing Co.,
Inc. (1991)
10. Chakrabarti, S., Kasturi, S., Balakrishnan, B., Ramakrishnan, G., Saraf, R.: Compressed data
structures for annotated web search. In: Proceedings of the 21st International Conference on
World Wide Web, WWW ’12, pp. 121–130. ACM (2012). doi: 10.1145/2187836.2187854
11. Chen, P.P.S.: The entity-relationship model–toward a unified view of data. ACM Trans.
Database Syst. 1(1), 9–36 (1976). doi: 10.1145/320434.320440
12. Cheng, T., Chang, K.C.C.: Beyond pages: Supporting efficient, scalable entity search with dualinversion index. In: Proceedings of the 13th International Conference on Extending Database
Technology, EDBT ’10, pp. 15–26. ACM (2010). doi: 10.1145/1739041.1739047
13. Cheng, T., Yan, X., Chang, K.C.C.: EntityRank: Searching entities directly and holistically. In:
Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB ’07, pp.
387–398 (2007)
14. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication.
IEEE Trans. on Knowl. and Data Eng. 24(9), 1537–1555 (2012). doi: 10.1109/TKDE.2011.127
15. Cohen, W.W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists
in HTML documents. In: Proceedings of the 11th International Conference on World Wide
Web, WWW ’02, pp. 232–241. ACM (2002). doi: 10.1145/511446.511477
16. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE
Trans. on Knowl. and Data Eng. 19(1), 1–16 (2007). doi: 10.1109/TKDE.2007.9
17. Fetahu, B., Gadiraju, U., Dietze, S.: Improving entity retrieval on structured data. In: In
Proceedings of the 14th International Semantic Web Conference. Springer (2015). doi:
10.1007/978-3-319-25007-6_28
18. Ganti, V., He, Y., Xin, D.: Keyword++: A framework to improve keyword search over entity
databases. Proc. VLDB Endow. 3(1-2), 711–722 (2010). doi: 10.14778/1920841.1920932
19. Guha, R., McCool, R., Miller, E.: Semantic search. In: Proceedings of the 12th International Conference on World Wide Web, WWW ’03, pp. 700–709. ACM (2003). doi:
10.1145/775152.775250
20. Guo, J., Xu, G., Cheng, X., Li, H.: Named entity recognition in query. In: Proceedings of
the 32nd international ACM SIGIR conference on Research and development in information
retrieval, SIGIR ’09, pp. 267–274. ACM (2009)
21. John, T.: What is semantic search and how it works with Google search (2012)

22

1 Introduction

22. Johnson, M.: How the statistical revolution changes (computational) linguistics. In: Proceedings of the EACL 2009 Workshop on the Interaction Between Linguistics and Computational
Linguistics: Virtuous, Vicious or Vacuous?, ILCL ’09, pp. 3–11. Association for Computational
Linguistics (2009)
23. Lin, T., Pantel, P., Gamon, M., Kannan, A., Fuxman, A.: Active objects. In: Proceedings of
the 21st international conference on World Wide Web, WWW ’12, pp. 589–598. ACM (2012).
doi: 10.1145/2187836.2187916
24. Liu, T.Y.: Learning to Rank for Information Retrieval. Springer (2011)
25. Liu, Y., Bai, K., Mitra, P., Giles, C.L.: TableSeer: Automatic table metadata extraction and
searching in digital libraries. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on
Digital Libraries, JCDL ’07, pp. 91–100. ACM (2007). doi: 10.1145/1255175.1255193
26. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge
University Press (2008)
27. Pérez-Agüera, J.R., Arroyo, J., Greenberg, J., Iglesias, J.P., Fresno, V.: Using BM25F for
semantic search. In: Proceedings of the 3rd International Semantic Search Workshop,
SEMSEARCH ’10. ACM (2010). doi: y10.1145/1863879.1863881
28. Pichai, S.: Google I/O 2016 keynote (2016)
29. Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extraction using conditional random
fields. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, SIGIR ’03, pp. 235–242. ACM (2003). doi:
10.1145/860435.860479
30. Piskorski, J., Yangarber, R.: Information extraction: Past, present and future. In: Multi-source,
Multilingual Information Extraction and Summarization, pp. 23–49. Springer (2013). doi:
10.1007/978-3-642-28569-1_2
31. Pound, J., Hudek, A.K., Ilyas, I.F., Weddell, G.: Interpreting keyword queries over web
knowledge bases. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM ’12, pp. 305–314. ACM (2012).
doi:
10.1145/2396761.2396803
32. Pound, J., Mika, P., Zaragoza, H.: Ad-hoc object retrieval in the web of data. In: Proceedings of
the 19th international conference on World wide web, WWW ’10, pp. 771–780. ACM (2010).
doi: 10.1145/1772690.1772769
33. Qian, L., Cafarella, M.J., Jagadish, H.V.: Sample-driven schema mapping. In: Proceedings of
the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12,
pp. 73–84. ACM (2012). doi: 10.1145/2213836.2213846
34. Rosen, G.: Abstract objects. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy
(Spring 2017 Edition) (2017)
35. Salton, G.: Automatic Information Organization and Retrieval. McGraw Hill Text (1968)
36. Sanderson, M.: Test collection based evaluation of information retrieval systems. Found.
Trends Inf. Retr. 4(4), 247–375 (2010). doi: 10.1561/1500000009
37. Sarawagi, S.: Information extraction. Found. Trends databases 1(3), 261–377 (2008). doi:
10.1561/1900000003
38. Sarkas, N., Paparizos, S., Tsaparas, P.: Structured annotations of web queries. In: Proceedings
of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10,
pp. 771–782. ACM (2010). doi: 10.1145/1807167.1807251
39. Sawant, U., Chakrabarti, S.: Learning joint query interpretation and response ranking. In:
Proceedings of the 22nd International Conference on World Wide Web, WWW ’13, pp. 1099–
1109. ACM (2013). doi: 10.1145/2488388.2488484

References

23

40. Weikum, G.: DB & IR: both sides now. In: Proceedings of the 2007 ACM SIGMOD
International Conference on Management of Data, SIGMOD ’07, pp. 25–30. ACM (2007).
doi: 10.1145/1247480.1247484
41. Yu, J.X., Qin, L., Chang, L.: Keyword search in relational databases: A survey. IEEE Data
Eng. Bull. 33(1), 67–78 (2010)

Chapter 2

Meet the Data

This chapter introduces the basic types of data sources, as well as specific datasets
and resources, that we will be working with in later chapters of the book. These
may be placed on a spectrum of varying degrees of structure, from unstructured to
structured data, as shown in Fig. 2.1.

Fig. 2.1 The data spectrum

On the unstructured end of the spectrum we have plain text. Typically, these are
documents written in natural language.1 As a matter of fact, almost any type of data
can be converted into plain text, including web pages, emails, spreadsheets, and
database records. Of course, such a conversion would result in an undesired loss
of internal document structure and semantics. It is nevertheless always an option to
treat data as unstructured, by not making any assumptions about the particular data
format. Search in unstructured text is often referred to as full-text search.
On the opposite end of the spectrum there is structured data, which is typically
stored in relational databases; it is highly organized, tabular, and governed by a strict
schema. Search in this type of data is performed using formal query languages, like
SQL. These languages allow for a very precise formulation of information needs,
but require expert knowledge of the query language and of the underlying database
schema. This generally renders them unsuitable for ordinary users.
The data we will mostly be dealing with is neither of two extremes and
falls somewhere “in the middle.” Therefore, it is termed semi-structured. It is
1 Written

in natural language does not imply that the text has to be grammatical (or even sensible).

26

2 Meet the Data

Table 2.1 Comparison of unstructured, semi-structured, and structured data search
Unit of retrieval
Schema
Queries

Unstructured
Documents
No
Keyword

Semi-structured
Objects
Self-describing
Keyword++

Structured
Tuples
Fixed
Formal languages

characterized by the lack of a fixed, rigid schema. Also, there is no clear separation
between the data and the schema; instead, it uses a self-describing structure (tags or
other markers). Semi-structured data can be most easily viewed as a combination of
unstructured and structured elements. Let us point out that text is rarely completely
without structure. Even simple documents typically have a title (or a filename, that
is often meaningful). In HTML documents, markup tags specify elements such
as headings, paragraphs, and tables. Emails have sender, recipient, subject, and
body fields. What is important to notice here is that these document elements or
fields may or may not be present. This differs from structured data, where every
field specified by the schema ahead of time must be given some permitted value.
Therefore, documents with optional, self-describing elements naturally belong to
the category of semi-structured data. Furthermore, relational database records may
also be viewed as semi-structured data, by converting them to a set of hierarchically
nested elements. Performing such conversions can in fact simplify data processing
for entity-oriented applications. Using a semi-structured entity representation, all
data related to a given entity is available in a single entry for that entity. Therefore,
no aggregation via foreign-key relationships is needed. Table 2.1 summarizes data
search over the unstructured-structured spectrum.
The remainder of this chapter is organized according to the main types of data
sources we will be working with: the Web (Sect. 2.1), Wikipedia (Sect. 2.2), and
knowledge bases (Sect. 2.3).

2.1 The Web
The World Wide Web (WWW), commonly known simply as “the Web,” is probably
the most widely used information resource and service today. The idea of the Web
(what today would be considered Web 1.0) was introduced by Tim Berners-Lee
in 1989. Beginning in 2002, a new version, dubbed “Web 2.0” started to gain
traction, facilitating a more active participation by users, such that they changed
from mere consumers to become also creators and publishers of content. The early
years of the Web 2.0 era were landmarked by the launch of some of today’s biggest
Finally, the Semantic Web (or Web 3.0) was proposed as an extension of the current
Web [3, 20]. It represents the next major evolution of the Web that enables data to be
understood by computers, which could then perform tasks intelligently on behalf of

2.1 The Web

27

Table 2.2 Publicly available web crawls
Name
ClueWeb09 fulla
ClueWeb09 (Category B)
ClueWeb12b
ClueWeb12 (Category B)
Common Crawlc
KBA stream corpus 2014d

Time period
Jan 2009–Feb 2009
Feb 2012–May 2012
May 2017
Oct 2011–Apr 2013

Size
5 TB
230 GB
5.6 TB
400 GB
58 TB
10.9 TB

#Documents
1B
50M
733M
52M
2.96B
1.2B

Size refers to compressed data
a https://lemurproject.org/clueweb09/
b https://lemurproject.org/clueweb12/
c http://commoncrawl.org/2017/06/may-2017-crawl-archive-now-available/
d http://trec-kba.org/kba-stream-corpus-2014.shtml

users. The term Semantic Web refers both to this as-of-yet-unrealized future vision
and to a collection of standards and technologies for knowledge representation (cf.
Sect. 1.2.4).
Web pages are more than just plain text; one of their distinctive characteristics
is their hypertext structure, defined by the HTML markup. HTML tags describe
the internal document structure, such as headings, paragraphs, lists, tables, and so
pages (or resources) on the Web. Links are utilized in at least two major ways.
First, the networked nature of the Web may be leveraged to identify important or
authoritative pages or sites. Second, many of the links also have a textual label,
referred to as anchor text. Anchor text is “incredibly useful for search engines
because it provides some extra description of the page being pointed to” [23].

2.1.1 Datasets and Resources
We introduce a number of publicly available web crawls that have been used in the
context of entity-oriented search. Table 2.2 presents a summary.
ClueWeb09/12 The ClueWeb09 dataset consists of about one billion web pages
in 10 languages,2 collected in January and February 2009. The crawl aims to be
a representative sample of what is out there on the Web (which includes SPAM
and pornography). ClueWeb09 was used by several tracks of the TREC conference.
The data is distributed in gzipped files that are in WARC format. About half of
the collection is in English; this is referred to as the “Category A” subset. Further,
the first segment of Category A, comprising about 50 million pages, is referred to

2 English,

Chinese, Spanish, Japanese, French, German, Portuguese, Arabic, Italian, and Korean.

28

2 Meet the Data

as the “Category B” subset.3 The Category B subset also includes the full English
Wikipedia. These two subsets may be obtained separately if one does not need the
full collection.
ClueWeb12 is successor to the ClueWeb09 web dataset, collected between
February and May 2012. The crawl was initially seeded with URLs from
ClueWeb09 (with the highest PageRank values, and then removing likely SPAM
pages) and with some of the most popular sites in English-speaking countries (as
reported by Alexa4 ). Additionally, domains of tweeted URLs were also injected
into the crawl on a regular basis. A blacklist was used to avoid sites that promote
pornography, malware, and the like. The full dataset contains about 733 million
pages. Similarly to ClueWeb09, a “Category B” subset of about 50 million English
Common Crawl Common Crawl5 is a nonprofit organization that regularly crawls
the Web and makes the data publicly available. The datasets are hosted on Amazon
S3 as part of the Amazon Public Datasets program.6 As of May 2017, the crawl
contains 2.96 billion web pages and over 250 TB of uncompressed content (in
WARC format). The Web Data Commons project7 extracts structured data from
the Common Crawl and makes those publicly available (e.g., the Hyperlink Graph
Dataset and the Web Table Corpus).
KBA Stream Corpus The KBA Stream Corpus 2014 is a focused crawl, which
concentrates on news and social media (blogs and tweets). The 2014 version
contains 1.2 billion documents over a period of 19 months (and subsumes the 2012
and 2013 KBA Stream Corpora). See Sect. 6.2.5.1 for a more detailed description.

2.2 Wikipedia
Wikipedia is one of the most popular web sites in the world and a trusted source
of information for many people. Wikipedia defines itself as “a multilingual, webbased, free-content encyclopedia project supported by the Wikimedia Foundation
and based on a model of openly editable content.”8 Content is created through the
collaborative effort of a community of users, facilitated by a wiki platform. There
are various mechanisms in place to maintain high-quality content, including the
verifiability policy (i.e., readers should be able to check that the information comes

3 The Category B subset was mainly intended for research groups that were not yet ready at that
time to scale up to one billion documents, but it is still widely used.
4 http://www.alexa.com/.
5 http://commoncrawl.org/.
6 https://aws.amazon.com/public-datasets/.
7 http://webdatacommons.org/.

2.2 Wikipedia

29

from a reliable source) and a clear set of editorial guidelines. The collaborative
editing model makes it possible to distribute the effort required to create and
maintain up-to-date content across a multitude of users. At the time of writing,
Wikipedia is available in nearly 300 languages, although English is by far the most
popular, with over five million articles. As stated by Mesgari et al. [15], “Wikipedia
may be the best-developed attempt thus far to gather all human knowledge in one
place.”
What makes Wikipedia highly relevant for entity-oriented search is that most
of its entries can be considered as (semi-structured) representations of entities. At
its core, Wikipedia is a collection of pages (or articles, i.e., encyclopedic entries)
that are well interconnected by hyperlinks. On top of that, Wikipedia offers several
(complementary) ways to group articles, including categories, lists, and navigation
templates. In the remainder of this section, we first look at the anatomy of a regular
Wikipedia article and then review (some of the) other, special-purpose page types.
We note that it is not our aim to provide a comprehensive treatment of all the types of
pages in Wikipedia. For instance, in addition to the encyclopedic content, there are
also pages devoted to the administration of Wikipedia (discussion and user pages,
policy pages and guidelines, etc.); although hugely important, these are outside our
present scope of interest.

2.2.1 The Anatomy of a Wikipedia Article
A typical Wikipedia article focuses on a particular entity (e.g., a well-known
person, as shown in Fig. 2.2) or concept (e.g., “democracy”).9 Such articles typically
contain, among others, the following elements (the letters in parentheses refer to
Fig. 2.2):
• Title (I.)
– Infobox (II.b)
– Introductory text (II.c)
• Body content (IV.)
• Appendices and bottom matter (V.)
– References and notes (V.a)
– Categories (V.c)

9 We

refer back to Sect. 1.1.1 for a discussion on the difference between entities and concepts.

30

2 Meet the Data

2.2 Wikipedia

31

The lead section of a Wikipedia article is the part between the title heading and the
of its contents. The lead section may contain several (optional) elements, including
introductory text. We will further elaborate on the title, infobox, and introductory
text elements below.
The main body of the article may be divided into sections, each with a section
heading. The sections may be nested in a hierarchy. When there are at least four
The body of the article may be followed by optional appendix and footer sections,
notes (that cite sources), further reading (links to relevant publications that have
not been used as sources), internal links organized into navigational boxes, and
categories.

2.2.1.1 Title
Each Wikipedia article is uniquely identified by its page title. The title of the page is
typically the most common name for the entity (or concept) described in the article.
When the name is ambiguous, the pages of the other namesakes are disambiguated
by adding further qualifiers to their title within parentheses. For instance, MICHAEL
JORDAN refers to the American (former) professional basketball player, and the
page about the English footballer with the same name has the title MICHAEL
JORDAN (FOOTBALLER). Note that the page title is case-sensitive (except the first
character). For special pages, the page title may be prefixed with a namespace,
separated with a colon, e.g., “Category:German racing drivers.” We will look at
some of the Wikipedia namespaces later in this section.

2.2.1.2 Infobox
The infobox is a panel that summarizes information related to the subject of the
article. In desktop view, it appears at the top right of the page, next to the lead
section; in mobile view it is displayed at the very top of the page. In the case
of entity pages, the infobox summarizes key facts about the entity in the form
of property-value pairs. Therefore, infoboxes represent an important source for
extracting structured information about entities (cf. Sect. 2.3.2). A large number
of infobox templates exist, which are created and maintained collaboratively, with
the aim to standardize information across articles that belong to the same category.
Infoboxes, however, are “free form,” meaning that what ultimately gets included
in the infobox of a given article is determined through discussion and consensus
among the editors.

32

2 Meet the Data

Schumacher holds many of Formula One ’ s [[ List of Formula One driver records |
driver records ] ] , including most championships , race victories , fastest laps ,
pole positions and most races won in a single season - 13 in [ [2004 Formula
One season |2004 ]] ( the last of these records was equalled by fellow German
[[ Sebastian Vettel ] ] 9 years later) . In [ [2002 Formula One season |2002] ] ,
he became the only driver in Formula One history to finish in the top three
in every race of a season and then also broke the record for most consecutive
podium finishes . According to the official Formula One website , he is
" statistically the greatest driver the sport has ever seen ".

2.2.1.3 Introductory Text
Most Wikipedia articles include an introductory text, the “lead,” which is a
brief summary of the article—normally, no more than four paragraphs long.
This should be written in a way that it creates interest in the article. The
first sentence and the opening paragraph bear special importance. The first
sentence “can be thought of as the definition of the entity described in the
article” [11]. The first paragraph offers a more elaborate definition, but still
without being too detailed. DBpedia, e.g., treats the first paragraph as the “short
abstract” and the full introductory text as the “long abstract” of the entity (cf.
Sect. 2.3.2).

Internal links are an important feature of Wikipedia as they allow “readers to deepen
their understanding of a topic by conveniently accessing other articles.”10 Listing 2.1
shows the original wiki markup for the second paragraph of the introductory text
the title of a target page in double square brackets ([[...]]). Optionally, an
alternative label, i.e., anchor text, may be provided after the vertical bar (|).
Linking is governed by a detailed set of guidelines. A key rule given to editors
is to link only the first occurrence of an entity or concept in the text of the
article.
relationships between articles. In addition, anchor texts are a rich source of entity
name variants. Wikipedia links may be used, among others, to help identify and
disambiguate entity mentions in text (cf. Chap. 5).

2.2 Wikipedia

33

2.2.3 Special-Purpose Pages
Not all Wikipedia articles are entity pages. In this subsection and the next, we
discuss two specific kinds of special-purpose pages.

2.2.3.1 Redirect Pages
Each entity in Wikipedia has a dedicated article and is uniquely identified by the
page title of that article. The page title is the most common (canonical) name of
the entity. Entities, however, may be referred to by multiple names (aliases). The
purpose of redirect pages i