Vocabulary 1) Text segmentation—given a short text, ?nd the

Vocabulary – We download a
list of english verbs and adjectives from an online dictionary – Your
Dictionary and harvest a collection of attributes, concepts, and instances from
a wellknown knowledgebase Probase. Altogether, they constitute our vocabulary.
To cope with the noise contained in short texts, we further extend the
vocabulary to incorporate abbreviations and nicknames of instances.

Knowledgebase – A
knowledgebase stores mappings between instances and concepts. Some existing
knowledgebases also associate each concept with attributes.In this work, we use
Probase as our knowledgebase. Probase is a huge semantic network of concepts
(e.g., country), instances (e.g., china) and attributes (e.g., population). It
mainly focuses on two types of relationships, namely the isA relationship
between instances and concepts (e.g., china isA country) and the isAttributeOf
relationship between attributes and concepts (e.g., population isAttributeOf

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

Given a short text s written in a
natural language, we generate a semantic interpretation of s represented as a
sequence of typed-terms and the semantics of each instance is labeled with the
top-1 concept cluster.

shown in Fig. 1, the semantic interpretation of the short text, we divide the
task of short text understanding into three subtasks:

1) Text segmentation—given a short text,
?nd the most semantically coherent segmentation;

Type detection—for each term, detect its best type;

Concept labeling—for each ambiguous instance, rerank its concept clusters
according to the context. Fig. 2 illustrates our framework for short text
understanding. we construct index on the entire vocabulary and acquire
knowledge from web corpus and existing knowledgebases. Then, we pre-calculate
semantic coherence between terms which will be used for short text
understanding. We perform text segmentation, type detection, and concept
labeling, and generate a semantically coherent interpretation for a given short

Segmentation – We can recognize all possible terms from a short text using the
trie-based framework 28. But the real question is how to obtain a coherent
segmentation from the set of terms. Let us consider two examples “april in
paris lyrics” and “vacation april in paris” to illustrate our approach to text
segmentation. Obviously, {april in Paris lyrics} is a better cleavage of “April
in Paris lyrics” than {april Paris lyrics}, since “lyrics” is more semantically
related to songs than two months or cities. Similarly, {vacation April paris}
is a better segmentation of “vacation April in Paris”, due to higher coherence
among “vacation”, “April”, and “Paris” than that between “vacation” and” April in
Paris”. We segment a short text into a sequence of terms. We give the following
heuristics to determine a valid segmentation. 
Except for stop words, each word belongs to one and only one term;  Terms are coherent (i.e., terms mutually
reinforce each other). We use a graph to represent candidate terms and their

we generate a semantic interpretation of
s represented as a sequence of typed-terms and the semantics of each instance
is labeled with the top-1 concept cluster. such that terms
cannot overlap and every non-stopword in the short text should be covered by a

Af?nity Score (AS) is defined to measure
semantic coherence between typed-terms. In this work, we consider two types of
coherence: similarity and relatedness (co-occurrence). We believe that two
typed-terms are coherent if they are semantically similar or they often
co-occur on the web. Therefore, the Af?nity Score between typed-terms x and y
can be calculated as follows: