On the Role of Conceptualization in
Commonsense Knowledge Graph Construction
Mutian He
1
, Yangqiu Song
1
, Kun Xu
2
, Dong Yu
2
1
Hong Kong University of Science and Technology
2
Tencent
{mhear, yqsong}@cse.ust.hk
{kxkunxu,dyu}@tencent.com
Abstract
Commonsense knowledge graphs (CKGs) like
ATOMIC and ASER are substantially differ-
ent from conventional KGs as they consist
of much larger number of nodes formed by
loosely-structured text, which, though, enables
them to handle highly diverse queries in nat-
ural language related to commonsense, leads
to unique challenges for automatic KG con-
struction methods. Besides identifying rela-
tions absent from the KG between nodes, such
methods are also expected to explore absent
nodes represented by text, in which different
real-world things, or entities, may appear. To
deal with the innumerable entities involved
with commonsense in the real world, we in-
troduce to CKG construction methods concep-
tualization, i.e., to view entities mentioned in
text as instances of specific concepts or vice
versa. We build synthetic triples by concep-
tualization, and further formulate the task as
triple classification, handled by a discrimina-
tory model with knowledge transferred from
pretrained language models and fine-tuned by
negative sampling. Experiments demonstrate
that our methods can effectively identify plau-
sible triples and expand the KG by triples of
both new nodes and edges of high diversity
and novelty.
1 Introduction
Commonsense knowledge, such as knowing that a
trophy could not fit in a suitcase because the tro-
phy is too big, is implicitly acknowledged among
human beings through real-life experience rather
than systematic learning. As a result, artificial in-
telligence meets difficulties in capturing such com-
monsense. To deal with the issue, commonsense
knowledge graphs (CKGs) like ConceptNet (Speer
et al., 2017), ATOMIC (Sap et al., 2019), ASER
(Zhang et al., 2019a), etc. have been proposed.
Such graphs are aimed at collecting and solidifying
the implicit commonsense in the form of triples
hh, r, ti
, with the head node h and the tail node t
connected by a relation (i.e., edge) r. However,
a key difference between traditional knowledge
graphs (like WordNet, Freebase, etc.) and CKGs is
that commonsense is often difficult to represent as
two strictly formed nodes compared by a specific
relation. Instead, recent approaches represent a
node with loosely-structured text, either annotated
by humans or retrieved from text corpora. Nodes
are then linked by one of some predefined types
of relations, such as the triple
h
h: I’m hungry, r:
Result, t: I have lunchi in ASER.
1
ASER ATOMIC
#Nodes 194.0M 309.5K
#Triples 64.4M 877.1K
#Relation Types 15 9
Average Degree 0.66 5.67
Entity Coverage 52.33% 6.98%
Average Distinct Entity 0.026 0.082
Table 1: Statistics for recently proposed commonsense
knowledge graphs. Entity Coverage is calculated by the
proportion of top 1% most frequent entities in Probase,
that are mentioned by nodes in each CKG. Average
Distinct Entity is given by average number of distinct
Probase entities per node in each CKG. Core version of
ASER is used for these two results.
Such CKGs, storing an exceptionally large num-
ber of triples, as shown in Table 1, are capable
of representing a much broader range of knowl-
edge and handling flexible queries related to com-
monsense. However, the complexity of real world
commonsense is still immense. Particularly, with
innumerable eventualities involved with common-
sense in the real world, it is of high cost for a CKG
1
Words in ASER are lemmatized, but in this paper we
always show the original text for easier understanding.
arXiv:2003.03239v2 [cs.CL] 7 Apr 2020
The trophy would not fit in the brown suitcase
because it was too big. What was too big?
(a) A commonsense reasoning problem
(c) Conceptualization on the extracted triple
(b) An existent triple in the CKG
h item not fit in container
r Reason
t item is big
h trophy not fit in brown suitcase
r Reason
t trophy is big
container
item
Figure 1: A sample for conceptualization in CKG.
Given a commonsense reasoning problem like (a), even
though the corresponding triple (c) is not in the KG,
a triple (b) which is present in the KG could be used
through abstraction. This is done by identifying real
world entities of trophy and brown suitcase in the text
of (c), and then substituting them using IsA relations.
Following the same idea, (c) can be produced from (b)
to be included in the CKG via instantiation.
to cover all of them in nodes; and even when cov-
ered, to acquire the corresponding relation between
each pair of nodes is of quadratic difficulty. Such a
situation is particularly demonstrated by the spar-
sity of edges in current CKGs in Table 1: even
automatic extraction methods as in ASER, fail to
capture edges between most nodes. Therefore, al-
ternative KG construction methods are in need.
As nodes are represented by text, utilizing se-
mantic information becomes critical for CKG con-
struction. Such semantic information can be lever-
aged by large pretrained language models like
BERT (Devlin et al., 2018) and GPT-2 (Radford
et al., 2019): These models capture rich semantic
knowledge from unsupervised pretraining, which
can be transferred to a wide range of downstream
tasks to reach impressive results. Therefore, efforts
have been made to extract knowledge from such
models for KG construction. COMeT (Bosselut
et al., 2019) is a notable attempt which fine-tuned
GPT on CKG triples to predict tails from heads and
relations. However, it is often observed that such
generative neural models suffer from the diversity
and novelty issue: They tend to overproduce high-
probability samples similar to those in the training
set, while failing to cover a broader range of pos-
sible triples absent in a CKG with diverse entities
involved. In contrast, discriminatory methods in-
corporate language models to KG completion tasks
such as link prediction and triple classification to
evaluate whether a triple is valid, and could be ex-
tended to arbitrary text for triples on various KGs
(Malaviya et al., 2019; Davison et al., 2019; Yao
et al., 2019). However it would be computation-
ally infeasible to identify new plausible triples on
recent large CKGs if we aimlessly explore new
nodes and evaluate each possible triple, without
leveraging existent nodes and triples as in genera-
tive methods. Therefore, all methods above have
certain shortcomings.
Particularly, we observe that current methods
miss the variation of real world entities, which is
a critical factor for the diversity of nodes. For ex-
ample, a CKG may cover the node I eat apple and
its relevant edges, but the node I eat banana might
be missing or the edges are incomplete. As shown
in Table 1, a large portion of the most common
real world entities in Probase are never mentioned
in the recent CKGs like ASER, not to say edges
related to the entities. More, as demonstrated by
the low number of average distinct entity per node,
directly expanding the scale of CKGs would not be
cost-effective to cover diverse entities.
To relieve the issue, we posit the importance of a
specific element of human commonsense,
concep-
tualization
, which, though found useful for certain
natural language understanding tasks (Song et al.,
2011; Wang et al., 2015; Hua et al., 2015), has
not been investigated in depth in this area. As ob-
served by psychologists, “concepts are the glue
that holds our mental world together” (Murphy,
2004). Human beings are able to make reasonable
inferences by utilizing the IsA relationship between
real-world concepts and instances. For example,
without knowing what a floppy disk is, given that
it
is a
memory device, people may infer that it may
store data and be readable by a computer, etc. From
this viewpoint, instead of directly building triples
with countless entities, a CKG can be broadly ex-
panded to handle various queries, as shown in Fig-
ure 1, by such substitution of instances on text with
the corresponding concepts (i.e.,
abstraction
), or
vice versa (i.e.,
instantiation
), given an extra CKG
of IsA relations.
However, such conceptualization is never strict
induction or deduction that is guaranteed to be true.
PersonX eats
cookies
to get some
milk
to get some
beverage
to get some
dairy product
xWant?
xWant?
×
Then, PersonX
wants (xWant)
Figure 2: A sample from ATOMIC for discriminating
conceptualization. Some conceptualizations, like re-
placing milk with beverage in the tail node is valid,
while with dairy is invalid in the context, as with com-
monsense one would often want something to drink af-
ter eating cookies, while the general concept of dairy is
not relevant in such a scenario, and dairy products are
not all drinkable.
As shown in Figure 2, it is still a challenging task
to determine whether a triple built from concep-
tualization is reasonable, and requires both the
context within the triple and a broader range of
commonsense. Such discriminatory problem could
be viewed as a particular case of the well-studied
task of KG completion. Therefore, we propose
to formulate our problem as a triple classification
task, one of the standard tasks for KG completion
(Socher et al., 2013). The difference is that, instead
of considering arbitrary substitution of the head or
tail with existent nodes, we apply conceptualization
as described in Section 2.1, and train our model by
negative sampling as discussed in Section 2.2. We
leverage the rich semantic information with large
pretrained language models by fine-tuning them
as discriminators. In this way, the models are ex-
pected to take triples with arbitrary node as inputs
and evaluate whether the triple is reasonable.
To conclude, our contributions are three-fold:
1.
We introduce conceptualization to CKG con-
struction to explore broader range of triples.
2.
We formulate the conceptualization-based
CKG construction as a triple classification
task to be performed on synthetic triples.
3.
We propose a method for the task by fine-
tuning pretrained language models with nega-
tive sampling.
Our code and pipeline is available at
github.com/
mutiann/ccc.
2 Methodologies
Our methodologies of CKG construction are based
on the idea that given a set of ground truth triples
as seeds, new triples can be built from them by
abstraction or instantiation of entities mentioned in
the head or tail node, i.e., substituting a mentioned
entity with a more general or more specific one,
using the particular commonsense of IsA relations.
Therefore, we need a CKG,
K
, viewed as seeds,
and a conceptualization KG,
C
, both denoting a set
of triples hh, r, ti, while in C, r is always IsA.
2.1 Conceptualization
Since there are diverse ways to conceptualize an en-
tity from commonsense, C must sufficiently cover
various real-world entities connected by IsA rela-
tions. Therefore, we choose Probase (Wu et al.,
2012), which is a large-scale KG that consists of
17.9M nodes and 87.6M edges, extracted from var-
ious corpora, and is proved to be suitable for con-
ceptualization (Song et al., 2011).
A single entity may have various ways of being
abstracted or instantiated, with different typicality.
For example, either Linux or BeOS is an operating
system, and a pen is either a writing tool or an enclo-
sure for animals, though for both examples the two
choices are not equally common. Since Probase
is extracted from real corpora, the frequencies of
text showing the triple
hh, IsA, ti
in the source cor-
pora can demonstrate how common the relation
is. Such frequencies
f
are given by Probase along
with the triple, forming 4-tuples of
hh, IsA, t, fi
.
The frequency information in Probase allows us to
balance between IsA relations of different typicality
and filter out noise of rare relations in the graph.
xWant
Person X is on
the basketball
team
to be a
professional
player
extracurricular
activity, sport
team, ...
sport,
game, ...
group,
club, ...
elite player, role
model, ...
person, digital
entertainment, ...
Figure 3: A sample of identifying entities. All noun
phrases are identified as possible entities on which sub-
stitutions are proposed. The phrases include basketball,
basketball team, team, player, professional player, but
except professional , which is tagged as an adjective
here.
With
C
prepared, conceptualization can then be
performed on any mentioned real-world entity in
the head or tail nodes in each triple, which could be
a single noun or noun phrases, serving as subjects,
objects or even modifiers, as shown in Figure 3.
What leads to more complexity is that although
nodes in
C
are all real-world entities in some con-
text, the word or phrase could be used in different
manners within a triple of interest. Therefore, for
raw text as in ATOMIC, we choose to perform de-
pendency parsing on each node by spaCy (Honni-
bal and Montani, 2017). Then we need to filter out
all nouns or noun phrases that also present in
C
as
possible candidates.
The method to identify entities is given in Al-
gorithm 1: We iterate through each noun
w
as the
root of the entity, and choose all continuous se-
quences of words within the range of the subtree
corresponding to
w
in the dependency tree, to en-
sure that all entities rooted by
w
(possibly with
different modifiers) are collected. We then query
the possible abstraction and instantiation of each
candidate with Probase, and, for any result, add the
substituted text and the corresponding frequency
into the list of results. If the text in the CKG are
given in the original form (i.e., not lemmatized, un-
like ASER), we further use a set of rules to inflect
the substitution
s
returned, and to modify the de-
terminer (if any) in the returned text so as to avoid
any false statistical clues of grammatical mistakes
introduced by the substitution.
2.2 Discriminator
We are aimed at building a model capable of eval-
uating whether a triple is valid, possibly with its
head or tail conceptualized. However, as in the
well-studied field of KG completion, we fall into
the difficult situation that we need to undertake the
evaluation using only positive ground truth present
in
K
, except that in our case not only unseen edges
but also nodes need to be considered. For such KG
completion tasks, commonly it is assumed that it is
unknown whether triples not present in
K
are valid,
while the method of negative sampling is applied.
In this way, synthetic triples are built from substitu-
tion (a.k.a. corruption) of the head or tail of present
ones, often using random nodes in the KG. Then
the triples are viewed as more likely to be invalid
than the original ones, and labelled as members of
a negative set
D
. In combination with triples in
K
as the positive set
D
+
, the model can be trained
by such pseudo-labelled data in a self-supervised
manner and evaluated by the classification accu-
Algorithm 1 IDENTIFYCONCEPTUALIZATION
Input:
W
[w
1
, w
2
, ..., w
n
]
: a node represented by
a sequence of words
P [p
1
, p
2
, ..., p
n
]: POS tags of words in W
D: Dependency tree for W
C
{hx, IsA, y, fi}
: Probase of IsA rela-
tions
Result:
S
[W
0
1
, W
0
2
, ...]
: List of substituted word
sequences
F
[f
1
, f
2
, ...]
: List of frequencies for each
substitution
S []
F []
for k [1, n] do
if p
k
{noun, propn} then
T subtree of w
k
in D
L min
xT
{index of x in W }
R max
xT
{index of x in W }
foreach (l, r), L l k r R do
E [w
l
, ..., w
r
]
A {(S, f)|(E, IsA, S, f) C}
I {(S, f)|(S , IsA, E, f) C}
foreach (s, f ) A I do
S
.add(
[w
1
, ..., w
l1
] + s +
[w
r+1
, ..., w
k
])
F .add(f)
end
end
end
end
racy under triples in
K
held out from
D
+
and the
corresponding negative samples generated in the
same way (Bordes et al., 2013; Socher et al., 2013;
Nickel et al., 2011).
Although there could be false-negatives, it is
demonstrated that models trained in this way can
successfully identify valid triples (though possibly
labelled negative) missing from the KG. More, it
is discovered that instead of uniform sampling, us-
ing negative samples similar to valid ones leads to
better performance (Cai and Wang, 2018; Wang
et al., 2018; Zhang et al., 2019b). Therefore, we
further propose to sample the substitution of a node
from its conceptualized versions, which might be
missing in the original KG. This fits into previous
KG completion methods if we view the concep-
tualized new nodes as isolated nodes. We expect
that in this way the model could better evaluate
conceptualization of triples.
To generate negative samples, two different set-
tings are applied.
1.
Node Substitution (
NS
): The common corrup-
tion method, as in Bordes et al. (2013), that
substitutes the head or tail (each with 0.5 prob-
ability) with a random node from the KG. For
a CKG like ATOMIC in which head nodes
and tails nodes, as well as tail nodes from
triples with different relations, can often be
easily distinguished from each other
2
, we fol-
low Socher et al. (2013) to pick random heads
only from other heads, and random tails only
from other tails appearing in triples with the
same relation.
2.
Entity Conceptualization (
EC
): To enable the
model to identify false triples with inappro-
priate conceptualization, we randomly choose
the head or tail (each with 0.5 probability) and
corrupt the node as in Section 2.1 by substitut-
ing an entity in the node with its abstraction
or instantiation. This method ensures that the
substituted nodes are often plausible. Then
we use the triples with the head or tail sub-
stituted as negative samples. Particularly, we
make use of the frequencies returned by Al-
gorithm 1 as weights (or unnormalized prob-
abilities), based on which we sample from
possible conceptualized nodes, as shown in
Algorithm 2. In this way, we strike a balance
between the diversity and the typicality of IsA
relations used.
Algorithm 2 BUILDSAMPLEEC
Input:
N: A node, represented by a sequence of
words
P: POS tags of words in N
D: Dependency tree for N
C: Probase
Result: N
0
: corrupted node
S
,
F
IDENTIFYCONCEPTUALIZATION(
N
,
P
,
D, C)
W F /
P
F
k Categorical(W )
N
0
S
k
Building negative samples with both settings, we
expect that the model will be capable of discrim-
inating whether a triple is valid or not, when the
2
For instance, the head node in ATOMIC always starts with
Person X, and tail nodes for relation type xAttr (attribute of
Person X) are often adjectives.
triple is possibly corrupted in either way. To reduce
noise in training and evaluation, negative samples
are filtered to ensure that they are different from
any positive samples, which matches the filtered
setting from Bordes et al. (2013).
To make the best use of semantic information
from the textual description of nodes, we apply the
widely used transformer-based pretrained language
models like BERT as the discriminator. Particu-
larly, the structure of our task matches the next sen-
tence prediction (NSP) in Devlin et al. (2018). As
a result, we follow a similar setting that takes pairs
of sentences separated by a special [SEP] token
and marked by different token type IDs as inputs,
with h the first sentence and the concatenation of r
and t the second sentence. Binary classification is
then performed by a fully-connected layer taking
the final representation corresponding to the [CLS]
token, as shown in Figure 4. All parameters in the
model, except those in the final fully-connected
layer for binary classification, can be initialized
from the pretrained model. Then the model is fine-
tuned using the positive and negative samples men-
tioned above, with a 1:1 frequency during training,
using binary cross-entropy loss below, based on the
output
s
, which is a scalar after a logistic sigmoid
activation, indicating the confidence on the input
being valid:
L =
X
(x,y)D
+
D
(y log s + (1 y) log(1 s)).
(1)
[CLS] I am hungry I eat lunch [SEP]
Transformer Block
H
Dense
Transformer Block
...
Pretrained BERT
Result[SEP]
Transformer Block
Figure 4: Architecture of the BERT-based discrimina-
tor model. Raw text are fed into the model to predict
the binary label y. All except the last fully-connected
layer are pretrained but not frozen.
3 Experiments
3.1 Datasets
Two different datasets, ATOMIC and ASER, which
are typical CKGs using open-form text as nodes,
are used in our experiments. Earlier CKGs such as
ConceptNet (Speer et al., 2017) are not discussed,
as in ConceptNet, unlike the more recent CKGs,
only simple text, mostly noun phrases, is used as
nodes, and previous work can already reach close-
to-human results (Saito et al., 2018).
3.1.1 ATOMIC
ATOMIC is a CKG of 877K triples on cause-and-
effect relations between everyday activities (Sap
et al., 2019). Within ATOMIC, head nodes, or base
events, are extracted from text corpora, with the
form of some person’s action or state (e.g., Person
X is on the basketball team). The dataset further
categorizes the cause-and-effect relations into nine
types, covering the intention, prerequisites and im-
pacts with regard to the agent Person X and other
people. The tail entities are then written in open-
form text by crowd-sourced workers under these
categories. In this way, a broad range of eventuali-
ties and their relations are covered by the dataset.
We follow the original data split of ATOMIC in our
experiments.
3.1.2 ASER
ASER (Zhang et al., 2019a) is a large-scale CKG
of 194M nodes representing verb-centric eventu-
alities matching certain patterns (e.g., s-v-o for I
love dogs and s-v-v-o for I want to eat an apple) ex-
tracted from various corpora. Relations of 15 types
between nodes are then extracted as well, identi-
fied by matching the text with language patterns
such as “
E
1
, because
E
2
” for
Reason
and “
E
1
,
E
2
instead” for
ChosenAlternative
. A total 194.0M
nodes and 64.4M triples are extracted in ASER. In
our experiments, we use the core release of ASER
with 27.6M nodes and 10.4M edges. Triples of
the
CoOccurance
type and isolated nodes are fur-
ther removed to create a smaller and cleaner KG
with 1.4M nodes and 1.1M edges. Triples are then
randomly split into train, dev, and test set at 8:1:1.
3.2 Settings
To build our CKG Construction by Conceptualiza-
tion (
CCC
) discriminator, we follow the scheme
for fine-tuning BERT on downstream tasks (De-
vlin et al., 2018), and use the pretrained 12-layer
BERT-base model on GTX1080 GPUs with 8GB
memory. To evaluate the impact of the two differ-
ent ways of producing negative samples given in
Section 2.2, and to trade off between model capa-
bilities of discriminating triples under general cases
and specifically identifying inappropriate concep-
tualization, we perform experiments with different
percentages of negative samples built by conceptu-
alization, i.e., the EC setting. Specifically, models
with 50%, 75%, and 87.5% negative samples cre-
ated by EC (and the rest by NS) are trained and
reported. For evaluation, negative samples are gen-
erated on the dev and test samples as well by both
methods, forming the EC and NS dev and test sets
with 1:1 positive and negative samples. Under EC,
those triples with nodes containing no entities to
be conceptualized (e.g., I am fine, for which Algo-
rithm 1 returns empty results) are ignored. Never-
theless, 79.65% and 83.56% of triples in the dev
and test sets are collected in the EC set for ATOMIC
and ASER respectively, showing that a majority of
triples can be conceptualized. Test results of the
models at best EC dev accuracy are reported.
3.3 Baselines
We train COMeT
3
and KG-BERT
4
on the two
datasets as our baselines. Particularly, our model
will degenerate into KG-BERT with 0% EC sam-
ples, as the NS setting is what KG-BERT trained
under. Since COMeT itself is not a discriminative
model, we use the perplexity per token
5
as its score
given to each triple, and use the dev set to find a
classification threshold with best accuracy.
3.4 Results
3.4.1 Triple Classification
Test accuracies of triple classification on both meth-
ods are given in Table 2. As shown by the results,
COMeT lacks discriminative power, which is con-
sistent with the results in Malaviya et al. (2019).
KG-BERT, which has been successfully applied
on traditional KGs, produce satisfactory results on
CKG as well, while our methods perform better
than both baseline methods by a large margin on
the EC tests. Hence it is demonstrated that introduc-
ing conceptualization during training is effective
to create a model capable of identifying false con-
ceptualization. Particularly, the percentage of EC
samples in training is critical for a trade-off be-
tween EC and NS tasks: Increased EC percentage
will lead to better EC results, but the NS results
will drop. The ATOMIC CCC models reach better
3
Available at
github.com/atcbosselut/
comet-commonsense
4
Available at github.com/yao8839836/kg-bert
5
Unlike the case in Malaviya et al. (2019), conceptualiza-
tion would not significantly change the length of a triple, so
we only use the NORMALIZED setting.
ASER ATOMIC
EC NS EC NS
COMeT 0.6388 0.5869 0.6927 0.5730
KG-BERT 0.7091 0.8018 0.7669 0.6575
CCC-50 0.8716 0.7775 0.9016 0.7840
CCC-75 0.8995 0.7250 0.9221 0.7446
CCC-87.5 0.9156 0.6635 0.9355 0.6980
CCC-75-scratch 0.8284 0.5587 0.8579 0.5003
CCC-75-RoBERTa 0.8999 0.6938 0.9305 0.7350
Table 2: Results of accuracy on EC and NS test set of baselines and our models on different datasets. CCC denotes
our model, with the number attached representing percentage of EC training.
results on NS than KG-BERT, which is possibly
due to the fact that ATOMIC nodes are mostly about
everyday activities, in contrast to ASER which cov-
ers a broader range of topics. Therefore by EC
training a more diverse set of nodes could be seen
by the model in training, and could be helpful for
the model to generalize in the NS test.
3.4.2 Ablation Studies
We perform ablation studies to examine the im-
portance of pretraining and model selection. With
the model trained from scratch on our task with-
out using pretrained parameters, the performance
significantly drops, as shown in the CCC-75-
scratch results in Table 2. We also attempted to
use RoBERTa, an alternative pretrained language
model that makes improvements on BERT training
and has demonstrated better performance on down-
stream tasks (Liu et al., 2019). However, the results
using the pretrained RoBERTa-base model (CCC-
75-RoBERTa) are generally on par with our model
using BERT. This could be possibly explained by
that BERT is sufficient in our current settings, that
RoBERTa uses a larger batch size while on our
GPU the batch size is more limited, and that the
NSP pretraining task is used in BERT but is absent
in RoBERTa, as NSP exactly matches the input
scheme of our task.
3.4.3 Generations
We generate triples using the test set from ASER
and ATOMIC as seeds by both COMeT with 10-
beam search and the CCC-75 model by conceptu-
alization, and apply various metrics on the results,
as shown in Table 3. Both methods may produce
a large number of triples, as given by the number
of generations per seed,
N/Seed
. For diversity, we
report
Dist-1
(number of distinct words per node),
Dist-2
(number of distinct bigrams per node), and
Dist-N
(number of distinct nodes per node). Due
to the different number of generated triplets, the
results are all normalized by number of nodes.
6
Novelty is measured by
N/T N
, the proportion
of nodes among all produced nodes that are novel,
i.e. not present in the training set, and
N/U N
, the
proportion of novel distinct nodes, among all dis-
tinct nodes. Moreover, since generative methods
may produce nodes of essentially the same mean-
ing with slight changes in forms, we also have the
produced nodes normalized, by removing structural
words like determiners, auxiliary verbs, pronouns,
etc. We then report the results for the metrics above
but applied to generations after such normalization,
denoted as
Dist-N-Norm
,
N/T N-Norm
, and
N/U
N-Norm
respectively. Furthermore, samples of
generations by both models are shown in Table 4.
It is clearly demonstrated in the diversity results
that a majority of generations by COMeT are sim-
ilar to each other given a certain head node and
relation, as the number of distinct nodes, words
and bigrams are all relatively low. The novelty
results further show that generated nodes are of-
ten similar to the seen ones in the training set as
well. It can be particularly observed that, though
the original diversity and novelty metrics appear
to be acceptable, which is consistent with Bosselut
et al. (2019), results drop sharply when the genera-
tions are normalized. This indicates that COMeT
may produce slightly different nodes paraphrasing
each other, as shown in Table 4 where four of five
generated tails are similar to each other (saying he
does not get it), while this is not the case for CCC,
6
Results from Bosselut et al. (2019) of test perplexity are
reproduced in our experiments. Differences on diversity met-
rics are due to the fact that we use 10-beam search instead for
fair comparison.
ASER ATOMIC
COMeT CCC-75 COMeT CCC-75
N/Seed 10 8.28 10 5.02
Dist-N 24.68% 96.57% 6.49% 51.26%
Dist-1 1.62% 6.30% 0.63% 8.34%
Dist-2 10.56% 15.45% 2.87% 49.76%
N/T N 88.48% 98.60% 10.30% 94.65%
N/U N 93.38% 99.18% 69.17% 96.66%
Dist-N-Norm 10.74% 84.16% 4.35% 46.36%
N/T N-Norm 9.37% 86.01% 5.12% 87.95%
N/U N-Norm 65.72% 95.02% 58.71% 93.01%
Table 3: Results for diversity and novelty on generations, larger for better results. All rows except N/seed are
given in percentage.
head tail
Seed another promises him a scholarship his parents own a successful business
COMeT
he never gets it
he does not get it
he never gets one
he could not pay it
he does not receive it
CCC-75
another promises him a grant
another promises him an award
his parents own a successful shop
his parents own a successful bank
his parents own a successful hotel
Table 4: Samples for ASER generations given the seed. In this sample the head and tail are connected by the
relation of Concession, i.e., although.
as generated nodes are mostly discussing different
entities, and the results will often be diverse and
novel.
4 Related Work
Automatic construction of structured KGs is a well-
studied task, and a number of learning-based meth-
ods have been proposed, including KG embedding
methods based on translational distances (Bordes
et al., 2013; Lin et al., 2015; Ji et al., 2015; Wang
et al., 2014; Shang et al., 2019) and semantic match-
ing (Nickel et al., 2011; Socher et al., 2013; Yang
et al., 2014; Trouillon et al., 2016), typically trained
by negative sampling techniques and applied on
tasks like link prediction and triple classification.
Furthermore, graph neural networks can be used to
better capture structural information (Schlichtkrull
et al., 2018), GANs are applied to improve negative
sampling (Cai and Wang, 2018; Wang et al., 2018;
Zhang et al., 2019b) by mining more difficult ex-
amples, and textual information from the node can
be leveraged(Wang and Li, 2016; Xie et al., 2016;
Xiao et al., 2017; An et al., 2018).
Textual information is more critical on CKGs
with nodes carrying complicated eventualities, of-
ten in open-form text. Therefore, Li et al. (2016)
proposed to score ConceptNet triples by neural
sequence models taking text inputs so as to dis-
cover new triples, while Saito et al. (2018) and
Sap et al. (2019) further proposed to generate tail
nodes by a sequence-to-sequence LSTM model
with head and relation as inputs. Recently power-
ful large pretrained models like BERT and GPT-2
have been proposed (Devlin et al., 2018; Radford
et al., 2019), from which, it has been observed
by Trinh and Le (2018) and Radford et al. (2019)
that rich knowledge including commonsense can
be extracted. Therefore, different ways for KG con-
struction have been introduced on such models as
downstream tasks: In KG-BERT, BERT was fine-
tuned for KG completion tasks like link prediction
and triple classification (Yao et al., 2019); COMeT
used GPT-based models to generate tails (Bosse-
lut et al., 2019); LAMA directly predicted masked
words in triples on various KGs by BERT (Petroni
et al., 2019); Davison et al. (2019) considered both
generation of new tails and scoring given triples;
Malaviya et al. (2019) utilized both structural and
semantic information for CKG construction on link
prediction tasks.
5 Conclusion
We introduce conceptualization to commonsense
knowledge graph construction and propose a novel
method for the task by generating new triples with
conceptualization and examine them by a discrimi-
nator transferred from pretrained language models.
Future studies will be focused on strategies of con-
ceptualization and its role in natural languages and
commonsense by deep learning approaches.
References
Bo An, Bo Chen, Xianpei Han, and Le Sun. 2018.
Accurate text-enhanced knowledge graph represen-
tation learning. In Proceedings of the 2018 Confer-
ence of the North American Chapter of the Associ-
ation for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long Papers), pages
745–755, New Orleans, Louisiana. Association for
Computational Linguistics.
Antoine Bordes, Nicolas Usunier, Alberto Garcia-
Dur
´
an, Jason Weston, and Oksana Yakhnenko.
2013. Translating embeddings for modeling multi-
relational data. In Proceedings of the 26th Interna-
tional Conference on Neural Information Processing
Systems, pages 2787–2795.
Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai-
tanya Malaviya, Asli Celikyilmaz, and Yejin Choi.
2019. COMET: Commonsense transformers for au-
tomatic knowledge graph construction. In Proceed-
ings of the 57th Annual Meeting of the Association
for Computational Linguistics, pages 4762–4779,
Florence, Italy. Association for Computational Lin-
guistics.
Liwei Cai and William Yang Wang. 2018. KBGAN:
Adversarial learning for knowledge graph embed-
dings. In Proceedings of the 2018 Conference of the
North American Chapter of the Association for Com-
putational Linguistics: Human Language Technolo-
gies, Volume 1 (Long Papers), pages 1470–1480,
New Orleans, Louisiana. Association for Computa-
tional Linguistics.
Joe Davison, Joshua Feldman, and Alexander Rush.
2019. Commonsense knowledge mining from pre-
trained models. In Proceedings of the 2019 Con-
ference on Empirical Methods in Natural Language
Processing and the 9th International Joint Confer-
ence on Natural Language Processing (EMNLP-
IJCNLP), pages 1173–1178, Hong Kong, China. As-
sociation for Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. BERT: Pre-training
of deep bidirectional transformers for language
understanding. Computing Research Repository,
arXiv:1810.04805.
Matthew Honnibal and Ines Montani. 2017. spacy 2:
Natural language understanding with bloom embed-
dings, convolutional neural networks and incremen-
tal parsing.
Wen Hua, Zhongyuan Wang, Haixun Wang, Kai Zheng,
and Xiaofang Zhou. 2015. Short text understand-
ing through lexical-semantic analysis. In 2015 IEEE
31st International Conference on Data Engineering,
pages 495–506.
Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and
Jun Zhao. 2015. Knowledge graph embedding via
dynamic mapping matrix. In Proceedings of the
53rd Annual Meeting of the Association for Compu-
tational Linguistics and the 7th International Joint
Conference on Natural Language Processing (Vol-
ume 1: Long Papers), pages 687–696, Beijing,
China. Association for Computational Linguistics.
Xiang Li, Aynaz Taheri, Lifu Tu, and Kevin Gimpel.
2016. Commonsense knowledge base completion.
In Proceedings of the 54th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1:
Long Papers), pages 1445–1455, Berlin, Germany.
Association for Computational Linguistics.
Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and
Xuan Zhu. 2015. Learning entity and relation em-
beddings for knowledge graph completion. In Pro-
ceedings of the Twenty-Ninth AAAI Conference on
Artificial Intelligence, pages 2181–2187.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du,
Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov.
2019. RoBERTa: A robustly optimized BERT pre-
training approach. Computing Research Repository,
arXiv:1907.11692.
Chaitanya Malaviya, Chandra Bhagavatula, Antoine
Bosselut, and Yejin Choi. 2019. Commonsense
knowledge base completion with structural and se-
mantic context. Computing Research Repository,
arXiv:1910.02915. Version 2.
Gregory Murphy. 2004. The big book of concepts. MIT
press, Cambridge, MA.
Maximilian Nickel, Volker Tresp, and Hans-Peter
Kriegel. 2011. A three-way model for collective
learning on multi-relational data. In Proceedings
of the 28th International Conference on Machine
Learning, pages 809–816.
Fabio Petroni, Tim Rockt
¨
aschel, Sebastian Riedel,
Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and
Alexander Miller. 2019. Language models as knowl-
edge bases? In Proceedings of the 2019 Confer-
ence on Empirical Methods in Natural Language
Processing and the 9th International Joint Confer-
ence on Natural Language Processing (EMNLP-
IJCNLP), pages 2463–2473, Hong Kong, China. As-
sociation for Computational Linguistics.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners.
Itsumi Saito, Kyosuke Nishida, Hisako Asano, and
Junji Tomita. 2018. Commonsense knowledge base
completion and generation. In Proceedings of the
22nd Conference on Computational Natural Lan-
guage Learning, pages 141–150, Brussels, Belgium.
Association for Computational Linguistics.
Maarten Sap, Ronan Le Bras, Emily Allaway, Chan-
dra Bhagavatula, Nicholas Lourie, Hannah Rashkin,
Brendan Roof, Noah A Smith, and Yejin Choi. 2019.
ATOMIC: An atlas of machine commonsense for if-
then reasoning. In Proceedings of the Thirty-Third
AAAI Conference on Artificial Intelligence, pages
3027–3035.
Michael Schlichtkrull, Thomas N Kipf, Peter Bloem,
Rianne Van Den Berg, Ivan Titov, and Max Welling.
2018. Modeling relational data with graph convolu-
tional networks. In The Semantic Web 15th Inter-
national Conference, ESWC 2018, Heraklion, Crete,
Greece, June 37, 2018, Proceedings, pages 593–
607.
Chao Shang, Yun Tang, Jing Huang, Jinbo Bi, Xi-
aodong He, and Bowen Zhou. 2019. End-to-end
structure-aware convolutional networks for knowl-
edge base completion. In Proceedings of the Thirty-
Third AAAI Conference on Artificial Intelligence,
pages 3060–3067.
Richard Socher, Danqi Chen, Christopher D Manning,
and Andrew Ng. 2013. Reasoning with neural ten-
sor networks for knowledge base completion. In
Proceedings of the 26th International Conference on
Neural Information Processing Systems, pages 926–
934.
Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hong-
song Li, and Weizhu Chen. 2011. Short text con-
ceptualization using a probabilistic knowledgebase.
In Proceedings of the Twenty-Second International
Joint Conference on Artificial Intelligence, pages
2330–2336.
Robyn Speer, Joshua Chin, and Catherine Havasi. 2017.
ConceptNet 5.5: An open multilingual graph of gen-
eral knowledge. In Proceedings of the Thirty-First
AAAI Conference on Artificial Intelligence, pages
4444–4451.
Trieu H Trinh and Quoc V Le. 2018. A simple method
for commonsense reasoning. Computing Research
Repository, arXiv:1806.02847.
Th
´
eo Trouillon, Johannes Welbl, Sebastian Riedel,
´
Eric
Gaussier, and Guillaume Bouchard. 2016. Complex
embeddings for simple link prediction. In Proceed-
ings of the 33rd International Conference on Ma-
chine Learning, pages 2071–2080.
Peifeng Wang, Shuangyin Li, and Rong Pan. 2018. In-
corporating GAN for negative sampling in knowl-
edge representation learning. In Proceedings of the
Thirty-Second AAAI Conference on Artificial Intelli-
gence, pages 2005–2012.
Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng
Chen. 2014. Knowledge graph embedding by trans-
lating on hyperplanes. In Proceedings of the Twenty-
Eighth AAAI Conference on Artificial Intelligence,
pages 1112–1119.
Zhigang Wang and Juanzi Li. 2016. Text-enhanced rep-
resentation learning for knowledge graph. In Pro-
ceedings of the Twenty-Fifth International Joint Con-
ference on Artificial Intelligence, pages 1293–1299.
Zhongyuan Wang, Kejun Zhao, Haixun Wang, Xi-
aofeng Meng, and Ji-Rong Wen. 2015. Query un-
derstanding through knowledge-based conceptual-
ization. In Proceedings of the Twenty-Fourth Inter-
national Joint Conference on Artificial Intelligence,
pages 3264–3270.
Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q
Zhu. 2012. Probase: A probabilistic taxonomy for
text understanding. In Proceedings of the 2012 ACM
SIGMOD International Conference on Management
of Data, pages 481–492.
Han Xiao, Minlie Huang, Lian Meng, and Xiaoyan
Zhu. 2017. SSP: semantic space projection for
knowledge graph embedding with text descriptions.
In Proceedings of the Thirty-First AAAI Conference
on Artificial Intelligence, pages 3104–3110.
Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, and
Maosong Sun. 2016. Representation learning of
knowledge graphs with entity descriptions. In Pro-
ceedings of the Thirtieth AAAI Conference on Artifi-
cial Intelligence, pages 2659–2665.
Bishan Yang, Wen-tau Yih, Xiaodong He, Jian-
feng Gao, and Li Deng. 2014. Embedding en-
tities and relations for learning and inference in
knowledge bases. Computing Research Repository,
arXiv:1412.6575.
Liang Yao, Chengsheng Mao, and Yuan Luo. 2019.
KG-BERT: BERT for knowledge graph completion.
Computing Research Repository, arXiv:1909.03193.
Hongming Zhang, Xin Liu, Haojie Pan, Yangqiu Song,
Cane Wing-Ki, et al. 2019a. ASER: A large-scale
eventuality knowledge graph. Computing Research
Repository, arXiv:1905.00270.
Yongqi Zhang, Quanming Yao, Yingxia Shao, and Lei
Chen. 2019b. NSCaching: Simple and efficient neg-
ative sampling for knowledge graph embedding. In
2019 IEEE 35th International Conference on Data
Engineering, pages 614–625.