On the Role of Conceptualization in Commonsense Knowledge

On the Role of Conceptualization in

Commonsense Knowledge Graph Construction

Mutian He

, Yangqiu Song

, Kun Xu

, Dong Yu

Hong Kong University of Science and Technology

Tencent

{mhear, yqsong}@cse.ust.hk

{kxkunxu,dyu}@tencent.com

Abstract

Commonsense knowledge graphs (CKGs) like

ATOMIC and ASER are substantially differ-

ent from conventional KGs as they consist

of much larger number of nodes formed by

loosely-structured text, which, though, enables

them to handle highly diverse queries in nat-

ural language related to commonsense, leads

to unique challenges for automatic KG con-

struction methods. Besides identifying rela-

tions absent from the KG between nodes, such

methods are also expected to explore absent

nodes represented by text, in which different

real-world things, or entities, may appear. To

deal with the innumerable entities involved

with commonsense in the real world, we in-

troduce to CKG construction methods concep-

tualization, i.e., to view entities mentioned in

text as instances of speciﬁc concepts or vice

versa. We build synthetic triples by concep-

tualization, and further formulate the task as

triple classiﬁcation, handled by a discrimina-

tory model with knowledge transferred from

pretrained language models and ﬁne-tuned by

negative sampling. Experiments demonstrate

that our methods can effectively identify plau-

sible triples and expand the KG by triples of

both new nodes and edges of high diversity

and novelty.

1 Introduction

Commonsense knowledge, such as knowing that a

trophy could not ﬁt in a suitcase because the tro-

phy is too big, is implicitly acknowledged among

human beings through real-life experience rather

than systematic learning. As a result, artiﬁcial in-

telligence meets difﬁculties in capturing such com-

monsense. To deal with the issue, commonsense

knowledge graphs (CKGs) like ConceptNet (Speer

et al., 2017), ATOMIC (Sap et al., 2019), ASER

(Zhang et al., 2019a), etc. have been proposed.

Such graphs are aimed at collecting and solidifying

the implicit commonsense in the form of triples

hh, r, ti

, with the head node h and the tail node t

connected by a relation (i.e., edge) r. However,

a key difference between traditional knowledge

graphs (like WordNet, Freebase, etc.) and CKGs is

that commonsense is often difﬁcult to represent as

two strictly formed nodes compared by a speciﬁc

relation. Instead, recent approaches represent a

node with loosely-structured text, either annotated

by humans or retrieved from text corpora. Nodes

are then linked by one of some predeﬁned types

of relations, such as the triple

h: I’m hungry, r:

Result, t: I have lunchi in ASER.

ASER ATOMIC

#Nodes 194.0M 309.5K

#Triples 64.4M 877.1K

#Relation Types 15 9

Average Degree 0.66 5.67

Entity Coverage 52.33% 6.98%

Average Distinct Entity 0.026 0.082

Table 1: Statistics for recently proposed commonsense

knowledge graphs. Entity Coverage is calculated by the

proportion of top 1% most frequent entities in Probase,

that are mentioned by nodes in each CKG. Average

Distinct Entity is given by average number of distinct

Probase entities per node in each CKG. Core version of

ASER is used for these two results.

Such CKGs, storing an exceptionally large num-

ber of triples, as shown in Table 1, are capable

of representing a much broader range of knowl-

edge and handling ﬂexible queries related to com-

monsense. However, the complexity of real world

commonsense is still immense. Particularly, with

innumerable eventualities involved with common-

sense in the real world, it is of high cost for a CKG

Words in ASER are lemmatized, but in this paper we

always show the original text for easier understanding.

arXiv:2003.03239v2 [cs.CL] 7 Apr 2020

The trophy would not fit in the brown suitcase

because it was too big. What was too big?

(a) A commonsense reasoning problem

(b) An existent triple in the CKG

h item not fit in container

r Reason

t item is big

h trophy not fit in brown suitcase

r Reason

t trophy is big

container

item

Figure 1: A sample for conceptualization in CKG.

Given a commonsense reasoning problem like (a), even

though the corresponding triple (c) is not in the KG,

a triple (b) which is present in the KG could be used

through abstraction. This is done by identifying real

world entities of trophy and brown suitcase in the text

of (c), and then substituting them using IsA relations.

Following the same idea, (c) can be produced from (b)

to be included in the CKG via instantiation.

to cover all of them in nodes; and even when cov-

ered, to acquire the corresponding relation between

each pair of nodes is of quadratic difﬁculty. Such a

situation is particularly demonstrated by the spar-

sity of edges in current CKGs in Table 1: even

automatic extraction methods as in ASER, fail to

capture edges between most nodes. Therefore, al-

ternative KG construction methods are in need.

As nodes are represented by text, utilizing se-

mantic information becomes critical for CKG con-

struction. Such semantic information can be lever-

aged by large pretrained language models like

BERT (Devlin et al., 2018) and GPT-2 (Radford

et al., 2019): These models capture rich semantic

knowledge from unsupervised pretraining, which

can be transferred to a wide range of downstream

tasks to reach impressive results. Therefore, efforts

have been made to extract knowledge from such

models for KG construction. COMeT (Bosselut

et al., 2019) is a notable attempt which ﬁne-tuned

GPT on CKG triples to predict tails from heads and

relations. However, it is often observed that such

generative neural models suffer from the diversity

and novelty issue: They tend to overproduce high-

probability samples similar to those in the training

set, while failing to cover a broader range of pos-

sible triples absent in a CKG with diverse entities

involved. In contrast, discriminatory methods in-

corporate language models to KG completion tasks

such as link prediction and triple classiﬁcation to

evaluate whether a triple is valid, and could be ex-

tended to arbitrary text for triples on various KGs

(Malaviya et al., 2019; Davison et al., 2019; Yao

et al., 2019). However it would be computation-

ally infeasible to identify new plausible triples on

recent large CKGs if we aimlessly explore new

nodes and evaluate each possible triple, without

leveraging existent nodes and triples as in genera-

tive methods. Therefore, all methods above have

certain shortcomings.

Particularly, we observe that current methods

miss the variation of real world entities, which is

a critical factor for the diversity of nodes. For ex-

ample, a CKG may cover the node I eat apple and

its relevant edges, but the node I eat banana might

be missing or the edges are incomplete. As shown

in Table 1, a large portion of the most common

real world entities in Probase are never mentioned

in the recent CKGs like ASER, not to say edges

related to the entities. More, as demonstrated by

the low number of average distinct entity per node,

directly expanding the scale of CKGs would not be

cost-effective to cover diverse entities.

To relieve the issue, we posit the importance of a

speciﬁc element of human commonsense,

concep-

tualization

, which, though found useful for certain

natural language understanding tasks (Song et al.,

2011; Wang et al., 2015; Hua et al., 2015), has

not been investigated in depth in this area. As ob-

served by psychologists, “concepts are the glue

that holds our mental world together” (Murphy,

2004). Human beings are able to make reasonable

inferences by utilizing the IsA relationship between

real-world concepts and instances. For example,

without knowing what a ﬂoppy disk is, given that

is a

memory device, people may infer that it may

store data and be readable by a computer, etc. From

this viewpoint, instead of directly building triples

with countless entities, a CKG can be broadly ex-

panded to handle various queries, as shown in Fig-

ure 1, by such substitution of instances on text with

the corresponding concepts (i.e.,

abstraction

), or

vice versa (i.e.,

instantiation

), given an extra CKG

of IsA relations.

However, such conceptualization is never strict

induction or deduction that is guaranteed to be true.

PersonX eats

to get some

milk

to get some

beverage

to get some

dairy product

xWant?

xWant? √

Then, PersonX

wants (xWant)

Figure 2: A sample from ATOMIC for discriminating

conceptualization. Some conceptualizations, like re-

placing milk with beverage in the tail node is valid,

while with dairy is invalid in the context, as with com-

monsense one would often want something to drink af-

ter eating cookies, while the general concept of dairy is

not relevant in such a scenario, and dairy products are

not all drinkable.

As shown in Figure 2, it is still a challenging task

to determine whether a triple built from concep-

tualization is reasonable, and requires both the

context within the triple and a broader range of

commonsense. Such discriminatory problem could

be viewed as a particular case of the well-studied

task of KG completion. Therefore, we propose

to formulate our problem as a triple classiﬁcation

task, one of the standard tasks for KG completion

(Socher et al., 2013). The difference is that, instead

of considering arbitrary substitution of the head or

tail with existent nodes, we apply conceptualization

as described in Section 2.1, and train our model by

negative sampling as discussed in Section 2.2. We

leverage the rich semantic information with large

pretrained language models by ﬁne-tuning them

as discriminators. In this way, the models are ex-

pected to take triples with arbitrary node as inputs

and evaluate whether the triple is reasonable.

To conclude, our contributions are three-fold:

We introduce conceptualization to CKG con-

struction to explore broader range of triples.

We formulate the conceptualization-based

CKG construction as a triple classiﬁcation

task to be performed on synthetic triples.

We propose a method for the task by ﬁne-

tuning pretrained language models with nega-

tive sampling.

Our code and pipeline is available at

github.com/

mutiann/ccc.

2 Methodologies

Our methodologies of CKG construction are based

on the idea that given a set of ground truth triples

as seeds, new triples can be built from them by

abstraction or instantiation of entities mentioned in

the head or tail node, i.e., substituting a mentioned

entity with a more general or more speciﬁc one,

using the particular commonsense of IsA relations.

Therefore, we need a CKG,

, viewed as seeds,

and a conceptualization KG,

, both denoting a set

of triples hh, r, ti, while in C, r is always IsA.

2.1 Conceptualization

Since there are diverse ways to conceptualize an en-

tity from commonsense, C must sufﬁciently cover

various real-world entities connected by IsA rela-

tions. Therefore, we choose Probase (Wu et al.,

2012), which is a large-scale KG that consists of

17.9M nodes and 87.6M edges, extracted from var-

ious corpora, and is proved to be suitable for con-

ceptualization (Song et al., 2011).

A single entity may have various ways of being

abstracted or instantiated, with different typicality.

For example, either Linux or BeOS is an operating

system, and a pen is either a writing tool or an enclo-

sure for animals, though for both examples the two

choices are not equally common. Since Probase

is extracted from real corpora, the frequencies of

text showing the triple

hh, IsA, ti

in the source cor-

pora can demonstrate how common the relation

is. Such frequencies

are given by Probase along

with the triple, forming 4-tuples of

hh, IsA, t, fi

The frequency information in Probase allows us to

balance between IsA relations of different typicality

and ﬁlter out noise of rare relations in the graph.

xWant

Person X is on

the basketball

team

to be a

professional

player

extracurricular

activity, sport

team, ...

sport,

game, ...

group,

club, ...

elite player, role

model, ...

person, digital

entertainment, ...

Figure 3: A sample of identifying entities. All noun

phrases are identiﬁed as possible entities on which sub-

stitutions are proposed. The phrases include basketball,

basketball team, team, player, professional player, but

except professional , which is tagged as an adjective

here.

With

prepared, conceptualization can then be

performed on any mentioned real-world entity in

the head or tail nodes in each triple, which could be

a single noun or noun phrases, serving as subjects,

objects or even modiﬁers, as shown in Figure 3.

What leads to more complexity is that although

nodes in

are all real-world entities in some con-

text, the word or phrase could be used in different

manners within a triple of interest. Therefore, for

raw text as in ATOMIC, we choose to perform de-

pendency parsing on each node by spaCy (Honni-

bal and Montani, 2017). Then we need to ﬁlter out

all nouns or noun phrases that also present in

possible candidates.

The method to identify entities is given in Al-

gorithm 1: We iterate through each noun

as the

root of the entity, and choose all continuous se-

quences of words within the range of the subtree

corresponding to

in the dependency tree, to en-

sure that all entities rooted by

(possibly with

different modiﬁers) are collected. We then query

the possible abstraction and instantiation of each

candidate with Probase, and, for any result, add the

substituted text and the corresponding frequency

into the list of results. If the text in the CKG are

given in the original form (i.e., not lemmatized, un-

like ASER), we further use a set of rules to inﬂect

the substitution

returned, and to modify the de-

terminer (if any) in the returned text so as to avoid

any false statistical clues of grammatical mistakes

introduced by the substitution.

2.2 Discriminator

We are aimed at building a model capable of eval-

uating whether a triple is valid, possibly with its

head or tail conceptualized. However, as in the

well-studied ﬁeld of KG completion, we fall into

the difﬁcult situation that we need to undertake the

evaluation using only positive ground truth present

, except that in our case not only unseen edges

but also nodes need to be considered. For such KG

completion tasks, commonly it is assumed that it is

unknown whether triples not present in

are valid,

while the method of negative sampling is applied.

In this way, synthetic triples are built from substitu-

tion (a.k.a. corruption) of the head or tail of present

ones, often using random nodes in the KG. Then

the triples are viewed as more likely to be invalid

than the original ones, and labelled as members of

a negative set

−

. In combination with triples in

as the positive set

, the model can be trained

by such pseudo-labelled data in a self-supervised

manner and evaluated by the classiﬁcation accu-

Algorithm 1 IDENTIFYCONCEPTUALIZATION

Input:

, w

, ..., w

]

: a node represented by

a sequence of words

P [p

, p

, ..., p

]: POS tags of words in W

D: Dependency tree for W

{hx, IsA, y, fi}

: Probase of IsA rela-

tions

Result:

, W

, ...]

: List of substituted word

sequences

, f

, ...]

: List of frequencies for each

substitution

S ← []

F ← []

for k ∈ [1, n] do

if p

∈ {noun, propn} then

T ←subtree of w

in D

L ← min

x∈T

{index of x in W }

R ← max

x∈T

{index of x in W }

foreach (l, r), L ≤ l ≤ k ≤ r ≤ R do

E ← [w

, ..., w

]

A ← {(S, f)|(E, IsA, S, f) ∈ C}

I ← {(S, f)|(S , IsA, E, f) ∈ C}

foreach (s, f ) ∈ A ∪ I do

.add(

, ..., w

l−1

] + s +

r+1

, ..., w

])

F .add(f)

end

racy under triples in

held out from

and the

corresponding negative samples generated in the

same way (Bordes et al., 2013; Socher et al., 2013;

Nickel et al., 2011).

Although there could be false-negatives, it is

demonstrated that models trained in this way can

successfully identify valid triples (though possibly

labelled negative) missing from the KG. More, it

is discovered that instead of uniform sampling, us-

ing negative samples similar to valid ones leads to

better performance (Cai and Wang, 2018; Wang

et al., 2018; Zhang et al., 2019b). Therefore, we

further propose to sample the substitution of a node

from its conceptualized versions, which might be

missing in the original KG. This ﬁts into previous

KG completion methods if we view the concep-

tualized new nodes as isolated nodes. We expect

that in this way the model could better evaluate

conceptualization of triples.

To generate negative samples, two different set-

tings are applied.

Node Substitution (

): The common corrup-

tion method, as in Bordes et al. (2013), that

substitutes the head or tail (each with 0.5 prob-

ability) with a random node from the KG. For

a CKG like ATOMIC in which head nodes

and tails nodes, as well as tail nodes from

triples with different relations, can often be

easily distinguished from each other

, we fol-

low Socher et al. (2013) to pick random heads

only from other heads, and random tails only

from other tails appearing in triples with the

same relation.

Entity Conceptualization (

): To enable the

model to identify false triples with inappro-

priate conceptualization, we randomly choose

the head or tail (each with 0.5 probability) and

corrupt the node as in Section 2.1 by substitut-

ing an entity in the node with its abstraction

or instantiation. This method ensures that the

substituted nodes are often plausible. Then

we use the triples with the head or tail sub-

stituted as negative samples. Particularly, we

make use of the frequencies returned by Al-

gorithm 1 as weights (or unnormalized prob-

abilities), based on which we sample from

possible conceptualized nodes, as shown in

Algorithm 2. In this way, we strike a balance

between the diversity and the typicality of IsA

relations used.

Algorithm 2 BUILDSAMPLEEC

Input:

N: A node, represented by a sequence of

words

P: POS tags of words in N

D: Dependency tree for N

C: Probase

Result: N

: corrupted node

F ←

IDENTIFYCONCEPTUALIZATION(

D, C)

W ← F /

k ∼ Categorical(W )

← S

Building negative samples with both settings, we

expect that the model will be capable of discrim-

inating whether a triple is valid or not, when the

For instance, the head node in ATOMIC always starts with

Person X, and tail nodes for relation type xAttr (attribute of

Person X) are often adjectives.

triple is possibly corrupted in either way. To reduce

noise in training and evaluation, negative samples

are ﬁltered to ensure that they are different from

any positive samples, which matches the ﬁltered

setting from Bordes et al. (2013).

To make the best use of semantic information

from the textual description of nodes, we apply the

widely used transformer-based pretrained language

models like BERT as the discriminator. Particu-

larly, the structure of our task matches the next sen-

tence prediction (NSP) in Devlin et al. (2018). As

a result, we follow a similar setting that takes pairs

of sentences separated by a special [SEP] token

and marked by different token type IDs as inputs,

with h the ﬁrst sentence and the concatenation of r

and t the second sentence. Binary classiﬁcation is

then performed by a fully-connected layer taking

the ﬁnal representation corresponding to the [CLS]

token, as shown in Figure 4. All parameters in the

model, except those in the ﬁnal fully-connected

layer for binary classiﬁcation, can be initialized

from the pretrained model. Then the model is ﬁne-

tuned using the positive and negative samples men-

tioned above, with a 1:1 frequency during training,

using binary cross-entropy loss below, based on the

output

, which is a scalar after a logistic sigmoid

activation, indicating the conﬁdence on the input

being valid:

L = −

(x,y)∈D

∪D

−

(y log s + (1 − y) log(1 − s)).

(1)

[CLS] I am hungry I eat lunch [SEP]

Transformer Block

Dense

Transformer Block

...

Pretrained BERT

Result[SEP]

Transformer Block

Figure 4: Architecture of the BERT-based discrimina-

tor model. Raw text are fed into the model to predict

the binary label y. All except the last fully-connected

layer are pretrained but not frozen.

3 Experiments

3.1 Datasets

Two different datasets, ATOMIC and ASER, which

are typical CKGs using open-form text as nodes,

are used in our experiments. Earlier CKGs such as

ConceptNet (Speer et al., 2017) are not discussed,

as in ConceptNet, unlike the more recent CKGs,

only simple text, mostly noun phrases, is used as

nodes, and previous work can already reach close-

to-human results (Saito et al., 2018).

3.1.1 ATOMIC

ATOMIC is a CKG of 877K triples on cause-and-

effect relations between everyday activities (Sap

et al., 2019). Within ATOMIC, head nodes, or base

events, are extracted from text corpora, with the

form of some person’s action or state (e.g., Person

X is on the basketball team). The dataset further

categorizes the cause-and-effect relations into nine

types, covering the intention, prerequisites and im-

pacts with regard to the agent Person X and other

people. The tail entities are then written in open-

form text by crowd-sourced workers under these

categories. In this way, a broad range of eventuali-

ties and their relations are covered by the dataset.

We follow the original data split of ATOMIC in our

experiments.

3.1.2 ASER

ASER (Zhang et al., 2019a) is a large-scale CKG

of 194M nodes representing verb-centric eventu-

alities matching certain patterns (e.g., s-v-o for I

love dogs and s-v-v-o for I want to eat an apple) ex-

tracted from various corpora. Relations of 15 types

between nodes are then extracted as well, identi-

ﬁed by matching the text with language patterns

such as “

, because

” for

Reason

and “

instead” for

ChosenAlternative

. A total 194.0M

nodes and 64.4M triples are extracted in ASER. In

our experiments, we use the core release of ASER

with 27.6M nodes and 10.4M edges. Triples of

the

CoOccurance

type and isolated nodes are fur-

ther removed to create a smaller and cleaner KG

with 1.4M nodes and 1.1M edges. Triples are then

randomly split into train, dev, and test set at 8:1:1.

3.2 Settings

To build our CKG Construction by Conceptualiza-

tion (

CCC

) discriminator, we follow the scheme

for ﬁne-tuning BERT on downstream tasks (De-

vlin et al., 2018), and use the pretrained 12-layer

BERT-base model on GTX1080 GPUs with 8GB

memory. To evaluate the impact of the two differ-

ent ways of producing negative samples given in

Section 2.2, and to trade off between model capa-

bilities of discriminating triples under general cases

and speciﬁcally identifying inappropriate concep-

tualization, we perform experiments with different

percentages of negative samples built by conceptu-

alization, i.e., the EC setting. Speciﬁcally, models

with 50%, 75%, and 87.5% negative samples cre-

ated by EC (and the rest by NS) are trained and

reported. For evaluation, negative samples are gen-

erated on the dev and test samples as well by both

methods, forming the EC and NS dev and test sets

with 1:1 positive and negative samples. Under EC,

those triples with nodes containing no entities to

be conceptualized (e.g., I am ﬁne, for which Algo-

rithm 1 returns empty results) are ignored. Never-

theless, 79.65% and 83.56% of triples in the dev

and test sets are collected in the EC set for ATOMIC

and ASER respectively, showing that a majority of

triples can be conceptualized. Test results of the

models at best EC dev accuracy are reported.

3.3 Baselines

We train COMeT

and KG-BERT

on the two

datasets as our baselines. Particularly, our model

will degenerate into KG-BERT with 0% EC sam-

ples, as the NS setting is what KG-BERT trained

under. Since COMeT itself is not a discriminative

model, we use the perplexity per token

as its score

given to each triple, and use the dev set to ﬁnd a

classiﬁcation threshold with best accuracy.

3.4 Results

3.4.1 Triple Classiﬁcation

Test accuracies of triple classiﬁcation on both meth-

ods are given in Table 2. As shown by the results,

COMeT lacks discriminative power, which is con-

sistent with the results in Malaviya et al. (2019).

KG-BERT, which has been successfully applied

on traditional KGs, produce satisfactory results on

CKG as well, while our methods perform better

than both baseline methods by a large margin on

the EC tests. Hence it is demonstrated that introduc-

ing conceptualization during training is effective

to create a model capable of identifying false con-

ceptualization. Particularly, the percentage of EC

samples in training is critical for a trade-off be-

tween EC and NS tasks: Increased EC percentage

will lead to better EC results, but the NS results

will drop. The ATOMIC CCC models reach better

Available at

github.com/atcbosselut/

comet-commonsense

Available at github.com/yao8839836/kg-bert

Unlike the case in Malaviya et al. (2019), conceptualiza-

tion would not signiﬁcantly change the length of a triple, so

we only use the NORMALIZED setting.

ASER ATOMIC

EC NS EC NS

COMeT 0.6388 0.5869 0.6927 0.5730

KG-BERT 0.7091 0.8018 0.7669 0.6575

CCC-50 0.8716 0.7775 0.9016 0.7840

CCC-75 0.8995 0.7250 0.9221 0.7446

CCC-87.5 0.9156 0.6635 0.9355 0.6980

CCC-75-scratch 0.8284 0.5587 0.8579 0.5003

CCC-75-RoBERTa 0.8999 0.6938 0.9305 0.7350

Table 2: Results of accuracy on EC and NS test set of baselines and our models on different datasets. CCC denotes

our model, with the number attached representing percentage of EC training.

results on NS than KG-BERT, which is possibly

due to the fact that ATOMIC nodes are mostly about

everyday activities, in contrast to ASER which cov-

ers a broader range of topics. Therefore by EC

training a more diverse set of nodes could be seen

by the model in training, and could be helpful for

the model to generalize in the NS test.

3.4.2 Ablation Studies

We perform ablation studies to examine the im-

portance of pretraining and model selection. With

the model trained from scratch on our task with-

out using pretrained parameters, the performance

signiﬁcantly drops, as shown in the CCC-75-

scratch results in Table 2. We also attempted to

use RoBERTa, an alternative pretrained language

model that makes improvements on BERT training

and has demonstrated better performance on down-

stream tasks (Liu et al., 2019). However, the results

using the pretrained RoBERTa-base model (CCC-

75-RoBERTa) are generally on par with our model

using BERT. This could be possibly explained by

that BERT is sufﬁcient in our current settings, that

RoBERTa uses a larger batch size while on our

GPU the batch size is more limited, and that the

NSP pretraining task is used in BERT but is absent

in RoBERTa, as NSP exactly matches the input

scheme of our task.

3.4.3 Generations

We generate triples using the test set from ASER

and ATOMIC as seeds by both COMeT with 10-

beam search and the CCC-75 model by conceptu-

alization, and apply various metrics on the results,

as shown in Table 3. Both methods may produce

a large number of triples, as given by the number

of generations per seed,

N/Seed

. For diversity, we

report

Dist-1

(number of distinct words per node),

Dist-2

(number of distinct bigrams per node), and

Dist-N

(number of distinct nodes per node). Due

to the different number of generated triplets, the

results are all normalized by number of nodes.

Novelty is measured by

N/T N

, the proportion

of nodes among all produced nodes that are novel,

i.e. not present in the training set, and

N/U N

, the

proportion of novel distinct nodes, among all dis-

tinct nodes. Moreover, since generative methods

may produce nodes of essentially the same mean-

ing with slight changes in forms, we also have the

produced nodes normalized, by removing structural

words like determiners, auxiliary verbs, pronouns,

etc. We then report the results for the metrics above

but applied to generations after such normalization,

denoted as

Dist-N-Norm

N/T N-Norm

, and

N/U

N-Norm

respectively. Furthermore, samples of

generations by both models are shown in Table 4.

It is clearly demonstrated in the diversity results

that a majority of generations by COMeT are sim-

ilar to each other given a certain head node and

relation, as the number of distinct nodes, words

and bigrams are all relatively low. The novelty

results further show that generated nodes are of-

ten similar to the seen ones in the training set as

well. It can be particularly observed that, though

the original diversity and novelty metrics appear

to be acceptable, which is consistent with Bosselut

et al. (2019), results drop sharply when the genera-

tions are normalized. This indicates that COMeT

may produce slightly different nodes paraphrasing

each other, as shown in Table 4 where four of ﬁve

generated tails are similar to each other (saying he

does not get it), while this is not the case for CCC,

Results from Bosselut et al. (2019) of test perplexity are

reproduced in our experiments. Differences on diversity met-

rics are due to the fact that we use 10-beam search instead for

fair comparison.

ASER ATOMIC

COMeT CCC-75 COMeT CCC-75

N/Seed 10 8.28 10 5.02

Dist-N 24.68% 96.57% 6.49% 51.26%

Dist-1 1.62% 6.30% 0.63% 8.34%

Dist-2 10.56% 15.45% 2.87% 49.76%

N/T N 88.48% 98.60% 10.30% 94.65%

N/U N 93.38% 99.18% 69.17% 96.66%

Dist-N-Norm 10.74% 84.16% 4.35% 46.36%

N/T N-Norm 9.37% 86.01% 5.12% 87.95%

N/U N-Norm 65.72% 95.02% 58.71% 93.01%

Table 3: Results for diversity and novelty on generations, larger for better results. All rows except N/seed are

given in percentage.

head tail

Seed another promises him a scholarship his parents own a successful business

COMeT –

he never gets it

he does not get it

he never gets one

he could not pay it

he does not receive it

CCC-75

another promises him a grant

–

another promises him an award

–

his parents own a successful shop

his parents own a successful bank

his parents own a successful hotel

Table 4: Samples for ASER generations given the seed. In this sample the head and tail are connected by the

relation of Concession, i.e., although.

as generated nodes are mostly discussing different

entities, and the results will often be diverse and

novel.

4 Related Work

Automatic construction of structured KGs is a well-

studied task, and a number of learning-based meth-

ods have been proposed, including KG embedding

methods based on translational distances (Bordes

et al., 2013; Lin et al., 2015; Ji et al., 2015; Wang

et al., 2014; Shang et al., 2019) and semantic match-

ing (Nickel et al., 2011; Socher et al., 2013; Yang

et al., 2014; Trouillon et al., 2016), typically trained

by negative sampling techniques and applied on

tasks like link prediction and triple classiﬁcation.

Furthermore, graph neural networks can be used to

better capture structural information (Schlichtkrull

et al., 2018), GANs are applied to improve negative

sampling (Cai and Wang, 2018; Wang et al., 2018;

Zhang et al., 2019b) by mining more difﬁcult ex-

amples, and textual information from the node can

be leveraged(Wang and Li, 2016; Xie et al., 2016;

Xiao et al., 2017; An et al., 2018).

Textual information is more critical on CKGs

with nodes carrying complicated eventualities, of-

ten in open-form text. Therefore, Li et al. (2016)

proposed to score ConceptNet triples by neural

sequence models taking text inputs so as to dis-

cover new triples, while Saito et al. (2018) and

Sap et al. (2019) further proposed to generate tail

nodes by a sequence-to-sequence LSTM model

with head and relation as inputs. Recently power-

ful large pretrained models like BERT and GPT-2

have been proposed (Devlin et al., 2018; Radford

et al., 2019), from which, it has been observed

by Trinh and Le (2018) and Radford et al. (2019)

that rich knowledge including commonsense can

be extracted. Therefore, different ways for KG con-

struction have been introduced on such models as

downstream tasks: In KG-BERT, BERT was ﬁne-

tuned for KG completion tasks like link prediction

and triple classiﬁcation (Yao et al., 2019); COMeT

used GPT-based models to generate tails (Bosse-

lut et al., 2019); LAMA directly predicted masked

words in triples on various KGs by BERT (Petroni

et al., 2019); Davison et al. (2019) considered both

generation of new tails and scoring given triples;

Malaviya et al. (2019) utilized both structural and

semantic information for CKG construction on link

prediction tasks.

5 Conclusion

We introduce conceptualization to commonsense

knowledge graph construction and propose a novel

method for the task by generating new triples with

conceptualization and examine them by a discrimi-

nator transferred from pretrained language models.

Future studies will be focused on strategies of con-

ceptualization and its role in natural languages and

commonsense by deep learning approaches.

References

Bo An, Bo Chen, Xianpei Han, and Le Sun. 2018.

Accurate text-enhanced knowledge graph represen-

tation learning. In Proceedings of the 2018 Confer-

ence of the North American Chapter of the Associ-

ation for Computational Linguistics: Human Lan-

guage Technologies, Volume 1 (Long Papers), pages

745–755, New Orleans, Louisiana. Association for

Computational Linguistics.

Antoine Bordes, Nicolas Usunier, Alberto Garcia-

Dur

an, Jason Weston, and Oksana Yakhnenko.

2013. Translating embeddings for modeling multi-

relational data. In Proceedings of the 26th Interna-

tional Conference on Neural Information Processing

Systems, pages 2787–2795.

Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai-

tanya Malaviya, Asli Celikyilmaz, and Yejin Choi.

2019. COMET: Commonsense transformers for au-

tomatic knowledge graph construction. In Proceed-

ings of the 57th Annual Meeting of the Association

for Computational Linguistics, pages 4762–4779,

Florence, Italy. Association for Computational Lin-

guistics.

Liwei Cai and William Yang Wang. 2018. KBGAN:

Adversarial learning for knowledge graph embed-

dings. In Proceedings of the 2018 Conference of the

North American Chapter of the Association for Com-

putational Linguistics: Human Language Technolo-

gies, Volume 1 (Long Papers), pages 1470–1480,

New Orleans, Louisiana. Association for Computa-

tional Linguistics.

Joe Davison, Joshua Feldman, and Alexander Rush.

2019. Commonsense knowledge mining from pre-

trained models. In Proceedings of the 2019 Con-

ference on Empirical Methods in Natural Language

Processing and the 9th International Joint Confer-

ence on Natural Language Processing (EMNLP-

IJCNLP), pages 1173–1178, Hong Kong, China. As-

sociation for Computational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

Kristina Toutanova. 2018. BERT: Pre-training

of deep bidirectional transformers for language

understanding. Computing Research Repository,

arXiv:1810.04805.

Matthew Honnibal and Ines Montani. 2017. spacy 2:

Natural language understanding with bloom embed-

dings, convolutional neural networks and incremen-

tal parsing.

Wen Hua, Zhongyuan Wang, Haixun Wang, Kai Zheng,

and Xiaofang Zhou. 2015. Short text understand-

ing through lexical-semantic analysis. In 2015 IEEE

31st International Conference on Data Engineering,

pages 495–506.

Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and

Jun Zhao. 2015. Knowledge graph embedding via

dynamic mapping matrix. In Proceedings of the

53rd Annual Meeting of the Association for Compu-

tational Linguistics and the 7th International Joint

Conference on Natural Language Processing (Vol-

ume 1: Long Papers), pages 687–696, Beijing,

China. Association for Computational Linguistics.

Xiang Li, Aynaz Taheri, Lifu Tu, and Kevin Gimpel.

2016. Commonsense knowledge base completion.

In Proceedings of the 54th Annual Meeting of the As-

sociation for Computational Linguistics (Volume 1:

Long Papers), pages 1445–1455, Berlin, Germany.

Association for Computational Linguistics.

Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and

Xuan Zhu. 2015. Learning entity and relation em-

beddings for knowledge graph completion. In Pro-

ceedings of the Twenty-Ninth AAAI Conference on

Artiﬁcial Intelligence, pages 2181–2187.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du,

Mandar Joshi, Danqi Chen, Omer Levy, Mike

Lewis, Luke Zettlemoyer, and Veselin Stoyanov.

2019. RoBERTa: A robustly optimized BERT pre-

training approach. Computing Research Repository,

arXiv:1907.11692.

Chaitanya Malaviya, Chandra Bhagavatula, Antoine

Bosselut, and Yejin Choi. 2019. Commonsense

knowledge base completion with structural and se-

mantic context. Computing Research Repository,

arXiv:1910.02915. Version 2.

Gregory Murphy. 2004. The big book of concepts. MIT

press, Cambridge, MA.

Maximilian Nickel, Volker Tresp, and Hans-Peter

Kriegel. 2011. A three-way model for collective

learning on multi-relational data. In Proceedings

of the 28th International Conference on Machine

Learning, pages 809–816.

Fabio Petroni, Tim Rockt

aschel, Sebastian Riedel,

Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and

Alexander Miller. 2019. Language models as knowl-

edge bases? In Proceedings of the 2019 Confer-

ence on Empirical Methods in Natural Language

Processing and the 9th International Joint Confer-

ence on Natural Language Processing (EMNLP-

IJCNLP), pages 2463–2473, Hong Kong, China. As-

sociation for Computational Linguistics.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,

Dario Amodei, and Ilya Sutskever. 2019. Language

models are unsupervised multitask learners.

Itsumi Saito, Kyosuke Nishida, Hisako Asano, and

Junji Tomita. 2018. Commonsense knowledge base

completion and generation. In Proceedings of the

22nd Conference on Computational Natural Lan-

guage Learning, pages 141–150, Brussels, Belgium.

Association for Computational Linguistics.

Maarten Sap, Ronan Le Bras, Emily Allaway, Chan-

dra Bhagavatula, Nicholas Lourie, Hannah Rashkin,

Brendan Roof, Noah A Smith, and Yejin Choi. 2019.

ATOMIC: An atlas of machine commonsense for if-

then reasoning. In Proceedings of the Thirty-Third

AAAI Conference on Artiﬁcial Intelligence, pages

3027–3035.

Michael Schlichtkrull, Thomas N Kipf, Peter Bloem,

Rianne Van Den Berg, Ivan Titov, and Max Welling.

2018. Modeling relational data with graph convolu-

tional networks. In The Semantic Web 15th Inter-

national Conference, ESWC 2018, Heraklion, Crete,

Greece, June 37, 2018, Proceedings, pages 593–

607.

Chao Shang, Yun Tang, Jing Huang, Jinbo Bi, Xi-

aodong He, and Bowen Zhou. 2019. End-to-end

structure-aware convolutional networks for knowl-

edge base completion. In Proceedings of the Thirty-

Third AAAI Conference on Artiﬁcial Intelligence,

pages 3060–3067.

Richard Socher, Danqi Chen, Christopher D Manning,

and Andrew Ng. 2013. Reasoning with neural ten-

sor networks for knowledge base completion. In

Proceedings of the 26th International Conference on

Neural Information Processing Systems, pages 926–

934.

Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hong-

song Li, and Weizhu Chen. 2011. Short text con-

ceptualization using a probabilistic knowledgebase.

In Proceedings of the Twenty-Second International

Joint Conference on Artiﬁcial Intelligence, pages

2330–2336.

Robyn Speer, Joshua Chin, and Catherine Havasi. 2017.

ConceptNet 5.5: An open multilingual graph of gen-

eral knowledge. In Proceedings of the Thirty-First

AAAI Conference on Artiﬁcial Intelligence, pages

4444–4451.

Trieu H Trinh and Quoc V Le. 2018. A simple method

for commonsense reasoning. Computing Research

Repository, arXiv:1806.02847.

eo Trouillon, Johannes Welbl, Sebastian Riedel,

Eric

Gaussier, and Guillaume Bouchard. 2016. Complex

embeddings for simple link prediction. In Proceed-

ings of the 33rd International Conference on Ma-

chine Learning, pages 2071–2080.

Peifeng Wang, Shuangyin Li, and Rong Pan. 2018. In-

corporating GAN for negative sampling in knowl-

edge representation learning. In Proceedings of the

Thirty-Second AAAI Conference on Artiﬁcial Intelli-

gence, pages 2005–2012.

Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng

Chen. 2014. Knowledge graph embedding by trans-

lating on hyperplanes. In Proceedings of the Twenty-

Eighth AAAI Conference on Artiﬁcial Intelligence,

pages 1112–1119.

Zhigang Wang and Juanzi Li. 2016. Text-enhanced rep-

resentation learning for knowledge graph. In Pro-

ceedings of the Twenty-Fifth International Joint Con-

ference on Artiﬁcial Intelligence, pages 1293–1299.

Zhongyuan Wang, Kejun Zhao, Haixun Wang, Xi-

aofeng Meng, and Ji-Rong Wen. 2015. Query un-

derstanding through knowledge-based conceptual-

ization. In Proceedings of the Twenty-Fourth Inter-

national Joint Conference on Artiﬁcial Intelligence,

pages 3264–3270.

Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q

Zhu. 2012. Probase: A probabilistic taxonomy for

text understanding. In Proceedings of the 2012 ACM

SIGMOD International Conference on Management

of Data, pages 481–492.

Han Xiao, Minlie Huang, Lian Meng, and Xiaoyan

Zhu. 2017. SSP: semantic space projection for

knowledge graph embedding with text descriptions.

In Proceedings of the Thirty-First AAAI Conference

on Artiﬁcial Intelligence, pages 3104–3110.

Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, and

Maosong Sun. 2016. Representation learning of

knowledge graphs with entity descriptions. In Pro-

ceedings of the Thirtieth AAAI Conference on Artiﬁ-

cial Intelligence, pages 2659–2665.

Bishan Yang, Wen-tau Yih, Xiaodong He, Jian-

feng Gao, and Li Deng. 2014. Embedding en-

tities and relations for learning and inference in

knowledge bases. Computing Research Repository,

arXiv:1412.6575.

Liang Yao, Chengsheng Mao, and Yuan Luo. 2019.

KG-BERT: BERT for knowledge graph completion.

Computing Research Repository, arXiv:1909.03193.

Hongming Zhang, Xin Liu, Haojie Pan, Yangqiu Song,

Cane Wing-Ki, et al. 2019a. ASER: A large-scale

eventuality knowledge graph. Computing Research

Repository, arXiv:1905.00270.

Yongqi Zhang, Quanming Yao, Yingxia Shao, and Lei

Chen. 2019b. NSCaching: Simple and efﬁcient neg-

ative sampling for knowledge graph embedding. In

2019 IEEE 35th International Conference on Data

Engineering, pages 614–625.