Do Language Models Perform Generalizable Commonsense

Do Language Models Perform Generalizable Commonsense Inference?

Peifeng Wang

1,2

, Filip Ilievski

, Muhao Chen

1,2

, Xiang Ren

1,2

Department of Computer Science, University of Southern California

Information Sciences Institute, University of Southern California

{peifengw,muhaoche,xiangren}@usc.edu, [email protected]

Abstract

Inspired by evidence that pretrained language

models (LMs) encode commonsense knowl-

edge, recent work has applied LMs to auto-

matically populate commonsense knowledge

graphs (CKGs). However, there is a lack of

understanding on their generalization to mul-

tiple CKGs, unseen relations, and novel enti-

ties. This paper analyzes the ability of LMs

to perform generalizable commonsense infer-

ence, in terms of knowledge capacity, transfer-

ability, and induction. Our experiments with

these three aspects show that: (1) LMs can

adapt to different schemas deﬁned by multiple

CKGs but fail to reuse the knowledge to gen-

eralize to new relations. (2) Adapted LMs gen-

eralize well to unseen subjects, but less so on

novel objects. Future work should investigate

how to improve the transferability and induc-

tion of commonsense mining from LMs.

1 Introduction

Large-scale commonsense knowledge graphs

(CKGs), like ConceptNet (Speer et al., 2017) and

ATOMIC (Sap et al., 2019), store structured knowl-

edge that can beneﬁt various knowledge-driven

applications. Given the usefulness of CKGs, but

also their inability to ﬂexibly provide information,

(Paulheim, 2018), recent work has paid much at-

tention to populating CKGs with commonsense

knowledge mined from pretrained language models

(LMs) (Wang et al., 2020c; Bosselut et al., 2019).

Enhancing the knowledge of CKGs is essential

to support reasoning on downstream tasks (Talmor

et al., 2019; Wang et al., 2020b; Young et al., 2018).

The task of completing CKGs has typically been

posed as commonsense knowledge inference, where

the goal is to predict the object of a fact triplet,

given its subject and a relation (predicate) (Petroni

The code is avaiable at

https://github.com/

wangpf3/LM-for-CommonsenseInference.

0. Single CKG 1. Multi-task 2. Transfer Learning

3. Low-resource

Figure 1: Unlike previous studies that adapt LM on one

single CKG (0), we investigate LM’s three aspects of

generlizability: (1) knowledge capacity by multi-task

learning, (2) transferability by transfer learning and (3)

induction by controlled low-resource learning.

et al., 2019; Bosselut et al., 2019). Commonsense

inference techniques, such as COMET (Bosse-

lut et al., 2019), typically ﬁne-tune an LM, like

GPT (Radford et al., 2018), over the training set

from a single CKG. While such methods are able

to dynamically enhance the completeness of CKGs,

their application so far has been limited to the re-

lation set of the source (training) CKG (Da et al.,

2021). In addition, the generated object concepts

are found to be largely biased towards the ones in

the training set (Wang et al., 2020a). It remains

unclear to which extent LMs can generalize to mul-

tiple CKGs, new relations, and novel objects. To

this end, we pose the question: do language models

perform generalizable commonsense inference?

To answer this question, we study three aspects

of the LM generalizability for commonsense infer-

ence, namely: knowledge capacity, transferability,

and induction. To measure the knowledge capac-

ity ability of LMs, we examine whether LMs can

be adapted to multiple CKGs simultaneously, and

tested on each of the CKGs. We test their transfer-

ability by assessing whether an initial adaptation

of a LM on multiple source CKGs can reduce the

effort on further adapting it to a new CKG. The

inductive power of LMs is measured by varying

the overlap between the objects in the training and

test splits of a CKG. The overview of our analysis

is depicted in Figure 1. Our results show that LMs

are able to infer knowledge for multiple CKGs si-

multaneously without loss of performance on the

arXiv:2106.11533v1 [cs.CL] 22 Jun 2021

target inference task, though the transferability of

knowledge across tasks is limited. In addition, we

observe that the inductive power of LMs for com-

monsense inference relies heavily on whether an

object is observed during training.

2 Analysis Setup

To shed light on the LM’s generalizalibility for

commonsense inference, we investigate: whether

LMs have the capability to adapt to multiple CKGs

(Q1: capacity), whether LMs can reuse the knowl-

edge learned from source CKGs to efﬁciently adapt

to a target CKG (Q2: transferability), and whether

LMs can predict unseen objects or mainly repeat

the observed ones (Q3: induction). In this Sec-

tion, we deﬁne the task, the CKGs we consider, our

experimental settings, and relate to prior studies.

2.1 Task Formulation

Following Hwang et al. (2020); Da et al. (2021),

we formalize commonsense inference as a task

of predicting the object of a triplet, given a pair

of (subject, relation) as input. The subject

and the object

are both expressed as free-form

phrases, while the relation

is a predeﬁned rela-

tion type from the CKG. A training example from

ConceptNet could have

(go to a concert,

MotivatedByGoal)

as input, and

listen

to music

as output. Assuming that a CKG is

given, the goal is to leverage the commonsense

triplets in the CKG as training examples to adapt

the LM for commonsense inference.

2.2 CKG Datasets

We consider three large and popular CKGs, with

different foci:(1)

ConceptNet

’s broad set of com-

monsense knowledge includes taxonomic (e.g.,

IsA

), utility (e.g.,

UsedFor

), and temporal

knowledge (e.g.,

HasPrerequisite

). It com-

bines crowdsourced knowledge with that from

existing sources, such as WordNet. We use its

ConceptNet-100K subset, collected by Li et al.

(2016). (2)

TupleKB

(Dalvi Mishra et al., 2017)

focuses on scientiﬁc commonsense knowledge like

(salt, dissolve in, water)

. It is con-

structed through an information extraction pipeline.

(3)

ATOMIC

(Sap et al., 2019) has social common-

sense knowledge about causes and effects of every-

day events, and mental states (e.g.,

xIntent

) of

their participants. It is created by crowdsourcing.

As indicated by Jastrzebski et al. (2018), a

large proportion of the subjects in the test set

of ConceptNet-100K overlap with its training set,

while TupleKB does not provide an ofﬁcial split.

Thus, we (re-)split these two datasets to ensure that

the subjects of testing triplets do not appear in the

training set. This criterion is also consistent with

how the ATOMIC dataset is constructed.

2.3 Experimental Settings

Multi-task Learning

To answer Q1, we adapt an

LM with balanced training data from ConceptNet,

TupleKB, and ATOMIC. We sample 8 triplets from

each dataset to form one training batch.

Transfer Learning

To provide insight into Q2, we

adopt transfer learning under a leave-one-out strat-

egy. In this setting, we adapt an LM on two of the

three CKGs, and then we further adapt it on the

third target CKG. Moreover, we study the data efﬁ-

ciency of this transfer learning by down-sampling

each training set to

x = {1, 20, 50}%

, in order to

see whether the LM can adapt to the target CKG

with less training effort. Fine-tuning on data as

small as 1% training set may suffer from instability,

and results may change dramatically given a new

split of training data (Gao et al., 2020). To control

the randomness, we re-sample the 1% training data

5 times with a ﬁxed set of random seeds and report

the average performance instead.

Controlled Low-resource Learning

To answer

Q3, we design a controlled experiment, where we

ﬁrst split the training set into two disjoint subsets

depending on whether the triplets in the original

training set contain objects that exist in the test set

or not. We denote the subset where the objects of

the triplets appear in testing data as

Ω

. We sam-

ple

x = {0, 25, 50, 100}%

of the training triplets

Ω

for adapting the LM. During the evaluation,

we also separate the test set into two disjoint sub-

sets, according to whether the objects are seen in

the original full training set. The results on these

two split test sets are reported separately for each

adapted LM.

Evaluation Protocol

For each (subject, relation)

pair in the test set, we treat all their objects as

ground truth references for evaluating the model

inference. We report scores for commonly used

automatic evaluation metrics for text generation:

BLEU (Papineni et al., 2002), ROUGE (Lin, 2004),

and METEOR (Banerjee and Lavie, 2005), which

are shown to be consistent with human judge-

ments (Hwang et al., 2020). During experiments,

we observe a high correlation among these differ-

Adaptation method Input Learnable params

Zero-shot (ZS) (s, r) N/A

ZS+demo (s

, r, o

, s, r) N/A

Fine-tuning (FT) (s, r) Transformer (LM)

FT+demo (s

, r, o

, s, r) Transformer (LM)

Adapter tuning (AT) (s, r) Adapter

Table 1: Methods for using LMs to conduct common-

sense inference. “+demo” means prepending a demon-

stration triplet (s

, r, o

) before the input tuple.

ent metrics and choose to report METEOR in the

main text and other metrics in the appendix.

2.4 Connections to Prior Studies

Earlier works (Li et al., 2016; Jastrzebski et al.,

2018; Davison et al., 2019) poses the CKG com-

pletion task as triplet classiﬁcation, where the

goal is to score the plausibility of a complete

triplet. COMET (Bosselut et al., 2019) is the ﬁrst

to cast this task as commonsense inference with

LMs. Follow-up contributions utilize COMET as

a commonsense provider in various downstream

tasks (Bosselut and Choi, 2021; Ammanabrolu

et al., 2021; Chakrabarty et al., 2020), thus provid-

ing evidence for LM’s generalization to previously

unseen scenarios. Further efforts include Hwang

et al. (2020), which show that the quality of the

training triplets is a key factor of adapting LMs,

and (Da et al., 2021), which investigates how to

learn COMET in a few-shot learning setting. Mean-

while, the study by Wang et al. (2020a) indicates

the limited generalization of COMET. Ma et al.

(2021) also adapt LMs simultaneously on multiple

CKGs, albeit their goal is to improve downstream

performance rather than CKG inference. In this pa-

per, we aim to provide a more comprehensive study

of a LM’s generalizability for CKG inference.

3 Method

While a set of pretrained LMs exists, we adopt

a widely used generative model, GPT2 (Radford

et al., 2019), as our baseline LM. The investigation

of other generative LMs is orthogonal to our analy-

sis. We experiment with its largest version, GPT2-

XL, which contains 48 transformer layers (Vaswani

et al., 2017), ensuring sufﬁcient capacity for stor-

ing knowledge acquired during its pretraining. We

introduce our experimental method as follows.

Commonsense Inference with LMs

Given a train-

ing triplet (s,r,o), we represent

and

as sequences

of tokens,

and

, which is trivial given that they

are already expressed as phrases. As for the rela-

tion

, we convert it by using a template taken from

the literature (Davison et al., 2019) into a natural-

language phrase

, e.g.,

IsA

is converted to “is a”.

This has been shown to facilitate efﬁcient adapta-

tion of LMs (Da et al., 2021). Note that we do not

explicitly provide the LMs with the information

about the source CKG of the triplet as input (e.g.,

prepending a related special token to the triplet).

Adapting LMs with Commonense Knowledge

The training objectives for adapting LMs is to maxi-

mize the probability of generating the object phrase

given the tuple

, x

)

. During inference, we

adopt greedy decoding to obtain the predicted ob-

ject from the adapted LM.

There have been various techniques devel-

oped for adapting pretrained LMs to downstream

tasks (Howard and Ruder, 2018; Chen et al., 2020).

Moreover, previously only the vanilla

Fine-tuning

i.e., updating the whole LM architecture during

training, has been employed to adapt LMs for com-

monsense inference (Bosselut et al., 2019; Hwang

et al., 2020; Da et al., 2021). To obtain comprehen-

sive results that are not speciﬁc to one particular

way of ﬁne-tuning, here we investigate two more

alternatives, each of which has their own advantage

when considered in different contexts.

Fine-tuning with Demonstration (FT+demo)

Combining the ideas of ﬁne-tuning and in-context

learning (Brown et al., 2020), this technique (Gao

et al., 2020) adds a demonstration to each input

as additional context and ﬁne-tunes the whole LM

as usual. Incorporating demonstrations is shown

to boost performance when the amount of training

data is extremely limited. In our case, a demon-

stration is a top-1 training triplet

, r, o

)

, ranked

according to the cosine similarity between the em-

bedding of the input tuple

(s, r)

and the embed-

dings of the training tuples with the same relation

type

. The tuple embeddings are given by a pre-

trained Sentence-BERT (Reimers and Gurevych,

2019). For instance, a demonstration (

go to

restaurant

UsedFor

eat out

) would be

added before the input (

go to pub

UsedFor

With the demonstrated triplets, the LM could learn

to understand the schema of the CKG instead of

simply learning the knowledge from the training

data.

Adapter Tuning (AT)

Unlike ﬁne-tuning, adapter

tuning (Houlsby et al., 2019) ﬁxes the entire LM

and adds one trainable adapter right before the skip

connection in each transformer layer of the LM,

Figure 2: Results (METEOR) for knowledge capacity of LMs. ”FT+d” refers to FT+demo.

We ﬁnd no notable performance

drop for any method trained in the multi-task setting.

Figure 3: Results (METEOR) for LM transferability.

”FT+d” refers to FT+demo. Across datasets, we do not ob-

serve that adapting to the source CKGs would enable the LMs

to adapt to the target CKG better or more easily.

Figure 4: Results (METEOR) for LM induction.

”FT+d”

refers to FT+demo. All the methods perform better on pre-

dicting facts that contain seen objects, while the performance

degrades when less objects are seen during training.

which is more parameter-efﬁcient. Each adapter

is a two-layer bottleneck network with a skip-

connection internally. Following Houlsby et al.

(2019), the parameters of the bottleneck network

are initialized close to zero so that the adapter ap-

proximates an identity function from the beginning.

We compare to two additional baselines, both

using GPT2-XL in a zero-shot setting:

Zero-shot

(ZS)

is fed with the same input as Fine-tuning,

while zero-shot with demonstrations (

ZS+demo

)

combines the input plus demonstration, as in the

FT+demo method. By investigating all these meth-

ods, we aim to understand the inﬂuence of different

adaptation techniques on the models’ performance.

Table 1 summarizes the set of methods which we

consider in this paper.

4 Results and Discussion

Knowledge Capacity (Q1)

The results that quan-

tify the knowledge capacity of LMs for common-

sense inference over multiple CKGs with ME-

TEOR scores are shown in Figure 2. The com-

plete results including other metrics can be found

in the appendix. All adaptation methods perform

considerably better than the zero-shot baselines,

indicating the beneﬁt of adaptation. There is no

clear distinction between the adaptation methods,

though FT+demo performs slightly better than the

others across CKGs. Most importantly, we ﬁnd no

notable performance drop for any method in the

multi-task training setup despite the challenge that

there is limited overlap between these CKGs. Only

10.0%

of the facts from ATOMIC can be found

in ConceptNet (Hwang et al., 2020) while

8.4%

of the facts from ConceptNet can be found in Tu-

pleKB (Dalvi Mishra et al., 2017)

. This indicates

the prominent capacity of LMs to simultaneously

adapt to different CKGs. Nevertheless, the results

reveal that learning different CKGs jointly do not

interfere with each other positively (via knowledge

sharing) or negatively (due to overﬁtting).

Transferability (Q2)

Figure 3 shows the obtained

results regarding the transferability of LMs. Across

different CKGs and for any training data size, we

observe no indications that adapting to the source

CKGs enhances the performance on the target

CKG. On the contrary, adapting from source CKGs

We also try to breakdown the results by relation types and

do not observe correlation between the relation-wise perfor-

mance and the extent of overlap.

even hurts the performance of the Adapter-tuning

method, revealing that this method overﬁts to the

source CKGs. Overall, we conclude that LMs can-

not reuse the knowledge learned from the source

CKGs to improve the performance on the target

CKG or achieve the same performance with less

training data. Thus, we call for future study on

developing more effective adaptation methods.

Induction (Q3)

The results in Figure 4 show that

without down-sampling (

x = 100%

), all methods

perform much better on predicting facts that con-

tain seen objects, and their performance degrades

more when less object entities are seen to training.

Meanwhile, the performance on facts with unseen

objects stays roughly unaffected. This indicates a

key limitation of the LMs: they adapt notably better

on seen objects. Since the training set and test set

do not share subjects, we conclude that the general-

izability of the LM is largely dependent on ﬁnding

the relationship between unseen subjects and ob-

served objects. We thus posit that a novel strategy

for adapting LMs while retaining the knowledge

acquired during pre-training is necessary for bet-

ter generalizability. Promising directions here are

preﬁx tuning (Li and Liang, 2021) or including an

additional objective during adaptation which would

encourage the generation of novel objects.

5 Conclusion

This work conducted a focused study of three as-

pects of the generalizability of LMs for common-

sense inference: knowledge capacity, transferabil-

ity, and induction. We experiment with ﬁve meth-

ods of using a generative LM and three represen-

tative CKGs. Despite their capability to accommo-

date multiple CKGs, we have observed that LMs

have limited ability to transfer knowledge across

CKGs. Moreover, their adaptation relies heavily

on whether the objects to predict are seen during

training. These ﬁndings help our understanding

of LMs’ adaptation behavior on commonsense in-

ference, and highlight the need for future work to

improve their transferability and induction.

Acknowledgments

We thank the anonymous reviewers for their in-

sightful comments. This material is based upon

work sponsored by the DARPA MCS program un-

der Contract No. N660011924033 with the United

States Ofﬁce Of Naval Research.

References

Prithviraj Ammanabrolu, Wesley Cheung, William

Broniec, and Mark O Riedl. 2021. Automated sto-

rytelling via causal, commonsense plot ordering. In

Proceedings of the 35th AAAI Conference on Artiﬁ-

cial Intelligence (AAAI).

Satanjeev Banerjee and Alon Lavie. 2005. METEOR:

An automatic metric for MT evaluation with im-

proved correlation with human judgments. In Pro-

ceedings of the ACL Workshop on Intrinsic and Ex-

trinsic Evaluation Measures for Machine Transla-

tion and/or Summarization, pages 65–72, Ann Ar-

bor, Michigan. Association for Computational Lin-

guistics.

Antoine Bosselut and Yejin Choi. 2021. Dynamic

knowledge graph construction for zero-shot com-

monsense question answering. In Proceedings of

the 35th AAAI Conference on Artiﬁcial Intelligence

(AAAI).

Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai-

tanya Malaviya, Asli Celikyilmaz, and Yejin Choi.

2019. COMET: Commonsense transformers for au-

tomatic knowledge graph construction. In Proceed-

ings of the 57th Annual Meeting of the Association

for Computational Linguistics, pages 4762–4779,

Florence, Italy. Association for Computational Lin-

guistics.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie

Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind

Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, Sandhini Agarwal, Ariel Herbert-Voss,

Gretchen Krueger, Tom Henighan, Rewon Child,

Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,

Clemens Winter, Christopher Hesse, Mark Chen,

Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin

Chess, Jack Clark, Christopher Berner, Sam Mc-

Candlish, Alec Radford, Ilya Sutskever, and Dario

Amodei. 2020. Language models are few-shot learn-

ers. In Advances in Neural Information Processing

Systems 33: Annual Conference on Neural Informa-

tion Processing Systems 2020, NeurIPS 2020, De-

cember 6-12, 2020, virtual.

Tuhin Chakrabarty, Smaranda Muresan, and Nanyun

Peng. 2020. Generating similes effortlessly like a

pro: A style transfer approach for simile generation.

In Proceedings of the 2020 Conference on Empirical

Methods in Natural Language Processing (EMNLP),

pages 6455–6469, Online. Association for Computa-

tional Linguistics.

Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang Che,

Ting Liu, and Xiangzhan Yu. 2020. Recall and learn:

Fine-tuning deep pretrained language models with

less forgetting. In Proceedings of the 2020 Con-

ference on Empirical Methods in Natural Language

Processing (EMNLP), pages 7870–7881, Online. As-

sociation for Computational Linguistics.

Jeff Da, Ronan Le Bras, Ximing Lu, Yejin Choi, and

Antoine Bosselut. 2021. Understanding few-shot

commonsense knowledge models. arXiv preprint

arXiv:2101.00297.

Bhavana Dalvi Mishra, Niket Tandon, and Peter Clark.

2017. Domain-targeted, high precision knowledge

extraction. Transactions of the Association for Com-

putational Linguistics, 5:233–246.

Joe Davison, Joshua Feldman, and Alexander Rush.

2019. Commonsense knowledge mining from pre-

trained models. In Proceedings of the 2019 Con-

ference on Empirical Methods in Natural Language

Processing and the 9th International Joint Confer-

ence on Natural Language Processing (EMNLP-

IJCNLP), pages 1173–1178, Hong Kong, China. As-

sociation for Computational Linguistics.

Tianyu Gao, Adam Fisch, and Danqi Chen. 2020.

Making pre-trained language models better few-shot

learners. arXiv preprint arXiv:2012.15723.

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski,

Bruna Morrone, Quentin de Laroussilhe, Andrea

Gesmundo, Mona Attariyan, and Sylvain Gelly.

2019. Parameter-efﬁcient transfer learning for NLP.

In Proceedings of the 36th International Confer-

ence on Machine Learning, ICML 2019, 9-15 June

2019, Long Beach, California, USA, volume 97 of

Proceedings of Machine Learning Research, pages

2790–2799. PMLR.

Jeremy Howard and Sebastian Ruder. 2018. Universal

language model ﬁne-tuning for text classiﬁcation. In

Proceedings of the 56th Annual Meeting of the As-

sociation for Computational Linguistics (Volume 1:

Long Papers), pages 328–339, Melbourne, Australia.

Association for Computational Linguistics.

Jena D Hwang, Chandra Bhagavatula, Ronan Le Bras,

Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and

Yejin Choi. 2020. Comet-atomic 2020: On symbolic

and neural commonsense knowledge graphs. arXiv

preprint arXiv:2010.05953.

Stanislaw Jastrzebski, Dzmitry Bahdanau, Seyedarian

Hosseini, Michael Noukhovitch, Yoshua Bengio,

and Jackie Cheung. 2018. Commonsense mining as

knowledge base completion? a study on the impact

of novelty. In Proceedings of the Workshop on Gen-

eralization in the Age of Deep Learning, pages 8–16,

New Orleans, Louisiana. Association for Computa-

tional Linguistics.

Xiang Li, Aynaz Taheri, Lifu Tu, and Kevin Gimpel.

2016. Commonsense knowledge base completion.

In Proceedings of the 54th Annual Meeting of the As-

sociation for Computational Linguistics (Volume 1:

Long Papers), pages 1445–1455, Berlin, Germany.

Association for Computational Linguistics.

Xiang Lisa Li and Percy Liang. 2021. Preﬁx-

tuning: Optimizing continuous prompts for genera-

tion. arXiv preprint arXiv:2101.00190.

Chin-Yew Lin. 2004. ROUGE: A package for auto-

matic evaluation of summaries. In Text Summariza-

tion Branches Out, pages 74–81, Barcelona, Spain.

Association for Computational Linguistics.

Kaixin Ma, Filip Ilievski, Jonathan Francis, Yonatan

Bisk, Eric Nyberg, and Alessandro Oltramari. 2021.

Knowledge-driven Data Construction for Zero-shot

Evaluation in Commonsense Question Answering.

In 35th AAAI Conference on Artiﬁcial Intelligence.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-

Jing Zhu. 2002. Bleu: a method for automatic eval-

uation of machine translation. In Proceedings of

the 40th Annual Meeting of the Association for Com-

putational Linguistics, pages 311–318, Philadelphia,

Pennsylvania, USA. Association for Computational

Linguistics.

Heiko Paulheim. 2018. How much is a triple? esti-

mating the cost of knowledge graph creation. In

Proceedings of the 17th International Semantic Web

Conference.

Fabio Petroni, Tim Rockt

aschel, Sebastian Riedel,

Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and

Alexander Miller. 2019. Language models as knowl-

edge bases? In Proceedings of the 2019 Confer-

ence on Empirical Methods in Natural Language

Processing and the 9th International Joint Confer-

ence on Natural Language Processing (EMNLP-

IJCNLP), pages 2463–2473, Hong Kong, China. As-

sociation for Computational Linguistics.

Alec Radford, Karthik Narasimhan, Tim Salimans, and

Ilya Sutskever. 2018. Improving language under-

standing by generative pre-training.

Alec Radford, Jeff Wu, Rewon Child, David Luan,

Dario Amodei, and Ilya Sutskever. 2019. Language

models are unsupervised multitask learners.

Nils Reimers and Iryna Gurevych. 2019. Sentence-

BERT: Sentence embeddings using Siamese BERT-

networks. In Proceedings of the 2019 Conference on

Empirical Methods in Natural Language Processing

and the 9th International Joint Conference on Natu-

ral Language Processing (EMNLP-IJCNLP), pages

3982–3992, Hong Kong, China. Association for

Computational Linguistics.

Maarten Sap, Ronan Le Bras, Emily Allaway, Chan-

dra Bhagavatula, Nicholas Lourie, Hannah Rashkin,

Brendan Roof, Noah A. Smith, and Yejin Choi. 2019.

ATOMIC: an atlas of machine commonsense for

if-then reasoning. In The Thirty-Third AAAI Con-

ference on Artiﬁcial Intelligence, AAAI 2019, The

Thirty-First Innovative Applications of Artiﬁcial In-

telligence Conference, IAAI 2019, The Ninth AAAI

Symposium on Educational Advances in Artiﬁcial

Intelligence, EAAI 2019, Honolulu, Hawaii, USA,

January 27 - February 1, 2019, pages 3027–3035.

AAAI Press.

Robyn Speer, Joshua Chin, and Catherine Havasi. 2017.

Conceptnet 5.5: An open multilingual graph of gen-

eral knowledge. In Proceedings of the Thirty-First

AAAI Conference on Artiﬁcial Intelligence, Febru-

ary 4-9, 2017, San Francisco, California, USA,

pages 4444–4451. AAAI Press.

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and

Jonathan Berant. 2019. CommonsenseQA: A ques-

tion answering challenge targeting commonsense

knowledge. In Proceedings of the 2019 Conference

of the North American Chapter of the Association

for Computational Linguistics: Human Language

Technologies, Volume 1 (Long and Short Papers),

pages 4149–4158, Minneapolis, Minnesota. Associ-

ation for Computational Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob

Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz

Kaiser, and Illia Polosukhin. 2017. Attention is all

you need. In Advances in Neural Information Pro-

cessing Systems 30: Annual Conference on Neural

Information Processing Systems 2017, December 4-

9, 2017, Long Beach, CA, USA, pages 5998–6008.

Cunxiang Wang, Jinhang Wu, Luxin Liu, and Yue

Zhang. 2020a. Commonsense knowledge graph rea-

soning by selection or generation? why? arXiv

preprint arXiv:2008.05925.

Haoyu Wang, Muhao Chen, Hongming Zhang, and

Dan Roth. 2020b. Joint constrained learning for

event-event relation extraction. In Proceedings of

the 2020 Conference on Empirical Methods in Natu-

ral Language Processing (EMNLP), pages 696–706,

Online. Association for Computational Linguistics.

Peifeng Wang, Nanyun Peng, Filip Ilievski, Pedro

Szekely, and Xiang Ren. 2020c. Connecting the

dots: A knowledgeable path generator for common-

sense question answering. In Findings of the Associ-

ation for Computational Linguistics: EMNLP 2020,

pages 4129–4140, Online. Association for Computa-

tional Linguistics.

Tom Young, Erik Cambria, Iti Chaturvedi, Hao Zhou,

Subham Biswas, and Minlie Huang. 2018. Aug-

menting end-to-end dialogue systems with common-

sense knowledge. In Proceedings of the Thirty-

Second AAAI Conference on Artiﬁcial Intelligence,

(AAAI-18), the 30th innovative Applications of Arti-

ﬁcial Intelligence (IAAI-18), and the 8th AAAI Sym-

posium on Educational Advances in Artiﬁcial Intel-

ligence (EAAI-18), New Orleans, Louisiana, USA,

February 2-7, 2018, pages 4970–4977. AAAI Press.

BLEU-2 ROUGE-L METEOR

single-task multi-task single-task multi-task single-task multi-task

Zero-shot 0.0069 NA 0.1009 NA 0.0506 NA

ZS+demo 0.0284 NA 0.1281 NA 0.0787 NA

Adapter-tuning 0.1289 0.1279 0.2598 0.2560 0.1739 0.1706

Fine-tuning 0.1325 0.1286 0.2629 0.2575 0.1775 0.1749

ConceptNet

FT+demo 0.1333 0.1398 0.2678 0.2738 0.1795 0.1851

Zero-shot 0.0017 NA 0.0999 NA 0.0263 NA

ZS+demo 0.0099 NA 0.2748 NA 0.0869 NA

Adapter-tuning 0.1383 0.1323 0.3785 0.3627 0.2094 0.2010

Fine-tuning 0.1371 0.1388 0.3985 0.3812 0.2151 0.2122

TupleKB

FT+demo 0.1699 0.1698 0.4902 0.4714 0.2622 0.2580

Zero-shot 0.0436 NA 0.2523 NA 0.1419 NA

ZS+demo 0.0808 NA 0.2233 NA 0.1572 NA

Adapter-tuning 0.2161 0.2035 0.4008 0.3890 0.2913 0.2832

Fine-tuning 0.2125 0.2057 0.3982 0.3908 0.2913 0.2843

ATOMIC

FT+demo 0.2111 0.2070 0.3915 0.3868 0.2887 0.2800

Table 2: Results of all the evaluation metrics for the knowledge capacity experiments.

A Appendix

A.1 Dataset Statistics

[h]

Train Dev Test

ConceptNet100k 79,770 10,203 10,027

TupleKB 98,674 12,357 12,427

ATOMIC 578,002 64,902 71,127

Table 3: CKG Dataset Statistics.

A.2 Implementation Details

The GPT2-XL language model we adopted in this

work has 1558M parameters in total. We train

all the models on a V100 GPU. As for hyper-

parameters, we adopt the commonly-used learning

rate (1e-5) and batch size (16) for adapting GPT2,

except that in the multi-task learning setting, the

batch size is 24 (8 samples from each CKG).

A.3 Additional Results

See Table 2 for the full results of all the evaluation

metrics considered in this paper.