Dataset for Canonicalization of Open Knowledge Bases (ICDE 2019)
I use two major datasets for my research in canonicalization of open knowledge bases. The details of both datasets and the side information are listed as follows:
- ReVerb45K: This is a new Open KB canonicalization dataset proposed by CESI and has been published by the authors. ReVerb45K is constructed based on Reverb Open KB, Clueweb09 corpus, as well as Freebase entity linking information.
- NYTimes2018: We collect this dataset from nytimes.com in 2018. This dataset contains news articles from 5 different domains, including sports, arts, business, science and health. We collect 500 articles and apply Stanford Open IE Tool on this article to produce Open IE triples.
- Side Information: For both datasets, we obtain the side information for each source text as follows. First, we apply NLTK to recognize the named entity mentions (with PERSON, ORGANIZATION, LOCATION... as the types) in the source text. We then use Wikidata Integrator to link each named entity mention to a list of candidate entities in Wikidata.
- [Detailed Noun Info][Data Description for *_Wikidata_entity_description][Data Description for *_Wikidata_noun_records]
- [Domain Keywords]
- [Patty Dataset] (This is a dataset published by the authors of "PATTY: A Taxonomy of Relational Patterns with Semantic Types" (EMNLP 2012). You can also download the original dataset from mpi-inf)
- Note 1: This dataset accompanies the publication below. Please cite this publication if you use the data above: Xueling Lin, and Lei Chen. "Canonicalization of Open Knowledge Bases with Side Information from Source Text.", In 2019 IEEE 35th International Conference on Data Engineering (ICDE) (pp. 950-961). More detail about the datasets listed above are available in this paper.
- Note 2: All the datasets can only be used for research purpose.