Xueling Lin | The Hong Kong University of Science and Technology | CSE

Xueling LIN (Sherry, 林雪玲)

AI Framework and Data Technology Lab,
Hong Kong Research Center, Huawei

Email: xlinai [-at-] connect.ust.hk

Since September 2021, I worked as a Researcher in The AI Framework and Data Technology Lab in Hong Kong Research Center, Huawei.

In July 2021, I received my Ph.D. degree from Department of Computer Science and Engineering (CSE) in Hong Kong University of Science and Technology (HKUST), where I worked on knowledge base refinement, datafusion and truth discevery in our Knowledge Base Group, advised by Prof. Lei Chen.

Before my Ph.D. journey, I obtained my M.Phil degree in Computer Science and Engineering from HKUST, and received my Bachelor degree in Software Engineering from Sun Yat-sen University.

Please refer to my GitHub repository for more details about my research topics.

Publications

Xueling Lin, Lei Chen, Chaorui Zhang, “TENET: Joint Entity and Relation Linking with Coherence Relaxation”, SIGMOD 2021: 1142-1155. [Paper] [Code] [README]
Hao Xin, Xueling Lin, Lei Chen, “CaSIE: Canonicalize and Informative Selection of the OpenIE system”, ICDE 2021: 2009-2014. [Paper]
Haoyang Li, Xueling Lin, Lei Chen, “Fine-grained Entity Typing via Label Noise Reduction and Data Augmentation”, DASFAA (1) 2021: 356-374. [Paper]
Xueling Lin, Haoyang Li, Hao Xin, Zijian Li, Lei Chen, "KBPearl: a Knowledge Base Population System Supported by Joint Entity and Relation Linking.", Proceedings of the VLDB Endowment 13.7 (2020), 1035-1049. [Paper] [Code] [README]
Zijian Li, Wenhao Zheng, Xueling Lin, Ziyuan Zhao, Zhe Wang, Yue Wang, Xun Jian, Lei Chen, Qiang Yan, Tiezheng Mao, "TransN: Heterogeneous Network Representation Learning by Translating Node Embeddings.", ICDE 2020: 589-600. [Paper]
Xueling Lin and Lei Chen, "Canonicalization of Open Knowledge Bases with Side Information from Source Text.", ICDE 2019: 950-961. [Paper] [Code] [Dataset Description]
Xueling Lin and Lei Chen, "Domain-Aware Multi-Truth Discovery from Conflicting Sources." Proceedings of the VLDB Endowment 11.5 (2018): 635-647. [Paper] [Dataset Dowload] [Dataset Description]
Xueling Lin, Jingjie Jiang, Calvin Hong Yi Li, Bo Li, Baochun Li. "Circa: collaborative code offloading among multiple mobile devices." Wireless Networks (2018): 1-19. [Paper]
Xueling Lin, Jingjie Jiang, Bo Li, Baochun Li, "Circa: Offloading Collaboratively in the Same Vicinity with iBeacons." Communications (ICC), 2015 IEEE International Conference on. IEEE, 2015. [Paper]

Awards

Huawei PhD Fellowship
Postgraduate Studentship, HKUST

Teaching Assistant

COMP3311 Database Management Systems (Fall 2017)
COMP4332 Big Data Mining (Spring 2017)
COMP3511 Operating System (Spring 2015)
COMP3021 Java Programming (Fall 2015)

Datasets for Truth Discovery (VLDB 2018)

I have collected two datasets for my research in truth discorvery. You can download both datasets and the groundtruths via this link. The details of both datasets is listed as follows:

We collected the book dataset from AbeBooks.com in April 2017. It contains 54,591 different sources registered as booksellers and provides 2,338,559 listing information (i.e., bookstores selling books) for 210,206 books. Each source provides 0.000005% (1 book) to 28.7% (6,0317 books) of the whole collection. This book dataset contains information of 18 different major categories of books, including crime fiction, children’s books, science fiction, horror stories, literature, arts, romance fiction, biographies, business, cookbooks, craft books, history, reference, religion, science, self-help, social science and travel books. We have classified them into 18 different txt files according to their major categories. For each listing information, there are 20 attributes, including ISBN_10, ISBN_13, title, authors, big_cate, small_cate, seller, location, seller_link, seller_date, used, publisher, publish_date, binding, book_condition, illustrator, dust_jacket_condition, signed, edition, price and link. [Sample data]
We collected the movie dataset in July 2017. This dataset contains movie data from 15 different websites, including imdb, allmovie, amazon, instantwatcher, moviefone, metacritic, movieinsider, 1moviesonline, goodfilms, dewanontons, letterboxd, filmcrave, ifcfilms, top250tv and agoodmovietowatch. The dataset provides 1,134,432 listing information (i.e. source providing a movie) for 468,607 movies. The genres of these movies include action, adventure, animation, biography, children, comedy, crime, documentary, drama, faith, family, fantasy, history, horror, music, romance, science-fiction, sports, thriller, war and western (21 in total). The release year is from 1900 to 2017. For each listing information, there are 8 attributes, including title, genre, source, year, director, vote, gross and hyperlink. [Sample data]

Note 1: More detail about the datasets listed above are available in this paper: Xueling Lin and Lei Chen, "Domain-Aware Multi-Truth Discovery from Conflicting Sources." Proceedings of the VLDB Endowment 11.5 (2018): 635-647.
Note 2: Please contact me through email if you are interested in these datasets and want to play with the complete ones.
Note 3: Both datasets can only be used for research purpose.

Datasets for Canonicalization of Open Knowledge Bases (ICDE 2019)

I use two major datasets for my research in canonicalization of open knowledge bases. The details of both datasets and the side information are listed as follows:

ReVerb45K: This is a new Open KB canonicalization dataset proposed by CESI and has been published by the authors. ReVerb45K is constructed based on Reverb Open KB, Clueweb09 corpus, as well as Freebase entity linking information.
NYTimes2018: We collect this dataset from nytimes.com in 2018. This dataset contains news articles from 5 different domains, including sports, arts, business, science and health. We collect 500 articles and apply Stanford Open IE Tool on this article to produce Open IE triples.
Side Information: For both datasets, we obtain the side information for each source text as follows. First, we apply NLTK to recognize the named entity mentions (with PERSON, ORGANIZATION, LOCATION... as the types) in the source text. We then use Wikidata Integrator to link each named entity mention to a list of candidate entities in Wikidata.
More details can be founded here.

My Personal Life

My paintings. :)

AI Framework and Data Technology Lab, Hong Kong Research Center, Huawei

AI Framework and Data Technology Lab,
Hong Kong Research Center, Huawei