My research interest includes text mining, Scholarly Big Data, digital libraries, search engines, natural language processing, applied machine learning, and deep learning.
Text Mining
The number of research papers has been dramatically increased in the last decades, and it will keep increasing, making it extremely stressing for scientists, researchers, and students to read and digest them, even in niche domains. The recent advancement of machine learning (ML) and natural language processsing (NLP) enables machines to read a research paper, extract the content, and link it to related papers. One important task is to extract semantic entities and relations from free text.
Publications: JCDL 2017
Scholarly Big Data
As an important instance of the Big Data, Scholarly Big Data (SBD) has been paid more attention to since the term was coined. The CiteSeerX project is a pioneer project to extract, index SBDs, and providing search and download services through web and API services. After 20 years, the CiteSeerX team at Penn State and ODU are working together to make it sustainable for the next 10 years. The goal is to crawl and index all open access academic documents on the Web. Sub-project topics include data cleansing, focused web crawling, large-scale information extraction and indexing, etc.
Publications: K-CAP 2015, K-CAP 2017
Funding support: NSF (Award#1823288)
Applied Machine Learning
The prosperity of artificial intelligent (AI) has increased the accuracy of text extraction and classification. In these AI based projects, academic papers are automatically separated from PDF documents with high precision and recall, keyphrases are extracted from the body text, and research subject areas are automatically identified.