top of page

Student Presentation Reading Materials

1: Digital libraries, repositories, and search engines

Major Readings:

* "Digital libraries and autonomous citation indexing" by §by Lawrence, Giles, and Bollacker, 1999

* "Big Scholarly Data: A Survey" by Xia et al. 2017 

* “Search Engine Technology and Digital Libraries” by Summann & Lossau, 2010

Additional Readings:

* Digital Library: https://en.wikipedia.org/wiki/Digital_library

* List of digital library projects: https://en.wikipedia.org/wiki/List_of_digital_library_projects

* ACM Digital Library: https://en.wikipedia.org/wiki/Association_for_Computing_Machinery#Portal_and_Digital_Library

* Institutional repository: https://en.wikipedia.org/wiki/Institutional_repository

* Dspace: https://duraspace.org/dspace/

* HathiTrust Digital Library: http://nrs.harvard.edu/urn-3:hul.eresource:hathitrust

* ROAR content search: Registry of Open Access Repositories (ROAR) Content Search

* OpenDOAR project: http://www.opendoar.org/search.php

* OAISTER project: http://oaister.worldcat.org/advancedsearch

* The Digital Public Library of America: http://dp.la/

* Google Scholar: https://en.wikipedia.org/wiki/Google_Scholar

2. Architectures of digital library search engines

Major Readings:

* "CiteSeerX: AI In A Digital Library Search Engine" by Wu et al. AI-Magazine 2015

* "ArnetMiner: extraction and mining of academic social networks" by Tang et al. KDD 2008 

Additional Readings:

* "Towards Building a Scholarly Big Data Platform: Challenges, Lessons and Opportunities" by Wu et al. 2014

* ETL model: https://en.wikipedia.org/wiki/Extract,_transform,_load 

* RESTful API: https://en.wikipedia.org/wiki/Representational_state_transfer 

* LAMP architecture model: https://en.wikipedia.org/wiki/LAMP_(software_bundle) 

* Apache Solr: http://lucene.apache.org/solr/ 

* MySQL: https://www.mysql.com/ 

* Apache Tomcat: http://tomcat.apache.org/ 

* Apache UIMA: https://uima.apache.org/ 

3. Textual metadata extraction: headers and citations 

Major Readings:

* "CERMINE: automatic extraction of structured metadata from scientific literature" by Tkaczyk et al. 2015

* "Neural ParsCit: a deep learning-based reference string parser" by Prasad, Kaur, and Kan 2018 

Additional Readings:

* "Evaluation of Header Metadata Extraction Approaches and Tools for Scientific PDF Documents" by Lipinski et al. 2014 

* "GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications" by Lopez 2009

* Dublin core: https://en.wikipedia.org/wiki/Dublin_Core 

* Metadata Object Description Schema (MODS): https://www.loc.gov/standards/mods/v3/mods-userguide-3-0.html

* "Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers" by Tkaczyk et al. 2018 

4. Non-textual metadata extraction: figures and tables 

Major Readings: 

* "Table Header Detection and Classification" by Fang et al. 2018

* "PDFFigures 2.0: Mining Figures from Research Papers" by Clark & Divvala 2016 

Additional Readings:

* "Extracting Scientific Figures with Distantly Supervised Neural Networks" by Siegel 2018 

* "Curve Separation for Line Graphs in Scholarly Documents" by Choudhury et al. 2016 

* Raster graphics: https://en.wikipedia.org/wiki/Raster_graphics 

* Vector graphics: https://en.wikipedia.org/wiki/Vector_graphics 

* pdffigureshttp://pdffigures.allenai.org/ 

5. Semantic Information extraction: entities and relations 

Major Readings: 

* "Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions" by Shen, Wang,, and Han (2015) 

* "Construction of the Literature Graph in Semantic Scholar" by Ammar et al. (2018) 

Additional Readings:

* Wikipedia: https://en.wikipedia.org/wiki/Wikipedia 

* DBPedia: https://en.wikipedia.org/wiki/DBpedia 

* Freebase: https://en.wikipedia.org/wiki/Freebase 

* YAGO: https://en.wikipedia.org/wiki/YAGO_(database) 

* Wikidata: https://en.wikipedia.org/wiki/Wikidata 

* "SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications" by Augenstein et al. (2017)

* Word2Vec:  https://en.wikipedia.org/wiki/Word2vec 

6. Document classification

Major Readings: 

* "Classifying Document Types to Enhance Search and Recommendations in Digital Libraries" by Charalampous & Knoth (2017) 

* "Document Type Classification in Online Digital Libraries" by Caragea et al. (2016)

Additional Readings:

* n-grams: https://en.wikipedia.org/wiki/N-gram 

* A Gentle Introduction to the Bag-of-Words Model: https://machinelearningmastery.com/gentle-introduction-bag-words-model/ 

* "Web page classification: Features and Algorithms" by Qi and Davison (2009)

7. Near-duplicate and plagiarism detection 

Major Readings:

* "Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information" by Ehsan & Shakery (2016) 

* "Near Duplicate Detection in an Academic Digital Library" by Williams & Giles (2013)

Additional Readings:

* "An Introduction to Duplicate Detection" by Naumann & Herschel (2010)

* Plagiarism Detection: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3856277/ 

* HelioBLAST: https://helioblast.heliotext.com/ 

* Simhash on a blog: http://matpalm.com/resemblance/simhash/ 

* Near duplicates and Shingles on the IR textbook: https://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html 

8. Ranking and recommendation of research papers 

Major Readings: 

* "Research-paper recommender systems: a literature survey" by Beel et al. (2016) 

* "Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding" by Xiong, Power, & Callan (2017) 

Additional Readings: 

* Recommender System: https://en.wikipedia.org/wiki/Recommender_system 

* Collaborative Filtering: https://en.wikipedia.org/wiki/Collaborative_filtering 

* PubMed: https://en.wikipedia.org/wiki/PubMed 

* Docear: http://www.docear.org/ 

* CiteULike: http://www.citeulike.org/ 

* Mendeley on Wikipedia: https://en.wikipedia.org/wiki/Mendeley 

* Click through rate: https://en.wikipedia.org/wiki/Click-through_rate 

* What is the difference between content based filtering and collaborative filtering? : https://www.quora.com/What-is-the-difference-between-content-based-filtering-and-collaborative-filtering

9. Question answering systems based on SBD

Major Readings:
* "Novel knowledge-based system with relation detection and textual evidence for question answering research" by Zheng et al. (2018) 

* "Open Domain Question Answering via Semantic Enrichment" by Sun et al. (2015)

Additional Readings:

* "Search needs a shake-up" by Oren Etzioni (2011)

* "Building Watson: An Overview of the DeepQA Project" by Ferrucci et al. (2010)

 

 

 

 

    

 

 

 

bottom of page