Student Presentation Reading Materials
1: Digital libraries, repositories, and search engines
Major Readings:
* "Digital libraries and autonomous citation indexing" by §by Lawrence, Giles, and Bollacker, 1999
* "Big Scholarly Data: A Survey" by Xia et al. 2017
* “Search Engine Technology and Digital Libraries” by Summann & Lossau, 2010
Additional Readings:
* Digital Library: https://en.wikipedia.org/wiki/Digital_library
* List of digital library projects: https://en.wikipedia.org/wiki/List_of_digital_library_projects
* ACM Digital Library: https://en.wikipedia.org/wiki/Association_for_Computing_Machinery#Portal_and_Digital_Library
* Institutional repository: https://en.wikipedia.org/wiki/Institutional_repository
* Dspace: https://duraspace.org/dspace/
* HathiTrust Digital Library: http://nrs.harvard.edu/urn-3:hul.eresource:hathitrust
* ROAR content search: Registry of Open Access Repositories (ROAR) Content Search
* OpenDOAR project: http://www.opendoar.org/search.php
* OAISTER project: http://oaister.worldcat.org/advancedsearch
* The Digital Public Library of America: http://dp.la/
* Google Scholar: https://en.wikipedia.org/wiki/Google_Scholar
2. Architectures of digital library search engines
Major Readings:
* "CiteSeerX: AI In A Digital Library Search Engine" by Wu et al. AI-Magazine 2015
* "ArnetMiner: extraction and mining of academic social networks" by Tang et al. KDD 2008
Additional Readings:
* "Towards Building a Scholarly Big Data Platform: Challenges, Lessons and Opportunities" by Wu et al. 2014
* ETL model: https://en.wikipedia.org/wiki/Extract,_transform,_load
* RESTful API: https://en.wikipedia.org/wiki/Representational_state_transfer
* LAMP architecture model: https://en.wikipedia.org/wiki/LAMP_(software_bundle)
* Apache Solr: http://lucene.apache.org/solr/
* MySQL: https://www.mysql.com/
* Apache Tomcat: http://tomcat.apache.org/
* Apache UIMA: https://uima.apache.org/
3. Textual metadata extraction: headers and citations
Major Readings:
* "CERMINE: automatic extraction of structured metadata from scientific literature" by Tkaczyk et al. 2015
* "Neural ParsCit: a deep learning-based reference string parser" by Prasad, Kaur, and Kan 2018
Additional Readings:
* "Evaluation of Header Metadata Extraction Approaches and Tools for Scientific PDF Documents" by Lipinski et al. 2014
* "GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications" by Lopez 2009
* Dublin core: https://en.wikipedia.org/wiki/Dublin_Core
* Metadata Object Description Schema (MODS): https://www.loc.gov/standards/mods/v3/mods-userguide-3-0.html
* "Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers" by Tkaczyk et al. 2018
4. Non-textual metadata extraction: figures and tables
Major Readings:
* "Table Header Detection and Classification" by Fang et al. 2018
* "PDFFigures 2.0: Mining Figures from Research Papers" by Clark & Divvala 2016
Additional Readings:
* "Extracting Scientific Figures with Distantly Supervised Neural Networks" by Siegel 2018
* "Curve Separation for Line Graphs in Scholarly Documents" by Choudhury et al. 2016
* Raster graphics: https://en.wikipedia.org/wiki/Raster_graphics
* Vector graphics: https://en.wikipedia.org/wiki/Vector_graphics
* pdffigures: http://pdffigures.allenai.org/
5. Semantic Information extraction: entities and relations
Major Readings:
* "Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions" by Shen, Wang,, and Han (2015)
* "Construction of the Literature Graph in Semantic Scholar" by Ammar et al. (2018)
Additional Readings:
* Wikipedia: https://en.wikipedia.org/wiki/Wikipedia
* DBPedia: https://en.wikipedia.org/wiki/DBpedia
* Freebase: https://en.wikipedia.org/wiki/Freebase
* YAGO: https://en.wikipedia.org/wiki/YAGO_(database)
* Wikidata: https://en.wikipedia.org/wiki/Wikidata
* "SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications" by Augenstein et al. (2017)
* Word2Vec: https://en.wikipedia.org/wiki/Word2vec
6. Document classification
Major Readings:
* "Classifying Document Types to Enhance Search and Recommendations in Digital Libraries" by Charalampous & Knoth (2017)
* "Document Type Classification in Online Digital Libraries" by Caragea et al. (2016)
Additional Readings:
* n-grams: https://en.wikipedia.org/wiki/N-gram
* A Gentle Introduction to the Bag-of-Words Model: https://machinelearningmastery.com/gentle-introduction-bag-words-model/
* "Web page classification: Features and Algorithms" by Qi and Davison (2009)
7. Near-duplicate and plagiarism detection
Major Readings:
* "Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information" by Ehsan & Shakery (2016)
* "Near Duplicate Detection in an Academic Digital Library" by Williams & Giles (2013)
Additional Readings:
* "An Introduction to Duplicate Detection" by Naumann & Herschel (2010)
* Plagiarism Detection: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3856277/
* HelioBLAST: https://helioblast.heliotext.com/
* Simhash on a blog: http://matpalm.com/resemblance/simhash/
* Near duplicates and Shingles on the IR textbook: https://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html
8. Ranking and recommendation of research papers
Major Readings:
* "Research-paper recommender systems: a literature survey" by Beel et al. (2016)
* "Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding" by Xiong, Power, & Callan (2017)
Additional Readings:
* Recommender System: https://en.wikipedia.org/wiki/Recommender_system
* Collaborative Filtering: https://en.wikipedia.org/wiki/Collaborative_filtering
* PubMed: https://en.wikipedia.org/wiki/PubMed
* Docear: http://www.docear.org/
* CiteULike: http://www.citeulike.org/
* Mendeley on Wikipedia: https://en.wikipedia.org/wiki/Mendeley
* Click through rate: https://en.wikipedia.org/wiki/Click-through_rate
* What is the difference between content based filtering and collaborative filtering? : https://www.quora.com/What-is-the-difference-between-content-based-filtering-and-collaborative-filtering
9. Question answering systems based on SBD
Major Readings:
* "Novel knowledge-based system with relation detection and textual evidence for question answering research" by Zheng et al. (2018)
* "Open Domain Question Answering via Semantic Enrichment" by Sun et al. (2015)
Additional Readings:
* "Search needs a shake-up" by Oren Etzioni (2011)
* "Building Watson: An Overview of the DeepQA Project" by Ferrucci et al. (2010)