Dr. Wu is recruiting motivated PhD students who are interested in Applied Machine Learning and Natural Language Processing Systems. For details, please see this blog post.
Biography
Dr. Jian Wu is an assistant professor in the Computer Science Department at the Old Dominion University, Norfolk, VA. Before joining ODU, Dr. Jian Wu was an assistant teaching professor in the College of Information Sciences and Technology (IST) at the Pennsylvania State University.
Dr. Jian Wu received his bachelor's degree in 2004 from the University of Science and Technology of China (USTC) in Physics and Astronomy. He graduated from the Department of Astronomy and Astrophysics at the Pennsylvania State University in August 2011. After that, he joined the CiteSeerX team led by Dr. C. Lee Giles. Jian Wu is the tech leader of the CiteSeerX project. He led a small team to scale the CiteSeerX collection from 3 million to 10 million academic documents.
Dr. Jian Wu has published nearly 30 20 peer-reviewed papers in ACM, IEEE, AAAI conferences, receiving one best application paper award and two best paper nominations. Dr. Jian Wu also processed and analyzed astronomical big data earlier in his career and published 7 journal papers in the Astrophysical Journal (ApJ), the Astronomical Journal (AJ), and Monthly Notices of the Royal Astronomical Society (MNRAS). Dr. Jian Wu was the Co-PI of NASA and NSF proposals.
Dr. Jian Wu has mentored at least 20 students towards their bachelor's or master's theses and teaches two undergraduate level courses.
Selected Publications in Information Sciences and Technology (refer to my CV for a full list)
-
Athar Sefid, Jian Wu, Jing Zhao, Lu Liu, Allen C. Ge, Cornelia Caragea, Prasenjit Mitra, C. Lee Giles. "Cleaning Noisy and Heterogeneous Metadata for Record Linking Across Scholarly Big Datasets." In: Proceedings of the 31th Innovative Applications of Artificial Intelligence Conference (IAAI 2019), January 29-31, 2019, Honolulu, Hawaii, USA. [pdf]
-
Jian Wu, Bharath Kandimalla, Shaurya Rohatgi, Athar Sefid, Jianyu Mao, C. Lee Giles. "CiteSeerX-2018: A Cleansed Multidisciplinary Scholarly Big Dataset." In: Proceedings of the 2018 IEEE International Conference on Big Data (BigData 2018), December 10-13, 2018, Seattle, WA, USA. [pdf]
-
Jian Wu, Athar Sefid, Allen C. Ge, C. Lee Giles. "A Supervised Learning Approach To Entity Matching Between Scholarly Big Datasets." In: Proceedings of the 9th International Conference on Knowledge Capture (K-CAP 2017), December 4-6, 2017, Austin, Texas, USA. [pdf] [bibtex]
-
Jian Wu, Sagnik Ray Choudhury, Agnese Chiatti, Chen Liang, and C. Lee Giles. "HESDK: A Hybrid Approach to Extracting Scientific Domain Knowledge Entities." In: Proceedings of ACM/IEEE-CS Joint Conference on Digital Libraries ( JCDL 2017), Toronto, Canada. [pdf] [bibtex]
-
Jian Wu, Chen Liang, Huaiyu Yang, and C. Lee Giles. "CiteSeerX data: semanticizing scholarly papers." In: Proceedings of the International Workshop on Semantic Big Data (SIGMOD-SBD 2016), San Francisco, CA, USA. [pdf] [bibtex]
-
Cornelia Caragea, Jian Wu, Sujatha Das G., and C. Lee Giles. "Document Type Classification in Online Digital Libraries." In: Proceedings of the 26th Innovative Applications of Artificial Intelligence Conference (IAAI 2016). [pdf] [bibtex]
-
Jian Wu, Kyle Williams, Hung-Hsuan Chen, Madian Khabsa, Cornelia Caragea, Suppawong Tuarob, Alexander Ororbia, Douglas Jordan, Prasenjit Mitra, and C. Lee Giles. "CiteSeerX: AI in a Digital Library Search Engine." Artificial Intelligence Magazine (AI Magazine), 2015. [pdf] [bibtex]
-
Jian Wu, Jason Killian, Huaiyu Yang, Kyle Williams, Sagnik Ray Choudhury, Suppawong Tuarob, Cornelia Caragea, and C. Lee Giles. "PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search." In: Proceedings of The 8th International Conference on Knowledge Capture (K-CAP 2015), Palisades, NY, USA. [Best Paper Nomination] [pdf] [bibtex]
-
Alexander Ororbia, David Reitter, Jian Wu, and C. Lee Giles. "Online Learning of Deep Hybrid Architectures for Semi-Supervised Categorization." In: Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2015), Porto, Portugal. [pdf] [bibtex]
-
Alexander G. Ororbia II, Jian Wu, and C. Lee Giles. "Big Scholarly Data in CiteSeerX: Information Extraction from the Web." In: The 2nd WWW Workshop on Big Scholarly Data: Towards the Web of Scholars (WWW-BigScholar 2015), Florence, Italy. [pdf] [bibtex]
-
Kyle Williams, Jian Wu, and C. Lee Giles. "SimSeerX: A Similar Document Search Engine." In: The 14th ACM Symposium on Document Engineering (DocEng 2014), Fort Collins, CO, USA. [pdf] [bibtex]
-
Jian Wu, Kyle Williams, Hung-Hsuan Chen, Madian Khabsa, Cornelia Caragea, Alexander Ororbia, Douglas Jordan and C. Lee Giles. "CiteSeerX: AI in a Digital Library Search Engine". In: The 26th Annual Conference on Innovative Applications of Artificial Intelligence (IAAI 2014), Quebec City, Quebec, Canada. [Best Application Paper]. [pdf] [bibtex]
-
Zhaohui Wu, Jian Wu, Madian Khabsa, Kyle Williams, Hung-Hsuan Chen, Wenyi Huang, Suppawong Tuarob, Sagnik Ray Choudhury, Alexander Ororbia, Prasenjit Mitra, and C. Lee Giles. "Towards Building a Scholarly Big Data Platform: Challenges, Lessons, and Opportunities". In: Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries (DL 2014), London, UK. [pdf] [bibtex]
-
Jian Wu, Pradeep Teregowda, Kyle Williams, Madian Khabsa, Douglas Jordan, Eric Treece, Zhaohui Wu and C. Lee Giles. "Migrating A Digital Library to A Private Cloud". In: Proceedings of the IEEE International Conference on Cloud Engineering 2014 (IC2E 2014), Boston, MA, USA. [Best Paper Nomination] [pdf] [bibtex]
-
Cornelia Caragea, Jian Wu, Kyle Williams, Sujatha Das G., Madian Khabsa, Pradeep Teregowda and C. Lee Giles. "Automatic Identification of Research Articles from Crawled Documents." In: WSDM 2014 Workshop on Web-scale Classification: Classifying Big Data from the Web (WSCBD 2014), New York City, NY, USA, 2014. [pdf] [bibtex]
-
Cornelia Caragea, Jian Wu, Alina Ciobanu, Kyle Williams, Juan Fernandez-Ramrez, Hung-Hsuan Chen, Zhaohui Wu and C. Lee Giles. "CiteSeerX: A Scholarly Big Dataset". In: Proceedings of the 36th European Conference on Information Retrieval (ECIR 2014), Amsterdam, Netherlands. [pdf] [bibtex]
-
Jian Wu, Pradeep Teregowda, Juan Pablo Fernandez Ramrez, Prasenjit Mitra, Shuyi Zheng, and C. Lee Giles. "The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists." In: Proceedings of the 3rd Annual ACM Web Science Conference (WebSci 2012), Evanston, IL, USA. [pdf] [bibtex]
Publications in Astronomy and Astrophysics
-
Jonathan Gelbord, Caryl Gronwall, Dirk Grupe, Dan Vanden Berk, and Jian Wu. 2015, arXiv astro-ph 1505.05248. Exploring Multiwavelength AGN Variability with Swift Archival Data
-
My Ph.D. Thesis (public since September 2013). Chapter 2 includes my work on the evolution of the Baldwin Effect using the composite spectra generated from the SDSS. This work was never published except in my thesis because we do not have time yet to finish the whole paper, but the result is interesting and inspiring.
-
Jian Wu and Daniel E. Vanden Berk. 2013, arXiv astro-ph 1312.7356. Fitting the Continuum Component of A Composite SDSS Quasar Spectrum Using CMA-ES
-
Luo, B., Brandt, W. N., Eracleous, M., Wu, J., Hall, P. B., Rafiee, A., Schneider, D. P., Wu Jianfeng 2013, MNRAS, 429, 1479. X-ray and multiwavelength insights into the inner structure of high-luminosity disc-like emitters
-
Wu, J., Vanden Berk, D. E., Grupe, D., Schneider, D. P., et al. 2012, ApJS, 201:10. A Quasar Catalog with Simultaneous UV, Optical and X-ray Observations by Swift
-
Wu, J., Charlton, J. C., Misawa, T., Eracleous, M., & Ganguly, R., 2010, ApJ, 722:997. The Physical Conditions of the Intrinsic NV Narrow Absorption Line Systems of Three Quasars
-
Wu, J., Vanden Berk, D. E., Brandt, W. N., Schneider, D. P., Gibson, R. R., & Wu, Jianfeng 2009, ApJ, 702:767. Probing Origins of the C iv λ1549 and Fe Kα Baldwin Effect
-
Donald P. Schneider, Patrick B. Hall, Gordon T. Richards, Michael A. Strauss, Daniel E. Vanden Berk, Scott F. Anderson, W. N. Brandt, Xiaohui Fan, Sebastian Jester, Jim Gray, James E. Gunn, Mark U. SubbaRao, Anirudda R. Thakar, Chris Stoughton, Alexander S. Szalay, Brian Yanny, Donald G. York, Neta A. Bahcall, J. Barentine, Michael R. Blanton, Howard Brewington, J. Brinkmann, Robert J. Brunner, Francisco J. Castander, Istvan Csabai, Joshua A. Frieman, Masataka Fukugita, Michael Harvanek, David W. Hogg, Zeljko Ivezic, Stephen M. Kent, S. J. Kleinman, G. R. Knapp, Richard G. Kron, Jurek Krzesinski, Daniel C. Long, Robert H. Lupton, Atsuko Nitta, Jerey R. Pier, David H. Saxe, Yue Shen, Stephanie A. Snedden, David H.Weinberg, and Jian Wu 2007, AJ, 134:102. The Sloan Digital Sky Survey Quasar Catalog. IV. Fifth Data Release
-
Lu, Y., Wang, T., Zhou, H., & Wu, J. 2007, AJ, 133:1615. On the Selection Effect of Radio Quasars in the Sloan Digital Sky Survey
Bookmarks
Digital Libraries and Search Engines
-
CiteSeerX (http://citeseerx.ist.psu.edu/)
-
Google Scholar (http://scholar.google.com/)
-
Microsoft Academic Search (http://academic.research.microsoft.com/)
-
ACM Digital Library (http://dl.acm.org/)
-
IEEE Xplore (http://ieeexplore.ieee.org/)
-
Computing Research Repository in arXiv (http://arxiv.org/corr/home)
-
Semantic Scholarly by AllenAI (http://s2.allenai.org)
The Seer Family
-
CiteSeerX, RefSeer, Collabsee, ChemxSeer, SimSeerX, AckSeer, Enthicseer, CSSeer, BBookX, ArchSeer
-
Phrase mining: Segphrase (weakly supervised), TopMine (unsupervised)
-
IlliMine (UIUC data mining research package dissemination portal)
-
Lookup Tables (Regular Expression Syntax, Penn Treebank Part-of-Speech Tags)
Digital Library
Knowledge Bases
-
WordNet (Home, Search, Rion Snow's Homepage)
Glossary and Dictionaries
-
List of programming and computer science terms (LabAutoPedia.org)
-
GeoNames geographical database (gazetteer; http://www.geonames.org/)
-
United States Census Bureau name database (http://www.census.gov)
Solr
Java
-
MaxClients in Apache and its effect on Tomcat during Full GC by Sangmin Lee
-
The Principles of Java Application Performance Turning by Sangmin Lee
-
How Statement Pooling in JDBC affects the Garbage Collection by Sangmin Lee
-
Oracle: Using JConsole
-
Oracle: Monitoring and Management using JMX Technology
Ruby on Rails
-
Ruby on rails (turoial point, book, rubyonrails.org)
-
Ruby tutorial (tutorial point)
Web Crawling
PDF Text and Metadata Extractors
-
CiteSeerX Extractor (http://citeseerextractor.ist.psu.edu:8080/static/index.html)
-
CERMINE (http://cermine.ceon.pl/index.html)
-
PDFLib TET (http://www.pdflib.com/products/tet/)
-
ParsCit (http://aye.comp.nus.edu.sg/parsCit/)
-
Dr. Inventor Extraction Lib (http://backingdata.org/dri/library/1.3/)
Tomcat
Networking
Hardware
Linux
-
18 commands to monitor network bandwidth on Linux server by Binary Tides
Python
-
Python csv – Comma-separated value files
-
Neural network packages (PyBrain, scikit-learn, Brain, Neurolab)
MySQL
-
MySQL Server Tuning (posted Aftab Khan at blogspot).
-
Knowledge base: MySQL 5.1 vs. 5.5 vs. 5.6 Performance Comparison
Tutorials
-
Deep Learning (deeplearning.net)
-
MapReduce Tutorial (www.mapr.com)
Course Notes
-
CMPSC 497B/ 597B Big Data Analytics by Dan Kifer
-
Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville
Data Repositories and Data Sets
Blogs
-
Comparison between NoSQL database management systems - Kristof Kovacs
-
Upgrade Solr from version 1.3 to 4.9 for CiteSeerX - Po-Yu Chuang
Software Packages
Homepages
Presentations
Cloud Resources
Teaching Courses
-
IST 210 ( Dr. Dashun Wang, Fall 2015, Dr. Zhenghui Li (login required) )
-
IST 431W ( Yu-San Lin )
Job Finder
Conferences and Journals
Others
-
USCIS - check Status
-
Expressing Dublin Core metadata using the Resource Description Framework (RDF)
-
NSF Proposal Latex Template (MIT Mathematics)
-
ACM Student Research Competition
Online Tools
Help is always here: Look out for the ? on the top right of the editor & anywhere you see it