Document Analysis (COMP4650/COMP6490)
Undergraduate/Postgraduate level, Australian National University, 2022
Useful links:
Textbooks:
- Introduction to Information Retrieval, C.D. Manning, P. Raghavan and H. Scutze, Cambridge University Press, 2008.
- Speech and Language Processing (3rd ed. draft), Dan Jurafsky and James H. Martin
Overview
Processing of semi-structured documents such as internet pages, RSS feeds and their accompanying news items, and PDF brochures is considered from the perspective of interpreting the content. This course considers the “document” and its various genres as a fundamental object for business, government and community. For this, the course covers four broad areas: (A) information retrieval, (B) natural language processing, (C) machine learning for documents, and (D) relevant tools for the Web. Basic tasks here are covered including content collection and extraction, formal and informal natural language processing, information extraction, information retrieval, classification and analysis. Fundamental probabilistic techniques for performing these tasks, and some common software systems will be covered, though no area will be covered in any depth.
Learning Outcomes:
- Differentiate between the basic probabilistic theories of language and document structure, information retrieval, and classification, clustering and document feature engineering.
- Identify the basic algorithms and software available for probabilistic theories of language and be proficient at using common libraries for natural language processing to perform basic analysis tasks.
- Index a document collection for use in an information retrieval system. Demonstrate advanced knowledge of basic theories and algorithms to determine large scale named-entity matching and standardization of names within a collection.
- Perform automated classification using probabilistic theories.
Schedule:
Week | Topics |
---|---|
1 | Introduction, IR - Boolean Retrieval |
2 | IR - Ranked Retrieval, IR - Evaluation |
3 | IR – Web Search, ML – Intro & Regression |
4 | ML – Representation, ML – Deep Neural Networks |
5 | ML – DNN in Practice, ML – DNN for Structured Data |
6 | ML – Attention, ML – Transformers |
7 | ML – Pre-training and Neural Language Models, ML – Clustering |
8 | NLP – Semantics, NLP – Syntax Parsing |
9 | NLP – Language Models |
10 | NLP – Dependency Parsing, NLP in Practice |
11 | NLP in Practice |
12 | NLP in Practice |