Title: Machine Translation
Instructor:David Chiang (University of Notre Dame)
Machine translation, or automatic translation of human languages, is one of the oldest problems in computer science, dating back to the 1950s. In our day, the explosion of text data in multiple languages presents both a challenge and an opportunity for machine translation: the challenge is the volume of data to be processed, and the opportunity is the wealth of knowledge about language to be mined.
Broadly, two approaches to machine translation have been taken: one which relies on knowledge of linguistic structure and meaning, and the other which relies on statistics from large amounts of data. For years, these two approaches seemed at odds with each other, but recent developments have made great progress towards building translation systems according to the maxim, "Linguistics tells us what to count, and statistics tells us how to count it" (Joshi). I will give an overview of the progression of statistical translation models based on increasingly complex formalisms, from word-based to phrase-based to syntax-based translation systems.
I will also discuss recent proposals for semantics-based statistical translation, focusing on the graph-rewriting formalisms upon which such systems might be based. Finally, in the last few years, neural networks have been shown serious promise as models of translation. I'll give a survey of some of these recent efforts, and discuss how this line of research might interact with research on syntax-based and semantics-based translation.
David Chiang (蔣偉) is an Associate Professor in the Department of Computer Science and Engineering at University of Notre Dame. He obtained his PhD from University of Pennsylvania in 2004. His research interests include language translation, syntactic parsing, and other areas in natural language processing as well. He has published about 40 papers at leading conferences and journals including ACL, EMNLP, NAACL, COLING, EACL, CL, TACL, Machine Translation, and Machine Learning. His work on applying formal grammars and machine learning to translation has been recognized with two best paper awards (at ACL 2005 and NAACL HLT 2009). He has received research grants from DARPA, NSF, and Google, has served on the executive board of NAACL and the editorial board of Computational Linguistics and JAIR.
Title: Automatic Summarization
Instructor:Yang Liu (The University of Texas at Dallas)
In the past decade, we have seen that the amount of digital data, such as news, scientific articles, conversations, social media posts, increases at an exponential pace. The need to address `information overload’ by developing automatic summarization systems has never been more pressing. This tutorial will give a systematic overview of traditional and more recent approaches for automatic summarization (focusing on extractive summarization).
A core problem in summarization research is devising methods to estimate the importance of a unit, be it a word, clause, sentence or utterance, in the input. We will introduce several unsupervised methods, including topic-based and graph-based models, semantically rich approaches based on latent semantic analysis and lexical resources, and Bayesian models for summarization. For supervised machine learning approaches, we will discuss the suite of traditional features used in summarization, as well as issued with data annotation and acquisition. Ultimately, the summary will be a collection of important units. The summary can be selected in a greedy manner, choosing the most informative sentence, one by one; or the units can be selected jointly, and optimized for informativeness. We explain both approaches, with emphasis on recent optimization work. Then we will discuss the standard manual and automatic metrics for evaluation, as well as very recent work on fully automatic evaluation.
This tutorial will end with a review of more recent advances in summarization. First we will discuss domain specific summarization, including summarization of speech data and social media posts (e.g., tweets). Second, we will introduce recent summarization methods, in particular variation of the optimization framework for extractive and abstractive summarization, as well as exploring deep language understanding methods for summarization. Last we will briefly touch one some recent summarization tasks, such as generating timelines and hierarchical summaries.
Yang Liu is an Associate Professor in the Department of Computer Science at The University of Texas at Dallas. She obtained his PhD from Purdue University in 2004. Her research interests include speech and natural language processing, social media language analysis, automatic summarization, emotion and affect modeling, and speech and language disorder. She has published about 75 papers at leading conferences and journals including ACL, EMNLP, IJCNLP, COLING, NAACL, EACL, CL, and SLP. She received the NSF CAREER award in 2009, and received the Air Force Young Investigator Program award in 2010. She served as the Area Chair for ACL 2012, EMNLP 2013, and ACL 2014, the Panel Chair for SLT 2014, and the Tutorial co-chair for NAACL 2015.
Title: Coreference Resolution
Instructor:Vincent Ng (The University of Texas at Dallas)
Coreference resolution, the task of determining which mentions in a text or dialogue refer to the same real-world entity or event, has been at the core of natural language understanding since the 1960s. Despite the large amount of work on coreference resolution, the task is far from being solved. The difficulty of the task stems in part from its reliance on sophisticated background knowledge and inference mechanisms.
This tutorial will provide an overview of machine learning approaches to coreference resolution. The first part of the tutorial focuses on entity coreference resolution, which is the most extensively investigated coreference task. We will examine both traditional machine learning approaches, which recast coreference as a classification task, as well as recent approaches, which recast coreference as a structured prediction task. We will conclude with a discussion of the Winograd Schema Challenge, a pronoun resolution task that has recently received a lot of attention in the artificial intelligence community owing to its relevance to the Turing Test.
The second part of the tutorial focuses on coreference research "beyond" entity coreference resolution. Specifically, we will examine zero anaphora resolution and event coreference resolution. Zero and event anaphors are not only less studied but arguably more difficult to resolve, owing to the lack of grammatical attributes in zero anaphors and an event coreference resolver's heavy reliance on the noisy output produced by its upstream components in the standard information extraction pipeline. To enable the applicability of coreference technologies to the vast majority of the world's natural languages for which coreference-annotated corpora are not readily available, we will examine semi-supervised and unsupervised models for the resolution of zero and event anaphora.
Vincent Ng is an Associate Professor in the Department of Computer Science at The University of Texas at Dallas. He obtained his PhD from Cornell University in 2004. His research interests include statistical natural language processing, information extraction, text data mining, machine learning, knowledge management, and artificial intelligence. He has published about 60 papers at leading conferences and journals including ACL, EMNLP, AAAI, NAACL, IJCNLP, IJCAI, CoNLL, ICTAI, COLING, EACL, CL, and JAIR. He served as the Member of Program Committees and Conference Review Panels for ACL, NAACL, EMNLP, CoNLL, COLING, and IJCAI. He also served as Journal Referee for CL, LRE, and JAIR.
Title: Information Extraction
Instructor:William Wang (Carnegie Mellon University)
Information Extraction (IE) is a core area in natural language processing that distills knowledge from unstructured data. In the era of information overload, extracting the key insights from big data is critical to almost all subareas of data science in both academia and industry. In this short course, I will cover various aspects of the theories and practices of modern information extraction techniques. First, I will provide a brief overview of IE, and describe simple classification and sequential models for named entity recognition. Second, I will introduce recent advances of IE techniques, including distant supervision and latent factor models. We will also look at some case studies of modern IE systems, including UW’s OpenIE and CMU’s NELL. Third, I will introduce the joint view of IE and reasoning, with a focus on the scalability issue. Finally, we will have a hand-on lab session to transfer the IE theories into practices.
The students are encouraged to bring a laptop with Linux/MacOS/Cygwin and Java installed. More information regarding the lab session will be updated at the course
William Wang (王威廉) is a PhD student at the Language Technologies Institute (LTI) of the School of Computer Science, Carnegie Mellon University. He works with William Cohen on designing scalable learning and inference algorithms for statistical relational learning, knowledge reasoning, and information extraction. He has published about 30 papers at leading conferences and journals including ACL, EMNLP, NAACL, IJCAI, CIKM, COLING, SIGDIAL, IJCNLP, INTERSPEECH, ICASSP, ASRU, SLT, Machine Learning, and Computer Speech & Language. He is a reviewer for many journals including Artificial Intelligence, IEEE/ACM TASLP, IEEE TAC, Bioinformatics, and JASIST, and he has organized or served as a PC member for many conferences and workshops, including IJCAI 2015, NAACL 2015, CIKM 2015, ICASSP 2015, and Interspeech 2015. Most recently, he served as the session chair for the data mining and machine learning session at CIKM 2013 and two text data mining sessions at CIKM 2014. He is the recipient of a Best Student Paper Award at ASRU 2013, Best Paper Honorable Mention Awards at CIKM 2013 and FLAIRS 2011, a Best Reviewer Award at NAACL 2015, the Richard King Mellon Presidential Fellowship in 2011, and Facebook Fellowship Finalist Awards for2014-2015 and 2015-2016. He is also an alumnus of Columbia University, a research scientist intern of Yahoo!, and a former intern of Microsoft Research Redmond.
Title: Parsing deeper and wider
Instructor:Nianwen Xue (Brandeis University)
As research on syntactic parsing has reached a plateau, the field of NLP is searching for new problems to solve. One direction is going “deeper” and parsing sentences into meaning representations that abstract away from surface syntactic structures. One example is recent efforts on building large-scale linguistic resources annotated with Abstract Meaning Representations (AMRs). Another direction is to go “wider” and parsing text units that go beyond sentence boundaries. This line of research involves building discourse treebanks and using them to train discourse parsers. In this talk I will present some recent work on developing AMR parsing algorithms and models as well as efforts in developing discourse parsers. I will discuss the many challenges in these two lines of work and the research opportunities they have opened up.
Nianwen Xue is an Associate Professor in the Computer Science Department and the Language and Linguistics Program at Brandeis University. Dr. Xue directs the Chinese Language Processing Group in the Computer Science Department. Before joining Brandeis, Dr. Xue was a Research Assistant Professor in the Department of Linguistics and the Center for Computational Language and Education Research (CLEAR) at the University of Colorado at Boulder. Prior to that, he was a postdoctoral fellow in the Institute for Research in Cognitive Science and the Department of Computer and Information Science at the University of Pennsylvania. He received his PhD in Linguistics from University of Delaware. His research interests include syntactic, semantic, temporal and discourse annotation, semantic-role labeling and Machine Translation. In addition to building syntactically and semantically annotated corpora, he has also published work on Chinese word segmentation and semantic parsing using statistical machine-learning techniques.
|14:00-19:00||Registration（理科一号楼 计算机科学技术系 1 层，电梯旁）|
|08:30-08:40||Welcome to CIPS Summer School|
|08:40-12:00||Title:Parsing deeper and wider
Instructor:Nianwen Xue(Brandeis University)
Chair: Houfeng Wang (Peking University)
|12:00-13:30||Lunch Break (Within Peking University)|
Instructor:Vincent Ng(The University of Texas at Dallas)
Chair: Zhifang Sui (Peking University)
Instructor:Yang Liu(The University of Texas at Dallas)
Chair: Sujian Li （Peking University）
|12:00-13:30||Lunch Break (Within Peking University)|
Instructor:William Wang(Carnegie Mellon University)
Chair: Xianpei Han （Institute of Software, Chinese Academy of Sciences）
|17:00-18:00||Dinner Break (Within Peking University)|
Instructor:David Chiang(University of Notre Dame)
Chair: Chenqing Zong (Institute of Automation,Chinese Academy of Sciences)