NLP from Zero to Picture
NLP 从入门到入土
What is NLP
Making wheels:造轮子
In the wheel-making movement of the big front-end era, each company has its own wheels, and a lot of repeated coding.
The ML field is better but not much better. There are a large number of databases and method libraries for us to use. Never write deep networks manually with tensoflow anymore. Even the birth of autokeras makes optimizing the network foolish.
The knowledge you need:
Simple python knowledge reserve
A preliminary understanding of the structure of the Neural network, such as gradient descent.
Recommend Courses :
CS224n: Natural Language Processing with Deep Learning Stanford University's very famous NLP course, because of the Covid-19, online cause also changed the professor's keynote style that never changes😄. A very detailed and systematic course, from ML basics to math formulas, with very detailed notes. However, mathematical formulas and algorithms are too ‘mathematical’, and are obscure for students who have no foundation.
Deep Learning for Human Language Processing (2020, Spring) The famous stand-up comic lecturer at National Taiwan University HUNG-YI LEE (Li Hong Yi), because of its humorous and easy-to-understand lectures, the course has a high number of broadcasts on Youtube.
Machine Learning (2021, Spring) The same course is taught by Hung-Yi Lee. ML has an introduction. The two courses have overlapping knowledge blocks that can be skipped as appropriate.
Representing words
Representing Image
For images, know that the grayscale image is one of our matrices.
The RBG image is a three-channel matrix.

各种Image processing 就是在这个矩阵上叠Buff,卷积啊,滤镜啊还有高斯傅立叶。。。。。。那么人类语言的词汇,该如何让机器去理解呢。
Various Image processing is to stack Buff, convolution, filter and Gaussian Fourier on this matrix. . . . . . So how can the vocabulary of human language be understood by the machine?
How do we have usable meaning in a computer?
用机器学习的思路,我们有一系列样本(x,y),这里 x 是词语,y 是它们的词性,我们要构建 f(x)->y 的映射,但这里的数学模型 f(比如神经网络、SVM)只接受数值型输入。
而 NLP 里的词语,是人类的抽象总结,是符号形式的(比如中文、英文、拉丁文等等),所以需要把他们转换成数值形式,或者说——嵌入到一个数学空间里,这种嵌入方式,就叫词嵌入word embedding,而 Word2vec,就是词嵌入 word embedding 的一种
- WordNet A thesaurus containing lists of synonym sets and hypernyms (“is a” relationships).
Representing words as discrete symbols One-hot vectors:
motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
Vector dimension = number of words in vocabulary (e.g., 500,000) 太庞大的数据了
- Word vector word vectors are also called word embeddings or (neural) word representations They are a distributed representation
通过统计一个事先指定大小的窗口内的word共现次数,以word周边的共现词的次数做为当前word的vector。具体来说,我们通过从大量的语料文本中构建一个共现矩阵来定义word representation。
For example
I like deep learning. I like NLP. I enjoy flying.

NLP Task

One Sequence | Multiple Sequences | |
One Class | Sentiment Classification, Stance Detection, Veracity Prediction, Intent Classification, Dialogue Policy | NLI Search Engine Relation Extraction |
Class for each Token | POS tagging Word segmentation Extraction Summarization Slotting Filling NER | |
Copy from Input | Extractive QA | |
General Sequence | Abstractive Summarization, Translation, Grammar Correction ,NLG | General QA Task Oriented Dialogue Chatbot |
Other? | Parsing, Coreference Resolution |
Part-of-Speech (POS) Tagging
截屏2021-12-17 13.48.36

Word Segmentation
It's how to break sentences. Especially in Chinese clauses, Chinese multiple attributives.
截屏2021-12-17 13.53.33

Extractive summarization
截屏2021-12-17 13.54.29最简单的resolution: It is binary classfication problem. To decide which sentence will be add in summary. 类似我们小时候写摘要,老师让总结课文,我们只是摘抄两句。
如果能用上DL,我们要把全文考虑进去,全部一起输入,用一个binary LSTM or Transformer然后输出每一句是否放在summary里
Abstractive summarization
截屏2021-12-17 14.18.06The machine needs to write the summary in its own words, not directly in the original text. resolution:Seq2Seq problem. Long seq -> short Seq
Machine Translation
截屏2021-12-17 14.19.467000种语言,每种语言上万词。胡翻译至少需要7000的平方。
Unsupervised learning!
Grammar Error Correction Seq2seq. 我们可以直接给他数据,硬train
进阶Input: Token -> Token calculate the different. For example: 3 options C for Copy, R for replace, A for append
Sentiment Classification 情感判断。广告推广啊,去判断电影的口碑啊。股票利空利多消息啊,币圈源于周热度啊之类的。
Stance Detection 立场侦测。新型民调,挖掘不愿意表态选民画像,选举广告推送。B站阿瓦隆系统
截屏2021-12-17 13.56.29Source:川普是个好总统 Reply: 他只是个资本家 这位网民的立场?->Denied Many systems use the support, denying, querying and commenting (SDQC 4classes) for classifying replies. 支持、否认、质疑和评论
Natural Language Inference (NLI)
Premise : 一个绿色的三角形 ->? hypothesis: 两边之和大于第三边 Premise ->? Hypothesis input: premise+ hypothesis -> output:蕴含
Search engine Bert 可以简化为
2 inputs: 搜索词+ 文章内容 -> model -> relevant
Question AnswerQA system
截屏2021-12-17 14.27.16传统方法是一个非常庞大的价格,包含一些svm的简单模型等等。 Input: Question & knowledge source -> QA model -> answer Reading comprehension Extractive QA 目前还太难实现了,目前的网络只是阅读理解和的程度,输出原文的答案的field。eg. (1_7-11,第一段7-11词) 如果实现,就是先知的诞生。
Chatting 尬聊 就。。。尬聊
Natural Language Generation (NLG)
Policy & State Tracker
Natural Language Understanding (NLU)
截屏2021-12-17 14.31.16
是芝麻街里的一个人物, 大家都在用网络方法的首字母凑芝麻街里的人物。Bert和RNN 将会在下一次Note里更新

LSTM: Will be introduced in next sharing session
Quote from "Statistical approach to speech" by Prof. Keiichi Tokuda in Interspeech 2019
Every time I fire a linguist, the performance of the speech recongnizer goes up

共有 0 条评论