Probability Notation

EC2(Amazon Elastic Compute): 확장식 컴퓨팅
- instance: 가상 컴퓨팅 환경
AMI(Amazon Machine Image): 서버에 필요한 OS, software 등 구성된 상태로 제공되는 template to launch instance
Read More

[AWS_1] 기본 용어

2019-2-28 Thu 10:57

AWS

1.용어(기타)

meta data: 효율적 search 위해, 일정한 규칙에 따라 Data에 부여하는 정보(a data of the data)
ETL: 데이터 Extract, Transform, Load
Computer Cluster: (여러 대 컴퓨터 묶음) = 1대 처럼 사용
Scale up/down: 서버 증/감
Scale out/in: CPU, memory 증/감
Read More

[Stat] 결정계수(Coefficient of Determination, r squared)

2019-2-23 Sat 09:22

Math

결정계수($R^2$, Coefficient of Determination)

[Linux] Command Cheat Sheet(on going)

2019-2-17 Sun 15:17

Linux

1) show only dir. whithin current dir: ls -d */

[Linux] netstat to check network status

2019-2-17 Sun 15:06

Linux

netstat is used to check port status.

[Math] Covariance and Correlation

2019-2-16 Sat 13:30

Math

공분산(covariance) & 상관계수(correlation coefficient)

[시계열] ARIMA 모델 모수추정 일반

2019-2-14 Thu 21:56

Math

확률과정 모형 추정

[Linux] Restart ubuntu wifi without reboot

2019-2-14 Thu 19:40

Linux

Wifi under Ubuntu env often fail to find signal.
Here is how to restart Wifi setting without reboot.

1	$ sudo lshw -C network 2>&1 \| grep wireless \| grep driver

# need "driver='name'" for a next step
>> configuration: broadcast=yes driver=iwlwifi driverversion=4.15.0-45-generic firmware=18.168.6.1 latency=0 link=no multicast=yes wireless=IEEE 802.11

1	$ sudo modprobe -r iwlwifi && sudo modprobe iwlwifi

Clear.

[ML] 차원의 저주 (Curse of Dimensionality)

2019-2-13 Wed 19:26

ML

차원의 저주 Curse of Dimensionality

[Math] 고유값 분해 (Eigen Value Decomposition)

2019-2-13 Wed 19:25

Math

고유값 분해(Eigen Value Decomposition)

[Linux] Make alias

2019-2-11 Mon 12:02

Linux

1) Execute application on Linux

[Math] Singular Value Decomposition, SVD

2019-2-11 Mon 12:02

Math

특이값 분해(Singular Value Decomposition, SVD)

[Stat]누적밀도함수cdf & 확률밀도함수pdf

2019-2-8 Fri 20:04

Math

1. 확률모형 probability model

[Stat] p-value

2019-2-8 Fri 19:55

Math

1. p-value(유의확률)

[시계열] 일반 선형확률과정 MA(3)

2019-2-4 Mon 15:58

Math

일반 선형확률과정 모형(general linear process model)

[시계열] 비정상 확률과정(2)

2019-2-4 Mon 15:58

Math

비정상 확률과정

[시계열] 일반 선형확률과정 AR(4)

2019-2-4 Mon 15:58

Math

일반 선형확률과정 모형(general linear process model)

[시계열] 확률과정 및 시계열 이론(1)

2019-2-4 Mon 15:58

Math

확률과정(Random process, Stochastic process):

[시계열] 일반 선형확률과정 ARMA / ARIMA(5)

2019-2-4 Mon 15:58

Math

일반 선형확률과정 모형(general linear process model)

[Linux] "Brackets"(Snippets) Install(Trouble shooting)

2019-1-31 Thu 06:57

Linux

Install

[Math] 모수추정

2019-1-30 Wed 22:14

Math

검정test과 모수추정parameter-estimation

[Hexo] Customize hexo-cactus theme using .ejs

2019-1-29 Tue 15:31

Github

Example Picture

[Linux] Install virtual python env

2019-1-28 Mon 15:11

Linux

1.1 pyenv 설치

[ML] WordCloud with NLTK

2019-1-28 Mon 15:11

ML►NLP

어간 추출 stemming: 단순 어미 제거, 즉 정확한 어간 아님
원형 복원 lemmatizing: 같은 의미 지니는 여러 단어를 사전형으로 통일.
- 품사 part of speech 지정시, 더 정확
품사 부착 part-of-speech tagging
품사 POS 구분: 낱말을 문법적 기능, 형태, 뜻에 따라 구분
NLTK는 Penn Treebank Tagset 채택
- NNP: 단수 고유명사
- VB: 동사
- VBP: 동사 현재형
- TO: 전치사
- NN: 명사
- DT: 관형사

cf. pos tagging: text pre-processing 연습

scikit-learn 자연어 분석시 “같은 토큰/다른 품사” = 다른 토큰
처리방법
- convert to “토큰/품사”

4. text class

plot: 단어token의 사용 빈도 그래프화
dispersion_plot: 단어가 사용된 위치 시각화
- eg. 소설의 등장인물 등장 위치
concordance: lines 입력 갯수만큼 해당 문장 display
similar: 해당 단어와 비슷한 문맥에서 사용된 단어

5. FreqDist

FreqDist: 문서에 사용된 단어의 사용빈도 정보 담는 class
return: {'word': frequency}
N(): 전체 단어수
freq("word"): 확률
most_common: 출현빈도 높은 단어
5.1 사용법1)
Text class의 vocab으로 추출
5.2 사용법2)
말뭉치에서 추려낸 단어로 FreqDist class 객체 생성
- 예) Emma.txt corpus에서 사람(NNP, 고유대명사)만 추출 & apply stop words
most_common: 출현빈도 높은 단어

6. wordcloud

FreqDist 활용
단어 빈도수에 따른 시각화

내용

1. 말뭉치(corpus)

1 2	import nltk nltk.download('book', quiet=True)

True

1	from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

1	nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

1 2	raw = nltk.corpus.gutenberg.raw('bryant-stories.txt') print(raw[:300])

[Stories to Tell to Children by Sara Cone Bryant 1918]


TWO LITTLE RIDDLES IN RHYME


     There's a garden that I ken,
     Full of little gentlemen;
     Little caps of blue they wear,
     And green ribbons, very fair.
           (Flax.)

     From house to house he goes,
     A me

2. 토큰생성(tokenizing)

sentence unit

sent_tokenize: return sentence

1 2	from nltk.tokenize import sent_tokenize sent_tokenize(raw[:300])

["[Stories to Tell to Children by Sara Cone Bryant 1918] \r\n\r\n\r\nTWO LITTLE RIDDLES IN RHYME\r\n\r\n\r\n     There's a garden that I ken,\r\n     Full of little gentlemen;\r\n     Little caps of blue they wear,\r\n     And green ribbons, very fair.",
 '(Flax.)',
 'From house to house he goes,\r\n     A me']

word unit

word_tokenize
= TreebankWordTokenizer

1 2	from nltk.tokenize import word_tokenize word_tokenize("this's, a, test! ha.")

['this', "'s", ',', 'a', ',', 'test', '!', 'ha', '.']

1
2
3

from nltk.tokenize import TreebankWordTokenizer
tree = TreebankWordTokenizer()
tree.tokenize("this's, a, test! ha.")

['this', "'s", ',', 'a', ',', 'test', '!', 'ha', '.']

WordPunctTokenizer

1
2
3

from nltk.tokenize import WordPunctTokenizer
punct = WordPunctTokenizer()
punct.tokenize("this's, a, test! ha.")

['this', "'", 's', ',', 'a', ',', 'test', '!', 'ha', '.']

RegexpTokenizer

from nltk.tokenize import RegexpTokenizer
pattern = "[\w]+"
retokenize = RegexpTokenizer(pattern)
retokenize.tokenize(raw[50:100])

['918', 'TWO', 'LITTLE', 'RIDDLES', 'IN', 'RHYME', 'T']

3. 형태소 분석

어간 추출 stemming: 단순 어미 제거, 즉 정확한 어간 아님
원형 복원 lemmatizing: 같은 의미 지니는 여러 단어를 사전형으로 통일.
- 품사 part of speech 지정시, 더 정확
품사 부착 part-of-speech tagging

1	words = retokenize.tokenize(raw[1300:2000])

stemming

1
2
3

from nltk.stem import PorterStemmer
st = PorterStemmer()
[(w, st.stem(w)) for w in words][:15]

[('said', 'said'),
 ('a', 'a'),
 ('little', 'littl'),
 ('soft', 'soft'),
 ('cheery', 'cheeri'),
 ('voice', 'voic'),
 ('and', 'and'),
 ('I', 'I'),
 ('want', 'want'),
 ('to', 'to'),
 ('come', 'come'),
 ('in', 'in'),
 ('N', 'N'),
 ('no', 'no'),
 ('said', 'said')]

1
2
3

from nltk.stem import LancasterStemmer
st = LancasterStemmer()
[(w, st.stem(w)) for w in words][:15]

[('said', 'said'),
 ('a', 'a'),
 ('little', 'littl'),
 ('soft', 'soft'),
 ('cheery', 'cheery'),
 ('voice', 'voic'),
 ('and', 'and'),
 ('I', 'i'),
 ('want', 'want'),
 ('to', 'to'),
 ('come', 'com'),
 ('in', 'in'),
 ('N', 'n'),
 ('no', 'no'),
 ('said', 'said')]

lemmatizing

1
2
3

from nltk.stem import WordNetLemmatizer
lm = WordNetLemmatizer()
[(w, lm.lemmatize(w)) for w in words][:15]

[('said', 'said'),
 ('a', 'a'),
 ('little', 'little'),
 ('soft', 'soft'),
 ('cheery', 'cheery'),
 ('voice', 'voice'),
 ('and', 'and'),
 ('I', 'I'),
 ('want', 'want'),
 ('to', 'to'),
 ('come', 'come'),
 ('in', 'in'),
 ('N', 'N'),
 ('no', 'no'),
 ('said', 'said')]

pos tagging

품사 POS 구분: 낱말을 문법적 기능, 형태, 뜻에 따라 구분
NLTK는 Penn Treebank Tagset 채택
- NNP: 단수 고유명사
- VB: 동사
- VBP: 동사 현재형
- TO: 전치사
- NN: 명사
- DT: 관형사

1
2
3

from nltk.tag import pos_tag
sentence = sent_tokenize(raw[203:400])[0]
sentence

'And green ribbons, very fair.'

1 2	word = word_tokenize(sentence) word

['And', 'green', 'ribbons', ',', 'very', 'fair', '.']

pos_tag

1 2	tagged_list = pos_tag(word) tagged_list

[('And', 'CC'),
 ('green', 'JJ'),
 ('ribbons', 'NNS'),
 (',', ','),
 ('very', 'RB'),
 ('fair', 'JJ'),
 ('.', '.')]

1	nltk.help.upenn_tagset('JJ')

JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...

filtering

1 2	cc_list = [t[0] for t in tagged_list if t[1] == "CC"] cc_list

['And']

untag: return word

1 2	from nltk.tag import untag untag(tagged_list)

['And', 'green', 'ribbons', ',', 'very', 'fair', '.']

pos tagging: text pre-processing 연습

scikit-learn 자연어 분석시 “같은 토큰/다른 품사” = 다른 토큰
처리방법
- convert to “토큰/품사”

def tokenizer(doc):
    return ["/".join(p) for p in tagged_list]

tokenizer(sentence)

['And/CC', 'green/JJ', 'ribbons/NNS', ',/,', 'very/RB', 'fair/JJ', './.']

4. text class

plot: 단어token의 사용 빈도 그래프화
dispersion_plot: 단어가 사용된 위치 시각화
- eg. 소설의 등장인물 등장 위치
concordance: lines 입력 갯수만큼 해당 문장 display
similar: 해당 단어와 비슷한 문맥에서 사용된 단어

1 2	from nltk import Text text = Text(retokenize.tokenize(raw))

plot: 단어token의 사용 빈도 그래프화

1 2	text.plot(30) plt.show()

png

dispersion_plot: 단어가 사용된 위치 시각화
- eg. 소설의 등장인물 등장 위치

raw = nltk.corpus.gutenberg.raw('austen-emma.txt')
text = Text(retokenize.tokenize(raw))

text.dispersion_plot(['Emma', 'Knightly', 'Frank', 'Jane', 'Robert'])
plt.show()

png

concordance: lines 입력 갯수만큼 해당 문장 display

1	text.concordance('Emma', lines=5)

Displaying 5 of 865 matches:
 Emma by Jane Austen 1816 VOLUME I CHAPTER
 Jane Austen 1816 VOLUME I CHAPTER I Emma Woodhouse handsome clever and rich w
f both daughters but particularly of Emma Between _them_ it was more the intim
nd friend very mutually attached and Emma doing just what she liked highly est
 by her own The real evils indeed of Emma s situation were the power of having

similar: 해당 단어와 비슷한 문맥에서 사용된 단어

1	text.similar('Emma', 10)

she it he i harriet you her jane him that

5. FreqDist

FreqDist: 문서에 사용된 단어의 사용빈도 정보 담는 class
return: {'word': frequency}

사용법1)

Text class의 vocab으로 추출

1 2	fd = text.vocab() type(fd)

nltk.probability.FreqDist

사용법2)

말뭉치에서 추려낸 단어로 FreqDist class 객체 생성
- 예) Emma.txt corpus에서 사람(NNP, 고유대명사)만 추출 & apply stop words

1	nltk.help.upenn_tagset('NNP')

NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...

1 2	emma_tokens = pos_tag(retokenize.tokenize(raw)) len(emma_tokens), emma_tokens[0]

(161983, ('Emma', 'NN'))

from nltk import FreqDist

stopwords = ['Mr.', 'Mrs.', 'Miss', 'Mr', 'Mrs', 'Dear']
names_list = [t[0] for t in emma_tokens if t[1] == "NNP" and t[0] not in stopwords]
fd_names = FreqDist(names_list)
fd_names

FreqDist({'Emma': 830, 'Harriet': 491, 'Weston': 439, 'Knightley': 389, 'Elton': 385, 'Woodhouse': 304, 'Jane': 299, 'Fairfax': 241, 'Churchill': 223, 'Frank': 208, ...})

N(): 전체 단어수
freq("word"): 확률

1	fd_names.N(), fd_names['Emma'], fd_names.freq('Emma')

(7863, 830, 0.10555767518758744)

most_common: 출현빈도 높은 단어

1	fd_names.most_common(5)

[('Emma', 830),
 ('Harriet', 491),
 ('Weston', 439),
 ('Knightley', 389),
 ('Elton', 385)]

6. wordcloud

FreqDist 활용
단어 빈도수에 따른 시각화

from wordcloud import WordCloud
wc = WordCloud(width=1000, height=600, background_color='white', random_state=0)
plt.imshow(wc.generate_from_frequencies(fd_names))
plt.axis('off')
plt.show()

[Linux] Create and post jupyter.md

2019-1-28 Mon 15:11

Linux

See the example below:

[Hexo] Create hexo github page

2019-1-26 Sat 20:25

Github

install Node.js

0. Architecture

Encoder part

변수사이의 관계

개요

sample code

Gradient Descent, Loss, update weights 정리

LSTM

1. 차원 축소

Softmax function derivative

Okapi BM25

1. sigmoid with bainary cross entropy back propagation

Word2vec 구현

python logging module 사용

1. word representation: 통계 vs 추론

hexo basic cmd

Data science CheatSheet

참고 사이트: 데이터사이언스 스쿨 https://datascienceschool.net/

개인 참고 정리용도

Tutorial: GCP instance로 Hadoop Cluster 구성

Kafka

AWS kinesis?

(이벤트/log 등)수집 단계에서 유의점

Cluster resource 최적화

SciyPy 활용한 기초 검정

EMR 개괄

Workflow: DAG(Directed Acyclic Graph), Stages and Task

Spark_SQL basic

How to install Pyspark

Hadoop HDFS Architecture

Spark Intro

통계 검정: 집단간 비교

공부 필기 정리

확률론적 선형회귀 가정 & 잔차 분석

대용량 데이터

AWS RDS

1. create EC2 instance

1.용어(기타)

결정계수($R^2$, Coefficient of Determination)

공분산(covariance) & 상관계수(correlation coefficient)

확률과정 모형 추정

차원의 저주 Curse of Dimensionality

고유값 분해(Eigen Value Decomposition)

1) Execute application on Linux

특이값 분해(Singular Value Decomposition, SVD)

1. 확률모형 probability model

1. p-value(유의확률)

일반 선형확률과정 모형(general linear process model)

비정상 확률과정

일반 선형확률과정 모형(general linear process model)

확률과정(Random process, Stochastic process):

일반 선형확률과정 모형(general linear process model)

Install

검정test과 모수추정parameter-estimation

Example Picture

1.1 pyenv 설치

목차

1. 말뭉치(corpus)

2. 토큰생성(tokenizing)

3. 형태소 분석

4. text class

5. FreqDist

5.1 사용법1)

5.2 사용법2)

6. wordcloud

내용

1. 말뭉치(corpus)

2. 토큰생성(tokenizing)

sentence unit

word unit

3. 형태소 분석

stemming

lemmatizing

pos tagging

pos tagging: text pre-processing 연습

4. text class

5. FreqDist

사용법1)

사용법2)

6. wordcloud

See the example below: