본문 바로가기
Programming/환경셋팅

Data Ingestion, Gobblin

by Chan_찬 2016. 9. 27.
728x90

Gobblin

Download

Gobblin release download : Release Page

압축풀기:

    tar -zxvf gobblin-distribution-…tar.gz         cd gobblin-dist  

환경설정

Gobblin job config directory:

job 설정파일 저장할 폴더 환경변수: GOBBLIN_JOB_CONFIG_DIR 환경변수 JAVA_HOME 제대로 되어있는지 확인

Gobblin working directory:

Gobblin의 job 출력, locks, state-store와 같은 정보 저장 환경변수: GOBBLIN_WORK_DIR

    export GOBBLIN_JOB_CONFIG_DIR=/var/javaApps/gobblin-dist/job-conf        export GOBBLIN_WORK_DIR=/var/javaApps/gobblin-dist/work      export JAVA_HOME=    

Job 설정파일

    vi $GOBBLIN_JOB_CONFIG_DIR\wikipedia.pull    
    job.name=PullFromWikipedia       job.group=Wikipedia      job.description=A getting started example for Gobblin         source.class=gobblin.example.wikipedia.WikipediaSource       source.page.titles=LinkedIn,Wikipedia:Sandbox        source.revisions.cnt=5        wikipedia.api.rooturl=https://en.wikipedia.org/w/api.php         wikipedia.avro.schema={"namespace": "example.wikipedia.avro","type": "record","name": "WikipediaArticle","fields": [{"name": "revid", "type": ["double", "null"]},{"name": "pageid", "type": ["double", "null"]},{"name": "title", "type": ["string", "null"]},{"name": "user", "type": ["string", "null"]},{"name": "anon", "type": ["string", "null"]},{"name": "userid",  "type": ["double", "null"]},{"name": "timestamp", "type": ["string", "null"]},{"name": "size",  "type": ["double", "null"]},{"name": "contentformat",  "type": ["string", "null"]},{"name": "contentmodel",  "type": ["string", "null"]},{"name": "content", "type": ["string", "null"]}]}      gobblin.wikipediaSource.maxRevisionsPerPage=10        converter.classes=gobblin.example.wikipedia.WikipediaConverter        extract.namespace=gobblin.example.wikipedia       writer.destination.type=HDFS         writer.output.format=AVRO        writer.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner       data.publisher.type=gobblin.publisher.BaseDataPublisher  

Gobblin 시작

    bin/gobbline-standalone.sh start     
    bin/gobbline-standalone.sh stop  

결과는 파일로 저장된다.

    $GOBBLIN_WORK_DIR/job-output/gobbline/part.tast_[]_시간.avro   

avro tools

download

    curl -O http://central.maven.org/maven2/org/apache/avro/avro-tools/1.8.1/avro-tools-1.8.1.jar    

avro -> json

    java -jar avro-tools-1.8.1.jar tojson --pretty [job_output].avro > output.json   
728x90
728x90

'Programming > 환경셋팅' 카테고리의 다른 글

ssh config  (0) 2020.06.20
AWS S3 - ec2에 mount 하기 - s3fs보다 2배 빠르다 - goofys  (0) 2019.12.31
vim 에서 go syntax highlight  (0) 2016.04.05
node.js 설치  (0) 2016.04.03
Mongo DB (NoSQL) 설치  (0) 2016.03.30
Buy me a coffeeBuy me a coffee

댓글