Apache Tajo 테스트 (Windows)

디클 2015. 11. 13. 14:49

2015. 11. 13. 14:49

# Apache Tajo

- Apache Tajo™: A big data warehouse system on Hadoop

- http://tajo.apache.org/

# Apache Tajo 설치

- Download : http://tajo.apache.org/downloads.html

- 최신 바이너리(Latest Release 0.11.0) 를 받아서 압축을 풀기

- conf/tajo-env.cmd 파일의 HADOOP_HOME 과 JAVA_HOME 세팅

@rem Hadoop home. Required

set HADOOP_HOME=%HADOOP_HOME%

@rem The java implementation to use. Required.

set JAVA_HOME=%JAVA_HOME%

# Apache Tajo 실행

bin\start-tajo.cmd

# tsql 실행 및 테스트

- 영화의 평점 샘플 데이터 활용 - http://grouplens.org/datasets/movielens/

- http://files.grouplens.org/datasets/movielens/ml-20m.zip (MovieLens 20M Dataset 사용)

> hadoop fs -ls /user/cdecl/data

Found 6 items

-rw-r--r-- 1 cdecl supergroup 8652 2015-11-13 13:03 /user/cdecl/data/README.txt

-rw-r--r-- 1 cdecl supergroup 569517 2015-11-13 13:03 /user/cdecl/data/links.csv

-rw-r--r-- 1 cdecl supergroup 1397542 2015-11-13 13:03 /user/cdecl/data/movies.csv

-rw-r--r-- 1 cdecl supergroup 258 2015-11-13 13:03 /user/cdecl/data/movies.csv.dsn

-rw-r--r-- 1 cdecl supergroup 533444411 2015-11-13 13:03 /user/cdecl/data/ratings.csv

-rw-r--r-- 1 cdecl supergroup 16603996 2015-11-13 13:03 /user/cdecl/data/tags.csv

- ratings.csv

- 영화 평점 정보, 약 500MB, 20,000,264 rows

Ratings Data File Structure (ratings.csv)

-----------------------------------------

All ratings are contained in the file `ratings.csv`.

userId,movieId,rating,timestamp

138493,60816,4.5,1259865163

138493,61160,4.0,1258390537

138493,65682,4.5,1255816373

138493,66762,4.5,1255805408

138493,68319,4.5,1260209720

- movies.csv

- 영화 정보, 약 1MB , 27,279 rows

Movies Data File Structure (movies.csv)

---------------------------------------

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

movieId,title,genres

131241,Ants in the Pants (2000),Comedy|Romance

131243,Werner - Gekotzt wird später (2003),Animation|Comedy

131248,Brother Bear 2 (2006),Adventure|Animation|Children|Comedy|Fantasy

131250,No More School (2000),Comedy

131252,Forklift Driver Klaus: The First Day on the Job (2001),Comedy|Horror

- tsql 실행

D:\hadoop\tajo-0.11.0

> bin\tsql

starting cli, logging to D:\hadoop\tajo-0.11.0\logs\tajo.log

Try \? for help.

default>

CREATE EXTERNAL table movies ( mid int, title text, genres text )

USING TEXT WITH ('text.delimiter'=',', 'text.skip.headerlines'='1')

LOCATION 'hdfs://localhost:9000/user/cdecl/data/movies.csv';

create EXTERNAL table ratings ( userid int, mid int, rate int, timest text )

USING TEXT WITH ('text.delimiter'=',', 'text.skip.headerlines'='1')

LOCATION 'hdfs://localhost:9000/user/cdecl/data/ratings.csv';

SELECT a.mid, max(b.title), avg(a.rate)

FROM ratings a join movies b on a.mid = b.mid

GROUP BY a.mid

ORDER BY avg(a.rate) DESC

LIMIT 10;

- 같은 결과를 얻기위해 Spark(Python)의 경우 약 3분의 소요된 반면 Tajo의 경우 약 1분 정도로 단순 Single node에서 실행은 빠른것으로 판단

- 허나 Spark 나 Tajo 의 경우 1개의 노드가 아닌 많은 Cluster에 의해 운영되어 성능을 극대화에 목적이 있으므로 로컬에서는 단순 테스트로만..

- Spark(Python) Test : http://cdecl.tistory.com/306

저작자표시 비영리 동일조건

'Dev > Data' 카테고리의 다른 글

Hadoop Single Node 설치 (linux) (0)	2016.07.23
Spark 테스트 (Windows, Scala, Self-Contained Applications) (1)	2015.11.18
Apache Hadoop 2.7.1 (Windows) (0)	2015.11.13
Spark 테스트 (Windows, Python 환경) (0)	2015.11.11
Spark 설치 (Standalone) (1)	2015.11.11

cdeclog

Apache Tajo 테스트 (Windows)

'Dev > Data' 카테고리의 다른 글

+ Recent posts

티스토리툴바