# Apache Tajo
- Apache Tajo™: A big data warehouse system on Hadoop
# Apache Tajo 설치
- Download : http://tajo.apache.org/downloads.html
- 최신 바이너리(Latest Release 0.11.0) 를 받아서 압축을 풀기
- conf/tajo-env.cmd 파일의 HADOOP_HOME 과 JAVA_HOME 세팅
@rem Hadoop home. Required
set HADOOP_HOME=%HADOOP_HOME%
@rem The java implementation to use. Required.
set JAVA_HOME=%JAVA_HOME%
# Apache Tajo 실행
bin\start-tajo.cmd
# tsql 실행 및 테스트
- 영화의 평점 샘플 데이터 활용 - http://grouplens.org/datasets/movielens/
- http://files.grouplens.org/datasets/movielens/ml-20m.zip (MovieLens 20M Dataset 사용)
> hadoop fs -ls /user/cdecl/data
Found 6 items
-rw-r--r-- 1 cdecl supergroup 8652 2015-11-13 13:03 /user/cdecl/data/README.txt
-rw-r--r-- 1 cdecl supergroup 569517 2015-11-13 13:03 /user/cdecl/data/links.csv
-rw-r--r-- 1 cdecl supergroup 1397542 2015-11-13 13:03 /user/cdecl/data/movies.csv
-rw-r--r-- 1 cdecl supergroup 258 2015-11-13 13:03 /user/cdecl/data/movies.csv.dsn
-rw-r--r-- 1 cdecl supergroup 533444411 2015-11-13 13:03 /user/cdecl/data/ratings.csv
-rw-r--r-- 1 cdecl supergroup 16603996 2015-11-13 13:03 /user/cdecl/data/tags.csv
- ratings.csv
- 영화 평점 정보, 약 500MB, 20,000,264 rows
Ratings Data File Structure (ratings.csv)
-----------------------------------------
All ratings are contained in the file `ratings.csv`.
userId,movieId,rating,timestamp
userId,movieId,rating,timestamp
138493,60816,4.5,1259865163
138493,61160,4.0,1258390537
138493,65682,4.5,1255816373
138493,66762,4.5,1255805408
138493,68319,4.5,1260209720
- movies.csv
- 영화 정보, 약 1MB , 27,279 rows
Movies Data File Structure (movies.csv)
---------------------------------------
Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:
movieId,title,genres
movieId,title,genres
131241,Ants in the Pants (2000),Comedy|Romance
131243,Werner - Gekotzt wird später (2003),Animation|Comedy
131248,Brother Bear 2 (2006),Adventure|Animation|Children|Comedy|Fantasy
131250,No More School (2000),Comedy
131252,Forklift Driver Klaus: The First Day on the Job (2001),Comedy|Horror
- tsql 실행
D:\hadoop\tajo-0.11.0
> bin\tsql
starting cli, logging to D:\hadoop\tajo-0.11.0\logs\tajo.log
Try \? for help.
default>
CREATE EXTERNAL table movies ( mid int, title text, genres text )
USING TEXT WITH ('text.delimiter'=',', 'text.skip.headerlines'='1')
LOCATION 'hdfs://localhost:9000/user/cdecl/data/movies.csv';
create EXTERNAL table ratings ( userid int, mid int, rate int, timest text )
USING TEXT WITH ('text.delimiter'=',', 'text.skip.headerlines'='1')
LOCATION 'hdfs://localhost:9000/user/cdecl/data/ratings.csv';
SELECT a.mid, max(b.title), avg(a.rate)
FROM ratings a join movies b on a.mid = b.mid
GROUP BY a.mid
ORDER BY avg(a.rate) DESC
LIMIT 10;
- 같은 결과를 얻기위해 Spark(Python)의 경우 약 3분의 소요된 반면 Tajo의 경우 약 1분 정도로 단순 Single node에서 실행은 빠른것으로 판단
- 허나 Spark 나 Tajo 의 경우 1개의 노드가 아닌 많은 Cluster에 의해 운영되어 성능을 극대화에 목적이 있으므로 로컬에서는 단순 테스트로만..
- Spark(Python) Test : http://cdecl.tistory.com/306
'Dev > Data' 카테고리의 다른 글
Hadoop Single Node 설치 (linux) (0) | 2016.07.23 |
---|---|
Spark 테스트 (Windows, Scala, Self-Contained Applications) (1) | 2015.11.18 |
Apache Hadoop 2.7.1 (Windows) (0) | 2015.11.13 |
Spark 테스트 (Windows, Python 환경) (0) | 2015.11.11 |
Spark 설치 (Standalone) (1) | 2015.11.11 |