阅读次数
一.druid的设计与架构
设计:http://druid.io/docs/0.9.1.1/design/design.html
白皮书:http://static.druid.io/docs/druid.pdf
(https://docs.imply.io/on-premise/deploy/cluster)
五种节点类型:
Historical: 离线节点,加载离线存储的segments。它和coordinator通过zk进行联系。当接收到新的segments加载请求的时候,先查本地,没命中则根据metadata信息从deep storage中加载,加载完成后申报到zk,这时候该segment就可以被查询
Broker:接受查询,根据zk的信息查询segment的位置,将查询路由到正确的位置。最后merge结果返回
Coordinator:协调segment的存储,决定哪些segments应该进historical nodes
Indexing Service:包含三大组件。peon,middle manager,overlord
任务从overlord的http提交,由middle manager分配给Peons处理。
Realtime:实时节点
其他名词:
Tranquility: helps you send real-time event streams to Druid and handles partitioning, replication, service discovery, and schema rollover, seamlessly and without downtime
Tranquility server:一个http server,有它就可以不需要写java程序来导数据到druid,而通过http接口就可以
依赖的外围模块:
Deep Storage:
Metadata Storage:
ZooKeeper:
二.安装部署:
1.参考
http://druid.io/docs/0.9.1.1/tutorials/cluster.html(./bin/xxx.sh start启动各个对应的服务)
http://www.open-open.com/lib/view/open1447852962978.html
2.下载druid以及mysql extenstion&tranquility
http://druid.io/downloads.html
3.拷贝mysql-metadata-storage-0.12.0.tar.gz到extensions路径下并解压
create database druid DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
4.配置conf/druid/_common/common.runtime.properties
5.拷贝hadoop相关配置到conf/druid/_common下(core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xm)
6.启动各个组件:(也可以使用sh bin/XXX.sh start来启动)
启动一个实例就够了
java cat conf/druid/coordinator/jvm.config | xargs -cp conf/druid/_common:conf/druid/coordinator:lib/ io.druid.cli.Main server coordinator
java cat conf/druid/overlord/jvm.config | xargs -cp conf/druid/_common:conf/druid/overlord:lib/ io.druid.cli.Main server overlord
可以按需启动多个实例
java cat conf/druid/historical/jvm.config | xargs -cp conf/druid/_common:conf/druid/historical:lib/ io.druid.cli.Main server historical
java cat conf/druid/middleManager/jvm.config | xargs -cp conf/druid/_common:conf/druid/middleManager:lib/ io.druid.cli.Main server middleManager
java cat conf/druid/broker/jvm.config | xargs -cp conf/druid/_common:conf/druid/broker:lib/* io.druid.cli.Main server broker
遇到问题:
historical node内存启不了
启动得时候报错:
12) Not enough direct memory. Please adjust -XX:MaxDirectMemorySize, druid.processing.buffer.sizeBytes, druid.processing.numThreads, or druid.processing.numMergeBuffers: maxDirectMemory[2,147,483,648], memoryNeeded[5,368,709,120] = druid.processing.buffer.sizeBytes[536,870,912] * (druid.processing.numMergeBuffers[2] + druid.processing.numThreads[7] + 1)
根据提示将maxDirectMemory从2G修改为5G就可以了。。
https://groups.google.com/forum/#!topic/druid-user/j0sFcUIiQiE
三.druid的数据导入简介:(分files和stream)
files方式不依赖tranquility,参考http://druid.io/docs/latest/tutorials/tutorial-batch.html
stream数据导入有两种方式:
1.Tranquility (a Druid-aware client) and the indexing service(push方式)
2.Realtime nodes(不推荐,有若干缺点:http://druid.io/docs/0.9.1.1/ingestion/stream-pull.html#limitations) (pull)
stream push & stream pull &batch ingestion
stream push有两种方式:
1)通过Tranquility server通过http接口推进去
http://druid.io/docs/0.9.1.1/tutorials/tutorial-streams.html
2)通过Tranquility Kafka推进去
stream pull:
通过realtime node的方式,参考:
http://druid.io/docs/latest/ingestion/stream-pull.html
四.druid的数据导入
1.files导入,参考:http://druid.io/docs/latest/tutorials/tutorial-batch.html
curl -X ‘POST’ -H ‘Content-Type:application/json’ -d@./wikiticker-index.json host-170.bjyz:8090/druid/indexer/v1/task
遇到问题:
1)peon启动不起来,报错:
3) Not enough direct memory. Please adjust -XX:MaxDirectMemorySize, druid.processing.buffer.sizeBytes, druid.processing.numThreads, or druid.processing.numMergeBuffers: maxDirectMemory[1,908,932,608], memoryNeeded[2,684,354,560] = druid.processing.buffer.sizeBytes[536,870,912] * (druid.processing.numMergeBuffers[2] + druid.processing.numThreads[2] + 1)
修改:druid.indexer.runner.javaOpts=-server -Xmx2g -XX:MaxDirectMemorySize=2560m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djav
a.util.logging.manager=org.apache.logging.log4j.jul.LogManager
可以查看状态:
http://host-170.bjyz.baidu.com:8090/console.html (overlord得端口,可以查看导入服务得运行状态)
可以看到middlemanager已经启动了一个peon来执行任务:
Main internal peon var/druid/task/index_hadoop_wikiticker_2018-03-27T10:07:12.646Z/task.json var/druid/task/index_hadoop_wikiticker_2018-03-27T10:07:12.646Z/d93e84a0-d9f2-40e4-8b1a-0e24072a00f3/status.json
2)默认的mr jobs是提交到yarn集群的default队列,为了修改该peon所提交得mapreduce job得queue name。设置一下common中得mapred-site.xml得mapreduce.job.queuename参数
3)关于yarn集群java版本和druid依赖java版本不一致得问题:
conf.set(“mapred.child.env”, “JAVA_HOME=/home/iteblog/java/jdk1.8.0_25”);
conf.set(“yarn.app.mapreduce.am.env”, “JAVA_HOME=/home/iteblog/java/jdk1.8.0_25”);
可以参考:https://www.iteblog.com/archives/1883.html
4)遇到JNDI lookup class is not available because this JRE does not support JNDI的问题(这个warning无视就好了。。。哈哈)
参考http://druid.io/docs/latest/operations/other-hadoop.html的Tip 2解决
5)使用的wikiticker-index.json文件
{
"type" : "index_hadoop",
"spec" : {
"ioConfig" : {
"type" : "hadoop",
"inputSpec" : {
"type" : "static",
"paths" : "/smallfile/druid/quickstart/wikiticker-2015-09-12-sampled.json.gz"
}
},
"dataSchema" : {
"dataSource" : "wikiticker",
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "day",
"queryGranularity" : "none",
"intervals" : ["2015-09-12/2015-09-13"]
},
"parser" : {
"type" : "hadoopyString",
"parseSpec" : {
"format" : "json",
"dimensionsSpec" : {
"dimensions" : [
"channel",
"cityName",
"comment",
"countryIsoCode",
"countryName",
"isAnonymous",
"isMinor",
"isNew",
"isRobot",
"isUnpatrolled",
"metroCode",
"namespace",
"page",
"regionIsoCode",
"regionName",
"user"
]
},
"timestampSpec" : {
"format" : "auto",
"column" : "time"
}
}
},
"metricsSpec" : [
{
"name" : "count",
"type" : "count"
},
{
"name" : "added",
"type" : "longSum",
"fieldName" : "added"
}, {
"name" : "deleted",
"type" : "longSum",
"fieldName" : "deleted"
},
{
"name" : "delta",
"type" : "longSum",
"fieldName" : "delta"
},
{
"name" : "user_unique",
"type" : "hyperUnique",
"fieldName" : "user"
}
]
},
"tuningConfig" : {
"type" : "hadoop",
"partitionsSpec" : { "type" : "hashed",
"targetPartitionSize" : 5000000
},
"jobProperties" : {
"mapreduce.map.java.opts":"-Duser.timezone=UTC -Dfile.encoding=UTF-8",
"mapreduce.reduce.java.opts":"-Duser.timezone=UTC -Dfile.encoding=UTF-8",
"mapred.child.env":"JAVA_HOME=/home/work/.jumbo/opt/sun-java8",
"yarn.app.mapreduce.am.env":"JAVA_HOME=/home/work/.jumbo/opt/sun-java8",
"mapreduce.job.queuename":"bigJob",
"mapreduce.job.classloader": "true"
}
}
}
6)确认任务真的导入成功
通过broker接口查询导入的user有多少
curl -X POST ‘host-175.bjyz:8082/druid/v2/?pretty’ -H ‘Content-Type:application/json’ -d @query/useruniq.json
useruniq.json内容:
{
"queryType": "timeseries",
"dataSource": "wikiticker",
"granularity": "day",
"aggregations": [
{ "type": "hyperUnique", "name": "user_unique", "fieldName": "user_unique" }
],
"intervals": [ "2015-09-12T00:00:00.000/2015-09-13T00:00:00.000" ],
"context" : {
"skipEmptyBuckets": "true"
}
}
通过broker接口查询导入的meta信息:
curl -X POST ‘host-175.bjyz:8082/druid/v2/?pretty’ -H ‘Content-Type:application/json’ -d @query/metadata.json
{
"queryType":"segmentMetadata",
"dataSource":"wikiticker",
"intervals":["2015-09-12/2015-09-13"]
}
或者使用自带的查询串
curl -X POST ‘host-175.bjyz:8082/druid/v2/?pretty’ -H ‘Content-Type:application/json’ -d @wikiticker-top-pages.json查看被编辑最大的pages
2.stream push方式(参考:http://druid.io/docs/latest/ingestion/stream-push.html)
stream push主要是借助了tranquility,关于tranquility的介绍:https://github.com/druid-io/tranquility/blob/master/docs/overview.md
tranquility导入数据主要有几种方式:
a)tranquility server (http接口)
b)tranquility kafka(用户将数据推入kafka,tranquility写入druid)
c)自己写一个依赖tranquility library的JVM app
参考:https://github.com/druid-io/tranquility/blob/master/docs/core.md
d)利用tranquility里面实现的各种流连接器,比如spark如何写入druid:
https://github.com/druid-io/tranquility/blob/master/docs/spark.md
其中a)b)方案依赖一定第三方服务。c)d)只依赖tranquility的library
针对spark中的计算结果如何写入druid,会另外开一篇文章专门讨论
3.stream pull方式
这种方式需要用到realtime node。貌似不推荐,这里不多研究
四.druid&caravel
当在druid存储了数据后,我们使用caravel页面进行展示
1)add druid cluster
配置以下coordinator以及broker的地址即可
配置完成保存后refresh一下druid元数据
然后点击进入datasource就可以愉快地olap了。如果没有数据检查下数据的起始时间。(例如例子中导入的2015年数据,需要选择4 years ago)
2)配置报表
在datasource视图中选择分组,度量查询之后,将结果保存成slice,点击报表标签页面可以看到刚才保存的slice,选择仪表盘页面新建仪表盘,报表就选择刚才保存的slice。。。