关于我们
书单推荐
新书推荐
|
大话存储后传/次世代数据存储思维与技术
全书分为: 灵活的数据布局、应用感知及可视化存储智能、存储类芯片、储海钩沉、集群和多控制器、传统存储系统、新兴存储系统、大话光存储系统、体系结构、IO协议栈及性能分析、存储软件、固态存储几个大章, 其中每章又有多个小节。每一个小节都是一个独立的课题。
冬瓜哥对技术的追求已经到了“痴迷”的境界,与10年前相比,文笔解析更为到位,技术理解更为精准。其公众号的每篇文章,都是存储业界风向标。
冬瓜哥(张冬),现任某半导体公司系统架构师,著有《大话存储》系列图书。存储领域技术专家和布道者。
第一章 灵活的数据布局 ·········································································1
1.1 Raid1.0和Raid1.5 ······························································································2 1.2 Raid5EE和Raid2.0 ·····························································································4 1.3 Lun2.0/SmartMotion ························································································13 第二章 应用感知及可视化存储智能 ·····················································23 2.1 应用感知精细化自动存储分层······································································25 2.2 应用感知精细化SmartMotion ········································································27 2.3 应用感知精细化QoS ······················································································28 2.4 产品化及可视化展现······················································································31 2.5 包装概念制作PPT ···························································································43 2.6 评浪潮“活性”存储概念··············································································49 第三章 存储类芯片 ··············································································53 3.1 通道及Raid控制器架构 ··················································································54 3.2 SAS Expander架构 ··························································································60 第四章 储海钩沉 ··················································································65 4.1 你绝对想不到的两种高格调存储器······························································66 4.2 JBOD里都有什么····························································································70 4.3 Raid4校验盘之殇 ····························································································72 4.4 为什么说Raid卡是台小电脑 ··········································································73 4.5 为什么Raid卡电池被换为超级电容 ······························································74 4.6 固件和微码到底什么区别··············································································75 4.7 FC成环器内部真的是个环吗 ·········································································76 4.8 为什么说SAS、FC对CPU耗费比TCPIP+以太网低 ····································77 4.9 双控存储之间的心跳线都跑了哪些流量······················································78 第五章集群和多控制器 ······································································· 79 5.1 浅谈双活和多路径··························································································80 5.2 “浅”谈容灾和双活数据中心(上)··························································82 5.3 “浅”谈容灾和双活数据中心(下)··························································87 5.4 集群文件系统架构演变深度梳理图解··························································96 5.5 从多控缓存管理到集群锁············································································107 5.6 共享式与分布式各论····················································································115 5.7 “冬瓜哥画PPT”双活是个坑 ·····································································118 第六章传统存储系统 ········································································· 121 6.1 与存储系统相关的一些基本话题分享························································122 6.2 高端存储系统江湖风云录!········································································133 6.3 惊了!原来高端存储架构是这样演进的!················································145 6.4 传统高端存储系统把数据缓存集中外置一石三鸟····································155 6.5 传统外置存储已近黄昏················································································156 6.6 存储圈老炮大战小鲜肉················································································166 6.7 传统存储老矣,新兴存储能当大任否?····················································167 第七章次世代存储系统 ····································································· 185 7.1 一杆老枪照玩次世代存储系统····································································187 7.2 最有传统存储格调的次世代存储系统························································192 7.3 最适合大规模数据中心的次世代存储系统················································203 7.4 最高性能的次世代存储系统········································································206 7.5 最具备感知应用能力的次世代存储系统····················································214 7.6 最具有数据管理灵活性的次时代存储系统················································225 第八章光存储系统············································································ 237 8.1 光存储基本原理····························································································238 8.2 神秘的激光头及蓝光技术············································································244 8.3 剖析蓝光存储系统························································································249 8.4 光存储系统生态····························································································253 8.5 站在未来看现在····························································································259 第九章体系结构 ················································································ 263 9.1 大话众核心处理器体系结构········································································264 9.2 致敬龙芯!冬瓜哥手工设计了一个CPU译码器! ····································271 9.3 NUNA体系结构首次落地InCloudRack机柜 ···············································274 9.4 评宏杉科技的CloudSAN架构 ······································································278 9.5 内存竟然还能这么玩?!············································································283 9.6 PCIe交换,什么鬼?····················································································293 9.7 聊聊FPGA/GPCPU/PCIe/Cache-Coherency ················································300 9.8 【科普】超算到底是怎样算的?································································305 第十章 I/O 协议栈及性能分析 ···························································· 317 10.1 最完整的存储系统接口/协议/连接方式总结 ···········································318 10.2 I/O协议栈前沿技术研究动态 ····································································332 10.3 Raid组的Stripe Size到底设置为多少合适? ·············································344 10.4 并发I/O——系统性能的根本! ································································347 10.5 关于I/O时延你被骗了多久? ····································································349 10.6 如何测得整条I/O路径上的并发度? ························································351 10.7 队列深度、时延、并发度、吞吐量的关系到底是什么··························351 10.8 为什么Raid对于某些场景没有任何提速作用? ······································365 10.9 为什么测试时性能出色,上线时却惨不忍睹?······································366 10.10 队列深度过浅有什么影响?····································································368 10.11 队列深度调节为多大最理想? ································································369 10.12 机械盘的随机I/O平均时延为什么有一过性降低? ······························370 10.13 数据布局到底是怎么影响性能的?························································371 10.14 关于同步I/O与阻塞I/O的误解 ·································································374 10.15 原子写,什么鬼?!················································································375 10.16 何不做个USB Target? ·············································································385 10.17 冬瓜哥的一项新存储技术专利已正式通过············································385 10.18 小梳理一下iSCSI底层 ··············································································394 10.19 FC的4次Login过程简析 ···········································································396 第十一章存储软件············································································ 397 11.1 Thin就是个坑谁用谁找抽!······································································398 11.2 存储系统OS变迁 ·························································································400 第十二章固态存储············································································ 409 12.1 浅析固态介质在存储系统中的应用方式··················································410 12.2 关于SSD元数据及掉电保护的误解··························································420 12.3 关于闪存FTL的Host Base和Device Based的误解 ····································421 12.4 关于SSD HMB与CMB ···············································································423 12.5 同有科技展翅归来······················································································424 12.6 和老唐说相声之SSD性能测试之“玉”··················································435 12.7 固态盘到底该怎么做Raid? ······································································441 12.8 当Raid2.0遇上全固态存储 ·········································································448 12.9 上/下页、快/慢页、MSB/LSB都些什么鬼? ··········································451 12.10 关于对MSB/LSB写0时的步骤 ·································································457
1.1 Raid1.0和Raid1.5
在机械盘时代,影响最终I/O性能的根本因素无非就是两个,一个是顶端源头,也就是应用的I/O调用方式和I/O属性;另一个是底端源头,那就是数据最终是以什么形式、状态存放在多少机械盘上的。应用如何I/O调用完全不是存储系统可以控制的事情,所以从这个源头来解决性能问题对于存储系统来讲是无法做什么工作的。但是数据如何组织、排布,绝对是存储系统重中之重的工作。 这一点从Raid诞生开始就一直在不断的演化当中。举个最简单的例子,从Raid3到Raid4再到Raid5,Raid3当时设计的时候致力于单线程大块连续地址I/O吞吐量最大化,为了实现这个目的,Raid3的条带非常窄,窄到每次上层下发的I/O目标地址基本上都落在了所有盘上,这样几乎每个I/O都会让多个盘并行读写来服务于这个I/O,而其他I/O就必须等待,所以我们说Raid3阵列场景下,上层的I/O之间是不能并发的,但是单个I/O是可以采用多盘为其并发的。所以,如果系统内只有一个线程(或者说用户、程序、业务),而且这个线程是大块连续地址I/O追求吞吐量的业务,那么Raid3非常合适。但是大部分业务其实不是这样,而是追求上层的I/O能够充分地并行执行,比如多线程、多用户发出的I/O能够并发地被响应,此时就需要增大条带到一个合适的值,让一个I/O目标地址范围不至于牵动Raid组中所有盘为其服务,这样就有一定几率让一组盘同时响应多个I/O,而且盘数越多,并发几率就越大。Raid4相当于条带可调的Raid3,但是Raid4独立校验盘的存在不但让其成为高故障率的热点盘,而且也制约了本可以并发的I/O,因为伴随着每个I/O的执行,校验盘上对应条带的校验块都需要被更新,而由于所有校验块只存放在这块盘上,所以上层的I/O只能一个一个第一章 灵活的数据布局3地顺着执行,不能并发。Raid5则通过把校验块打散在Raid组中所有磁盘上,从而实现了并发I/O。大部分存储厂商提供针对条带宽度的设置,比如从32KB到128KB。假设一个I/O请求读16KB,在一个8块盘做的Raid5组里,如果条带为32KB,则每块盘上的段(Segment)为4KB,这个I/O起码要占用4块盘,假设并发几率为100%,那么这个Raid组能并发两个16KB的I/O,并发8个4KB的I/O;如果将条带宽度调节为128KB,则在100%并发几率的条件下可并发8个小于等于16KB的I/O。 讲到这里,我们可以看到单单是调节条带宽度,以及优化校验块的布局,就可以得到迥异的性能表现。但是再怎么折腾,I/O性能始终受限在Raid组那少得可怜的几块或者十几块盘上。为什么是几块或者十几块?难道不能把100块盘做成一个大Raid5组,然后,通过把所有逻辑卷创建在它上面来增加每个逻辑卷的性能么?你不会选择这么做的,当一旦有一块盘坏掉,系统需要重构的时候,你会后悔当时的决定,因为你会发现此时整个系统性能大幅降低,哪个逻辑卷也别想好过,因为此时99块盘都 在全速读出数据,系统计算xor校验块,然后把校验块写入热备盘中。当然,你可以控制降速重构,来缓解在线业务的I/O性能,但是付出的代价就是增加了重构时间,重构周期内如果有盘再坏,那么全部数据荡然无存。所以,必须缩小故障影响域,所以一个Raid组最好是几块或者十几块盘。这比较尴尬,所以人们想出了解决办法,那就是把多个小Raid5/6组拼接成大Raid0,也就是Raid50/60,然后将逻辑卷分布在其上。当然,目前的存储厂商黔驴技穷,再也弄出什么新花样,所以它们习惯把这个大Raid50/60组成“Pool”,也就是池,从而迷惑一部分人,认为存储又在革新了,存储依然生命力旺盛。 那冬瓜哥在这里也不妨顺水推舟忽悠一下,如果把传统的Raid组叫作Raid1.0,把Raid50/60叫作Raid1.5。我们其实在这里可以体会出一种周期式上升的规律,早期盘数较少,主要靠条带宽度来调节不同场景的性能;后来人们想通了,为何不用Raid50呢? 把数据直接分布到几百块盘中,岂不快哉?上层的并发线程I/O在底层可以实现大规模并发,达到超高吞吐量。此时,人们被成功冲昏了头脑,没人再去考虑另一个可怕的问题。至这些文字倾诸笔端时仍没有人考虑这个问题,至少从厂商的产品动向里没有看出。究其原因,可能是另一轮底层的演变,那就是固态介质。底层的车轮是不断地提速的,上层的形态是循环往复的,但有时候上层可能直接跨越式前进,跨越了其中应该有的一个形态,这个形态或者转瞬即逝,亦或者根本没出现过,但是总会有人产生火花,即便这火花是那么微弱。这个可怕的问题其实被一个更可怕的问题盖过了,这个更可怕的问题就是重构时间过长。一块4TB的SATA盘,在重构的时候就算全速写入,其转速决定了其吞吐量极4 大话存储后传——次世代数据存储思维与技术限也基本在80MB/s左右,可以算一下,需要58h,实际中为了保证在线业务的性能,一般会限制在中速重构,也就是40MB/s左右,此时需要116h,也就是5天5夜,我敢打赌没有哪个系统管理员能在这一周内睡好觉。 1.2 Raid5EE和Raid2.0 20年前有人发明过一种叫作Raid5EE的技术,其目的有两个,第一是把平时闲着没事干的热备盘用起来,第二就是加速重构。很显然,如果把下图中用“H(hot spare)”表示的热备盘的空间也像校验盘一样,打散到所有盘上的话,就会变成图右侧所示的布局,每个P块都跟着一个H块。这样整个Raid组能比原来多一块磁盘可用于工作。另外,由于H空间也被打散了,当有一块盘损坏时,重构的速度理应被加快,因为此时可以多盘并发写入了。但是实际却不然,整个系统的重构速度其实并不是被这块单独的热备盘限制了,而是被所有盘一起限制了,因为热备盘以满速率写入重构后的数据的前提是,其他所有盘都以满速率读出数据,然后系统对其做xor。就算把热备盘打散,甚至把热备盘换成SSD、内存,对结果也毫无影响。那到底怎样才能加速重构呢?唯一的办法只有像下图所示这样,把原本挤在5块盘里的条带,横向打散,请注意,是以条带为粒度打散,打散单盘是毫无用处的。这样,才能成倍地提升重构速度。
你还可能感兴趣
我要评论
|