登录
首页精彩阅读CDA:Trifacta通过服务简化数据整理方式
CDA:Trifacta通过服务简化数据整理方式
2016-02-07
收藏


CDA:Trifacta通过服务简化数据整理方式

Trifacta是一种提供数据分析服务的平台,最近获得了风险投资以推动其能使数据分析更容易地做数据整理的工作。它的目标是能够比目前更快、更容易地收集、清理和转换数据。

Trifacta


数据整理(Data wrangling)一直是每个大数据项目中最耗费时间和最令人痛苦的部分。在我们这个时代,数据是流动的、异构的,作为数据源其属性会不断变化。 NoSQL数据库一直都尝试解答在存储方面是使用基于列式存储还是基于文档型存储,但问题依然是如何收集数据和应用其语义。

Trifacta以用户为中心的角度而不是以程序员的角度去解决问题。业务分析师和数据科学家将能使用可视化的方式去清洗数据集。基于伯克利分校和斯坦福大学的研究,该平台的目的是使员工和机器一起合作,以从数据集中提取数据。

使用可视化的方式我们可以从大数据集中自动化采样数据,这让分析师可以在很短的时间发现有趣的模式。Trifacta可以应用机器学习算法为重新组织信息和整理提供建议。大数据分析师可以将数据集分组为信息的逻辑部分,每次将其规范化,并在其工作过程中以友好的界面方式显示。归纳概括整个数据集合是最后一个步骤,这将最终形成半结构化的数据集并最终成形。该平台是在底层设计时考虑到用户的体验,让数据分析师能专注于数据的处理,而无需开发复杂的管道去清理数据和把它们放入数据仓库

Trifacta的项目前身DataWrangler 和相关研究文章都可以在线获取并可以从中了解Trifacta是如何实现的,因为它们目前依然处于封闭的beta测试阶段,所以只能通过预约邀请的方式进行演示。

Trifacta Seeks to Simplify Data Wrangling-as-a-Service

Trifacta, a data analysis services platform, recently received VC investment to advance on their efforts of making data wrangling easier for data analysts. The goal is to collect, cleanse and munge data in a fraction of the time and effort it currently takes.

Data wrangling has traditionally been the most time consuming and painful part of every Big Data project. In our era, data is flowing, heterogeneous and constantly changing attributes as data sources are evolving. NoSQL databases have long tried to answer this question in the storage side by being column based or document based but the problem still remains in getting the data collected and applying semantics to it.

Trifacta is approaching the problem from a user centric perspective, instead of a developer one. Business analysts and data scientists will be able to cleanse datasets in a visual oriented way. Based on research at Berkeley and Stanford, the platform aims to make employees and machines collaborate together in extracting insights from datasets.

Automated smart sampling from big data sets together with visualization allows for the analyst to discover interesting patterns at a fraction of the time. Trifacta can then apply machine learning algorithms to suggest ways to reorganize information and get it into shape. The analyst can group the dataset into logical parts of information, normalizing it one step at a time and viewing the outcome in a user friendly way along its course of work. Generalizing in the whole dataset is the last step which turns the semi-structured dataset into shape. The platform is designed from ground up with user experience in mind to allow data analysts to shift in depth through data, without the need to develop complex pipelines to cleanse the data and bring them into the Data Warehouse.

Trifacta’s predecessor research project, DataWrangler and the research paper are available online and can give a sneak preview of what Trifacta is getting to, since they are still in a closed beta, only scheduling demos by invitation.


数据分析咨询请扫描二维码

客服在线
立即咨询