How SAS gets to data in Hadoop

2015-06-04来源：SAS BLOG作者： Rob Collum收藏

SAS offers a rich collection of features for working with Hadoop quickly and efficiently. This post will provide a brief run-through of the various technologies used by SAS to get to data in Hadoop and what’s needed to get the job done.

Working with text files

Base SAS software has the built-in ability to communicate with Hadoop. For example, Base SAS can work directly with plain text files in HDFS using either PROC HADOOP or the FILENAME statement. For this to happen, you need: Hadoop client JAR files (location specified in SAS_HADOOP_JAR_PATH); Hadoop configuration in a “merged” XML file*; Base SAS PROC HADOOP or FILENAME statement.

* TAKE NOTE! The “merged” XML file is manually created by you! The idea is that you must take the relevant portions of the various Hadoop “-site.xml” files (such as hdfs-site.xml, hive-site.xml, yarn-site.xml, etc.) and concatenate the contents into one syntactically correct XML file.

Working with SPDE data

The SAS Scalable Performance Data Engine (SPDE) functionality is also built into Base SAS. SAS can write SPDE tables directly to HDFS to take advantage of its multiple IO paths for reading and writing data to disk. You need: Hadoop client JAR files (location specified in SAS_HADOOP_JAR_PATH); Hadoop configuration files (location specified in SAS_HADOOP_CONFIG_PATH); Base SAS LIBNAME specifying SPDE engine.

Working with data in SASHDAT

SASHDAT is a SAS proprietary data format optimized for high-performance environments. The software pieces required are: Hadoop client JAR files (location specified in SAS_HADOOP_JAR_PATH); Hadoop configuration files (location specified in SAS_HADOOP_CONFIG_PATH); SAS High-Performance Analytics Environment (distributed mode); Base SAS LIBNAME specifying SASHDAT engine.

Working with data in Hive

Hadoop supports Hive as a database warehouse infrastructure. SAS/ACCESS technology can get to that data. You need: Hadoop client JAR files (location specified in SAS_HADOOP_JAR_PATH); Hadoop configuration files (location specified in SAS_HADOOP_CONFIG_PATH); Base SAS and SAS/ACCESS Interface to Hadoop.

Working with SAS In-Database Technology

The SAS In-Database technology achieves the goal of bringing the statistics to the data as a more efficient approach for working with very large volumes. In particular, the SAS Embedded Process is deployed into the Hadoop cluster to work directly where the data resides, performing the requested analysis and returning the results.

SAS In-Database technology for Hadoop is constantly evolving and adding new features. With SAS 9.4, the SAS Embedded Process provides SQL passthrough, code and scoring acceleration capabilities, as well as support for SAS High-Performance procedures. To get started, you need: Hadoop client JAR files (location specified in SAS_HADOOP_JAR_PATH); Hadoop configuration files (location specified in SAS_HADOOP_CONFIG_PATH); In-Database Deployment Package for Hadoop (SAS Embedded Process); Base SAS; SAS/ACCESS Interface to Hadoop.

But that’s not all. The SAS Embedded Process is incredibly sophisticated and can offer something else: asymmetric parallel data load to the SAS High-Performance Analytics Environment. To enable that, you also need: Remote deployment of distributed LASR (or other HPAE); Data stored in Hadoop using: Hive/Impala/SPDE.

Note that SPDE is on that last list. Without the Embedded Process, SAS can stream data from SPDE tables using the serial approach to the LASR Root Node. When the Embedded Process is available, then it can coordinate the direct transfer of the data from each of the Hadoop data nodes to their counterpart LASR (or other HPAE) worker nodes – that is, it enables concurrent, parallel data streams between the two services!

How SAS gets to data in Hadoop

2015-06-04来源：SAS BLOG作者： Rob Collum收藏

最新评论

热门资讯

学人资讯

学科资讯