CMR

Download | Bitbucket (see here for examples)

CMR: Rapid Data Acquisition for the Apache Spark Cluster Computing Platform

Absent the data, Spark is useless.

According to the article from Infoworld entitled “The 80/20 data science dilemma”,

…currently, most data scientists spend only 20 percent of their time on actual data analysis.

will come as a surprise to many, though probably not the data scientists themselves.

And this is where we come in: We have developed the CMR Data Acquisition API for the Apache Spark Cluster Computing Platform to allow data scientists that rely on the Apache Spark Cluster Computing Platform to leverage existing data sources in a way which is natural when working in the Spark environment.

Integration is available for the following data sources:

Data ProviderStatusAvailability (free / payware)
FRED Client / Federal Reserve Bank of St. Louis FRED web servicesPartially working (Series, Categories, Observation)Free, however an API key is required
World Bank / World Bank ClientWorkingFree
TreasuryDirect.gov / Coherent Data Adapter: US Treasury Direct ClientWorkingFree
OpenFIGI.com / Coherent Data Adapter: OpenFIGI Client EditionWorkingFree, however an API key is recommended
Quandl.com / Coherent Data Adapter: Quandl Client EditionNot readySome data is free, some requires a subscription
CUSIP Global Services / Coherent Data Adapter: CUSIP Global Services Web EditionNot readySubscription required
Others TBDTBDTBD

Compare the following:

Scenario one:
A data scientist, using Spark, wants to work with observation data for the S&P 500 from the Federal Reserve Bank of St. Louis. The data scientist can spend their time interfacing with the web services directly, or they can import a CSV file into Spark — a process that requires manual effort and, over time, this adds up to much time spent on other stuff.

Scenario two:

A data scientist, using Spark, wants to work with observation data for the S&P 500 from the Federal Reserve Bank of St. Louis simply invokes one single line as follows:

val sp500ObservationsDS = cmr.fred.series.observations.withApiKey (key).withSeriesId(“SP500”).doGetAsObservationsDataset(spark)

The benefits to this approach are:
– Takes less than a minute to write
– Requires a few seconds to execute
– Can be easily repeated
– Is obvious what data is being loaded
– Keeps the data scientist focused on their analysing data.

References:
[1] The 80/20 data science dilemma

Summary
software image
Author Rating
1star1star1star1star1star
Aggregate Rating
no rating based on 0 votes
Software Name
CMR: Data Acquisition API for the Spark Cluster Computing Platform
Operating System
Agnostic (Java and Scala)
Software Category
Middleware
Price
USD 120