CMR: Rapid Data Acquisition for the Apache Spark Cluster Computing Platform
Absent the data, Spark is useless.
According to the article from Infoworld entitled “The 80/20 data science dilemma”,
…currently, most data scientists spend only 20 percent of their time on actual data analysis.
will come as a surprise to many, though probably not the data scientists themselves.
And this is where we come in: We have developed the CMR Data Acquisition API for the Apache Spark Cluster Computing Platform to allow data scientists that rely on the Apache Spark Cluster Computing Platform to leverage existing data sources in a way which is natural when working in the Spark environment.
Integration is available for the following data sources:
|Data Provider||Status||Availability (free / payware)|
|FRED Client / Federal Reserve Bank of St. Louis FRED web services||Partially working (Series, Categories, Observation)||Free, however an API key is required|
|World Bank / World Bank Client||Working||Free|
|TreasuryDirect.gov / Coherent Data Adapter: US Treasury Direct Client||Working||Free|
|OpenFIGI.com / Coherent Data Adapter: OpenFIGI Client Edition||Working||Free, however an API key is recommended|
|Quandl.com / Coherent Data Adapter: Quandl Client Edition||Not ready||Some data is free, some requires a subscription|
|CUSIP Global Services / Coherent Data Adapter: CUSIP Global Services Web Edition||Not ready||Subscription required|
Compare the following:
A data scientist, using Spark, wants to work with observation data for the S&P 500 from the Federal Reserve Bank of St. Louis. The data scientist can spend their time interfacing with the web services directly, or they can import a CSV file into Spark — a process that requires manual effort and, over time, this adds up to much time spent on other stuff.
A data scientist, using Spark, wants to work with observation data for the S&P 500 from the Federal Reserve Bank of St. Louis simply invokes one single line as follows:
val sp500ObservationsDS = cmr.fred.series.observations.withApiKey (key).withSeriesId(“SP500”).doGetAsObservationsDataset(spark)
The benefits to this approach are:
– Takes less than a minute to write
– Requires a few seconds to execute
– Can be easily repeated
– Is obvious what data is being loaded
– Keeps the data scientist focused on their analysing data.
 The 80/20 data science dilemma