Absent the data Apache Spark is pretty much useless.

According to the article from Infoworld entitled “The 80/20 data science dilemma”,

…currently, most data scientists spend only 20 percent of their time on actual data analysis.

will come as a surprise to many, though probably not the data scientists themselves.

And this is where we come in: We have developed the CMR Data Acquisition API for the Apache Spark Cluster Computing Platform to allow data scientists that rely on the Apache Spark Cluster Computing Platform to leverage existing data sources in a way which is natural when working in the Spark environment.

Compare the following:

Scenario One

A data scientist, using Spark, wants to work with observation data for the S&P 500 from the Federal Reserve Bank of St. Louis. The data scientist can spend their time interfacing with the web services directly, or they can import a CSV file into Spark — a process that requires manual effort and, over time, this adds up to much time spent on other stuff.

Scenario Two

A data scientist, using Spark, wants to work with observation data for the S&P 500 from the Federal Reserve Bank of St. Louis simply invokes one single line as follows:

val sp500ObservationsDS = cmr.fred.series.observations.withApiKey (key).withSeriesId(“SP500”).doGetAsObservationsDataset(spark)

The benefits to this approach are:
– Takes less than a minute to write
– Requires a few seconds to execute
– Can be easily repeated
– Is obvious what data is being loaded
– Keeps the data scientist focused on their analysing data.

Maven Coordinates

com.coherentlogic.cmr.api:cmr-api-core:2.0.3-RELEASE

Endpoint Integration Available

Integration is available for the following data sources:

Data Provider Status Availability (free / payware)
FRED Client / Federal Reserve Bank of St. Louis FRED web services Partially working (Series, Categories, Observation) Free, however an API key is required
World Bank / World Bank Client Working Free
TreasuryDirect.gov / Coherent Data Adapter: US Treasury Direct Client Working Free
OpenFIGI.com / Coherent Data Adapter: OpenFIGI Client Edition Working Free, however an API key is recommended
Quandl.com / Coherent Data Adapter: Quandl Client Edition Working Some data is free, some requires a subscription
CUSIP Global Services / Coherent Data Adapter: CUSIP Global Services Web Edition Not ready Subscription required
Others TBD TBD TBD

Tutorials

2.0.3-RELEASE Example 1.) US TreasuryDirect.gov

2.0.3-RELEASE Example 2.) Quandl

2.0.2.1-RELEASE 1.) Example 1
2.0.2.1-RELEASE 2.) Example 2

Starting CMR in the Spark Shell

For 2.0.3-RELEASE use the start, for example, spark-shell, as follows:

[hadoop@ip-redacted ~]$ spark-shell –packages “com.coherentlogic.cmr.api:cmr-api-core:2.0.3-RELEASE,dom4j:dom4j:1.6.1,com.fasterxml.jackson.core:jackson-databind:2.9.6” –exclude-packages “junit:junit,org.jboss.spec.javax.ws.rs:jboss-jaxrs-api_2.1_spec,org.jboss.spec.javax.servlet:jboss-servlet-api_3.1_spec,org.jboss.spec.javax.annotation:jboss-annotations-api_1.2_spec,org.jboss.logging:jboss-logging-annotations,org.jboss.logging:jboss-logging-processor,org.jboss.spec.javax.xml.bind:jboss-jaxb-api_2.3_spec,org.reactivestreams:reactive-streams,javax.activation:activation,net.jcip:jcip-annotations,javax.validation:validation-api,javax.json.bind:javax.json.bind-api,javax.ws.rs:javax.ws.rs-api”

References

[1] The 80/20 data science dilemma

See Also