Acquire Data in Apache Spark | Coherent Logic @ McLean, VA USA

Absent the data Apache Spark is pretty much useless.

According to the article from Infoworld entitled “The 80/20 data science dilemma”,

…currently, most data scientists spend only 20 percent of their time on actual data analysis.

will come as a surprise to many, though probably not the data scientists themselves.

And this is where we come in: We have developed the CMR Data Acquisition API for the Apache Spark Cluster Computing Platform to allow data scientists that rely on the Apache Spark Cluster Computing Platform to leverage existing data sources in a way which is natural when working in the Spark environment.

Compare the following:

Scenario One

A data scientist, using Spark, wants to work with observation data for the S&P 500 from the Federal Reserve Bank of St. Louis. The data scientist can spend their time interfacing with the web services directly, or they can import a CSV file into Spark — a process that requires manual effort and, over time, this adds up to much time spent on other stuff.

Scenario Two

A data scientist, using Spark, wants to work with observation data for the S&P 500 from the Federal Reserve Bank of St. Louis simply invokes one single line as follows:

val sp500ObservationsDS = cmr.fred.series.observations.withApiKey (key).withSeriesId(“SP500”).doGetAsObservationsDataset(spark)

The benefits to this approach are:
– Takes less than a minute to write
– Requires a few seconds to execute
– Can be easily repeated
– Is obvious what data is being loaded
– Keeps the data scientist focused on their analysing data.

Maven Coordinates

com.coherentlogic.cmr.api:cmr-api-core:2.0.3-RELEASE

Endpoint Integration Available

Integration is available for the following data sources:

Data Provider	Status	Availability (free / payware)
FRED Client / Federal Reserve Bank of St. Louis FRED web services	Partially working (Series, Categories, Observation)	Free, however an API key is required
World Bank / World Bank Client	Working	Free
TreasuryDirect.gov / Coherent Data Adapter: US Treasury Direct Client	Working	Free
OpenFIGI.com / Coherent Data Adapter: OpenFIGI Client Edition	Working	Free, however an API key is recommended
Quandl.com / Coherent Data Adapter: Quandl Client Edition	Working	Some data is free, some requires a subscription
CUSIP Global Services / Coherent Data Adapter: CUSIP Global Services Web Edition	Not ready	Subscription required
Others TBD	TBD	TBD

Tutorials

2.0.3-RELEASE Example 1.) US TreasuryDirect.gov

2.0.3-RELEASE Example 2.) Quandl

2.0.2.1-RELEASE 1.) Example 1
2.0.2.1-RELEASE 2.) Example 2

Starting CMR in the Spark Shell

For 2.0.3-RELEASE use the start, for example, spark-shell, as follows:

[hadoop@ip-redacted ~]$ spark-shell –packages “com.coherentlogic.cmr.api:cmr-api-core:2.0.3-RELEASE,dom4j:dom4j:1.6.1,com.fasterxml.jackson.core:jackson-databind:2.9.6” –exclude-packages “junit:junit,org.jboss.spec.javax.ws.rs:jboss-jaxrs-api_2.1_spec,org.jboss.spec.javax.servlet:jboss-servlet-api_3.1_spec,org.jboss.spec.javax.annotation:jboss-annotations-api_1.2_spec,org.jboss.logging:jboss-logging-annotations,org.jboss.logging:jboss-logging-processor,org.jboss.spec.javax.xml.bind:jboss-jaxb-api_2.3_spec,org.reactivestreams:reactive-streams,javax.activation:activation,net.jcip:jcip-annotations,javax.validation:validation-api,javax.json.bind:javax.json.bind-api,javax.ws.rs:javax.ws.rs-api”

References

[1] The 80/20 data science dilemma

Scenario One

Scenario Two

Maven Coordinates

Endpoint Integration Available

Tutorials

Starting CMR in the Spark Shell

References

See Also

Company Contact Details

Services

Follow Coherent Logic

Join Coherent Logic Groups

Blog