Tutorial: Getting Data

Getting Data

This walkthrough is also available as a Jupyter ipynb Notebook - you can run yourself

A frequent question asked is: 'How do I get data?'

Importantly, we are just working with NodeJS
albeit within the context within a Jupyter Lab Notebook.

So essentially, any data that you can programmatically access with an NPM module is available within a Jupyter Lab notebook.

Example Data Sources:

  • Local Files
  • APIs and Cloud Services
  • Scraping
  • Generating
  • etc.

Note on Asynchronous Calls

[TODO]

We would recommend using promises and using await / asynchronous functions

We provide the utils.ijs.await function to provide await functionality within the iJavaScript kernel, so the next cell will only be executed once the asynchronous calls are done.

This allows for very simple chaining of asynchronous and synchronous calls.

Please give it a look

For example:

//-- get the data
//-- fetch the data
//-- and do not execute the next cell until received.
utils.ijs.await(async ($$, console) => {
 barley = await utils.datasets.fetch('barley.json');
})
//-- use the data as though it was synchronously received

//-- get the min max of the types of barley
barleyByVarietySite = d3.group(barley, d => d.variety, d => d.site)
//-- now group by variety and year
barleyByVarietyYear = d3.group(barley, d => d.variety, d => d.year)

Library Setup

//-- this library
utils = require('jupyter-ijavascript-utils');
//-- library for parsing and manipulating html (like from fetch)
cheerio = require('cheerio');
//-- data-driven-documents (d3) library - allows for TSV/CSV/etc. files
d3 = require('d3');
//-- library for working with secrets locally
dotenv = require('dotenv');
['utils', 'cheerio', 'd3', 'dotenv'];
[ 'utils', 'cheerio', 'd3', 'dotenv' ]

Local Files

While there are many NPM modules available for loading data, we have tried to simplify two main types of files:

Including functions to write files as-well.

There are many NPM Modules available for loading different kinds of files.

The jupyter-ijavascript-utils library includes a few simple functions - for requent situations.

  • utils.file.pwd - prints the current directory Jupyter Lab is looking at
utils.file.pwd()
// /path/to/notebooks
utils.file.listFiles('.');

/*
[
  'ex_GettingData.ipynb',
  'node_modules',
  'package-lock.json',
  'package.json',
  ...
]
*/

FOR EXAMPLE: checking if your dotenv credentials can be found

credentialsPath = `${utils.file.pwd()}/credentials.env`;

if (utils.file.checkFile('./credentials.env')) {
    //-- credentials could not be found
    //-- let the user know what is expected
    
    console.error(`Could not find ${credentialsPath}

We use dotenv to securely store credentials
and require it to access provider XYZ.

* username {string}
* password {string}

for example:

"""
username="jdoe@example.com"
password=""
"""

Please create this file and run again.
`);
    throw Error(`credentials file not found:${credentialsPath}\nPlease read the message above and try again`);
    
    
} else {
    //-- the credentials file was found, so load it
    credentials = dotenv.config({ path: credentialsPath }).parsed;
}

//-- check we have all the information needed to move forward

if (!credentials.username || !credentials.password) {
    throw Error(`Credentials not provided
${credentialsPath}`);
}

//-- indicate success
//-- BUT always be careful NOT TO leak credentials
console.log(`credentials loaded`);

APIs and Cloud Services

[TODO]

Remember, we are still using NodeJS - so you can leverage NPM packages to load data.

For example:

JSForce is a brilliant library for working with Salesforce.

Working with Secrets

[TODO]

Working with Secrets in a jupyter notebook is similar to working with any NodeJS project.

Dotenv is a staple for working with credentials and highly recommended.

Remember though:

  • DO NOT include credentials within any notebook
  • If using dotenv, ensure the files are properly secured (as they are outside of the notebook)
    • For example, that they are gitignored, have appropriate read access, etc.
  • As a notebook can provide summaries of data accessed through secure means, always protect the notebook as-well.
    • To avoid any security leaks

Database Access

[TODO]

As we are working within NodeJS, there are many NPM libraries that can help with accessing databases.

For example: sequelize is an Sequelize is a promise-based Node.js ORM tool for Postgres, MySQL, MariaDB, SQLite, DB2 and Microsoft SQL Server. It features solid transaction support, relations, eager and lazy loading, read replication and more.

And of course the native database libraries can be used:

Of course, the native libraries can always be used:

  • mssql - for working with sql server
  • mysql for working with mysql

Scraping

[TODO]

Scraping (retrieving through fetch, parsing and collating) can be done within a Jupyter Notebook.

We would recommend to keeping this to simple fetches and parsing in general however.

This jupyter-ijavascript-utils library includes two convenience functions for working with fetch, a simple shim for traditional JavaScript fetch calls from within node.

Additional libraries can also be used to parse the data and generate datasets

(For example: cheerio)

Working with Text Files

[TODO]

You can also work with Text files and use it based on the current path:

There are also many many different ways to do this.

// sillySong = utils.file.load('../data/pirates.txt');

sillySong = `I am the very model of a modern Major-General
I've information vegetable, animal, and mineral
I know the kings of England, and I quote the fights Historical
From Marathon to Waterloo, in order categorical
I'm very well acquainted, too, with matters Mathematical
I understand equations, both the simple and quadratical
About binomial theorem I'm teeming with a lot o' news
With many cheerful facts about the square of the Hypotenuse
With many cheerful facts about the square of the Hypotenuse
With many cheerful facts about the square of the Hypotenuse
With many cheerful facts about the square of the Hypotepotenuse`

sillyLines = sillySong.split(/\n\s*/)        // split on multiple line breaks
    .map(line => line.trim());   // trim each line
[
  'I am the very model of a modern Major-General',
  "I've information vegetable, animal, and mineral",
  'I know the kings of England, and I quote the fights Historical',
  'From Marathon to Waterloo, in order categorical',
  "I'm very well acquainted, too, with matters Mathematical",
  'I understand equations, both the simple and quadratical',
  "About binomial theorem I'm teeming with a lot o' news",
  'With many cheerful facts about the square of the Hypotenuse',
  'With many cheerful facts about the square of the Hypotenuse',
  'With many cheerful facts about the square of the Hypotenuse',
  'With many cheerful facts about the square of the Hypotepotenuse'
]
sillyLines[0]; // I am the very model of a modern Major-General,
'I am the very model of a modern Major-General'

More Complicated Example

To show both randomly creating text and then parsing it back:

errorHeader = ['INFO', 'WARNING', 'ERROR'];
errorType = ['Syntax Error', 'Uncaught Exception', 'Exception Thrown'];
errorIn = ['File_A.js', 'File_B.js', 'File_C.js'];

generateErrorLine = () => utils.random.randomInteger(0, 200);
generateErrorHeader = () => utils.random.pickRandom(errorHeader);
generateType = () => utils.random.pickRandom(errorType);
generateFile = () => utils.random.pickRandom(errorIn);

generateError = () => `[${
    generateErrorHeader()
}]: ${
    generateType()
} occurred in ${
    generateFile()
}: ${
    generateErrorLine()
}`;
[Function: generateError]

Example Error Line

generateError()
'[WARNING]: Exception Thrown occurred in File_A.js: 84'

Generate Example Error file.

Each line in the format of:

[ErrorType]: [ExceptionType] occurred in [File]:[Line Number]

errorFile = utils.array.size(10, generateError).join('\n');
console.log(errorFile);
[ERROR]: Exception Thrown occurred in File_B.js: 28
[ERROR]: Syntax Error occurred in File_C.js: 85
[INFO]: Exception Thrown occurred in File_C.js: 67
[ERROR]: Uncaught Exception occurred in File_C.js: 68
[INFO]: Exception Thrown occurred in File_A.js: 118
[ERROR]: Exception Thrown occurred in File_B.js: 101
[INFO]: Syntax Error occurred in File_C.js: 87
[ERROR]: Uncaught Exception occurred in File_A.js: 170
[ERROR]: Exception Thrown occurred in File_A.js: 47
[ERROR]: Exception Thrown occurred in File_B.js: 191

Now we can parse it

We can use the same format as the one above:

[ErrorType]: [ExceptionType] occurred in [File]:[Line Number]

errorLineRegex = /\[(.+)\]: (.+) occurred in (.+): (.+)/i;
// The `/` character at the start and the end, indicates it is a regex.
// `/i` at the end means the end of the regex, and case insensitive

// here we capture things to remember through `(.+)`

// `\[(.+)\]` means look for a `[]` characters (escaped) - and remember what is inside `(.+)`
// followed by `: ` before the next group to remember until `occurred in`
// followed by `occurred in ` before the next group until `:`
// and then remember anything after the `:`

parsedErrorFile = errorFile
    .split(/\n/)
    .map((line) => Array.from(line.match(errorLineRegex)).slice(1));
[
  [ 'ERROR', 'Exception Thrown', 'File_B.js', '28' ],
  [ 'ERROR', 'Syntax Error', 'File_C.js', '85' ],
  [ 'INFO', 'Exception Thrown', 'File_C.js', '67' ],
  [ 'ERROR', 'Uncaught Exception', 'File_C.js', '68' ],
  [ 'INFO', 'Exception Thrown', 'File_A.js', '118' ],
  [ 'ERROR', 'Exception Thrown', 'File_B.js', '101' ],
  [ 'INFO', 'Syntax Error', 'File_C.js', '87' ],
  [ 'ERROR', 'Uncaught Exception', 'File_A.js', '170' ],
  [ 'ERROR', 'Exception Thrown', 'File_A.js', '47' ],
  [ 'ERROR', 'Exception Thrown', 'File_B.js', '191' ]
]

Converting Arrays to Objects

We can use Array Destructuring and then construct the objects.

[ErrorType]: [ExceptionType] occurred in [File]:[Line Number]

errorData = parsedErrorFile.map(([errorType, exceptionType, file, lineNumber]) =>
    ({ errorType, exceptionType, file, lineNumber }));

utils.table(errorData).render()
errorType exceptionType file lineNumber
ERROR Exception Thrown File_B.js 28
ERROR Syntax Error File_C.js 85
INFO Exception Thrown File_C.js 67
ERROR Uncaught Exception File_C.js 68
INFO Exception Thrown File_A.js 118
ERROR Exception Thrown File_B.js 101
INFO Syntax Error File_C.js 87
ERROR Uncaught Exception File_A.js 170
ERROR Exception Thrown File_A.js 47
ERROR Exception Thrown File_B.js 191

Generating

[TODO]

Generating Data can also be a simple option if desired.

We have two sets of methods on the random module

  • Generating Random Numbers
  • Working with Arrays
  • Simplex Noise
    • simplexGenerator(seed) - Number generator between -1 and 1 given an x/y/z coordinate

Additionally, there are so many different ways of generating visualizations based on simplex noise.

From straight (red - negative / green - positive)

Screenshot of animation

To indicators with length, and rotation (negative ccw / positive cw)

Screenshot of animation