1. What’s new
The first implementation of UI for creating source in BDC has been completed. It is placed in a separate branch on github. The branch is named
gl_source_type. This memo describes what has been done so far for this part of the product, and how to get the code from the branch in github.
I will also give a description of the design of the create source component. To get clear with the design will help understand the purpose of the implementation and help see what is still open for implementation. Specifically, you can find out where the “Source Reader” functionality will fit in the overall picture. The Source Reader component will be one of the major components. Its functionality will also be one of the reasons customers want to pay for this tool. More on this later.
Normally, in large software developments, developers create branches separate from the master branch so that they can easily share only the features to others, and get the new code more frequently reviewed and modified while not affecting the master branch where more stable code resides. When the feature branch is fully developed and less likely to change frequently, it is merged into the master. I am following this standard development process practiced in all software development teams.
As of now the gl_source_type branch is “code completed” but not yet fully tested. Once you can run it in your environment, please test it and give me suggestions. I will make some modifications and commit changes into the branch. When we are all happy with the outcome, I will merge it into the master. Every feature we develop should follow this same process.
2. How to get the new code
To get the code from gl_source_type branch, use the following commands:
cd <some dir>/BitA/bdc_site git fetch
All the branches in github are copied to your local repository. One of them is gl_source_type. Switch to this branch using
git checkout gl_source_type
You can check your branches with “git branch”. And you should see something like this
bit_readers * gl_source_type master
The asterisk indicates gl_source_type is your current branch. You can switch back to master with “git checkout master”
3. How to Run the Code
There is no need to rebuild and relaunch your existing docker containers. All changes are in the bdc_site. If you have the BitA on the docker mount volume, the changes in the bdc_site are accessible within your currently launched docker container.
First get into the running docker container using the sh.sh script. Once in it, cd to /data/BitA directory.
Second, run the BDC web service with:
This will load the new code in the new BDC service.
Third, start a browser and point to this address
Your screen will look like this:
If you have already created an account, click Signin and enter the username and password. Otherwise, create a user for the first time, then log in.
Once logged in, you can see the top navigation bar contains Home, Data Sources, Dataflows, and Data Targets. The purpose of each
component is as follows.
- Home: This is the front page of the site. It should include brief info about the tool in one screen.
It will also be desirable to include some important links, such as link to tutorial of the tool.
- Data Sources: This is where the user specifies information about data sources. The UI of this part is what the gl_source_type has implemented.
- Dataflows: This is where the user designs the dataflows, which starts from sources and ends in targets.
- Data Targets: This is where the user specifies where the targets are. This is similar to creating sources.
4. Creating Sources
Click on Data Sources in the navigation bar, you will see this
You have not created any sources yet. So the screen is nearly empty.
To create a new data source, click on the “Create Source” button, your will see the create source screen. This is a page with three claps-able sections. Each section can be expanded and closed by clicking on the title. The purpose of each section is as follows.
- General: This is where general information about the source is defined. The user specifies name, and type of the source. The connection string may also be needed so that our tool can set up connection to get real data. There is also a Description box for user to add some documentations about this source.
- Data Type: This is where the user describes how the data looks like when it comes out of this source. User can add attributes if the data type is Record.
- Sample Data: This is where the user can preview some portion of the data. The connection string specified in the General section is used to pull some data from the real source for display. Pulling sample data from source will be helpful to user in design of dataflow.
Lets enter some data.
In the General section:
- Enter EMP as the name of the source
- Select Instant Data as type
- Leave the connection string blank
- Add some descriptions
In the Data Type section.
- Select Record as data type
- Click on “Add Attribute” button to add the following attributes:
The screen looks like this
At this stage, lets temporarily forget about sample data, and go ahead click on the Create button. This will cause the data entered to be saved in the MongoDB database. And the browser will bring you back to the list of sources page. Now you
should see the new source you just created.
You can click on the red cross to delete this source. But don’t do it right now. Let’s see how to edit it. Click on the name of the source, it will bring you back to the source detail page. Then click on the Sample data and enter some records.
Currently, the sample data only accepts data in JSON format, so in the following screen you can see data are in the JSON format.
This is a list of records, fitting the data type specified in the Data Type section. In the General section, you can find that we have set the Source Type as “Instant Data”. This type of source does not have a real source such as JDBC which represents a database source.
An Instant Data source is a source whose data are entered by the BDC user in the sample data section directly, such as in the about diagram. Since this kind of source does not need to connect to real data source to pull data, the data is available “instantly”. Hence its name.
Now click on Update button, you will be back to the list of source pages, as if nothing happened. But you can click on the source name and come back to the detail page and see that the sample data you entered is still there. This means that it has been saved in the MongoDB.
5. The Design of the Sources component
In bdc_site/backend/models/source_server_model.js, you can find the definition of the Source schema. This schema defines the Source collection in the MongoDB. A Source collection has the following attributes
|name||A unique name for the source|
|created||The creation time of the source|
|lastModified||The update time of the source|
|desc||A note on this source|
|owner||The user who created this source.|
|sourceType||Specifies what kind of source this is, such as Google GS, BQ, JDBC, etc. These are all the data sources BDC will be able to read from.|
|connectionStr||The string used to access the source|
|typeDef||A pointer to the data type definition of the source. A type definition is also a MongoDB object. Its class name is TypeDef. Its definition is in bdc_site/backend/models/typeDef_server_model.js file. More on that shortly.|
|sample||The sample data pulled from source or entered by user directly if the Source Type is Instance Data. The sample data is a string of Json format. For example, in the above example, the sample data is an array of records.|
The TypeDef class has these fields
|name||The name of this field.|
|type||The data type of the field. This must be in Json format and convention.|
|desc||Some note about this field.|
When a source record is being created by user, the Data Type section from the UI is used to create a TypeDef object first in MongoDB. A reference to this new TypeDef object is held and then saved in a new Source object in the typeDef field.
If the user set the “Source type” of a Source to “Instant Data”, the user can directly enter data in the “Sample Data” field. However, if the Source type is not Instant Data, the sample data must be pulled from the real data source, using the connection string om the General section.
As of this writing, the part for pulling sample data from real source has not been implemented. Only Instant Data works now. This is because with Instant Data, we can start working on create dataflow component. The instant data we enter is more than enough for driving the test of dataflow.
6. Design of Reading Sample Data from Real Source
(This part is still to be completed…)
If source type is not Instant Data, samples must be pulled from real source. The UI will be like this:
The user specifies how many rows to pull from source, then click on the “Get Data” button. The actions following the clicking of this button is like what’s shown in this diagram.
The handler of the click event in AngularJS sends a GET request to backend NodeJS for the source data. The number of rows to pull from source is passed as parameter of this GET request.
The backend spawns a child process and runs java in this child process. The java code is the source reader code currently being developed in the bit_readers branch. Upon successful access to source and pulling data, the child process finishes and returns the records in Json format to the NodeJS backend.
The backend then returns an HTTP response back to the frontend, which in turn sends it to the browser for display.
In these interactions, there will be two promise objects. One is at the browser, which wait on a promise object until it is fulfilled by records or fails. Another is at the AngularJS frontend which uses a promise object to wait on the return of records from backend.
7. Validate the Sample Data
(This part is still to be completed…)
Immediately below the Json data, there will be another button for “Validate Data”. This button checks to see if the sample data agrees with the schema defined in the Data Type section.
8. Possible More Bells and Whistles
(This part is still to be completed…)
The above description constitutes the basic Source access front. More advanced features may be later added. The following are the possible extensions.
- Let user define filters. For example, the user could specify simple filter condition like “price > 0.5 and numItems > 100”.
- Let user specify more source type specific properties. Example: if the source type is JDBC, in addition to the connection string, user needs to give a SQL statement. This additional property is not needed in other types of sources. For example, a Hadoop file could normally be specified by hdfs://host:port/path/file.txt. Both how to connect and where is the file is specified in one connection string. Apparently, the UI needs to be adaptive to the source type.
- Let user specify where to store data that do not fit the schema defined in the Data Type section. This will be a must have before this product can be released. This is because in the big data environment, the high velocity of data means that the source data may not have been carefully prepared for our tool. Or temporarily there may be bad data. The tool must be resilient to these abnormalities. We must allow user specify where to direct bad records that do not fit the schema.