(disclaimer: this is a rant. no real insight should be expected)
Came across this well known mathematical/philosophical phrase by chance (somethines: "one plus one equals one" or "one plus two equals zero").
And this is what happend:
We came up with this product that provide a kind of ETL pipeline over data using sparks. This pipeline give a simplified read-process-write for multiple data-sources using multiple processing scripts into any number of sink data targets.
For the initial POC release we based our work on an existing product, that is used to share data between organizations. This share product provided (being in its early version itself) only one share method, called "Copy-Based".
Copy based looks like that:
- I share a "pointer" to my data with you
- When you want it, you create a "snapshot"
- You ask for a copy of this snapshot into your own storage.
- Now you have a copy, and you can sell it.
This method give you an authentic copy of the data, and is simpler to secure on the backend. Disadvantages: It gives you stale data by definision, require storage, and is pretty slow.
A more advanced option is "In-Place" method, that goes a bit like this:
- I "allow" you to read my data.
- You access the shared "pointer" and read my data "directly"
- You are smarter now.
The second method saves storage cost, limit trafic cost to when you need it, and ensure the data you read is always up to date. It is also faster inherently, because you access the data as soon as it is available. Disadvantages: It is pretty hard to secure on the backend, and lose the inherent snapshot capability of the Copy-Based method.
When we leveraged this product, copy based method fit our need perferctly for the early version, and this is how we used it:
- The user give us the simplified pipeline.
- We identify all the data it consumes, and copy it into our processing environment.
- We process away like there is no tomorrow.
- We copy the results out into the users' storage.
Then we said "Yes, we can do all that, but we are using Copy-Based method, and maybe some day, we'll do the In-Place fandango."
After a while we had some kind of architectual review with someone, and he looked at our product, and then he looked some more, and then he coughed and said:
"You are doing In-Place dude. As soon as the magik-pipeline is triggered, it pulls the data, process it, and spit out a result. This is what In-Place is for!"
We where shocked and bewildered. It meant that we will have to rename all our Enums! But it also meant something else.
It was funny.
Our product provided In-Place processing that needs to create a snapshot from the source data, copy it for processing, need storage to save the result before you can access it, and by then it is already stale. Because it takes forever.
In other words, we managed to get all the disadvantages of Copy-Based and In-Place in one go!
Then I realized that in this case, One plus One equals Zero.