Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Tuesday, June 21, 2016

Why Python? (Sad Follow-up)

In "Why Python?" I linked to a deep and sophisticated analysis of programming languages. Anyway, I thought it was a deep and sophisticated analysis.

I got a reply that shows how wrong I was. Here's the quote:
The point is that the Python ecosystem has a lot to offer. We could argue about the language design choices. However, why bother? Why not just take advantage of what the ecosystem has to offer.
Ah. Discussing the language is just "arguing". I guess the points are all debatable and my comparison of Python to any benchmark is just the seed for an argument. A religious war, perhaps. I guess this wasn't compelling. It was a "why bother?"

Why bother pointing out the strong points of the language?

The email emphasized the "ecosystem" with a cool, but short example of how scipiy.spatial.KDTree works. 

It appears that -- for some people -- "Python code actually works" is a useful response to "why python?" 

I would have thought that "Python code actually works" was a precondition to even discussing the value proposition behind Python.

But -- clearly -- I was wrong.  The mere fact of a working example is a Very Important Thing™.

What does this mean?
  1. There are people who use software that doesn't actually work. When they see software that works, it's important. Very important.
  2. When software actually works, these people find this simple fact to be a compelling and substantial argument for placing a high value on the software.
  3. Other considerations like clarity and simplicity aren't relevant. If these poor souls are suffering software that doesn't actually work, then broken and obscure is still broken. Other parts of the long discussion from Wirth are just arguing points.
The email included "consider amending the why python? blog w/ the other big pro: ecosystem" I'm not sure I actually understand the request. When code that works is a "big pro", this comes from a world I can't pretend to understand.

Also. The example code used xrange(). Which is a Python 2 smell. Those days are passed.

Tuesday, June 14, 2016

Continuous Data Migration

See http://slott-softwarearchitect.blogspot.com/2013/07/database-conversion-or-schema-migration.html

People talk about CI/CD (Continuous Integration/Continuous Deployment).

They also need to talk about CM (Continuous Migration).

"Wait, what?" you ask.

When we roll out a new version of the software (CD) there are three common situations.
  1. The new software uses the existing data model with no changes. This is a "minor version change": from v3.2 to v3.3.
  2. The new software requires a tweak to the schema, but it's backward compatible. This, too, is a minor version change. In a SQL context, we might have used an ALTER TABLE to add a nullable column. If there are no SELECT * statements in the code, this change is essentially transparent to legacy code.
  3. The new software involves a new schema that's not backwards compatible. This is a major version change. From v3.2 to v4.0. This is difficult. Really difficult.
Clearly, the first two can be done with the data in place. New software is installed, the servers are restarted, and away we go. In a big environment, there may be a rolling deployment. There may be a canary release that will get converted first, then others will be brought online.

A change of the Second Kind does involve the one-time database transformation script. This may lead to some down-time. Or it may lead to a feature toggle so that the new software can work with the old database until the script is run.

In a NoSQL context, a change of the Second Kind doesn't require the one-time script. The new documents have new fields that old documents don't have. NoSQL apps -- in general -- must be able to cope with data model variations.

A change of the Third Kind is trouble.

Big trouble.

We have two schema: the v3 schema and the v4 schema. We have two sets of software: the v3.2 release and the v4.0 release. We'd like to have just one valid set of data. How do we deal with this?

How can we do schema migration badly?

We can't easily have a single software release that includes one set of data in both schema. It's technically possible Anything that doesn't involve time travel, anti-gravity or perpetual motion is technically possible. But it rapidly becomes so complex that we have to set this uber-version idea aside.

We have to do more deployment work to have both v3.2 and v4.0 installed in parallel. v3.2 will use data in the v3 schema, v4.0 will use data in the new schema.

How do we migrate the data from the old schema to the new schema?

This can be tricky. There are proven bad ideas out there. Really epically bad ideas.

One Very Bad Idea (VBI™) is the one-time-only data migration. Back in the olden days, we couldn't afford enough storage to have two copies of the database. Seriously. When a company owned exactly one computer (before PC's -- a Very Long Time Ago) the conversion had to be done by making special backups and restoring the backups into the new schema.

This VBI is still with us today.  Lots of places want to do one-time-only data migrations because it's the traditional approach. If they can't done a one-time conversion (over a long weekend) they complain. Loudly.

BTW. This never worked well. The one-time-only conversion software was never tested carefully, and therefore rarely worked the one time it was needed. Also, data profiling was never done, so edge and corner cases were found during conversion. These often called the new software's features into question, leading to larger and larger problems.

Continuous Migration

The ideas behind continuous migration are these.
  1. We're always going to be migrating the data. Always.
  2. Storage is cheaper than labor. When in doubt, buy more storage.
  3. Data outside the database (in CSV files or YAML documents) is smaller than data inside the database. Don't be afraid to export.
  4. Data outside the database is inaccessible. Be cautious of the implied down-time during exports and imports.
  5. ABP. Always Be Profiling. If you don't have a data profiler in place right now, that's the first thing to build. There are schema definition tools and schema checking tools. Look at JSON-Schema.org. Write schema definitions and use a data profiler to examine all rows and check all rules. All. Seriously. All. In a SQL DB, actually check the foreign keys to be sure the referenced row exists; you'll be surprised.
  6. We're moving forward. We're not milling around; we're not supporting the old version except for the purposes of a parallel test or a fall-back in the event the next version doesn't work. There's no long-term coexistence strategy. Preserve the data; upgrade the software.
Here's the central data migration requirement:

Be able to migrate to the new schema as many times as needed.

I'll repeat that. As. Many. Times. As. Needed.

Migration is not a one-time thing. You do it all the time.
  1. Migrating (and possibly sanitizing or subsetting) production data into the development environment.
  2. Migrating production data for QA testing.
  3. Migrating production data for integration testing.
  4. Migrating production data for performance testing.
  5. Migration production data for the production upgrade.
These are all the same activity. 

I'll repeat that. The. Same. Activity. Sometimes with mappings. Sometimes with filters. 

Since you'll do many, many migrations, your data migration programming is as important as your application programming. Perhaps more important than the application code because it's what preserves the data, and the data is the only thing of value. Applications come and go. Data is forever.

Having real data available permits seamless, silent, and automatic parallel testing. We can easily do a parallel test with v3.2 and v4.0 release candidates by simply running the migration (or migration with subset filter) to gather some data for the parallel test. If the release candidate has problems, we can fix v4.0 to create the next release candidate, re-migrate the data, and try the parallel test again.

At some point the v4.0 release is final, and we need to migrate all of the data. This (usually) involves some feature toggles to put v3.2 into a special "end-of-life" mode where the keys for records which change are logged separately. After turning off v3.2 and turning on v4.0, a second phase of migration will process these end-of-life rows through the migration mill.

Software and Schema Design Consequences

This has an important consequence.

Your software must be explicitly bound to a specific schema by major version number.

Explicitly bound. In a SQL context, you can use the "schema" construct an include the version number in the schema name. "myapp_v3" vs. "myapp_v4". This becomes a ubiquitous qualifier on all table names. SELECT col FROM myapp_v4.some_table AS st.

Yes. Do this Everywhere. Do it Now. 

If you're using mybatis or SQLAlchemy to get the SQL out of your application, then this kind of thing is a trivial change. If you have SQL in your application code, well, you have two problems to solve. First, get the SQL out of your application. Then make the schema version explicit.

In a NoSQL context, you can include the schema version as part of a collection name. "collection_v3" or "collection_v4".

This should be present everywhere.

Then, you'll need data validation apps and data migration apps. The validation apps will use your favorite schema definition and schema validation framework. Start running this as soon as you think you might need to make a major version change.

Finally, you'll need the data migration tool set. This will involve filter rules and sanitizing rules. These are not sophisticated "rules engine" kind of things with unbounded complexity. They're usually if statements and simple computations. But they come and go pretty freely, so design the software in a way that makes the filter and sanitizing code obvious.

Now you can -- trivially -- migrate data between schema versions inside the same database. You can have v3.2 and v4.0 running side-by-side. You can migrate the data early and often. You can profile and validate the data. You have a formal schema for the data validation. 

Tuesday, June 7, 2016

Why rewrite a shell script in Python?

Here's the actual quote:

Why would you need to rewrite a working script in python ? Was there any business direction towards this ?

This was an unexpected response. And unwelcome. I guess I called their baby ugly.

The short answer is that the shell script language is perhaps one of the worst programming languages ever invented. Okay. I suppose it's better than whitespace.  Okay it's better than many others. See https://en.wikipedia.org/wiki/Esoteric_programming_language

The longer answer is this:
  • There are (at least) three ksh scripts involved, two of which are over 1,000 lines long. It's not perfectly clear precisely what's involved. It's ksh. Code could come from a variety of places through very obscure paths; e.g. the source command and it's synonym, ..
  • There are no comments other than #!/usr/bin/ksh and a few places where code is commented out.  Without explanation.
  • There is no other documentation. The author had sent a email describing the github repo. The repo lacked a README. It took two tries to get them to understand that any email describing a repo should have been the README in the repo. There is barely even a command-line synopsis. (Eventually, I found it in the parser for command-line options.)
  • No tests of any kind.
The last point is the one that I find shocking. And I find it shocking on a regular basis.

Folks are able and willing to write 1,000's of lines of shell script without a single unit test, integration test, system test, performance test, anything test. How do they know if this works? Why am I supposed to trust it?

More importantly, how can I meaningfully wrap this into a RESTful API if I'm not even sure what the command-line interface really is? It's the shell. It could use environment variables that are otherwise undocumented. They would be discovered when they cause a crash at run time. Crashes that become an HTTP 500 status code and a traceback error message in the web log.

The "business direction" sounds like an attempt to trump the technical discussion with a business consideration like "cost" or "benefit". It should be pretty self-evident that 1,000's of lines of shell script code is technical debt.

The minimally viable replacement will probably be a similarly-sized of Python script that mindlessly mirrors the original shell script. It's sometimes quite hard to tell what purpose a shell function really serves. The endless use (and re-use) of global variables can make state change hard to follow. Also, the use of temporary files which are parsed (and reparsed) as a way to set state is a serious problem.

What's important is that the various OS services used by the shell script are mockable. Which means that each of the 100 or so individual functions within the script can be tested in isolation. Once that's out of the way, refactoring becomes plausible.

Let's savor those words for a moment: Tested. In. Isolation.

Ahhh.

The better replacement is (I think) about 250 lines of Python -- perhaps less -- that perform the real 8-step process that we're automating.  Getting rid of bash language cruft is challenging, but essential.