Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Thursday, December 25, 2014

Intro to Python Tutorial

http://sophieclayton.github.io/2015-01-15-uw/novice/python/index.html

A very nice tutorial. It's Focused on a specific problem. It covers the solution technology in some depth. I think the focus and depth features are important. It's often tempting to cover the technical features without really solving a problem.

In the "real world," we're often pressured to put the first MVP into production and move on to the next problem. I put "real world" in "scare quotes" because this approach is as dumb as a bag of hammers. Managers who insist on installing or shipping the first Minimally Viable Product are essentially purchasing technical debt instead of a solution.

I like the tutorial because it includes additional aspects like quality assurance. It's called "Defensive Programming," but it's really QA. I like to call it "fit and finish." The job's not over until there are automated tests to demonstrate that its over.

The Software Carpentry site as a whole looks quite good. It seems to have numerous high-quality tutorials.

Tuesday, December 23, 2014

Packt Deals

Okay. This seems shameless. But.

Here's the link http://bit.ly/1zg0mpA straight to my book information page on www.PacktPub.com


I'm slowly coming to grips with the reality of marketing.

Friday, December 19, 2014

Dev of the Week

http://java.dzone.com/articles/dev-week-steven-lott

Yes. Everyone is famous for 15 minutes.

And. "On the Web, everyone will be famous to fifteen people."

Thursday, December 18, 2014

Making Learning Accessible

Visit Packt Publishing today for the $5 eBook Bonanza.  https://www.packtpub.com.

eBooks and videos at a discount through something like the 6th of January.

We autodidacts are rejoicing.

Specifically, I can look at some of the Scala and Hadoop titles. I'm working with folks who have Hadoop but I've heard rumors that they're leaning toward Scala, also. Does that mean Apache Spark? Or does it mean Scalding?

I'm biased toward using Python with Hadoop; but I appear to be in the minority on this. Time to do some additional learning.


Tuesday, December 16, 2014

The Getting Started Problem

How does one get started developing software? What's the first step?

When you come to this craft -- or sullen art -- without a background except as a user, how do you get started writing code?

It's not easy. Indeed, developing software may be one the hardest things there is. Really, really hard.

Why? Consider the orders of magnitude involved. From sub-microsecond clock speeds to software that's supposed to continue running for 8,763 hours a year without interruption. That's 31,547,269 seconds. Isn't that about 15 orders of magnitude?

Or consider scope of storage. We wrangle over bytes in a dataset that spans terabytes. That's 12 orders of magnitude.

When engineers build a 13,000' long bridge, are they looking at it from scales of 10±5? Do they even care what's 21 miles away? They might care about things at the scale of 10-5, since that's about an inch. But 10-7? 100th of an inch? I could be wrong, but I have doubts.

I won't go so far as to say bridge building is particularly easy. It's safety critical work. People die when things go wrong. Consequently, it's regulated by civil engineering standards. Bridge designs are limited to proven patterns. You can't spring something new on the world and expect anyone to pay money for it or trust their life to it.

If you're with me so far, you see my point: software is different. And that makes it particularly hard. People do learn elements of it. How does this happen?

Two Paths Diverge

I see two separate paths:

  • More formal, and
  • Less formal.
The more formal path includes the kind of curriculum you find at big CS schools. Formal treatment of algorithms and data structures. Logic and Computable Functions. The essentials of Turing Completeness.


The less formal path starts with -- essentially -- random hacking around, trying to get stuff to work. Some folks argue that a curriculum of structured exercises isn't "random" hacking around. I suggest that a curriculum of structured exercises can be the formal path concealed under a patina of hackeriness. On the other hand, a set of exercises can be successful at training programmers; if it doesn't follow a formalized structure, it's merely a small step from random. 

[Random doesn't mean "bad;" it means "informal" and "unstructured."]

Some folks learn well in a formal, structured approach. They like axiomatic definitions of computability, and they can get a grip on how to map the abstractions of computing to specific languages and problem domains. They read content at http://www.algorist.com and see applications of principles.

Other folks can be shown the formal background that makes their random hacking fit into a larger pattern. When shown how some things fit a larger pattern, they're often happy work in a new context with an expanded repertoire of data structures and algorithms. They read content at http://www.algorist.com and look for solutions to problems; the formal patterns will emerge eventually.

Not all folks respond well to having their informal notions challenged. Some folks have ingrained bad habits and prefer to fight to the death to avoid change. A sad state of affairs, but remarkably common. They didn't understand linked lists at some point and steadfastly refuse to use the java.util.LinkedList class. This is what software religious wars are about. Some trolls truly and deeply love an uniformed religious war. 

Chickens and Eggs

Is this a chicken-and-egg problem? 
  • You can't really appreciate the formal foundations until you have some hands-on coding experience.
  • You shouldn't dirty your hands with implementation details until you have the proper theoretical foundations.
That seems potentially reductionist and uninformative. Or. Perhaps there is a nugget of truth in this. Perhaps one is actually foundational.

Eggs, to be specific, show the fresh mutations. The egg comes first from a chicken-like precursor that's not properly a chicken. 

What's that precursor to programming in Python? CS Fundamentals? Hacking around? I suggest that the way we acquire languages is important here.

Language Skills

Software languages are a small step from natural languages. As with learning natural languages, formal grammar may not be as helpful as engaging in conversations. Indeed, for natural languages, formal grammars are an afterthought. They're something we discover about a corpus. We impose the discovered grammar rules on ourselves (and others) to be understood in a context of other writing (and speaking.) 

Natural language grammar isn't timeless and immutable. People throw their hands up in despair at the erosion of grammar and language. They're -- of course -- just being reactionary. Language evolves. The loudest complainers are the ones who didn't pay attention for a long time and suddenly (somehow) realized the don't know what "WTF" means. LOL.

With an artificial language, the grammar is formalized. It has a first-class existence in compilers, interpreters and other tools. 

However, I think the bits of our brain that assimilate grammar work best from concrete examples. A formal grammar definition -- while helpful -- isn't the way to start. I think that a less formal, "try this" suite of exercises is perhaps the best way to learn to program.

As an author, I'm beholden to my publisher's notions of what sells. Examples sell. See almost everything from Packt. Working examples are solid gold. 

These are not necessarily problems for the reader to tackle and solve. They're examples to study.

The conundrum with attempting to solve problems is the attempting part. It's hard to set out a list of "solve these problems and master programming" problems and hope folks get through them. What if they fail? Clearly, you'd provide answers. In that case, you'd be back at examples to study. Hmm.

I have intermittent interest in my older Building Skills in Python book. Partly because it's got extensive exercises in each chapter. I get donations. I get inquiries. The exercises seem to resonate in a small way.

I've done about 22 levels of the Python Challenge (I'll write about that separately.) It's not a great way to learn from scratch. You need to know a lot. And you need a lot of hints. 

I've done almost 70 levels of Project Euler. It might be a better way to learn programming because the easy problems are really easy. No guesswork. No riddles. No steganography. The answers are totally cut-and-dried, unambiguous, and absolute. However, there's no easy guidance for learners. Either you have an answer, and want help on improving it, or ... well ... you're stuck and frustrated. 

Structured Sequence of Exercises

What strikes me as a possibility here is a structured series of exercises that lay out the foundations of computer science as realized in a specific programming language.

Puzzle-style. With extensive hints. Background readings, too. But with absolutely right answers. And a score-keeping system to show where you stand. 

No tricky riddles. No quizzes to proceed. You could go on to advanced material without mastering the foundations, if you wanted.

I've got a bunch of exercises and examples in my Building Skills books. Plus some of the examples in my Packt books can be modified and repurposed. Plus. Projects like HamCalc contain a wealth of simple applications that can be adjusted to show CS fundamentals.

Perhaps relevant is this: https://www.google.com/edu/programs/exploring-computational-thinking/.   I'm not sure precisely how it fits, since it seems to be more aimed at providing a general background, rather than teaching programming language skills. They decompose the skills into four specific techniques. Here are specific techniques.
  • Decomposition: Breaking a task or problem into steps or parts.
  • Pattern Recognition: Make predictions and models to test.
  • Pattern Generalization and Abstraction: Discover the laws, or principles that cause these patterns.
  • Algorithm Design: Develop the instructions to solve similar problems and repeat the process.
Perhaps this is relevant: http://interactivepython.org/courselib/static/pythonds/index.html.  I haven't read this carefully, but it seems to be expository rather than exploratory.  It's really thorough. It has quizzes and self-checks. 

I think there's a big space for publishing lots simple recreational programming exercises as teaching tools. 

Thursday, December 11, 2014

Wow. Two-Word Question. Profound Insight.

I'm working on yet another Python book. This one looking at functional programming in Python. It doesn't really go with with Mastering Object-Oriented Python and Python for Secret Agents because the focus isn't on Python's strong suit.

In chapter one, a reviewer had this two-word question:

"yield from?"

What? What does "yield from" mean?

Oh.

Wow.

https://docs.python.org/3/whatsnew/3.3.html#pep-380-syntax-for-delegating-to-a-subgenerator

I had utterly missed this profound, important feature.

I guess I have been too blasé in skimming the release notes.

That's embarrassing.  And it only took two words to reveal my mistake.

I had to then review all 113 yield statements in 72 files of examples that go with the book.  That means most chapters will get touched to revise an example to show yield from iter instead of the older for x in iter: yield x template.

This also changes the Tail Call Optimization material. The explicit for was actually kind of nice for showing how TCO is implemented in Python. The yield from makes it a little less clear.

Some reviewers consider TCO so fundamental that it belongs in chapter 1. The omission of detailed analysis of Python's TCO approach was considered a significant flaw. Other reviewers seemed happy setting discussion of TCO aside for later.

The Functional Python Conundrum

This book is going to be difficult. The ratings from the reviewers were low. Really low. It looks like I've got a lot of work to do. Finding the target audience will be difficult.

One reviewer asked -- in effect -- why would someone who knew functional LISP ever use Python? I don't think there's a big audience of disgruntled LISP programmers, so that's not a relevant question.

Viewed from the other direction, it's hugely import. Why would a Python programmer adopt functional design patterns? That's the question that needs to be answered clearly.

And from the reviews of chapter 1, it wasn't addressed clearly enough.

Thursday, December 4, 2014

Architectural Principles, Spring Framework, and Jersey JAX-RS

See this: http://www.moschetti.org

Attended a meeting with Buzz. Not stated in his blog (in an obvious way) was something he said about not being a fan of big frameworks. I didn't write down his punchline, but it was a pretty pithy summary of the framework tradeoff.

IIRC, it was essentially this: you can wrestle with one or both of these technical problems.
  • Boilerplate Code
  • A Framework's Conceptual Model
Either you have to create your own libraries or you have to learn someone else's. This is in addition to wrestling with the business problem you're supposed to be solving.

Buzz's point seemed to be that you can often manage your own boilerplate more easily than you can come to grips with a framework. If one member of your sprint team handles reusable services, you can just ask them for a feature. You don't have to spend an hour reading other people's struggles.

After spending three months getting my brain wrapped around Spring Framework, I'm inclined toward partial, qualified agreement. Frameworks seem to have limited value until you're an expert in using them.

Layers and Layers

When wrestling with a new feature, you are forced to assume that you've understood its semantics. When you mock a framework element for test purposes, you're reduced to hope that your unit tests are sufficient. A unit tests of a mocked framework element only tests your assumptions. If you're not using the element's API correctly, your tests can't show that the framework will break or raise exceptions.

For new technology, you need to start with a technical spike to understand the framework. Then you can write unit tests that test against known framework behavior. Then you can write the real code that's based on the unit tests that are based on a spike that shows how the framework really works.

Using a technical spike for discovery and debugging can be challenging. You don't want to drag around your entire application just to create a spike. But you don't want to drop back to a trivial "hello world" spike that doesn't really apply to your context. You have to balance simplicity against realism.

For example, making JAX-RS requests to web services is aggravating to debug. You can spend many hours looking at boilerplate 401 and 404 errors wondering what's missing. You can't write the unit tests until you finally get something to work. Once you have something, you can replace real objects with mock objects.

If you already know JAX-RS features, it's easy. If you already know the RESTful service, it's not too bad. If you know neither JAX-RS nor the service, you don't have any clue which direction to turn. Did I misuse JAX-RS? Is something wrong in the request? Am I missing a required header? Did I leave something off the Accept header?

I finally had to give up creating spikes and debugging RESTful requests in Java. It turned out to be simpler to write a version of the REST client in Python. I used this to figure out how the real service really worked. Given a working Python spike, I could then save those interactions for WireMock.

Once I has a clue how the service worked, I could also write a mock server for some more sophisticated experiments.  This was useful for debugging problems based on a failure to understand JAX-RS.

Yes. Rather than struggle with the framework, I wrote the client once in Python and then rewrote the client again in Java. It seemed quicker than trying to debug it in Java.

One contributing factor is the 1m 30s build time in Maven. Compare that with interactive  Python at the >>> prompt.

Perhaps a smaller framework would have been better.

Thursday, November 20, 2014

MongoDB and Schema Validation

One part of the MongoDB value proposition is being freed from the constraints of a database schema.

There's a "baby and bathwater" issue here. While a schema can become a low-value constraint, we have to be careful about throwing out the baby when we throw out the bathwater. A schema isn't inherently evil. A schema that's hard to modify can become more cost than benefit.

When working with document databases like MongoDB or CouchDB, we're freed from the constraints of a schema.

But.

Do we really want the kind of freedom that can devolve to anarchy?

Or.

Do we want some kind of constraint checking capability to provide some additional run-time assurance that the applications are using the database properly?

Read this http://realprogrammer.wordpress.com/tag/json-schema/ and this http://www.litixsoft.de/english/mms-json-schema/.

My thesis is that some schema validation may have some value.

My plan is this.

1. Define the essential collections for the various documents using ordinary document design practices.

2. For each document class, we'll have two closely associated collections:

  • The primary collection, call it it "class" because it matches one of the application classes.
  • An additional "class.schema" collection. This collection will contain JSON-schema documents. See http://json-schema.org for more information.
  • For audit, and sequential key generation, we may have some additional associated collections.
Because JSON schema documents have a "$schema" field, we can replace the "$" with "\uFF04" the "FULLWIDTH DOLLAR SIGN" character when saving the JSON-schema document into a MongoDB database. We can do the inverse operation when finding the schema documents in the database.

3. Use a tool like https://github.com/Julian/jsonschema to validate the schema. The document-level validation could be embedded in the application for each transaction. However, it seems better trust the code and the unit testing of the code to enforce schema rules. We'd use this validation periodically to check the schema. Significant events should include a validation pass. For example, before and after any schema changes. This way we can be sure that things are continuing to go properly.

It would be strictly an additional layer of checking.

Thursday, November 13, 2014

Declarative Programming

I know that some folks swear by declarative programming. They like the ideas behind ant (and make) and SCons and related examples.

You can google for "ant v. maven v. gradle" where people gripe about which is more declarative. The point of the whining being that more declarative == good and any traces of procedural or imperative programming == bad.

All, of course, without any really good justification of why declarative is better. It's assumed that declarative simply has innumerable advantages. And yes, I've started with http://en.wikipedia.org/wiki/Declarative_programming. The issue isn't simply moot; the justification is weak.

Perhaps there's a awful bias toward imperative and functional programming. After all, the big thinkers in computer science tend to favor the imperative and functional schools of thought. Maybe declarative suffers from some bias.

Or maybe declarative has limited utility.

There. I said it. Limited utility.

I think a functional approach might be better, faster and simpler.

Side-bar Ranting

The code is below. You can skip down to the "The Functional Build System" section and not miss much.

Declarative programming seems applicable to the cases where the ordering of operations can be easily deduced. It seems like the significant value of declarative programming is to rely on an optimizing compiler rearrange the declarations into properly-ordered imperative steps. From this viewpoint, it seems like ant/maven/gradle are optimizers that look at the dependencies among transformation functions and then apply the functions in the proper order.

It seems like we're writing expressions like these:

x.class = java(x.java)
xyz.jar = jar(x.class, y.class, z.class, ... )
app.war = war(xyz.jar, abc.jar, ... )

and then turning them over to a clever compiler (like Haskell) to work out a total order among the expressions that will build the right thing for us.

There's a potential difference between manually structuring a script to get all of the steps in order and allowing the compiler to arrange things properly based on some formal semantics behind each expression.

It's a potential difference because most folks that deal with ant/maven/gradle tend to put things in more-or-less the right order so that others can figure out what the hell is going on. In the trivial cases where we're building simple web sites, the default rules have evolved to the point where they work in almost all cases, so we don't even look at the configuration of the tools. We hit Ctrl+B knowing that it's all setup properly

Some Requirements

A number of applications have ant-like (or make-like) aspects but don't really cry out for ant with customized actions. We might be doing data warehouse loads which involve an ant-like sequence of processing steps to do transformations, loads, and produce final summaries and confirmations. We can, of course, write this all in first-class Java code. The hard way.

It's not terribly complex. A class to define a dependency. A suite of plug-in strategies. Some static definitions of the actual rules. Been there. Done that.

Pragmatically, the declarative style suffers from a limitation of being rather rigid in applying a fixed set of rules. A more script-like implementation can be more helpful to support reruns, debugging, problem-solving and the inevitable special cases and exceptions. After a storage failure -- and the reruns required to get the warehouse back up-to-date -- one sees more need for script-like flexibility and less need for overly simplistic rigidity.

Another end of the spectrum is individual steps all manually coordinated with a tool like BMC's Control-M. This requires endless manual intervention to make sure all the various tasks are defined properly in Control-M.

Somewhere near the middle is a configurable application with some processing rules to give it flexibility. But some defined structure to remove the need for carefully planned manual intervention and deep expertise.

The Functional Build System

We can image an ant-like build system defined functionally.

The core is a function that implements build-if-needed rules:

def build_if_needed( builder, target_file, *source ):
    if target_ok( target_file, *source ):
        return "ok({0},...)".format(target_file)
    builder( target_file, *source )
    return "{0}({1},...)".format(builder.__class__.__name__,target_file)


We can use this function to define the essential dependency: use a builder function to create some target if it's out-of-date with respect to the sources. The return value forms a kind of audit log.

This relies on some helper functions: target_ok() checks the modification times of files. The various builders do the various kinds of operations required to make one from the sources.

Here's the target_ok() function

def target_ok( target_file, *source_list, logger=logging ):
    try:
        mtime_target= datetime.datetime.fromtimestamp(
            os.path.getmtime( target_file ) )
    except Exception:
        return False
    # If a source doesn't exist, we throw an exception.
    times = (datetime.datetime.fromtimestamp(
            os.path.getmtime( source ) ) for source in source_list)
    return all(mtime_target > mtime_source for mtime_source in times)


I think this function is what started me thinking about a functional approach. It could be a method of a class. But. It's seems like a very functional design. It could be reduced to a single (long) expression.

The builders are composite functions. They need to combine the subprocess.check_call() with a function that builds the command. We can do functional composition several ways in Python: we can combine functions via decorators. We can also combine functions via Callables. We could write a higher-order function that combines the check_call() with a function to create the command.

We'll opt for the higher-order function and create partially evaluated forms using functools.partial().

Here's a typical case:


def subprocess_builder( make_command, target_file, *source_list ):
    command= make_command( target_file, *source_list )
    subprocess.check_call( command )


This is a generic function: it requires a function (or lambda) to build the actual command. We might do something like this to create a specific builder.


def command_rst2html( output, *input ):
        return ["rst2html.py", "--syntax-highlight=long", "--input-encoding=utf-8", input[0], output]

rst2html= partial( subprocess_builder, command_rst2html )


This rst2html() function can be used to define a dependency rule. We might have something like this:


    files_txt = glob.glob( "*.txt" )
    for f in files_txt:
        build_if_needed( rst2html, ext_to(f,'.html'), f )


This rule specifies that *.html files depend on *.txt files; when needed, use the rst2html() function to build the required html file when the txt file is newer.

The ext_to() function is a two-liner that changes the extension on a filename. This helps us write "template" build rules rather than exhaustively enumerating a large number of similar files.


def ext_to( filename, new_ext ):
    name, ext = os.path.splitext( filename )
    return name + new_ext


What we've done here is define a few generic functions that form the basis for a functional build system that can compete against ant, make or scons. The system is not even close to declarative. However, we only need to assure that our final build_if_needed() functions have a sensible ordering, something that's rarely a towering intellectual burden.

The individual customizations are the build commands like rst2html() where we created the command-line list of strings for subprocess.check_call(). We can just as easily build functions which run entirely in the process or functions which farm the work out to separate processes via queues or RESTful web services.

Bottom Lines

It appears that declarative programming isn't terribly helpful. There may be a niche, but it seems to be a small niche to me.

I'm sure that an object-oriented approach to this problem isn't any better. I've written a shabby-make version of this, and it's bigger. There's just more code and it's not significantly more clear what's going on. Inheritance can be difficult to suss out.

Python seems to be a good functional programming language. It did this very nicely.

Thursday, November 6, 2014

Hard Copy Books

I've now got my actual souvenir hard-copies of my two Packt books

https://www.packtpub.com/application-development/mastering-object-oriented-python

https://www.packtpub.com/hardware-and-creative/python-secret-agents

So far, so good. I've got one more title in the works. After that, I think I'll have to take a small break and do some development work and learn more new stuff.

I've been advised to square away my Amazon.com author's page.

http://amazon.com/author/steven_f_lott

I think this will work to help folks post questions, comments, and suggestions.

Thursday, October 30, 2014

My First Webcast

http://www.oreilly.com/pub/e/3255

I'm a pretty good public speaker. But I've avoided webcasting and podcasting because it's kind of daunting. In a smaller venue, the audience members are right there, and you can tell if you're not making sense. In a webcast, the feedback will be indirect. In a podcast it's seems like it would be nonexistent.

Also, I find that programming is an intensely literate experience. It's about reading and writing. A podcast -- listening and watching -- seems very un-programmerly to me. Perhaps I'm just being an old "get-off-my-lawn-you-kids" fart.

But I'll see how the webcast thing goes in January, and perhaps I'll try to do some podcasts.

Thursday, October 23, 2014

Currying and Partial Function Evaluation

Old. But still interesting.

Partial Function Application is not Currying

It seems like hair-splitting. However, the distinction between bound variables and curried functions does have some practical implications.

I'm looking closely at PyMonad and the built-in functools library.

I'm finding some benefits in understanding functional programming and how to apply functional design patterns in Python. I'm also seeing the important differences between compiled -- and optimized languages -- and Python's approach. I'm slowly coming to understand how a (simple) recursive design is flattened into a for loop as part of manual tail-recursion optimization.

The functional programming goodness is giving me first-class headaches when trying to apply the lessons learned to Java, however. I suppose I should look closely at http://www.functionaljava.org and https://code.google.com/p/functionaljava/. There are claims that it's dangerously inefficient. Also, the customer who insists on Java has a (very) limited set of allowed libraries; if this isn't on the list, then the whole concept is a non-starter.

Thursday, October 16, 2014

Using Bottle as a miniature demo server

Let's talk small.

When writing API's, it sometimes helps to have a small demo web site to show the API in a context that's easy to visualize. API's are sometimes abstract, and without an application to provide some context, it can be unclear why the path looks like that or why the JSON document has those fields.

I want to emphasize the "small" part of the small demo. A small page or two with some HTML forms and a submit button. Really small.

The actual customer-facing apps (mobile, mobile web, and full web site) are being built by lots of other people. Not us. They're big. We build the API's (there are a lot) that support the data structures that support the processing that supports the user experience.

Building fake mobile apps is right out. We're not going to lard on Android SDK or Xcode development environments to our already overburdened laptops. We build backend API's.

Building a fake mobile web or full web site is appealing. What makes it complex is the UX folks are building everything in Angular.js. If we want to properly implement a form, we would have to master Angular just to do a demo for the product owner.

No thanks. Still too far afield for API developers. We're focused on mongo and JSON and performance and scalability. Not Angular.js and the UX.

What we want to do is build a small web server which runs just a few pages plucked out of the UX demo code so that we can show how interactions with a web page put stuff in a database. And vice-versa: stuff in the database will show up on a web page.

"Really?" we get asked. Some folks look askance at us for wanting to put a small demo site together.

"Yes," we answer. "Our product owner has a big vision and we're breaking that into a bunch of little API's. It's not perfectly clear how we're building up to that vision."

It's not perfectly clear how some of this work. Folks outside the scrum team have distracting questions. We want to have a page or two where we can fill in a form and click submit and stuff happens. This is far easier to explain than showing them Postman or SoapUI and claiming that this will support some user stories.

And as we grow toward the epic, the workflow aspects of this will grow. The stuff that admin "A" does after user "U" has made an initial request. Or the stuff that internal user "I" does after external user "X" has done something. But really, it's just a few small web pages. Small.

Imagine the demo. On laptop #1, we'll show user "X". On laptop #2, we're running a Mongo shell to query what's in the db. On laptop #3 we're showing user "I". The focus is really the API's. And how the API's add up to an epic collection of stories.

Serving some HTML pages

Just to make it painful, we can't simply grab the demo web pages out of the UX team's SVN repository. Why not? First, it's an Angular app. We can't just grab some HTML and go. The demo pages are served via node.js with Bower, so it's not even clear (to us) what the complete deployment looks like.

So. We cheated. We took a screen shot. We trimmed the edges of the page as .PNG files. We wrote our own form and cobbled in enough CSS to make it look close. We're not here to fake out the UX. We just want to enter some data and have it tickle our API. (Indeed, we have a "Not The Real Experience" on some pages.)

Initially, some of the team members tried serving these small pages with WebLogic. Then Jetty. It's not bad. But it's Java. It takes forever to build and deploy something after a trivial change. There are a lot of moving parts even with Jetty, and not all of them are obvious.

Since we're building "enterprise" API's, we're deeply enmeshed with every feature of the Spring Framework. Our STS/Eclipse environments are fat with add-ons and features.

While the Spring Framework ideal is to allow a developer to focus on relevant details and have the irrelevant details handled automagically, the magic almost gets in the way. These are small applications that are little more than a few static pages with forms and a submit button. Spring can do it, of course. But we're often testing our the actual API's in a Jetty server (or two). If the demo site requires yet another instance of Jetty with yet another configuration, our ability to cope diminishes.

How can we get back to small?

Python and Bottle

Python has several web servers built-in. We can use http.server. We can use wsgiref. Both of these are almost OK for what we want to do.

We can do better with two small downloads: Bottle and Jinja2. With these, we can build simple HTML pages that show some data. We can build simple servers that collect form data, use http.client to make API requests, and write copious logging details. We can write little bottle apps that handle just GETs and POSTs simply.

This is suitably small.

We can share the module with the Bottle object and the HTML mock-up pages. We can fire up the app in an instant on anyone's laptop, no matter what else they're running. We can tweak the server to adjust the logging or the API request or the form.

We actually run the server from within Idle. Make a change and hit F5 to redeploy after a change. It's small. It's fast. And it doesn't involve the huge complexities associated with Java.

Bottle doesn't do much. But what little it does do is a pretty tidy fit with tiny little demonstrations of super-simple HTML interactions.

Thursday, October 9, 2014

Scipy.optimization.anneal Problems

Well, not really "problems" per se. More of a strange kind of whining than a solvable problem.

Here's the bottom line. Two real quotes. Unedited.

Me: "> There's a way to avoid the religious nature of the argument. "
Them: "Please suggest away."

Really. Confronted with choices between anneal and basin hopping, they could only resort to hand-waving and random utterances.

The tl;dr summary is this:
  • "scipy.optimize.anneal only has three hard-wired schedule variants: ‘fast’, ‘cauchy’ or ‘boltzmann’."
  • My initial response was "And..."? 
  • "Not being able to specify my own cooling schedule severely limits the usability of the code"
A complaint that causes me deep pain: "severely limits" with no actual evidence. And no plan to get evidence beyond a religious wars style argument.

There may have been a technical question on the class definitions inside scipy. But that question was overshadowed by the essential problems with what they were doing. Or, more properly, what they were whining about.

Did they really have a problem with a state of the art solution to optimization problems? More specifically:

1. Did they read the "Deprecated" part of the scipy documentation? This is a hint that there are better solutions available. Perhaps they could start there instead of whining.
2. Did they actually read the details of the three schedules in the "Notes" section? Do they seriously think they've got a new approach that does not fit any of the various parameters of the three installed algorithms? I don't mean to be too rude, but... Do they really think they're that scale of genius?
3. Do they have any evidence that their problem is so unlike the typical case handled by basin hopping?
4. Do they have any evidence that their solution totally crushes the already-built code?

I think the answers to all four question were "no". 

I'm not even certain that I could help them with some of the Python technology required to extend scipy. But, I'm sure I cannot actually do anything of value under the circumstances that (a) they have not really tried the established algorithms and (b) they're already sure that the established algorithms can't work based on religious-wars arguments.

It was clear that they never read the "Notes" section on this SciPy page: http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.anneal.html#scipy.optimize.anneal

One of the emails in the exchange had a kind of hand-waving justification for the problem domain being somehow unique. Lacking any actual evidence, I'm inclined to believe they were just hoping that their problem domain was unique, allowing them to dismiss the available Python solution and do something uniquely bad. 

(Optimization is not my area of expertise. Perhaps I'm way off base; perhaps the existing solutions are so problem-domain specific that everyone has to invent new technology. Maybe established solutions really don't work.)

More importantly: there was no actual evidence that the existing optimization (either annealing or basin hopping) failed to solve their problem.

But the worst part was this:

"From, a business perspective, I need to know about SA because our competitor stole our biggest client using it."

They don't actually want to innovate. They only want to try and catch up by making religious war arguments over the deprecated simulated annealing vs. basin hopping.

Thursday, October 2, 2014

Not sure what went wrong, but...

Read this: http://quantlabs.net/blog/2014/09/here-is-why-i-gave-up-on-python-aka-dogs-breakfeast-of-a-so-called-programming-language/#sthash.Rp7pXObf.dpuf

Not sure what's going on here.

"Script I want to run" seemed clear. http://vispy.org/examples/basics/scene/surface_plot.html#sthash.kIzbd33O.dpuf

The rest seemed like ill-advised trips down numerous ratholes. In particular, anything that involved Python Tools for Visual Studio seems like a waste of time and brain calories.

It's not clear at all what's not working. That's perhaps the most frustrating thing about this kind of post.

The final note, "Decisions decisions..." pointed out a simple confusion that befuddles the technically-minded. Too many details.

The decision between Python2 and 3 is trivial. There are a lot of details, but they're irrelevant for making the decision.

What package are you trying to use? It's hard to tell, but it looks like it's vispy. If so, that's all that matters. vispy works with Python3.3, requires numpy, and a "backend." Install just that and nothing more. In particular, avoid junk like Visual Studio.

The Dog's Breakfast seems to be the result of chasing down lots of details that aren't too relevant. It's hard to tell. But a scatter-shot post claiming "all this is broken" is a hint that the author wasn't simply following the vispy installation instructions. It could be that they turned something simple into a dog's breakfast by chasing irrelevant technologies all around the garden.

Thursday, September 25, 2014

PyCrypto Experience

Let me start with a wow. PyCrypto is very nice.

Let me emphasize the add-ons that go with PyCrypto. These are as valuable as the package itself.

Here's the story. I was working with a Java-based AES encrypter that used the "PBKDF2WithHmacSHA1" key generator algorithm. This was part of a large, sophisticated web application framework that was awkward to unit test because we didn't have a handy client to encode traffic.

We could run a second web application server with some client-focused software on it. But that means tying up yet another developer laptop running a web server just to encode message traffic. Wouldn't it be nicer to have a little Python app that the testers could use to spew messages as needed?

Yes. It would be nice. But, that the heck is the PBKDF2WithHmacSHA1 algorithm?

The JDK says this "Constructs secret keys using the Password-Based Key Derivation Function function found in PKCS #5 v2.0." One can do a lot of reading when working with well-designed crypto algorithms.

After some reading, I eventually wound up here: https://www.dlitz.net/software/python-pbkdf2/ Perfect. A trustable implementation of a fairly complex hash to create a proper private key from a passphrase. An add-on to PyCrypto that saved me from attempting to implement this algorithm myself.

The final script, then, was one line of code to invoke the pbkdf2 with the right passphrase, salt, and parameters to generate a key. Then another line of code to use PyCrypto's AES implementation to encrypt the actual plaintext using starting values and the generated key.

Yep.  Two lines of working code. Layer in the two imports, a print(), and a bit more folderol because the the character-set issues and URL form encoding. We're still not up to anything more than a tiny script with a command-line interface. "encrypt.py this" solved the problem.

At first we were a little upset that the key generation was so slow. Then I read some more and learned that slow key generation is a feature. It makes probing with a dictionary of alternative pass phrases very expensive.

The best part?

PyCrypto worked the first time. The very first result matched the opaque Java implementation.

The issue I have with crypto is that it's so difficult to debug. If our Python-generated messages didn't match the Java-generated messages. Well. Um. What went wrong? Which of the various values weren't salted or padded or converted from Unicode to bytes or bytes to Unicode properly? And how can you tell? The Java web app was a black box because we can't -- easily -- instrument the code to see intermediate results.

In particular, the various values that go into PBKDF2WithHmacSHA1 were confusing to someone who's new to crypto. And private key encryption means that the key doesn't show up anywhere in the application logs: it's transient data that's computed, used and garbage collected. It would have been impossible for us to locate a problem with the key generator.

But PyCrypto and the add-on pbkdf2 did everything we wanted.

Thursday, September 4, 2014

API Testing: Quick, Dirty, and Automated

When writing RESTful API's, the process of testing can be simple or kind of hideous.

The Postman REST Client is pretty popular for testing an API. There are others, I'm sure, but I'm working with folks who like Postman.

Postman 10 has some automation capabilities. Some.

However. (And this is important.)

It doesn't provide much help in framing up a valid complex JSON message.

When dealing with larger and more complex API's with larger and more complex nested and repeating structures, considerably more help is required to frame up a valid request and do some rational evaluation of the response.

Enter Python, httplib and json. While Python3 is universally better, these libraries haven't changed much since Python2, so either version will work.

The idea is simple.
  1. Create templates for the eventual class definitions in Python. This can make it easy to build the JSON structures. It can save a lot of hoping that the JSON content is right. It can save time in "exploratory" testing when the JSON structures are wrong. 
  2. Build complex messages using the template class definitions.
  3. Send the message with httplib. Read the response.
  4. Evaluate the responses with a simple script.
Some test scripting is possible in Postman. Some. In Python, you've got a complete programming language. The "some" qualifier evaporates.

When it comes to things like seeding database data, Python (via appropriate database drivers) can seed integration test databases, also.

Further, you can use the Python unittest framework to write elegant automated script libraries and run the entire thing from the command line in a simple, repeatable way.

What's important is that the template class definitions aren't working code. They won't evolve into working code. They're placeholders so that we can work out API concepts quickly and develop relatively complete and accurate pictures of what the RESTful interface will look like.

I had to dig out my copy of https://www.packtpub.com/application-development/mastering-object-oriented-python to work out the metaclass trickery required.

The Model and Meta-Model Classes

The essential ingredient is a model class what we can use to build objects. The objective is not a complete model of anything. The objective is just enough model to build a complex object.
Our use case looks like this.


>>> class P(Model):
...    attr1= String()
...    attr2= Array()
...
>>> class Q(Model):
...    attr3= String()
...
>>> example= P( attr1="this", attr2=[Q(attr3="that")] )

Our goal is to trivially build more complex JSON documents for use in API testing.  Clearly, the class definitions are too skinny to have much real meaning. They're handy ways to define a data structure that provides a minimal level of validation and the possibility of providing default values.

Given this goal, we need a model class and descriptor definitions. In addition to the model class, we'll also need a metaclass that will help build the required objects. One feature that we really like is keeping the class-level attributes in order. Something Python doesn't to automatically. But something we can finesse through a metaclass and a class-level sequence number in the descriptors.

Here's the metaclass to cleanup the class __dict__. This is the Python2.7 version because that's what we're using.


class Meta(type):
    """Metaclass to set the ``name`` attribute of each Attr instance and provide
    the ``_attr_order`` sequence that defines the origiunal order.
    """
    def __new__( cls, name, bases, dict ):
        attr_list = sorted( (a_name
            for a_name in dict
            if isinstance(dict[a_name], Attr)), key=lambda x:dict[x].seq )
        for a_name in attr_list:
            setattr( dict[a_name], 'name', a_name )
        dict['_attr_order']= attr_list
        return super(Meta, cls).__new__( cls, name, bases, dict )

class Model(object):
    """Superclass for all model class definitions;
    includes the metaclass to tweak subclass definitions.
    This also provides a ``to_dict()`` method used for
    JSON encoding of the defined attributes.

    The __init__() method validates each keyword argument to
    assure that they match the defined attributes only.
    """
    __metaclass__= Meta
    def __init__( self, **kw ):
        for name, value in kw.items():
            if name not in self._attr_order:
                raise AttributeError( "{0} unknown".format(name) )
            setattr( self, name, value )
    def to_dict( self ):
        od= OrderedDict()
        for name in self._attr_order:
            od[name]= getattr(self, name)
        return od

The __new__() method assures that we have an additional _attr_order attribute added to each class definition. The __init__() method allows us to build an instance of a class with keyword parameters that have a minimal sanity check imposed on them. The to_dict() method is used to convert the object prior to making a JSON representation.

Here is the superclass definition of an Attribute. We'll extend this with other attribute specializations.


class Attr(object):
    """A superclass for Attributes; supports a minimal
    feature set. Attribute ordering is maintained via
    a class-level counter.

    Attribute names are bound later via a metaclass
    process that provides names for each attribute.

    Attributes can have a default value if they are
    omitted.
    """
    attr_seq= 0
    default= None
    def __init__( self, *args ):
        self.seq= Attr.attr_seq
        Attr.attr_seq += 1
        self.name= None # Will be assigned by metaclass ``Meta``
    def __get__( self, instance, something ):
        return instance.__dict__.get(self.name, self.default)
    def __set__( self, instance, value ):
        instance.__dict__[self.name]= value
    def __delete__( self, *args ):
        pass

We've done the minimum to implement a data descriptor.  We've also included a class-level sequence number which assures that descriptors can be put into order inside a class definition.

We can then extend this superclass to provide different kinds of attributes. There are a few types which can help us formulate messages properly.


class String(Attr):
    default= ""

class Array(Attr):
    default= []

class Number(Attr):
    default= None

The final ingredient is a JSON encoder that can handle these class definitions.  The idea is that we're not asking for much from our encoder. Just a smooth way to transform these classes into the required dict objects.


class ModelEncoder(json.JSONEncoder):
    """Extend the JSON Encoder to support our Model/Attr
    structure.
    """
    def default( self, obj ):
        if isinstance(obj,Model):
            return obj.to_dict()
        return super(NamespaceEncoder,self).default(obj)

encoder= ModelEncoder(indent=2)


The Test Cases

Here is an all-important unit test case. This shows how we can define very simple classes and create an object from those class definitions.


>>> class P(Model):
...    attr1= String()
...    attr2= Array()
...
>>> class Q(Model):
...    attr3= String()
...
>>> example= P( attr1="this", attr2=[Q(attr3="that")] )
>>> print( encoder.encode( example ) )
{
  "attr1": "this", 
  "attr2": [
    {
      "attr3": "that"
    }
  ]
}


Given two simple class structures, we can get a JSON message which we can use for unit testing. We can use httplib to send this to the server and examine the results.

Thursday, August 21, 2014

Permutations, Combinations and Frustrations

The issue of permutations and combinations is sometimes funny.

Not funny weird. But, funny "haha."

I received an email with 100's of words and 10 attachments. (10. Really.) The subject was how best to enumerate 6! permutations of something or other. With a goal of comparing some optimization algorithm with a brute force solution. (I don't know why. I didn't ask.)

Apparently, the programmer was not aware that permutation creation is a pretty standard algorithm with a standard solution. Most "real" programming languages have libraries which already solve this in a tidy, efficient, and well-documented way.

For example

https://docs.python.org/3/library/itertools.html#itertools.permutations

I suspect that this is true for every language in common use.

In Python, this doesn't even really involve programming. It's a first-class expression you enter at the Python >>> prompt.

>>> import itertools
>>> list(itertools.permutations("ABC"))
[('A', 'B', 'C'), ('A', 'C', 'B'), ('B', 'A', 'C'), ('B', 'C', 'A'), ('C', 'A', 'B'), ('C', 'B', 'A')]

What's really important about this question was the obstinate inability of the programmer to realize that their problem had a tidy, well understood solution. And has had a good solution for decades. Instead they did a lot of programming and sent 100's of words and 10 attachments (10. Really.)

The best I could do was provide this link:

Steven Skiena, The Algorithm Design Manual

It appears that too few programmers are aware of how much already exists. They plunge ahead creating a godawful mess when a few minutes of reading would have provided a very nice answer.

Eventually, they sent me this:

http://en.wikipedia.org/wiki/Heap's_algorithm

As a grudging acknowledgement that they had wasted hours failing to reinvent the wheel.

Saturday, August 9, 2014

Some Basic Statistics

I've always been fascinated by the essential statistical algorithms. While there are numerous statistical libraries, the simple measures of central tendency (mean, media, mode, standard deviation) have some interesting features.

Well.  Interesting to me.

First, some basics.


def s0( samples ):
    return len(samples) # sum(x**0 for x in samples)

def s1( samples ):
    return sum(samples) # sum(x**1 for x in samples)

def s2( samples ):
    return sum( x**2 for x in samples )

Why define these three nearly useless functions? It's the cool factor of how they're so elegantly related.

Once we have these, though, the definitions of mean and standard deviation become simple and kind of cool.

def mean( samples ):
    return s1(samples)/s0(samples)

def stdev( samples ):
    N= s0(samples)
    return math.sqrt((s2(samples)/N)-(s1(samples)/N)**2)

It's not much, but it seems quite elegant. Ideally, these functions could work from iterables instead of sequence objects, but that's impractical in Python. We must work with a materialized sequence even if we replace len(X) with sum(1 for _ in X).

The next stage of coolness is the following version of Pearson correlation. It involves a little helper function to normalize samples.


def z( x, μ_x, σ_x ):
    return (x-μ_x)/σ_x


Yes, we're using Python 3 and Unicode variable names.

Here's the correlation function.

def corr( sample1, sample2 ):
    μ_1, σ_1 = mean(sample1), stdev(sample1)
    μ_2, σ_2 = mean(sample2), stdev(sample2)
    z_1 = (z(x, μ_1, σ_1) for x in sample1)
    z_2 = (z(x, μ_2, σ_2) for x in sample2)
    r = sum( zx1*zx2 for zx1, zx2 in zip(z_1, z_2) )/len(sample1)
    return r

I was looking for something else when I stumbled on this "sum of products of normalized samples" version of correlation. How cool is that? The more text-book versions of this involve lots of sigmas and are pretty bulky-looking. This, on the other hand, is really tidy.

Finally, here's least-squares linear regression.


def linest( x_list, y_list ):
    r_xy= corr( x_list, y_list )
    μ_x, σ_x= mean(x_list), stdev(x_list)
    μ_y, σ_y= mean(y_list), stdev(y_list)
    beta= r_xy * σ_y/σ_x
    alpha= μ_y - beta*μ_x
    return alpha, beta


This, too, was buried at the end of the Wikipedia article. But it was such an elegant formulation for least squares based on correlation. And it leads to a tidy piece of programming. Very tidy.

I haven't taken the time to actually measure the performance of these functions and compare them with more commonly used versions.

But I like the way the Python fits well with the underlying math.

Not shown: The doctest tests for these functions. You can locate sample data and insert your own doctests. It's not difficult.

Thursday, July 24, 2014

Building Probabilistic Graphical Models with Python

A deep dive into probability and scipy: https://www.packtpub.com/building-probabilistic-graphical-models-with-python/book

I have to admit up front that this book is out of my league.

The Python is sensible to me. The subject matter -- graph models, learning and inference -- is above my pay grade.

Asking About a Book

Let me summarize before diving into details.

Asking someone else if a book is useful is really not going to reveal much. Their background is not my background. They found it helpful/confusing/incomplete/boring isn't really going to indicate anything about how I'll find it.

Asking someone else for a vague, unmeasurable judgement like "useful" or "appropriate" or "helpful" is silly. Someone else's opinions won't apply to you.

Asking if a book is technically correct is more measurable. However. Any competent publisher has a thorough pipeline of editing. It involves at least three steps: Acceptance, Technical Review, and a Final Review. At least three. A good publisher will have multiple technical reviewers. All of this is detailed in the front matter of the book.

Asking someone else if the book was technically correct is like asking if it was reviewed: a silly question. The details of the review process are part of the book. Just check the front matter online before you buy.

It doesn't make sense to ask judgement questions. It doesn't make sense to ask questions answered in the front matter. What can you ask that might be helpful?

I think you might be able to ask completeness questions. "What's omitted from the tutorial?" "What advanced math is assumed?" These are things that can be featured in online reviews.

Sadly, these are not questions I get asked.

Irrational Questions

A colleague had some questions about the book named above. Some of which were irrational. I'll try to tackle the rational questions since emphasis my point on ways not to ask questions about books.

2.  Is the Python code good at solidifying the mathematical concepts? 

This is a definite maybe situation. The concept of "solidifying" as expressed here bothers me a lot.

Solid mathematics -- to me -- means solid mathematics. Outside any code considerations. I failed a math course in college because I tried to convert everything to algorithms and did not get the math part. A kindly professor explained that "F" very, very clearly. A life lesson. The math exists outside any implementation.

I don't think code can ever "solidify" the mathematics. It goes the other way: the code must properly implement the mathematical concepts. The book depends on scipy, and scipy is a really good implementation of a great deal of advanced math. The implementation of the math sits squarely on the rock-solid foundation of scipy. For me, that's a ringing endorsement of the approach.

If the book reinvented the algorithms available in scipy, that would be reason for concern. The book doesn't reinvent that wheel: it uses scipy to solve problems.

4. Can the code be used to build prototypes? 

Um. What? What does the word prototype mean in that question? If we use the usual sense of software prototype, the answer is a trivial "Yes." The examples are prototypes in that sense. That can't be what the question means.

In this context the word might mean "model". Or it might mean "prototype of a model". If we reexamine the question with those other senses of prototype, we might have an answer that's not trivially "yes." Might.

When they ask about prototype, could they mean "model?" The code in the book is a series of models of different kinds of learning. The models are complete, consistent, and work. That can't be what they're asking.

Could they mean "prototype of a model?" It's possible that we're talking about using the book to build a prototype of a model. For example, we might have a large and complex problem with several more degrees of freedom than the text book examples. In this case, perhaps we might want to simplify the complex problem to make it more like one of the text book problems. Then we could use Python to solve that simplified problem as a prototype for building a final model which is appropriate for the larger problem.

In this sense of prototype, the answer remains "What?"  Clearly, the book solves a number of simplified problems and provides code samples that can be expanded and modified to solve larger and more complex problems.

To get past the trivial "yes" for this question, we can try to examine this in a negative sense. What kind of thing is the book unsuitable for? It's unsuitable as a final implementation of anything but the six problems it tackles. It can't be that "prototype" means "final implementation." The book is unsuitable as a tutorial on Python. It's not possible this is what "prototype" means.

Almost any semantics we assign to "prototype" lead to an answer of "yes". The book is suitable for helping someone build a lot of things.

Summary

Those two were the rational questions. The irrational questions made even less sense.

Including the other irrational questions, it appears that the real question might have been this.

Q: "Can I learn Python from this book?"

A: No.

It's possible that the real question was this:

Q: "Can I learn advanced probabilistic modeling with this book?"

A: Above my pay grade. I'm not sure I could learn probabilistic modeling from this book. Maybe I could. But I don't think that I have the depth required.

It's possible that the real questions was this:

Q: Can I learn both Python and advanced probabilistic modeling with this book?"

A: Still No.

Gaps In The Book

Here's what I could say about the book.

You won't learn much Python from this book. It assumes Python; it doesn't tutor Python. Indeed, it assumes some working scipy knowledge and a scipy installation. It doesn't include a quick-start tutorial on scipy or any of that other hand-holding.

This is not even a quibble with the presentation. It's just an observation: the examples are all written in Python 2. Small changes are required for Python 3. Scipy will work with Python 3. http://www.scipy.org/scipylib/faq.html#do-numpy-and-scipy-support-python-3-x. Reworking the examples seems to involve only small changes to replace print statements. In that respect, the presentation is excellent.




Thursday, July 17, 2014

New Focus: Data Scientist

Read this: http://www.forbes.com/sites/emc/2014/06/26/the-hottest-jobs-in-it-training-tomorrows-data-scientists/

Interesting subject areas: Statistics, Machine Learning, Algorithms.

I've had questions about data science from folks who (somehow) felt that calculus and differential equations were important parts of data science. I couldn't figure out how they decided that diffeq's were important. Their weird focus on calculus didn't seem to involve using any data. Odd: wanting to be a data scientist, but being unable to collect actual data.

Folks involved in data science seem to think otherwise. Calculus appears to be a side-issue at best.

I can see that statistics are clearly important for data science. Correlation and regression-based models appear to be really useful. I think, perhaps, that these are the lynch-pins of much data science. Use a sample to develop a model, confirm it over successive samples, then apply it to the population as a whole.

Algorithms become important because doing dumb statistical processing on large data sets can often prove to be intractable. Computing the median of a very large set of data can be essentially impossible if the only algorithm you know is to sort the data and find the middle-most item.

Machine learning and pattern detection may be relevant for deducing a model that offers some predictive power. Personally, I've never worked with this. I've only worked with actuaries and other quants who have a model they want to confirm (or deny or improve.)

Thursday, July 10, 2014

The Permissions Issue

Why?

Why are Enterprise Computers so hard to use? What is it about computers that terrifies corporate IT?

They're paying lots of money to have me sit around and wait for mysterious approver folks to decide if I can be given permission to install development tools. (Of course, the real work is done by off-shore subcontractors who are (a) overworked and (b) simply reviewing a decision matrix.)

And they ask, "Are you getting everything you need?"

The answer is universally "No, I'm not getting what I need." Universally. But I can't say that.

You want me to develop software. And you simultaneously erect massive, institutional roadblocks to prevent me from developing software.

I have yet to work somewhere without roadblocks that effectively prevent development.

And I know that some vague "security considerations" trump any productive approach to doing software development. I know that there's really no point in trying to explain that I'm not making progress because I can't actually do anything. And you're stopping me from doing anything.

My first two weeks at every client:

The client tried to "expedite" my arrival by requesting the PC early, so it would be available on day 1. It wasn't. A temporary PC is -- of course -- useless. But that's the balance of days 1-5: piddling around with the temporary PC. That was ordered two weeks earlier.

Day 6 begins with the real PC. It's actually too small for serious development due to an oversight in bringing me on as a developer, but not ordering a developer's PC for me. I'll deal. Things will be slow. That's okay. Some day, you'll discover that I'm wasting time waiting for each build and unit test suite. Right now, I'm doing nothing, so I have no basis to complain.

Day 7 reveals that I need to fill in a form to have the PC you assigned me "unlocked." Without this, I cannot install any development tools.

In order to fill in the form, I need to run an in-house app. Which is known by several names, none of which appear on the intranet site. Day 8 is lost to searching, making some confused phone calls, and waiting for someone to get back to me with something.

Oh. And the email you sent on Day 9 had a broken link. That's not the in-house app anymore. It may have been in the past. But it's not.

Day 10 is looking good. The development request has been rejected because I -- as an outsider -- can't make the request to unlock a PC directly. It has to be made by someone who's away visiting customers or off-shore developers or something.

Remember. This is the two weeks I'm on site. The whole order started 10 business days earlier with the request for the wrong PC without appropriate developer permissions.

Thursday, July 3, 2014

Project Euler

This is (was?) an epic web site:

http://projecteuler.net/about

Currently, they're struggling with a security problem.

http://forum.projecteuler.net/viewtopic.php?f=5&t=3591

Years ago, I found the site and quickly reached Level 2 by solving a flood of easy problems.

Recently, a recruiter strongly suggested reviewing problems on Project Euler as preparation for a job interview.

It was fun! I restarted my quest for being a higher-level solver.

Then they took the solution checking (and score-keeping) features off-line.

So now I have to content myself with cleaning up my previous solutions to make them neat and readable and improve the performance in some areas.

I -- of course -- cannot share the answers. But, I can (and will) share some advice on how to organize your thinking as you tackle these kinds of algorithmically difficult problems.

My personal preference is to rewrite the entire thing in Django. It would probably take a month or two. Then migrate the data. That way I could use RST markup for the problems and the MathJax add-on that docutils uses to format math. But. That's just me.

I should probably take a weekend and brainstorm the functionality that I can recall and build a prototype. But I'm having too much fun solving the problems instead of solving the problem of presenting the problems.

Thursday, June 26, 2014

Package Deal for Learning Python

If you're very new to programming in general, Python's a great place to start.

There are many, many tutorials. I won't even try to summarize them. They're generally good. And the more you read, the more you learn.

Moving past the n00bz needs, there are some more advanced books. Here's a collection for generalists:


My suggestion is to master the general features of the language overall.

Focus on specific things (Django, NLTK, SciPy, Maya, Scrapy, MatPlotLib, etc.) can follow.

I worry that early exposure to some of the details of Python-based packages may obscure the fundamentals of using the language properly. Perhaps that worry is misplaced. I know that the NLTK Book has numerous good examples of Python which are independent of the NLTK focus.

Thursday, June 19, 2014

The Swift Programming Language

https://developer.apple.com/swift/

This lowers the bar for entry to the iOS market.

Does it also lower the bar for Mac OS X?

Can it be used to write command-line command-line applications ("scripts")? It has a REPL, which means it can do a kind of "just-in-time" compile and run. This is how Python works, so perhaps this is a viable mode for using Swift.

Via the Objective-C and C compatibility, it has full access to the POSIX libraries, as well as Cocoa, so it can clearly be used to build command-line apps. It might lack the flexibility of Python, since it's compiled. But C (C++, Objective-C) with automated memory management is still a gigantic victory for writing fast and reliable programs.

Can it be plugged into Apache to write backend applications? It's compiled, and compatible with C and Objective-C. So, one can imagine that a mod-swift component in Apache might be possible. It might be better to work through existing FCGI interfaces and write stand-alone Swift back-ends. This would require a bunch of libraries for database API's, template rendering, request and response processing, and the various bits and pieces that make up a rich web development environment. But this is largely available for C and C++, making it available to Swift-based backends.

Is one language even a desirable goal?

The idea of having one official version of the class definitions seems very helpful for capturing knowledge and managing the intellectual property that is embodied in application logic.

Thursday, June 12, 2014

TDD, API Design and Refactoring

See this short discussion on a Stingray Reader feature:
https://sourceforge.net/p/stingrayreader/discussion/COBOL/thread/d2132851/?limit=25#2a3a

This turned into an exercise in pure TDD.

<rant>
I'm not a fan of applying TDD in a strict, death-march fashion.

I see the comments on Stack Overflow that indicate that some folks feel strongly that strict TDD is somehow helpful. While "test before code" is laudable and often helpful, there's no royal road to good software.

Design involves a great deal of back and forth between code and test. A great deal.

It's logically impossible to write a test without having thought about the code. In order to write the test first, there must be a notional API against which the test is written. Anyone who requires that the test file must be written before the notional class or module is just playing at petty tyranny.

The notional design -- the rough outline of the class or module -- can be written into a file before any tests. It's okay. It is still test-driven because the considerations of testability drove the design process.

In particular, when starting "from scratch" -- with nothing -- writing tests first is senseless. Some module or package structure must exist for the test modules to import.

</rant>
Having ranted, it still arises that the tests do come before any code under some circumstances.

In this case, the requested functionality was quite difficult to visualize. However, it was possible to cobble together a test case that simplified the problem down to something like this this:


01 Some-Record.
     05 Header PIC XXX.
     05 Body PIC X(17).

01 ABC-Segment.
     05 Field-ABC PIC X(17).

01 DEF-Segment.
     05 Field-DEF PIC X(17).


In COBOL, the program would use logic like IF Header EQUALS "ABC" THEN MOVE Body TO ABC-Segment. We need a way to handle something like this in Python so that we can parse the EBCDIC COBOL data.

This summarized example allowed construction of a test case that made use of a API that might have existed. I was pretty sure I had a test case that showed an approach.

What Actually Happened

Since the application already had 178 unit tests, there was plenty of structure that worked.

The single new unit test relied on a notional API that wasn't really in place. The new test bombed grotesquely.

There are two solutions:

  • Modify the test.
  • Fix the notional API so that it works properly.

I started out chasing the second option. I tweaked some things. More tests failed. I tweaked some more things. The new test finally passed, but another test was failing.

Some careful study of the failing test revealed that my approach was wrong. Way wrong.

The notional API was a bad idea.

The tweaks to make it work were a worse idea.

Back to the Lab Bench

At this point, I had made enough changes that the only thing to do was copy the new test and use the Git Revoke on the local changes to unwind the awful mistakes.

Staring again, I had a slightly better grip on the relevant code. I had a failing test. I tried a different approach that wasn't quite so inventive. This meant modifying the test.

I actually went through a few iterations of the test, using the test method as a kind of lab bench.

A more Pythonic approach to the lab bench is to work from the >>> prompt. I think that all of the exemplary projects use the >>> prompt examples in their documentation. This is a way to narrow and clarify the API. As projects get big, they can sprawl. New features can wind up with many imports to pick and choose elements from existing modules.

When it becomes difficult to use the >>> prompt as the lab bench, that's a sign that the API is too complex. Refactoring must happen.

Using the unit test framework as the lab bench was a hint that something had drifted out of tolerance.

However. I did get a test which passed. Yay. Sort of.

The test code was hideous.

TDD and API Design

The point of TDD, however, is that we have a working suite of tests. Refactoring won't break anything.

The point was that the hideous API could be rewritten into something that both

  • Passed all the tests, and
  • Was usable at the >>> prompt.
It's difficult to express how valuable the Python >>> prompt is to help clarify API design issues.

The rule is this:

If the API doesn't make sense at the >>> prompt, it's incomprehensible

Sadly, Java doesn't have this kind of boundary. Java programming can spin into quite complex API's, limited only by the laziness of the programmer who avoids refactoring.

Or the malice of the programmer's manager in not allowing time to refactor.

Thursday, May 29, 2014

Stingray 4.4 Update -- the Posix split command applied to COBOL files

Here's an interesting problem. Implement the split command for mainframe COBOL EBCDIC files with their BDW and RDW headers.

The conventional split can't handle COBOL EBCDIC files because they don't have sensible \n line breaks. Translating an EBCDIC file to ASCII is high-risk because COMP and COMP-3 fields will be trashed by the translation.
If the files include Occurs Depending On, then the FTP transfer should include the RDW/BDW headers. The SITE RDW (or LOCSITE RDW) are essential. It's much faster to include this overhead. Stingray can process files without the headers, but it's slower.
There are two essential Python techniques for building file splitters than involve parsing.
  • The itertools.groupby() function.
  • The with statement.
Along with this, we need an iterator over the underlying records.  For example, the stingray.cobol.RECFM subclasses will parse the various mainframe RECFM options and iterate over records or records+RDW headers or blocks (BDW headers plus records with RDW headers.

The itertools.groupby() function can break a record iterator into groups based on some group-by criteria. We can use this to break into sequential batches.

itertools.groupby( enumerate(reader), lambda x: x[0]//batch_size )

This expression will break the iterable, reader, into groups each of which has a size of batch_size records. The last group will have total%batch_size records.

The with statement allows us to make each individual group into a separate context. This assures that each file is properly opened and closed no matter what kinds of exceptions are raised.

Here's a typical script.

    import itertools
    import stringray.cobol
    import collections
    import pprint
    
    batch_size= 1000
    counts= collections.defaultdict(int)
    with open( "some_file.schema", "rb" ) as source:
        reader= stringray.cobol.RECFM_VB( source ).bdw_iter()
        batches= itertools.groupby(enumerate(reader), lambda x: x[0]//batch_size):
        for group, group_iter in batches:
            with open( "some_file_{0}.schema".format(group), "wb" ) as target:
            for id, row in group_iter:
                target.write( row )
                counts['rows'] += 1
                counts[str(group)] += 1
    pprint.pprint( dict(counts) )

There are several possible variations on the construction of the reader object.

  • cobol.RECFM_F( source ).record_iter() -- result is RECFM_F
  • cobol.RECFM_F( source ).rdw_iter() -- result is RECFM_V; RDW's have been added. 
  • cobol.RECFM_V( source ).rdw_iter() -- result is RECFM_V; RDW's have been preserved. 
  • cobol.RECFM_VB( source ).rdw_iter() -- result is RECFM_V; RDW's have been preserved; BDW's have been discarded. 
  • cobol.RECFM_VB( source ).bdw_iter() -- result is RECFM_VB; BDW's and RDW's have been preserved. The batch size is the number of blocks, not the number of records.
This should allow slicing up a massive mainframe file into pieces for parallel processing.

Thursday, May 22, 2014

Python Package Design, Refactoring and the Stingray Reader Project

We'll be digging into Mastering Object-Oriented Python. Chapter 17, specifically.

We'll also be looking at a big refactoring of the Stingray Schema-Based File Reader.

We can identify three species of packages.

One common design is a Simple Package. A directory with an empty __init__.py file. This package name becomes a qualifier for the internal module names. The package is simply a namespace for modules. We’ll use the package with something like this:

import package.module


Another common design is the Module-Package. This is a package which appears to be a module.  It will have a larger and more sophisticated __init__.py that is a effectively, a  module definition. There are two variations on this theme. Sometimes we'll use this during the early stages of development because we don't know if the package will get really big or stay small. If we start out with a package and all the code is in the __init__.py, we can refactor down to a module.

The more common use for a module-package is to have the __init__.py import objects or other modules from the package directory. Or, it can stand as a part of a larger design that includes the top-level module and the qualified sub-modules. We’ll use the package with something like this:

import package

or perhaps

from package import thing


The third common pattern is a package where the __init__.py selects among alternative implementations. The os module is a good example of this. We’ll use the package with something like this:


import package


Knowing that it did something roughly like the following for us.

import package.implementation as package


Refactoring Module to Package

The Stingray angle on this is the need to add iWork '13 numbers to the collection of spreadsheets which it can parse. The iWork '13 format is unique.

Previously, all of the spreadsheets fell into three families:

iWork '13 uses Snappy compress and Protobuf Serialization. Without some documentation, the files would be incomprehensible.  Read this: See https://github.com/obriensp/iWorkFileFormat. Brilliant. 

The previous releases of Stingray had a single, large module to handle a variety of workbook formats. Folding iWork '13 into this module would have been lunacy. It was already large to the point of being painful to understand.

The original module will be transparently turned into Module-Package. The API (import stingray.workbook or from stingray.workbook import SomeClass) will remain the same.

However.

The implementation will involve a package with each workbook format as a separate module inside that package. At the top, the __init__.py will include code like the following.

    from stingray.workbook.csv import CSV_Workbook
    from stingray.workbook.xls import XLS_Workbook
    from stingray.workbook.xlsx import XLSX_Workbook
    from stingray.workbook.ods import ODS_Workbook
    from stingray.workbook.numbers_09 import Numbers09_Workbook
    from stingray.workbook.numbers_13 import Numbers13_Workbook
    from stingray.workbook.fixed import Fixed_Workbook

This has the advantage of allowing us to include additional parsing cruft in each module that's not part of the exposed API in the workbook package.

The Mastering Object-Oriented Python book has more details on this kind of design.