Tuesday, August 8, 2017

Refusing to Code. Or. How to help the incurious?

The emphasis on code is important. Code defines the behavior of systems -- for the most part Once upon a time, we used clever mechanical designs, or discrete electronic components. The InternetofThings idea exists because high-powered general-purposes CPU's are ubiquitous.

A DevOps mantra is "infrastructure as code". The entire deployment is automated, from the allocation of processors and storage down to pining the health-check endpoint to be sure it's live. Blue-Green deployments, traffic switching, etc., and etc. These all require lots of code and as little manual intervention as possible. 

The gold standard is to use tools to visualize state, make a decision, and use tools to take action. Lots of code.

When I meet the anti-code people, it's confusing.

Outside my narrow realm of tech, anti-code is fine. I have a sailboat, I meet lots of non-tech people who can't code, won't code, and aren't sure what code is.

But when I meet people who claim they want to be data science folks but refuse to code, I'm baffled.

Step 1 was to "learn more" about data science or something like that. I suggested some of the ML tutorials available for Python. Why? It appears that Scikit Learn is the gold standard for ML applications. http://scikit-learn.org/stable/tutorial/index.html

Because they didn't want to code, they insisted on doing things in Excel. Really.

Step 2 was to figure out some simulated annealing process -- in Excel. They had one of the central textbooks on ML algorithms. And they had a spreadsheet. They had some question that can only arise from avoiding open-source code. I suggested they use the open source code available to everyone. Or perhaps find a more modern tutorial like this: http://katrinaeg.com/simulated-annealing.html

Because they don't want to code, they used the fact that scipy.optimize.anneal() was deprecated to indict Python. I almost wish I'd saved all the emails over why basin hopping was unacceptable. The reasoning involved having an old textbook that covered annealing in depth, and not wanting to actually read the code for basin hopping. Or something. 

Step 3 was to grab a Kaggle problem and start working on it. This is too large for a spreadsheet. Indeed, the data sets push the envelope on what can be done on a Windows laptop because the dataframes tend to be quite large. It requires installing Scikit learn, which means installing Anaconda from Continuum. There's no reasonable alternative.

The Kaggle exercise may also involve buying a new laptop or renting time on a cloud-based server that's big enough to handle the data set. ML processing takes time, and GPU acceleration can be a huge help. All of this, however, presumes that there's code to run.

Because they don't want to code, this bled into an amazing number of unproductive directions.  There's some kind of classic "do everything except what you need to do" behavior. I'm sure it has a name. It's more than "work avoidance." It's a kind of active negation of the goals. It was impossible to discern what was actually going on or how I was supposed to help.

I suggested a Trello board. 

The Trello board devolved into dozens of individual lists, each list had one card. Seriously. The card/list thing became a way of avoiding progress. There were cards for considering the implications of installing Anaconda. The cards turned into hand-wringing discussions and weird status updates and memo-to-self notes, instead of actual actions.

Bottom line? 

No code. 

In the middle of the Kaggle something-or-other board, a card appeared asking for comments on some code. :yay2: Something I can actually help with.

The code was bad. And precious. I blogged about this phenomenon earlier. The code can't be changed because it was so hard to create. It was really bad, and riddled with bizarre things that make it look like they'd never seen code before.

Use pylint? This got a grudging kind of reluctant cleanup. But huge_variable_names_with_lots_of_useless_clauses aren't flagged by Pylint. They're still bad, and reading other code would show how atypical these names are. Unless, of course, you hate code; then reading code is not going to happen.

My new model for their behavior? They hate code. So when they do it, they do it badly. Intentionally badly. And because it was so painful, it's precious. (I'm probably wrong, and there's probably a lot more to this, but it seems to fit the observed behavior.)

It gets worse (or better, depending on your attitude.)

Another Trello card appears wondering what [a, b] * 2 or some such Pythonic thing might mean. Um. What?

It appears that they can't find the Standard Library description of the built-in data types and their operators. As if chapter four was deleted from their copy, or something.

The "can't find" seems unlikely. It's pretty prominent. I would think that anyone aspiring to learn Python would see the "keep this under your pillow" admonition on the standard library docs and perhaps glance through the first five sections to see what the fuss was about. Unless they hate code.

I'm left with "won't find."  Perhaps they're refusing to use the documentation? Are they also refusing to use Python's internal help? It's not great, but you can try a bunch of things and get steered around from topic to topic, eventually, you have to find something useful.

Apply my new model: they hate code and Python help() is code.

Do they really hate code that much? I now think they do. I think they truly and deeply hate losing manual, personal. hands-on control over things. If it's not a spreadsheet -- where they typed each cell personally -- it's reviled. (Or feared? Let's not go too far here.)

Test the hypothesis. Ask if they used help().

Answer: Yes. They had tried three things (exactly three) and none of those three had a satisfactory explanation. The help() function did not work. Indeed, two of the things they tried had the same result, and the third reported a syntax error. So they stopped.

They tried three things and stopped.

Okay, then. They hate code. And -- Bonus! -- They refuse to explore. Somehow they're also able to insist they must learn to code. Will the self-beatings continue until the attitude improves?

It's difficult to offer meaningful help under these circumstances. I don't see the value in being someone's personal Google, since that only reinforces the two core refusals to use code or explore by typing code to see what happened.

I like to think that coding is a core life skill. Like cooking. You don't have to become a chef, but you have to know how to handle food. You don't have to create elaborate, scalable meshes of microservices. But you have to be able to find the data types and operators on your own.

And I don't know how to coach someone who is so incurious that three attempts with help() is the limit. Done at three. Count it as a failure and stop trying. "Try something different" seems vague, but it's all I've got. Anything more feels isomorphic to "Here's the link, attached is an audio file of me reading the words out loud for you." 

Other Entries Other Blogs


Plus, of course, lots of other stuff from lots of other folks. Enjoy.

Tuesday, August 1, 2017

JSON vs. XML: The battle for format supremacy may be wasted energy - SD Times


This article seems silly. Perhaps I missed something important.

I'm not sure who's still litigating the JSON vs. XML, but it seems like it's more-or-less done.

XHTML/XML for HTML things.

JSON for everything else.

Maybe there are people still wringing their hands over this. AFAIK, the last folks using SOAP/XML services are commercial and governmental agencies where change tends to happen very slowly.

I remember when Sun Microsystems was a company and had the Java Composite Applications Suite. Very XML. That was -- perhaps -- ten years ago. Since then, I think the problem has been solved. I'm not sure who's battling for supremacy or why.

Tuesday, July 25, 2017

The "My Code Is Precious To Me" Conundrum

I suspect some people sweat so hard over each line of code that it becomes precious. Valuable. An investment wrung from their very soul. Or something.

When they ask for comments, it becomes difficult.

The Pull Request context can be challenging. There the code is, beaten into submission after Herculean toils, and -- well -- it's not really very good. The review isn't a pleasant validation with some suggested rewrites of the docstrings to remove dangling participles (up with which I will not put.) Perhaps the code makes a fundamentally flawed assumption and proceeds from there to create larger and larger problems until it's really too awful to save.

How do you break the news?

I get non-PR requests for code reviews once in a while. The sincere effort at self-improvement is worthy of praise. It's outside any formal PR process; outside formal project efforts. It's good to ask for help like that.

The code, on the other hand, has to go.

I'm lucky that the people I work with daily can create -- and discard -- a half-dozen working examples in the space of an hour.

I'm unlucky that people who ask for code review advice can't even think rationally about starting again with different assumptions. They'd rather argue about their assumptions than simply create new code based on different (often fewer) assumptions.

I've seen some simple unit conversion problems turned into horrible messes. The first such terrifying thing was a data query filter based on year-month with a rolling 13-month window. Somehow, this turned into dozens and dozens of lines of ineffective code, filled with wrong edge cases.

Similar things happen with hour-minute windows. Lots of wrong code. Muddled confusion. Herculean efforts doing the wrong thing. Herculean.

Both year-month and hour-minute problems are units conversion. Year-month is months in base 12. Hour-minute is minutes in base 60. Technically, they're mixed bases, simple polynomials in two terms. It's a multiply and an add. 12y+m, where 0 ≤ m < 12. Maybe an extra subtract 1 is involved.

The entire algorithm is a multiply and an add. There shouldn't very many lines of code involved. In some cases, there's an additional conversion from integer minutes to float hours. Which is a multiply by a constant (1/720.) Or integer months to float years after an epochal year (another add with a negative number and multiply by 1/12.)

I think it's common that ineffective code need to be replaced. Maybe it's sad that it has to get replaced *after* being written? I don't think so. All code gets rewritten. Some just gets written sooner.

I think that some people may need some life-coaching as well as code reviews.

Perhaps they should be encouraged to participate in a design walk-through before sweating their precious life's blood into code that doesn't solve the problem at hand.

Tuesday, July 18, 2017

Yet Another Python Problem List

This was a cool thing to see in my Twitter feed:

Dan Bader (@dbader_org)
"Why Python Is Not My Favorite Language" zenhack.net/2016/12/25/why…

More Problems with Python. Here's the short list.

1. Encapsulation (Inheritance, really.)
2. With Statement
3. Decorators
4. Duck Typing (and Documentation)
5. Types

I like these kinds of posts because they surface problems that are way, way out at the fringes of Python. What's important to me is that most of the language is fine, but the syntaxes for a few things are sometimes irksome. Also important to me is that it's almost never the deeper semantics; it seems to be entirely a matter of syntax.

The really big problem is people who take the presence of a list like this as a reason dismiss Python in its entirety because they found a few blog posts identifying specific enhancements. That "Python must be bad because people are proposing improvements" is madding. And dismayingly common.

Even in a Python-heavy workplace, there are Java and Node.js people who have opinions shaped by little lists like these. The "semantic whitespace" argument coming from JavaScript people is ludicrous, but there they are: JavaScript has a murky relationship with semi-colons and they're complaining about whitespace. Minifying isn't a virtue. It's a hack. Really.

My point in general is not to say this list is wrong. It's to say that these points are minor. In many cases, I don't disagree that these can be seen as problems. But I don't think they're toweringly important.

1. The body of first point seems to be more about inheritance and accidentally overiding something that shouldn't have been overridden. Java (and C++) folks like to use private for this. Python lets you read the source. I vote for reading the source.

2. Yep. There are other ways to do this. Clever approach. I still prefer with statements.

3. I'm not sold on the syntax change being super helpful.

4. People write bad documentation about their duck types. Good point. People need to be more clear.

5. Agree. A lot of projects need to implement type hints to make it more useful.

Tuesday, July 11, 2017

Extracting Data Subsets and Design By Composition

The request was murky. It evolved over time to this:
Create a function file_record_selection(train.csv, 2, 100, train_2_100.csv)
First parameter: input file name (train.csv)
Second parameter: first record to include (2)
Third parameter: last record to include (100)
Fourth parameter: output file name (train_2_100.csv)
Fundamentally, this is a bad way to think about things. I want to cover some superficial problems first, though.

First superficial dig. It evolved to this. In fairness to people without a technical background, getting to tight, implementable requirements are is difficult. Sadly the first hand-waving garbage was from a DBA. It evolved to this. The early drafts made no sense.

Second superficial whining. The specification -- as written -- is extraordinarily shabby. This seems to be written by someone who's never read a function definition in the Python documentation before. Something I know is not the case. How can someone who is marginally able to code also unable to write a description of a function? In this case, the "marginally able to code" may be a hint that some folks struggle with abstraction: the world is a lot of unique details; patterns don't emerge from related details.

Third. Starting from record 2, seems to show that they don't get the idea that indexes start with zero. They've seen Python. They've written code. They've posted code to the web for comments. And they are still baffled by the start value of indices.

Let's move on to the more interesting topic, functional composition. 

Functional Composition

The actual data file is a .GZ archive. So there's a tiny problem with looking at .CSV extracts from the gzip. Specifically, we're exploding a file all over the hard drive for no real benefit. It's often faster to read the zipped file: it may involve fewer physical I/O operations. The .GZ is small; the computation overhead to decompress may be less than the time waiting for I/O.

To get to functional composition we have to start by decomposing the problem. Then we can build the solution from the pieces. To do this, we'll borrow the interface segregation (ISP) design principle from OO programming.

Here's an application of ISP: Avoid Persistence. It's easier to add persistence than to remove it. This leads peeling off three further tiers of file processing: Physical Format, Logical Layout, and Essential Entities.

We shouldn't write a .CSV file unless it's somehow required. For example, if there are multiple clients for a subset. In this case, the problem domain is exploratory data analysis (EDA) and saving .CSV subsets is unlikely to be helpful. The principle still applies: don't start with persistence in mind. What are the Essential Entities?

This leads away from trying to work with filenames, also. It's better to work with files. And we shouldn't work with file names as strings, we should use pathlib.Path. All consequences of peeling off layers from the interfaces.

Replacing names with files means the overall function is really this. A composition. 

file_record_selection = (lambda source, start, stop, target: 
    file_write(target, file_read_selection(source, start, stop))

We applied the ISP again, to avoid opening a named .CSV file. We can work with an open file-like objects, instead of a file names. This doesn't change the overall form of the functions, but it changes the types. Here are the two functions that are part of the composition:

from typing import *
import typing
Record = Any
def file_write(target: typing.TextIO, records: Iterable[Record]):
def file_read_selection(source: csv.DictReader, start: int, stop: int) -> Iterable[Record]:

We've left the record type unspecified, mostly because we don't know what it just yet. The definition of Record reflects the Essential Entities, and we'll defer that decision until later. CSV readers can produce either dictionaries or lists, so it's not a complex decision; but we can defer it.

The .GZ processing defines the physical format. The content which was zipped was a .CSV file, which defines the logical layout.

Separating physical format, logical layout, and essential entity, gets us code like the following:

with gzip.open('file.gz') as source:
    reader = csv.DictReader(source)  # Iterator[Record]
    for line in file_read_selection(reader, start, stop):

We've opened the .GZ for reading. Wrapped a CSV parser around that. Wrapped our selection filter around that. We didn't write the CSV output because -- actually -- that's not required. The core requirement was to examine the input.

We can, if we want, provide two variations of the file_write() function and use a composition like the file_record_selection() function with the write-to-a-file and print-to-the-console variants. Pragmatically, the print-to-the-console is all we really need.

In the above example, the Record type can be formalized as  List[Text].  If we want to use csv.DictReader instead, then the Record type becomes Dict[Text, Text].

Further Decomposition

There's a further level of decomposition: the essential design pattern is Pagination. In Python parlance, it's a slice operation. We could use itertools to replace the entirety of file_read_selection() with itertools.takewhile() and itertools.dropwhile(). The problem with these methods is they don't short-circuit: they read the entire file.

In this instance, it's helpful to have something like this for paginating an iterable with a start and stop value.

for n, r in enumerate(reader):
    if n < start: continue
    if n = stop: break
    yield r

This covers the bases with a short-circuit design that saves a little bit of time when looking at the first few records of a file. It's not great for looking at the last few records, however. Currently, the "tail" use case doesn't seem to be relevant. If it was, we might want to create an index of the line offsets to allow arbitrary access. Or use a simple buffer of the required size.

If we were really ambitious, we'd use the Slice class definition to make it easy to specify start, stop, and step values. This would allow us to pick every 8th item from the file without too much trouble.

The Slice class doesn't, however support selection of a randomized subset. What we really want is a paginator like this:

def paginator(iterable, start: int, stop: int, selection: Callable[[int], bool]):
    for n, r in enumerate(iterable):
        if n < start: continue
        if n == stop: break
        if selection(n): yield r

file_read_selection = lambda source, start, stop: paginator(source, start, stop, lambda n: True)

file_read_slice = lambda source, start, stop, step: paginator(source, start, stop, lambda n: n%step == 0)

The required file_read_selection() is built from smaller pieces. This function, in turn, is used to build file_record_selection() via functional composition. We can use this for randomized selection, also.

Here are functions with type hints instead of lambdas.

def file_read_selection(source: csv.DictReader, start: int, stop: int) -> Iterable[Record]:
    return paginator(source, start, stop, lambda n: True)

def file_read_slice(source: csv.DictReader, start: int, stop: int, step: int)  -> Iterable[Record]:
    return paginator(source, start, stop, lambda n: n%step == 0)

Specifying type for a generic iterable and the matching result iterable seems to require a type variable like this:

T = TypeVar('T')
def paginator(iterable: Iterable[T], ...) -> Iterable[T]:

This type hint suggests we can make wide reuse of this function. That's a pleasant side-effect of functional composition. Reuse can stem from stripping away the various interface details to decompose the problem to essential elements.


What's essential here is Design By Composition. And decomposition to make that possible.

We got there by stepping away from file names to file objects. We segregated Physical Format and Logical Layout, also. Each application of the Interface Segregation Principle leads to further decomposition. We unbundled the pagination from the file I/O. We have a number of smaller functions. The original feature is built from a composition of functions.

Each function can be comfortably tested as a separate unit. Each function can be reused.

Changing the features is a matter of changing the combination of functions. This can mean adding new functions and creating new combinations. 

Tuesday, July 4, 2017

Python and Performance

Real Question:

One of the standard problems that keeps coming up over and over is the parsing of url's. A sub-problem is the parsing of domain and sub-domains and getting a count.

For example

It would be nice to parse the received file and get counts like

.com had 15,323 count
.google.com had 62 count
.theatlantic.com had 33 count

The first code snippet would be in Python and the other code snippet would be in C/C++ to optimize for performance.


Yes. They did not even try to look in the standard library for urllib.parse. The general problem has already been solved; it can be exploited in a single line of code.

The line can be long-ish, so it can help to use a lambda to make it a little easier to read. The code is below.

The C/C++ point about "optimize for performance" bothers me to no end. Python isn't very slow. Optimization isn't required.

I made 16,000 URL's. These were not utterly random strings, they were random URL's using a pool of 100 distinct names. This provides some lumpiness to the data. Not real lumpiness where there's a long tail of 1-time-only names. But enough to exercise collections.Counter and urllib.parse.urlparse().

Here's what I found. Time to parse 16,000 URLs and pluck out the last two levels of the name?

CPU times: user 154 ms, sys: 2.18 ms, total: 156 ms
Wall time: 157 ms


CPU times: user 295 ms, sys: 6.87 ms, total: 302 ms
Wall time: 318 ms

At that pace, why use C?

I suppose one could demand more speed just to demand more speed.

Here's some code that can be further optimized.

top = lambda netloc: '.'.join(netloc.split('.')[-2:])
random_counts = Counter(top(urllib.parse.urlparse(x).netloc) for x in random_urls_32k)

The slow part of this is the top() function. Using rsplit('.', maxsplit=2) might be better than split('.'). A smarter approach might be find all the "." and slice the substring from the next-to-last one. Something like this, netloc[findall('.', netloc)[-2]:], assuming a findall() function that returns the locations of all '.' in a string.

Of course, if there is a problem, using a numpy structure might speed things up. Or use dask to farm the work out to multiple threads.

Tuesday, June 27, 2017

OOP and FP -- Objects vs. Functional -- Avoiding reductionist thinking

Real Quote (lightly edited to remove tangential nonsense.)
Recently, I watched a video and it stated that OO is about nouns and Functional programming is about the verbs. Also, ... Aspect-Oriented Programming with the e Verification Language  by David Robinson 
It would be nice to have a blog post which summarized the various mindset associated w/ the various paradigms.

I find the word "mindset" to be challenging.

Yes. All Turing Complete programming languages do have a kind of fundamental equivalence at the level of computing stuff represented as numbers. This, however, seems reductionist.

["All languages were one language to him. All languages were 'woddly'." Paraphrased from James Thurber's "The Great Quillow", a must-read.]

So. Beyond the core Turing Completeness features of a language, the rest is reduced to a difference in "mindset"? The only difference is how we pronounce "Woddly?"

"Mindset" feels reductionist. It replaces a useful summary of language features with a dismissive "mindset" categorization of languages. In a way, this seems to result from bracketing technology choices as "religious wars," where the passion for a particular choice outweighs the actual relevance; i.e., "All languages have problems, so use Java."

In my day job, I work with three kinds of Python problems:
  • Data Science
  • API Services
  • DevOps/TechOps Automation
In many cases, one person can have all three problems. These aren't groups of people. These are problem domains.

I think the first step is to replace "mindset" with "problem domain". It's a start, but I'm not sure it's that simple.

When someone has a data science problem, they often solve it with purely function features of Python. Generally they do this via numpy, but I've been providing examples of generator expressions and comprehensions in my Code Dojo webinars. Generator expressions are an elegant, functional approach to working with stateless data objects.

In Python 3, the following kind of code doesn't create gigantic intermediate data structures. The functional transformations are applied to each item generated by the "source".

x = map(transform, source)
y = filter(selector_rule, x)
z = Counter(y)

I prefer to suggest that a fair amount of data analysis involves little or no mutation of state. Functional features of a language seem to work well with immutable data.

There is state change, but it's at a macro level. For example, the persistence of capturing data is a large-scale state change that's often implemented at the OS level, not the language level.

When someone's building an API, on the other hand, they're often working with objects that have a mutating state. Some elements of an API will involve state change, and objects model state change elegantly. RESTful API's can deserialize objects from storage, make changes, and serialize the modified object again.

[This summary of RESTful services is also reductionist, and therefore, possibly unhelpful.]

When there's mutability, then objects might be more appropriate than functions.

I'm reluctant to call this "mindset." It may not be "problem domain." It seems to be a model that involves mutable or immutable state.

When someone's automating their processing, they're wrestling with OS features, and OS management of state change. They might be installing stuff, or building Docker images, or gluing together items in a CI/CD pipeline, setting the SSL keys, or figuring out how to capture Behave output as part of Gherkin acceptance testing. Lots of interesting stuff that isn't the essential problem at hand, but is part of building a useful, automated solution to the problem.

The state in these problems is maintained by the OS. Application code may -- or may not -- try to model that state.

When doing Blue/Green deployments, for example, the blueness and greenness isn't part of the server farm, it's part of an internal model of how the servers are being used. This seems to be stateful; object-oriented programming might be helpful. When the information can be gleaned from asset management tools, then perhaps a functional processing stream is more important for gathering, deciding, and taking action.

I'm leaning toward the second view-point, and suggesting that some of the OO DevOps programming might be better looked at as functional map-filter-reduce processing. Something like

action_to_take = some_decision_pipeline(current state, new_deployment)

This reflects the questions of state change. It may not be the right abstraction though, because carrying out the action is, itself, a difficult problem that involves determining the state of the server farm, and then applying some change to one or more servers.

We often think of server state change as imperative in nature. It feels like object-oriented programming. There are steps, the object models those steps. I'm not sure that's right. I think there's a repeated "determine next action" cycle here. Sometimes it involves waiting for an action to finish. Yes, it sounds like polling the server farm. I'm not sure that's wrong. How else do you know a server is crashed except by polling it?

I think we've moved a long way from "mindset."

I think it's about fitting language features to a problem in a way that creates the right abstraction to capture (and eventually) solve the problem.

I haven't mentioned Aspect-Oriented Programming because it seems to cut across the functional/object state management boundary. It's a distinctive approach to organizing reusable functionality. I don't mean to dismiss it as uninteresting. I mean to set it aside as orthogonal to the "mutable state" consideration that seems to to be one of the central differences between OOP and FP.

In response to the request: "No. I won't map mindset to paradigm."

Tuesday, June 20, 2017

NMEA Data Acquisition -- An IoT Exercise with Python

Here's the code: https://github.com/slott56/NMEA-Tools. This is Python code to do some Internet of Things (IoT) stuff. Oddly, even when things connected by a point-to-point serial interface, it's still often called IoT. Even though there's no "Internetworking."

Some IoT projects have a common arc: exploration, modeling, filtering, and persistence. This is followed by the rework to revise the data models and expand the user stories. And then there's the rework conundrum. Stick with me to see just how hard rework can be.

What's this about? First some background. Then I'll show some code.

Part of the back story is here: http://www.itmaybeahack.com/TeamRedCruising/travel-2017-2018/that-leaky-hatch--chartplot.html

In the Internet of Things Boaty (IoT-B) there are devices called chart-plotters. They include GPS receivers, displays, and controls. And algorithms. Most important is the merging of GPS coordinates and an active display. You see where your boat is.

Folks with GPS units in cars and on their phones have an idea core feature set of a chart plotter. But the value of a chart plotter on a boat is orders of magnitude above the value in a car.

At sea, the hugeness and importance of the chartplotter is magnified. The surface of the a large body of water is (almost) trackless. Unless you're really familiar with it, it's just water, generally opaque. The depths can vary dramatically. A shoal too shallow for your boat can be more-or-less invisible and just ahead. Bang. You're aground (or worse, holed.)

A chart -- and knowledge of your position on that chart -- is a very big deal. Once you sail out of sight of land, the chart plotter becomes a life-or-death necessity. While I can find the North American continent using only a compass, I'm not sure I could find the entrance to Chesapeake Bay without knowing my latitude. (Yes, I have a sextant. Would I trust my life to my sextant skills?)

Modern equipment uses modern hardware and protocols. N2K (NMEA 2000), for example, is powered Ethernet connectivity that uses a simplified backbone with drops for the various devices. Because it's Ethernet, they're peers, and interconnection is simplified. See http://www.digitalboater.com for some background.

The Interface Issue

The particularly gnarly problem with chart plotters is the lack of an easy-to-live-with interface.

They're designed to be really super robust, turn-it-on-and-it-works products. Similar to a toaster, in many respects. Plug and play. No configuration required.

This is a two-edged sword. No configuration required bleeds into no configuration possible.

The Standard Horizon CP300i uses NT cards. Here's a reader device. Note the "No Longer Available" admonition. All of my important data is saved to the NT card. But. The card is useless except for removable media backup in case the unit dies.

What's left? The NMEA-0183 interface wiring.

NMEA Serial EIA-422

The good news is that the NMEA wiring is carefully documented in the CP300i owner's manual. There are products like this NMEA-USB Adaptor. A few wire interconnections and we can -- at least in principle -- listen to this device.

The NMEA standard was defined to allow numerous kinds of devices to work together. When it was adopted (in 1983), the idea was that a device would be a "talker" and other devices would be "listeners." The intent was to have a lot of point-to-point conversations: one talker many listeners.

A digital depth meter or wind meter, for example, could talk all day, pushing out message traffic with depth or wind information. A display would be a listener and display the current depth or wind state.

A centralized multiplexer could collect from multiple listeners and then stream the interleaved messages as a talker. Here's an example. This would allow many sensors to be combined onto a single wire. A number of display devices could listen to traffic on the wire, pick out messages that made sense to them, and display the details.

Ideally, if every talker was polite about their time budget, hardly anything would get lost.

In the concrete case of the CP300i, there are five ports. usable in various combinations. There are some restrictions that seem to indicate some hardware sharing among the ports. The product literature describes a number of use cases for different kinds of interconnections including a computer connection.

Since NMEA is EIA-422 is RS-232, some old computer serial ports could be wired up directly. My boat originally had an ancient Garmin GPS and an ancient windows laptop using an ancient DB-9 serial connector. I saved the data by copying files off the hard drive and threw the hardware away.

A modern Macintosh, however, only handles USB. Not direct EAI-422 serial connections. An adaptor is required.

What we will have, then, is a CP300i in talker mode, and a MacBook Pro (Retina, 13-inch, Late 2013) as listener.

Drivers and Infrastructure

This is not my first foray in the IoT-B world. I have a BU-353 GPS antenna. This can be used directly by the GPSNavX application on the Macintosh. On the right-ish side of the BU-353 page are Downloads. There's a USB driver listed here. And a GPS Utility to show position and satellites and the NMEA data stream.

Step 1. Install this USB driver.

Step 2. Install the MAC OS X GPS Utility. I know the USB interface works because I can see the BU-353 device using this utility.

Step 3. Confirm with GPSNavX. Yes. The chart shows the little boat triangle about where I expect to be.

Yay! Phase I of the IoT-B is complete. We have a USB interface. And we can see an NMEA-0183 GPS antenna. It's transmitting in standard 4800 BAUD mode. This is the biggest hurdle in many projects: getting stuff to talk.

In the project Background section on Git Hub, there's a wiring diagram for the USB to NMEA interface.

Also, the Installation section says install pyserial. https://pypi.python.org/pypi/pyserial. This is essential to let Python apps interact with the USB driver.

Data Exploration

Start here: NMEA Reference Manual. This covers the bases for the essential message traffic nicely. The full NMEA standard has lots of message types. We only care about a few of them. We can safely ignore the others.

As noted in the project documentation, there's a relatively simple message structure. The messages arrive more-or-less constantly. This leads to an elegant Pythonic design: an Iterator.

We can define a class which implements the iterator protocol (__iter__() and __next__()) that will consume lines from the serial interface and emit the messages which are complete and have a proper checksum. Since the fields of a message are comma-delimited, might as well split into fields, also.

It's handy to combine this with the context manager protocol (__enter__() and __exit__()) to create a class that can be used like this.

    with Scanner(device) as GPS:
        for sentence_fields in GPS:

This is handy for watching the messages fly past. The fields are kind of compressed. It's a light-weight compression, more like a lack of useful punctuation than proper compression.

Consequently, we'll need to derive fields from the raw sequences of bytes. This initial exploration leads straight to the next phase of the project.


We can define a data model for these sentences using a Sentence class hierarchy. We can use a simple Factory function to emit Sentence objects of the appropriate subclass given a sequence of fields in bytes. Each subclass can derive data from the message.

The atomic fields seem to be of seven different types.
  • Text. This is a simple decode using ASCII encoding.
  • Latitude. The values are in degrees and float minutes.
  • Longitude. Similar to latitude.
  • UTC date. Year, month, and day as a triple.
  • UTC time. Hour, minute, float seconds as a triple.
  • float. 
  • int.
Because fields are optional, we can't naively use the built-in float() and int() functions to convert bytes to numbers. We'll have to have a version that works politely with zero-length strings and creates None objects.

We can define a simple field definition tuple, Field = namedtuple('Field', ['title', 'name', 'conversion']). This slightly simplifies definition of a class.

We can define a class with a simple list of field conversion rules.

class GPWPL(Sentence):
    fields = [
        Field('Latitude', 'lat_src', lat),
        Field('N/S Indicator', 'lat_h', text),
        Field('Longitude', 'lon_src', lon),
        Field('E/W Indicator', 'lon_h', text),
        Field("Name", "name", text),        

The superclass __init__() uses the sequence of field definitions to apply conversion functions (lat(), lon(), text()) to the bytes, populating a bunch of attributes. We can then use s.lat_src to see the original latitude 2-tuple from the message. A property can deduce the actual latitude from the s.lat_src and s.lat_h fields.

For each field, apply the function to the value, and set this as an attribute.

        for field, arg in zip(self.fields, args[1:]):
                setattr(self, field.name, field.conversion(arg))
            except ValueError as e:
                self.log.error(f"{e} {field.title} {field.name} {field.conversion} {arg}")

This sets attributes with useful values derived from the bytes provided in the arguments.

The factory leverages a cool name-to-class mapping built by introspection.

    sentence_class_map = {
        class_.__name__.encode('ascii'): class_ 
        for class_ in Sentence.__subclasses__()
    class_= self.sentence_class_map.get(args[0])

This lets us map a sentence header (b"GPRTE") to a class (GPRTE) simply. The get() method can use an UnknownSentence subclass as a default.

Modeling Alternatives

As we move forward, we'll want to change this model. We could use a cooler class definition style, something like this. We could then iterate of the keys in the class __dict__ to set the attribute values.

class GPXXX(Sentence):
    lat_src = Latitude(1)
    lat_h = Text(2)
    lon_src = Longitude(3)
    lon_h = Text(4)
    name = Text(5)

The field numbers are provided to be sure the right bunch of bytes are decoded.

Or maybe even something like this:

class GPXXX(Sentence):
    latitude = Latitude(1, 2)
    longitude = Longitude(3, 4)
    name = Text(5)

This would combine source fields to create the useful value. It would be pretty slick. But it requires being *sure* of what a sentence' content is. When exploring, this isn't the way to start. The simplistic list of field definitions comes right off web sites without too much intermediate translation that can lead to confusion.

The idea is to borrow the format from the SiRF reference and start with Name, Example, Unit, and Description in each Field definition. That can help provide super-clear documentation when exploring. The http://aprs.gids.nl/nmea/ information has similar tables with examples. Some of the http://freenmea.net/docs examples only have names.

The most exhaustive seems to be http://www.catb.org/gpsd/NMEA.html. This, also, only has field names and position numbers. The conversions are usually pretty obvious.


A talker -- well -- talks. More or less constantly. There are delays to allow time to listen and time for multiplexers to merge in other talker streams.

There's a cycle of messages that a device will emit. Once you've started decoding the sentences, the loop is obvious.

For an application where you're gathering real-time track or performance data, of course, you'll want to capture the background loop. It's a lot of data. At about 80 bytes times 8 background messages on a 2-second cycle, you'll see 320 bytes per second, 19K per minute, 1.1M per hour, 27.6M per day. You can record everything for 38 days to and be under a Gb.

The upper bound for 4800 BAUD is 480 bytes per second. 41M per day. 25 days to record a Gb of raw data.

For my application, however, I want to capture the data not in the background loop.

It works like this.
  1. I start the laptop collecting data.
  2. I reach over to the chartplotter and push a bunch of buttons to get to a waypoint transfer or a route transfer.
  3. The laptop shows the data arriving. The chartplotter claims it's done sending.
  4. I stop collecting data. In the stream of data are my waypoints or routes. Yay!
A reject filter is an easy thing: Essentially it's filter(lambda s: s._name not in reject_set, source). A simple set of names to reject is the required configuration for this filter.


How do we save these messages?

We have several choices.
  1. Stream of Bytes. The protocol uses \r\n as line endings. We could (in principle) cat /dev/cu.usbserial-A6009TFG >capture.nmea. Pragmatically, that doesn't always work because the 4800 BAUD setting is hard to implement. But the core idea of "simply save the bytes" works.
  2. Stream of Serialized Objects. 
    1. We can use YAML to cough out the objects. If the derived attributes were all properties, it would have worked out really well. If, however, we leverage __init__() to set attributes, this becomes awkward.
    2. We can work around the derived value problems by using JSON with our own Encoder to exclude the derived fields. This is a bit more complex, than it needs to be. It permits exploration though.
  3. GPX, KML, or CSV. Possible, but. These seems to be a separate problem.
When transforming data, it's essential to avoid "point-to-point" transformation among formats. It's crucial to have a canonical representation and individual converters. In this case, we have NMEA to canonical, persist the canonical, and canonical to GPX (or KML, or CSV.)


Yes. There's a problem here.  Actually there are several problems.
  1. I got the data I wanted. So, fixing the design flaws isn't essential anymore. I may, but... I should have used descriptors.
  2. In the long run, I really need a three-way synchronization process between computer, new chart plotter and legacy chart plotter. 
Let's start with the first design issue: lazy data processing.

The core Field/Sentence design should have looked like this:

class Field:
    def __init__(self, position, function, description):
        self.position = position
        self.function = function
        self.description = description
    def __get__(self, object, class_):
        print(f"get {object} {class_}")
        transform = self.function
        return transform(object.args[self.position])
class Sentence:
    f0 = Field(0, int, "Item Zero")
    f1 = Field(1, float, "Item One")
    def __init__(self, *args):
        self.args = args

This makes all of the properties into lazy computations. It simplifies persistence because the only real attribute value is the tuple of arguments captured from the device.

>>> s = Sentence(b'1', b'2.3')
>>> s.f1
>>> s.f2

That would have been a nicer design because serialization would have been trivial. Repeated access to the fields might have become costly. We have a tradeoff issue here that depends on the ultimate use case. For early IoT efforts, flexibility is central, and the computation costs don't matter. At some point, there may be a focus on performance, where extra code to save time has merit.

Synchronization is much more difficult. I need to pick a canonical representation. Everything gets converted to a canonical form. Differences are identified. Then updates are created: either GPX files for the devices that handle that, or NMEA traffic for the device which updated over the wire.


This IoT project followed a common arc: Explore the data, define a model, figure out how to filter out noise, figure out how to persist the data. Once we have some data, we realize the errors we made in our model.

A huge problem is the pressure to ship an MVP (Minimally Viable Product.) It takes a few days to build this. It's shippable.

Now, we need to rework it. In this case, throw most of the first release away. Who has the stomach for this? It's essential, but it's also very difficult.

A lot of good ideas from this blog post are not in the code. And this is the way a lot of commercial software happens: MVP and move forward with no time for rework.

Friday, June 16, 2017

Python Resources


Red Hat was an early adopter, putting Python-based tools into their distros.

Sadly, they've also lagged behind. They haven't gotten much beyond Python 2.6 or maybe 2.7.

RHEL is really good. Except for the Python problem.

Tuesday, June 13, 2017

Another "Problems With Python" List

This: https://darkf.github.io/posts/problems-i-have-with-python.html

  1. Slow.
  2. Threads.
  3. Legacy Python 2 Code.
  4. Some Inconsistencies.
  5. Functional Programming.
  6. Class Definitions Don't Have Enough Features.
  7. Switch Statement.

  1. Find the 20% that needs a speedup and write that in C. Most of the time that's already available in numpy. The rest of the time you may have found something useful.
  2. Generally, most folks make a hash of threaded programming. A focus on process-level parallelism is simpler and essentially guarantees success by avoiding shared data structures.
  3. Maybe stop using the legacy projects?
  4. Yup.
  5. Proper functional programming would requires an optimizer, something that doesn't fit well with easy-to-debug Python. It would also require adding some features to cope the optimization of functional code (e.g., monads.) It seems to be a net loss. And http://coconut-lang.org.
  6. Consider a metaclass that provides the missing features?
  7. I can't figure out why 'elif' is considered hard to use. The more complex matching rules are pretty easy to implement, but I guess this falls into the "awful hacks" category.

What causes me to write this is the lack of concrete "do this instead" for most of the points. It sounds too much like empty complaining.

I hope for some follow-on from this on Twitter:
But I'm not optimistic. It's too easy to complain and too hard to solve.

Tuesday, June 6, 2017

An Epic Fail Example

What's the most Epic Fail I've ever seen?

I was a traveling consulting for almost 35 years. I saw a lot. I did learn from epic fail scenarios. But. I haven't really spent a lot of time thinking about the lessons learned there. I never have a glib answer to this question. Mostly because the stories are incomplete: I came in during a awful mess and left and it was still an awful mess. No arc. No third act. No punchline.

These aren't really stories as much as they're vignettes -- just sad fragments of some larger tragedy. Consequently, they don't leap to front of mind quickly.

One example is a smallish company that had built some pretty cool software in MS-Access. They had created something that was narrowly focused on a business problem and they were clever, so it worked. And worked well.

They leveraged this success, solving another major business problem. In MS-Access. Clever. Focused on real user's real needs. It doesn't get any better than that.

Well, of course, it does get better than that.

They replicated their success seven times. Seven interlocking MS-Access databases. They had subsumed essentially all of the company's information and data processing. Really. It's not that hard to do. Companies buy General Ledger software that doesn't really do very much. You can write a perfectly serviceable ledger application yourself. (Many people ask "why bother?")

When I talked with them they had finally been swamped by the inevitable scalability problem. They had done all the hackarounds they could do. Their network of MS-Access servers and interlocked cluster of databases had reached it's limit of growth.


Questions I did not ask at the time: Who let this happen? Who closed their eyes to the scalability problem and let this go forward? Who avoided the idea of contingency planning? How do you back this up and restore it to a consistent state?

They were in a world of trouble. I told them what they had to do and never saw them again. End of vignette.

(In case you want to know... I told them to get a real server, install SQL-Server, and migrate each individual MS-Access table to that central SQL-Server database, replacing the Access table with an ODBC connection to the central DB. This would take months. Once every single database was expunged from MS-Access, they could start to look at a web-based front-end to replace the Access front-end.)

There are others. I'll have to ransack my brain to see if I've got other examples.

Tuesday, May 30, 2017

Ancient Software and A Small Value of "Works"

I'm not sold about this argument at all.

The "constant replacement" issue has two sides to it. If you follow the thread, there's a certain amount of (appropriate) bashing of the acquisitiveness that -- when exploited ruthlessly -- is a damning indictment of capitalism. There's a lot of value in recognizing the core capitalistic "buy more stuff" part of this.

But there's more.

One problem with this tweet is the threshold for "things that work."

Consumer software tends to be rather complex. It's often fat with features. If the feature you use aren't obviously broken, then you can say the software "works." But it's heavily qualified. Your interesting feature set may be rather small when compared to the whole.

Also, the technology stack tends to be quite tall. Your consumer software sits on top of a consumer OS and consumer-friendly libraries. All of which have to "work" to claim that your software "works."

As with the app, you're only using a subset of OS and runtime library features. This wraps the threshold for "works" in more and more layers of qualification. Smaller and smaller percentages of the code are involved in "works."

I'd suggest there's no other context where things are quite so complex and interdependent as software. The narrow feature set that appears to "work" may be adjacent to numerous security flaws and bugs and unintended consequences. The constant replacement may be necessary bug fixes.

There's more.

Part of the "constant replacement" situation stems from the cost and complexity of innovation. I think that some companies have a rentier mind-set. Once the software is written, they hope to derive ongoing revenue from the software. This doesn't happen because other companies innovate, the product becomes obsolete. So they scramble around trying to make money without doing too much real work.

There are three scenarios where ancient software gets replaced:
  • A new version include bug fixes. See above; the previous version didn't really "work" for large values of work. Blame for using ancient software is deserved. Keeping the old security flaws is not a virtue.
  • A new version is incremental feature creep. This is (potentially) rampant capitalism. Add a little something and sell the product as "new and improved." Keeping the old software because the new features aren't helpful makes sense.
  • Some businesses have a rentier mind-set. They want an ongoing revenue stream. There aren't any material improvements. Keep the old software. Make these people do real work.
Traditional manufacturing business models rely on things wearing out. In the world of atoms, things need to be replaced. A good product has a long future because of parts. And Service. Possibly even customization. I lived on a sailboat made in 1982. I've been carefully rebuilding and replacing pieces all over the boat.

A software "platform" (e.g. OS, database, etc.) can have good long-term value. A software product, however, lives in a hyper-competitive marketplace where improvements appear constantly. The lazy route of non-innovative upgrades is tempting.

I think that there is a place to blame people for having too low a threshold for "works."

Tuesday, May 23, 2017

Python under XCode?!? Cool. Sort of.

Dan Bader (@dbader_org)
"Running Python in Xcode: Step by Step" ericasadun.com/2016/12/04/run…

Thanks for the link. There's a fair amount of "this doesn't seem to be what Xcode was designed to do." But it does seem to work.

I'll stick with Komodo Edit.

Tuesday, May 16, 2017

Needless Complexity -- Creating Havoc Leads to Mistakes

I received the worst code example ever. The. Worst.

Here's the email.
I have hurriedly created a blog post titled [omitted] at the url below
[also omitted]
Unfortunately, I am neither an algorithm expert or a Python expert. However, I am willing to jump in. 
Please review the Python code snippets. I know that they work because I ran them using Python 2.7.6. It was the environment available on my work PC. Speed to get something to the group so that it does not disband outweighs spending time on environments. The goal is not to be Pythonic but to have anyone that has written any code follow the logic.
Also, please review the logic. Somehow, I managed to get the wrong answer. The entire blog post is a build to provide a solution to CLRS exercise 2.3-7. My analysis gave me the answer of O( {n [log n]}**2 ) and the CLRS answer is O(n [log n] ). Where did I screw up my logic?

The referenced blog post is shocking. The "neither an algorithm expert or a Python expert" is an understatement. The "willing to jump in" is perhaps a bad thing. I sent several comments. They were all ignored. I asked for changes a second time. That was also ignored. Eventually, changes were made reluctantly and only after a distressing amount of back-and-forth.

Havoc was created through a process of casually misstating just about everything that can possibly be  misstated. It transcended mere "wrong" and enters that space where the whole thing can't even be falsified. It was beyond simply wrong.

My point here is (partially) to heap ridicule on the author. More importantly, want to isolate a number of issues to show how simple things can become needlessly complex and create havoc.

The Problem

The definition of the problem 2.3-7 seems so straightforward.
"Describe a $\Theta(n \log n)$-time algorithm that, given a set $S$ of $n$ integers and another integer $x$, determines whether or not there exist two elements in $S$ whose sum is exactly $x$."
This brings us to problem #1. The blog post is unable to actually repeat the problem. Here's the quote:
x + y = x0
x ∈ N and y ∈ N for some finite set of integer values N
x0: the integer to which x and y sum
x != y
It's not at all clear what's going on here. What's x0? What's N? Why is x!=y even introduced?

This is how havoc starts. The requirements have been restated in a way that makes them more confusing. The original terminology was dropped in favor of random new terminology. There's no reason for restating the problem. The consequence of the bad restatement is to introduce needless features and create confusion.

The Python Nonsense

A quote:
Concepts are demonstrated via code snippets. They code snippets were executed using Python 2.7.6. They were written in such a way that anyone with basic coding skills could read the code. In other words, the goal was not to be Pythonic.
Python 2.7.6 has been obsolete since May of 2014. At the very least, use a current release.

The goal of using Python without being Pythonic seems to be -- well -- schizophrenic. Or it's intentional troll-bait. Hard to say. 

Another paragraph says this.
The Python community will be annoyed because I am using Python 2 and not 3. Their annoyance is appropriate. Unfortunately, I only have Windows machines and can't afford to screw them up at this point in time.
What? That makes no sense at all. It's trivial to install Python 3.6 side-by-side with Python 2. Everyone should. Right now. I'll wait. See https://conda.io/docs/. Start here: https://conda.io/miniconda.html.

If you're going to insist on using the quirky and slow Python 2, you absolutely must use this in all of your code:

from __future__ import print_function, division, absolute_import, unicode_literals

Python 2 code without this is wrong. If you're still using Python 2, add this to all your code, right now. Please. You'll have to fix stuff that breaks; but we'll all thank you for it. pylint --py3k will help you locate and fix this.

The code with a -2/10 pylint score

I'm trying to reproduce this faithfully. It's hard, because the original blog post has issues with layout.

SomeIntegerList = [1, 2, 3, 4, 5, 6]
DesiredSumOfIntegers = 11
for SomeIntegerA in SomeIntegerList:
 for SomeIntegerB in SomeIntegerList:
  if SomeIntegerA == SomeIntegerB: continue
        SumOfIntegers = SomeIntegerA + SomeIntegerB
        print "SomeInteger A = ", SomeIntegerA, ", SomeInteger B = ", SomeIntegerB, ", Sum of Integers = ", SumOfIntegers
  if DesiredSumOfIntegers == SumOfIntegers:
      print "DesiredSumOfIntegers = ", DesiredSumOfIntegers, " was found"

(The original really could not be copied and pasted to create code that could even be parsed. I may have accidentally fixed that. I hope not.)

Almost every line of code has a problem. It gets worse, of course.

There's output in the original blog post that provides a hint as to what's supposed to be happening here.

Addition is Commutative

Yes. There is an entire paragraph plus a spreadsheet which proves that addition is commutative. An. Entire. Paragraph. Plus. A. Spreadsheet.

Meanwhile, factorial, multiplication, and division aren't mentioned. Why do we need a spreadsheet to show that addition is commutative, yet, all other operators are ignored? No clue. Moving on.


A quote:
Now, let's talk about the number of computations involved in using nested for loops to examine all the possible addition permutations. Here I am using the term permutation as it is strictly defined in mathematics.
First. The algorithm uses all combinations. $\textbf{O}(n^2)$.

Second. "as it is strictly defined in mathematics" should go without saying. If you feel the need to say this, it calls the entire blog post into question.

It's like "honestly." Anyone who has to establish their honesty with "can I be honest with you?" is still lying.

If we're being strict here, are we not being strict elsewhere? If we're not being strict, why not?

The algorithm enumerates all combinations of n things taken 2 at a time without replacement. For reasons that aren't clear. The original problem statement permits replacement. The restatement of the problem doesn't permit replacement.

The n things taken r or 2 at a time problem

There's a table with values for $\frac{n!}{(n-r)!}$

No hint is given as to what this table is or why it's here.  I think it's supposed to be because of this:

$\frac{n!}{r!(n-r)!} \text{ with } r=2 \equiv \frac{n!}{2(n-2)!} \equiv \frac{n\times(n-1)}{2}$

It's hard to say why commutativity of addition gets a paragraph, but this gets no explanation at all. To me, it shows a disregard for the reader: the reader doesn't understand addition, but they totally get factorial.

Another Perspective

A quote
Another perspective is to note that the nested for loops result in O(n^2). Clearly, the above approach is not scalable.
That's not "another perspective." That's. The. Point. The entire point of the exercise is that the brute force algorithm isn't optimal.

The Worst Code Snippet Ever

This is truly and deeply shocking.

SomeIntegerList = [1, 2, 3, 4, 5, 6]
DesiredSumOfIntegers = 11
i = 0
for SomeIntegerA in SomeIntegerList:
    i = i + 1
    j = 0
 for SomeIntegerB in SomeIntegerList:
        j = j + 1
        if j > i:
            print "i = ", i, ", j = ", j
            SumOfIntegers = SomeIntegerA + SomeIntegerB
            print "SomeInteger A = ", SomeIntegerA, ", SomeInteger B = ", SomeIntegerB, ", Sum of Integers = ", SumOfIntegers
      if DesiredSumOfIntegers == SumOfIntegers:
          print "DesiredSumOfIntegers = ", DesiredSumOfIntegers, " was found"

This is what drove me over the edge. This is unconscionably evil programming. It transcends mere "non-Pythonic" and reaches a realm of hellish havoc that can barely be understood as rational. Seriously. This is evil incarnate.

This is the most baffling complex version of a half-matrix iteration that I think I've ever seen. I can only guess that this is written by someone uncomfortable with thinking. They copied and pasted a block of assembler code changing the syntax to Python. I can't discern any way to arrive at this code.

The Big-O Problem

This quote:
Even though the number of computations is cut in half
The rules for Big-O are in the cited CLRS book.  $\textbf{O}(\frac{n^2}{2}) = \textbf{O}(n^2)$.

The "cut in half" doesn't count when describing the overall worst-case complexity. It needs to be emphasized that "cut in half" doesn't matter. Over and over again.

This code doesn't solve the problem. It doesn't advance toward solving the problem. And it's unreadable. Maybe it's a counter-example? An elaborate "don't do this"?

The idea of for i in range(len(S)): for j in range(i): ... seems to be an inescapable approach to processing the upper half of a matrix, and it seems to be obviously $\textbf{O}(n^2)$.

The Binary Search

This quote is perhaps the only thing in the entire blog post that's not utterly wrong.
we can compute the integer value that we need to find. We can than do a search over an ordered list for the integer that we need to find.
Finally. Something sensible. Followed by more really bad code.

The code starts with this

def binarySearch(alist, item):

instead of this

from bisect import bisect

Why does anyone try to write code when Python already provides it?

There's more code, but it's just badly formatted and has a net pylint score that's below zero. We've seen enough.

There's some further analysis that doesn't make any sense at all:
Since the integers that sum must be distinct, the diagnol on the matrix have values of N/A
And this:
Secondly, we should remove the integer that we are on from the binary search
This is a consequence of the initial confusion that decided that $x \neq y$ was somehow part of the problem. When it wasn't. These two sentences indicate a level of profound confusion about the essential requirements. Which leads to havoc.

Added Complication

The whole story is pretty badly confused. Then this arrives.
Complicate Problem by Having Integer List Not Sorted
It's not clear what this is or why it's here. But there it is. 

It leads eventually to this, which also happens to be true. 
The total computation complexity is O(2 * n [log n] ) = O(n [log n] )
That's not bad. However. The email that asked for help claimed O( {n [log n]}**2 ). I have no idea what the email is talking about. Nor could I find out what any of this meant. 

The Kicker

The kicker is some code that solves the problem in $\textbf{O}(n)$ time. Without using a set, which is interesting.

This was not part of the CLRS exercise 2.3-7. I suppose it's just there to point out something about something. Maybe it's a "other people are smarter than CLRS"? Or maybe it's a "just google for the right answer without too much thinking"? Hard to say.

A sentence or two of introduction might be all that's required to see why the other result is there.

Lessons Learned

Some people like to add complexity to the problem. The $x \neq y$ business is fabricated from thin air. It adds to the code complexity, but is clearly not part of the problem space.

This creates havoc. Simple havoc.

Some people appear to act like they're asking for help. But they're not. They may only want affirmation. A nice pat on the head. "Yes, you've written a blog post." Actual criticism isn't expected or desired. This is easy to detect by the volume and vehemence of the replies.

Given a list of numbers, S, and a target, x, determine of two values exist in the set that sum to x.

>>> S = [1,2,3,4,5,6]
>>> x=11
>>> [(n, x-n) for n in S if (x-n) in S]
[(5, 6), (6, 5)]
>>> bool([(n, x-n) for n in S if (x-n) in S])

This follows directly from the analysis. It doesn't add anything new or different. It just uses Python code rather than indented assembler.

This first example is $\textbf{O}(n^2)$ because the in operator is applied to a list. We can, however, use bisect() instead of the in operator.

>>> [(n, x-n) for n in S if S[bisect(S, (x-n))-1] == x-n]
[(5, 6), (6, 5)]
>>> x=13
>>> [(n, x-n) for n in S if S[bisect(S, (x-n))-1] == x-n]

This achieves the goal -- following the parts of the analysis that aren't riddled with errors -- without so much nonsensical code.

This does require some explanation for what bisect(S, i) does. It's important to note that the bisect() function returns the position at which we should insert a new value to maintain order. It doesn't return the location of a found item. Indeed, if the item isn't found, it will still return a position into which a new item should be inserted.

If we want this to be $\textbf{O}(n)$, we can use this:

>>> S = [1,2,3,4,5,6]
>>> S_set = set(S)
>>> x=11
>>> bool([(n, x-n) for n in S_set if (x-n) in S_set])

This replaces the linear list with a set, S_set. The (x-n) in S_set operation is $\textbf{O}(1)$, leading to the overall operation being $\textbf{O}(n)$.

If you want to shave a little time, you can use any() instead of bool([]). If you're not returning the pairs, you can reduce it to any(x-n in S_set for n in S_set). Try it with timeit to see what the impact is. It's surprisingly small.

Tuesday, May 9, 2017

Fizz Buzz Overthought, Project Euler #1, and Unit Tests

This. http://www.tomdalling.com/blog/software-design/fizzbuzz-in-too-much-detail/

And many other thoughts on overthinking fizz buzz. I'm going to overthink it, also. Why not?

This is a problem where the obvious unit test may not cover the cases properly. See https://projecteuler.net/problem=1 for a unit test case with a subtle misdirection built in. I love this problem statement deeply because the happy path is incomplete.

Part of my overthinking is overthinking this as a "classification" exercise, where we have simple classifiers that we can apply as functions of an input value. A higher-level function (i.e., map or a generator expression) applies all of these functions to the input. Some match. Some don't.

It shakes out like this.

The core classifier is a function that requires flexible parameter binding.

query = lambda n, t: lambda x: t if x%n == 0 else None

This is a two-step function definition. The outer function binds in two parameters, n, and t. The result of this binding is the inner function. If an argument value is a multiplier of n, return the text, t. We can think of the result of the outer lambda as a partial function, also, with some parameters defined, but not all.

We have two concrete results of this query() function:

fizz = query(3, 'fizz') 
buzz = query(5, 'buzz')

We've bound in a number and some text. Here's how the resulting functions work.

>>> fizz(3) 
>>> fizz(2)

The "trick" in the fizz buzz problem space is recognizing that we're working with the power set of these two rules. There are actually four separate conditions. This is remarkably easy to get wrong even though the sample code may pass a unit test like the Project Euler #1 sample data.

Here's the power set that contains the complete set of all subsets of the rules.

rule_groups = set(powerset([fizz, buzz]))

"Um," you say, "Is that necessary? And powerset() isn't built-in."

Try to add a third or fourth rule and the $\textbf{O}(2^n)$ growth in complexity of checking all combinations of the rules will become readily apparent. For two rules, $4 = \lvert\mathcal{P}(\{q(3), q(5)\})\rvert$. For a general set of rules, $r$, it's $2^{\lvert r \rvert} = \lvert \mathcal{P}(r)\rvert$. Four rules? sixteen outcomes. It sure seems like the power set is absolutely necessary; it describes the domain of possible outcomes. How could it not be necessary?

Also. There's a nice definition of powerset in the itertools recipes section of the standard library. It's *almost* built-in. 

The domain of possible responses form a power set. However, it's also clear that we aren't obligated to actually enumerate that set for each value we're testing. We do need to be aware that the complexity of the classification output is $\textbf{O}(2^n)$ where $n$ is the number of rules.

The processing to build each set of classifications, however, is $\textbf{O}(n)$. Here's how it looks.

for n in range(20):
    m = set(filter(None, (r(n) for r in [fizz, buzz])))
    print(m if m else n)

This locates the set of all matches, m. We apply the rules, fizz() and buzz(), to the given value. The result of applying the rules is filtered to remove falsy values. The resulting set, m, has the values from all rules which matched. This will be one of the sets from the power set of applying the rules to each value. The match, $m$, is an element of $\mathcal{P} (\{q(3), q(5)\})$.

I'm delighted that Python has some support for creating partial functions in a variety of ways. When things are complex, we can use the def statement. We can use functools partial(). When things are simple, we can even use lambdas.

Tuesday, May 2, 2017

Functional Python, Literate Programming & Trello Board Analysis

The general advice to people using Kanban/Agile Project boards of various kinds is this:

Stop Starting -- Start Finishing


There's a lot of this advice. Some of it is helpful.

Many tools have various dashboards and metrics computations.


The basic velocity calculations -- starts v. finishes -- is pretty straight-forward. The rules to classify a Trello action as "start" or "finish" are actually nice examples of simple functions or lambdas. Which also means that the basic pipeline required to gather the data can be written as a lazy, functional process.

Which leads to writing a Literate Programming version of a small program that gathers data from a Trello board.


It's a kind of deep-dive into some aspects of Python functional-style programming. It's also a dive into Literate Programming via a longish example. And it has a fair number of type hints. It's not perfectly clean from MyPy-'s analysis. So there's some more to do on that front.

Tuesday, April 25, 2017

Modules vs. Monoliths vs. Microservices:

Dan Bader (@dbader_org)
Worth a read: "Modules vs. Microservices" (and how to find a middle ground) oreilly.com/ideas/modules-…

"don't trick yourself into a microservices-only mindset"

Thanks for sharing.

The referenced post gives you the freedom to have a "big-ish" microservice. My current example has four very closely-related resources. There's agony in decomposing these into separate services. So we have several distinct Python modules bound into a single Flask container.

Yes. We lack the advertised static type checking for module boundaries. The kind of static type checking that doesn't actually solve any actual problems, since the issues are always semantic and can only be found with unit tests and integration tests and Gherkin-based acceptance testing (see Python BDD: https://pypi.python.org/pypi/pytest-bdd and https://pypi.python.org/pypi/behave/1.2.5).

We walk a fine line. How tightly coupled are these resources? Can they actually be used in isolation? What do the possible future changes look like? Where is the swagger.json going to change?

It's helpful to have both options on the table.

Wednesday, April 19, 2017

AWS "Serverless" Architecture Update -- Python 3.6 News

This: https://aws.amazon.com/about-aws/whats-new/2017/04/aws-lambda-supports-python-3-6/

You can now use Python 3.6 Lambdas. This changes things dramatically. We can now write faster, simpler, less quirky code using the latest-and-greatest Python.

If you want to configure a server in the cloud, consider this: https://wiki.ubuntu.com/Python. Use Ubuntu as the base image. Faster. Cleaner. Less Quirky.