Associated Press going too far?

Recently the widely used wire service has started charging for excerpts using more than 4 words. Under fair use rules and established copyright law authors of editorials and reviews are ordinarily permitted to excerpt from a work without permission from the copyright holder as long as the source is properly identified.

In a NY Times article dated July 23, 2009:

“In an interview, [AP’s chief executive Tom Curley] specifically cited references that include a headline and a link to an article, a standard practice of search engines like Google, Bing and Yahoo, news aggregators and blogs.” —Richard Perez-Pena, NY Times

AP has been on a road to protect its content and revenue stream for a while taking what some see as overreaching measures. Last year, the Drudge Report’s liberal opposite Drudge Retort was at the center of a publicized dispute (about the case) over content posted by its contributors. AP’s legal eagles have been reportedly going after a number of sites and bloggers, and a form of automated DRM (AP’s press release on the subject) is also being rolled out.

In the last couple of days, many are pointing to James Grimmelmann’s experiment where he was charged to license words clearly in the public domain (Grimmelmann’s blog post). Mr. Grimmelmann was refunded his money and the license rescinded having no implications on his use of them given that the words – a quote by Thomas Jefferson – are already in public domain. AP likened the experiment to running an item you’ve already paid for through a self-serve grocery store checkout station.

The iCopyright system which Grimmelmann used to license the content runs on the honor system, and according to others licensees can copy paste text into it that they intend to use. The system is relatively simple and apparently at least in this case did not check the public domain status of the content. Given that we are dealing with a system that already provides the user some discretion about what’s entered, is the Grimmelmann story a minor procedural flaw and not really a big deal?

As AP points out, reporters should be paid for their work. Why should bloggers – especially those few who turn a profit – be freeloading on a hardworking industry that requires real people and real infrastructure on the ground asking questions and going places. Furthermore, in today’s era of sound bytes, sometimes just a few words tell the story.

Traditional media’s revenue sources and business models are on shaky ground with print – a big part of AP’s customer base – in the most precarious position. When content can be syndicated, rehashed or even originated by so many people, will centralized sources large enough to generate real advertising and subscriber revenue survive? Can AP, Reuters, CNN or any other large privately funded news organization survive in the coming climate?

Some say what’s most alarming about the situation is the notion that fair use is per policy being thrown out altogether which could set a dangerous precedent.

LINK: Interview about Data Grids by Ryan Slobojan w/ Cameron Purdy, VP Development at Oracle

Interview with Cameron Purdy, VP Development at Oracle, about data grids. Interesting insights, and several things I’ve been saying for a good while. 🙂

Innovation Process: Limitations of Schemas

Once upon a time, I took a college class on interpersonal communications. We discussed schemas upon which the brain operates. Interestingly, in marketing – the subject designed among other things to manipulate or aid in the manipulation of the human psyche for increased profit – we discussed schemas upon which the brain operates.

Then, in a class on neural networks we discussed why brains both organic and artificial tend to remember the first and last things they learned about a specific topic. Furthermore, we talked about how schemas within these brains operate.

Speaking to a technical crowd: SQL operates upon very rigidly defined schemas. Ordinarily, we have tables with columns defining things like people’s names and addresses and telephone numbers and dates of birth and gender and what have you.

Schemas are wonderfully robotic – if by robotic you mean those old conceptions of robots from 1950’s sci-fi. Simplistic notions of schemas tend to dictate that we approach the world very deterministically, very discretely (and I don’t mean privately) and logically. I say, wrong!

Schemas mean patterns. We and most organisms with neurons learn by association. We start with some hard wired axioms and go from there. Break the pattern, and things become difficult to understand. While most “out of the box” thinking is I might argue pretty boxed in, the theoretical ideal of “out of the box” operation is to go beyond the schemas. Is this possible? I don’t know. But maybe we can combine schemas.

Most attempts at productivity are based on refining operations into consistent, easy to follow schemas. In software design, we use design patterns to enforce models that we can wrap our brains around – or at least – having spent much time banging our heads against walls now have a particular schema thoroughly beaten in … and might as well recycle.

Consistent, reusable schemas are absolutely wonderful for Model T’s, Model F’s and many things that churn down an assembly line. Plenty of simple database-driven software can be built perfectly well with a lot of recycled thought.

Now, there is an antiquated saying in research with words to the effect: Before wasting your time going down a road much travelled to re-invent the wheel, the donut, what have you … see if somebody else has done it first and better. If you’ve got something on a shelf, pull it off and use it. Great. This works 99% of the time when you’re not producing new schemas. There’s a ton of value in evolutionary steps and applying something from one schema into another.

However, once in a while we want to do something revolutionary. We don’t start from zero. We are surrounded by many good schemas; old solutions to old problems should often prevail. Then there comes a time when we must come up with a schema we believe to be genuinely new. New? Is there such a thing as a new schema? I have no idea. I would venture to say there likely is not; all schemas are combinations of others in some way; everything is based upon association of one form or another. I don’t care. I’ll leave this subtle point for the philosophers.

To me, I care about not being constrained by old schemas. The less I know sometimes the better. The less structure I have sometimes the better. I want to look at my problem, flail about, come up with a half-baked solution and then plug the holes with somebody’s tried and true schema.

If I’m operating under tremendous structure, I can’t do this. The wonder of iterative design is in some sense a means to apply my very semi-structured process. Iterative improvement allows one to drift about for a solution, come up with something new and then not waste too much time dawdling on unnecessary details.

That’s my 3.5 + rand( rand(34) )^rand(2/rand(5)) cents. Ironically, this article itself is bound by structure. Go figure.

My Project – Better Information: It’s Coming

As many close to me know, I have spent the last few years working on a largely stealth project. The original idea hatched in late 2005 on a 25 hour journey to visit a friend in Singapore.

The project remains mostly in stealth, but I will make some public comments.

Broadly speaking, today’s information suffers from intentional and unintentional inaccuracy, bias, incompleteness, inconsistency, inefficient presentation and other problems.

I look to bridge the gap between masses of loosely structured information and usable knowledge. Raw data needs to go to real wisdom in your brain … faster.

To this end, my team has explored many solutions both technological and non-technological.

Stay tuned.

Metered Broadband? It’s Not Particularly New or Totally Evil – A Brief Introduction to Commercial Bandwidth Services Pricing

Many consumers are in arms over announcements by several providers that they will begin charging overage rates or limiting data transferred. In fact, much of the hosting industry and higher end commercial solutions provide Internet connectivity on a basis of (1) the physical line and (2) the amount of bandwidth actually used.

For example, a provider might charge $20 a month for a network connection that might have a capacity of a 100 megabits or 1000 megabits per second. The provider might then charge a separate fee depending on how much of that connection is used. Bandwidth is often metered on a megabit per second (8 megabits in a megabyte) or based on the total amount transferred often measured in gigabytes.

When bandwidth is sold on a megabit per second (often abbreviated “mbps”) utilization rate, it is often metered by reading actual bandwidth flowing through the connection every so many minutes. In industry standard 95th percentile billing, the highest 5% of those readings are thrown out. The customer is then billed based on the “sustained 95th percentile”.

Under 95th percentile assuming a monthly billing cycle, the customer could in principle use much more bandwidth than usual for up to about 36 hours and would not be billed for the increased amount. So, the 5% in 95th percentile lets customers retain some flexibility for less frequent “bursts”.

Per “bucket” or “data transferred” billing is just so much money per gigabyte (or other amount).

Customers typically pay for:
(1) the line –
Physical line or uplink to provider.
(2) commit –
The amount of bandwidth for which the customer agrees typically over some contract term to purchase. This bandwidth is sold at a “commit rate” which is often less expensive than the overage rate.
(3) overage –
The amount of bandwidth over the commit rate. Overage bandwidth is sold at an “overage rate” which is often double or so the commit rate.

Higher overage vs. commit rates encourage customers to take on larger commits ensuring ISPs can better plan their infrastructure. Higher overage rates also account for the inherently often higher cost and over provisioning necessary to provide services when demand is less predictable.

A service provider that has unpredictable bandwidth utilization must choose among (1) over provisioning infrastructure and charging more money for services or (2) providing a lower quality of service particularly at peak times and likely cutting corners elsewhere.

A service provider that has many customers all paying the same rate but using very different amounts of bandwidth must (1) charge all customers higher rates or (2) deliver a lower overall quality of service to all customers.

Your power is metered. Your cell phone is metered. You pay for the gasoline you burn in a car. You choose whether to buy expensive or inexpensive products. You choose the nature and quality of what you consume often based on what you’re willing to pay.

In spite of some of the uproar, I believe that charging for or even capping bandwidth based on usage is in fact fair. Implemented properly, such efforts could result in a higher quality of service for all consumers.

The key issue should be whether prices charged for overage and larger commits are fair.

Unapologetically Embracing the Term: Artificial Intelligence

In a college course on neural networks, a professor once described to the class how the reputation of artificial intelligence had taken a nose dive in the 1980’s. A divided community and its pundits had built up a perception that C-3PO-like robots and talking, thinking computers were not far off. AI’s visionaries over promised and under delivered.

To this very day, entrepreneurs hesitate to utter the words “artificial intelligence” for fear of losing credibility. Various systems are often called by more specific names whether it be “Bayesian classifier”, “prediction system”, “search engine”, “knowledge base”, etc. These terms all have various meanings known well to the AI community, but we dare not lump them together and utter the words “artificial intelligence.”

There are plenty who would say I am bastardizing terminology. Artificial intelligence’s very definition is gray. Is a car engine that employs a neural network to manage a fuel air mixture actually intelligent? Is Google intelligent? At what point is information retrieval AI? Is a spell checker AI? As many others have said before me, I take the viewpoint that AI (or oftentimes things that apply AI) is a continuum without clearly defined boundaries.

Rather than trying to carefully classify certain algorithms, I devise solutions that make use of various methods that might be borrowed from an AI textbook, might arise from mathematics or that simply come from my own ideas. If the approach is particularly probabilistic without adhering to well defined mathematics or relies on certain kinds of innovations employing non-deterministic or difficult to predict behavior, I tend to call it AI.

During the course of applying or developing AI, I rarely use such words as “artificial” or “intelligent”. Afterall, to me I’m just building a program in a way that makes sense to me.

The most difficult to solve problems in practical applications tend to be those with many possible answers or no exact answers. We run into cases where we cannot build a computer program to solve the problem with a reasonable amount of time and computing resources. Other times, given infinite time and resources the problem is still unsolvable. In computer science, these are often problems said to have “non-polynomial” solutions. For such problems, we can not solve them at all or we must devise a solution that provides an approximate answer.

Approximate answers to hard problems very often involve smart solutions — artificially intelligent solutions. Much of AI is about reducing a problem to what matters most and then pumping out a best guess … just like real human beings semi-solving real problems.

As we approach more human or more intelligent approaches, I’m unafraid to call these solutions “artificial intelligence”.

With vast amounts of computing power and more creative approaches to problems, I believe our constraints to building pretty good solutions are more and more just the limitations of our own minds. And even there, plenty of AI algorithms do things their own creators (including myself) don’t fully comprehend.

I don’t think about “am I solving a problem logically and intelligently” so much as I try to approach all problems logically and intelligently. But if you ask whether I’m building AI … most of the time in these situations my answer will be “Yes, in which shade of gray?”

OpenCL – common framework for CPU+GPU computing

Very interesting:

“OpenCL is a programming framework that allows software to run on both the CPU and the graphics processor of the computer.”

“…earlier this year Apple offered OpenCL to the Khronos Group, a standards-setting organization, and Intel, Nvidia and AMD joined forces to create a standard that would work on multiple chips.”


Thanks JranDe for the heads up on this.

Wikipedia has a code example:

Introducing COHESION – highly automated open source ORM for Java — CALLING CONTRIBUTORS!

My first open source project …

Think features of ActiveRecord + Hibernate + a little more – limitations on some data structures. Designed for maximum ease and speed of development for common applications.

The goal: classInstance1 );
// boom! – cohesion creates a table(s) if need be
// record gets saved

// look ups by example class
classInstanceExample.setName( “Sparky” );
classInstance2 = orm.load( classInstanceExample );

// or by field names and values
Map m = new HashMap();
m.put( “name”, “Sparky” );
classInstance3 = orm.find_by_example( m );

Looking to bring ActiveRecord-like functionality to a Java platform with the added Hibernate style bonus of being able to generate a schema automagically … but even better … on the fly from the class using reflection. Unlike Hibernate – no annotations or schema definitions necessary.

Cohesion is not a clone or port of ActiveRecord or Hibernate but meant to provide similar functionality drawing on the strengths and lessons of each of these very important and powerful projects. At least for a good while, I do not anticipate Cohesion will provide the same performance as more mature products like Hibernate, but it will be easier to code with.

I have preliminary code for doing lists using joins and definitely borrowing some ideas from Hibernate. Barring some pretty big contributions from others, I expect some limitations on more complex data structures at least in any early versions.

SourceForge project:

Browse source:

This is code I started on last year and decided recently to open source currently under an Apache 2 license. I’m open to some discussion on licenses.

If this project makes it to maturity it could provide a very widely used fundamental building block for a lot of development and improve productivity in a lot of places.

Also a founding member, Matthew Molinyawe will be working on this project with me.

Please do comment/drop me or Matt a line if you wish to contribute.