Category Archives: Uncategorized

The Web is Dead: Long Live Pew

I love the Pew Internet Research group. Great objective data that is generally free to access. I trust them.

They just released this report on the future of the Web and Apps (on mobile devices). Here is the report.

This debate is incredibly entertaining, in large part because it so misconstrued. But first, let’s get some definitions out there:

The Web: Accessing applications and content through a web browser.

An App: Accessing  applications and content through dedicated application.

The Internet: The network infrastructure, or plumbing, used to drive most Web and App interactions.

Most of the debate centers on whether we are losing the free and open Web to the closed and controlling Apps.  This is also where it all breaks down. The Web and Apps are generally both built on the Internet. The Internet, current attempts notwithstanding, is still essentially a free and open infrastructure. The power lies with the infrastructure, not with how we access the infrastructure.

Businesses have a choice in how they deliver their applications and content. Consumers have a choice in how they consume applications and content. To the extent that we are losing the ability to consume via free and open interfaces it is by consumer choice, voting with the wallet… but we have not lost the underlying free and open infrastructure.

In reality we are not seeing anything new. What is the difference between a web based pay wall and a mobile app, both of which deliver news? They are both in effect closed. Or an App that is simply a web browser? It is open. We can build a closed Web experience and an open App experience. We do not lose the open infrastructure. And we should all rest assured that we will see some other form of the Web vs. App debate in the coming years, the consumption layer will continue evolve.

Don’t worry about the Web and Apps, worry about the underlying pipes, the Internet and the associated protocols that make this fun debate possible.


Dedicated Devices or Services: Will the Real Value-Add Please Stand Up?

I like to cycle in the Bay Area. Originally I had a simple “cat eye” cycling computer that tracked my speed and mileage by wiring a sensor to the wheel. Sometimes it would get wonky and think I was going half speed or not moving at all. Today I have a Garmin cycling computer with GPS and a configurable display of real-time data. It keeps track of, well, pretty much everything: route map, heart rate, moving time averages, total time averages, temperature, top speed… the list goes on. It can get wonky too, it sometimes loses the satellite connection, but it can usually recreate the route based on map data. When I get home I plug it in and I can see all of my ride information on a dedicated Garmin website. One entry for each ride. I can even see the same details of the workouts that other folks are doing. Good stuff.

So what does this have to do with mobile devices, cloud based services, and where the real value lies? A good friend of mine just introduced me to Strava. While Garmin is providing a device (the cycling computer) and a service (the web site), Strava only provides the service. They let me upload data from my Garmin device as well as iPhones and Androids.

Garmin’s site is pretty nice. It works well, the data is presented intelligently, and I can store as many rides as I want. Strava has only the service to hang their hat on, and you can tell. They offer everything Garmin offers and much more. When I upload my information the ride is broken into segments, such as a really hard hill climb. It then shows me my time on that segment and compares me to other riders. If I’ve done well enough it awards me a badge showing off my time and prompting others to beat it. Strava has taken a good service and made it outstanding while Garmin is balancing their outstanding device with a pretty good service.

So wait a minute, why is Garmin spending time manufacturing devices when they are being outdone on the service side and the iPhone or Android can do it all as a device? Well, the iPhone doesn’t have a heart rate monitor, it doesn’t support the ANT standard that the specialized devices support for additional input (i.e. power output), and the battery life is inferior. It looks like the Garmin devices have enough differentiation, or do they? A quick survey finds DigiFit has a pluggable heart rate monitor based on the ANT+ standard for the iPhone. Another company, Wahoo, has a similar offering. Cool, now my music comes with my cycling computer too! I imagine the Garmin folks are not very happy watching this unfold.

Anyone spending resources on developing hardware devices should be asking themselves hard questions right now. The proliferation of mobile devices is a proliferation of small general purpose computers that can take on almost any task. Witness the fall of the Flip, a dedicated mobile video device, when suitable video recorders were included in the latest generation of smart phones. Look at Square, which Visa recently invested in. They added a small credit card reader to enable iPhone based credit card purchases. The hardware that was once specialized is now offered built-in or via modules that can be added later. The general purpose mobile device is like a Swiss Army knife that can be expanded to add the latest blade. This approach is nothing new in the world of desktops and laptops, but it is disrupting the mobile device market in a much grander fashion because we already have so many dedicated devices that are being replaced: cell phone, point and shoot camera, video recorder, music player, portable game player, watch. It doesn’t make sense to attach a laptop with a GPS card to my bike, but I’m more than happy to bring along an iPhone, in my back pocket or directly attached to my bike.

So what questions should the niche mobile device vendors be asking? Questions focused on the service. The physical device is becoming a platform in most cases, a commodity part of the solution. It is not THE solution in much the same way a browser is part of but not THE solution when visiting a web site. If Flip had a more compelling service behind it I might be using Flip for the iPhone, instead I’m using iTunes. As a consumer I don’t want another device to worry about, I want an outstanding service that will stick with me across devices.


BigData: when transactions, analytics, and search collide

This is a longer piece on BigData that came together over the last six months. Had to wait a bit for this stripped down version to appear in DBTA (thanks for publishing DBTA!)…

What happens when the information your business depends on is too big, slow, and brittle to keep up? Do you have strategies in place to deal with massive information sets and new types of complex and variably structured information? These are the types of questions many of the customers I’ve worked with are facing. I am going to provide a view into the problems I see customers running into and the technology trends I expect to see as more businesses grapple with these problems. The answers to these questions are bigger than any one solution, which is why it is an incredibly exciting time to be involved in information management technology. Something strange is afoot at the Circle K! There is one term in particular that attempts to capture a lot of this excitement: BigData.

The term BigData has been barnstorming the IT world as solutions from the biggest those in between and the shiny new things are getting their arms around BigData. However, talk to two people involved in information management and you will likely get two different definitions of BigData and the associated problems. It has so far defied any consistent definition. And while analysts are happily taking the cue, they are creating more rigorous definitions with different names: Extreme Data (Gartner), Total Data (451 Group), Total Information all center around the same issues of managing information . Ah, well, this is enterprise software after all, where, as an industry, many thrive on incredibly precise but inconsistent definitions and “standards”.

More recently there has been some convergence around the 3 V’s definition of BigData: Volume, Velocity, and Variety. I like this definition. Gartner has been and appears to still be keen on it. IBM very recently got right in front of this particular parade. Kudos to them for jumping in front. Someone needed to.

What these terms and definitions have in common is that they are attempting to encompass the issues businesses are running into as they manage increasingly large sets of information. To a certain extent issues are relative, what is a large data set to one business is an hours worth of activity to another. While the scale may be different between organizations, as pressure is applied to any given information management system the same set of problems tend to appear, sometimes just one, sometimes a whole bunch. Rather than attempt to define BigData or create a new term, which I happily leave to the pros, I’ll talk about the issues and trends that I have seen when working with customers and tracking different approaches to these problems. These customers routinely address information management problems in the terabyte range, are actively solving petabyte issues, and use the term exabyte in all seriousness when doing longer term planning.

First, a fun example that has all the hallmarks of a pressure filled information management environment. Cantor Gaming is in the business of making bets. More importantly they offer a specific type of sports wager called in-running which allows people to place bets on real time game outcomes like whether a baseball batter will strikeout during a given at bat or a football team will turn the ball over on the current drive. This is not a customer I have worked with but Wired did a fantastic write-up of this operation and described their custom built system for managing their information. Similar to the Wall Street systems it was modeled after, the Cantor system is managing information in a pressure cooker: The system churns through historical sports data, feeds in live information, sets betting lines, and manages bet transactions, all in real-time, and all of which is fed back into the system.

As the article notes, this is similar to what Wall Street has been doing with their custom transactional systems for years. Wall Street routinely works with huge rapidly changing data sets. They also have immense incentives to leverage this information as quickly as possible. They have, of course, been cutting their teeth on the problems these data sets create for years. Seemingly simple tasks, such as loading historic data sets or recovering lost data, are problematic once the data sets become too large.

This is important because the rest of the world is catching up in terms of their information needs. But the rest of the world cannot always dedicate, or even find, a team capable of developing and managing this type of system. As more businesses need to generate, manage, and analyze BigData they are increasingly looking to commercial vendors and open source projects to provide a stable foundation for their solution. Twitter and Facebook may be high profile examples but they are not alone. Just look at what Zynga, FourSquare, and are doing these days using commercial and open source software. The information is overwhelming yet extremely valuable. A critical requirement for these businesses is to be able to scale. And it just keeps growing, 10 fold in 5 years according to some estimates such as a BAML BigData report.

Managing this data creates a host of challenges, some new and some old. These are the key pressure points I see businesses grappling with:

Capacity (Volume): The capacity runway has been used up.  Incremental improvements and the onward march of CPU/RAM/Storage capacity will help but information growth is fast outpacing those improvements. The impact can be seen across the infrastructure stack with transactional data stores, commodity search engines, and data warehouses unable to simply keep up with the total volume of information.

Mixed Data (Variety): Customers are working with many different types of information from raw text and precise tables to complex denormalized structures and huge video files. Traditional systems can usually store this information but can’t leverage it to create value.

Throughput (Velocity): Simply ingesting or routing new information is a challenge in and of itself. Consider the updates of 500 million Facebook users. Traditional transactional systems are not designed to ingest or update at these rates.

Real-Time: It is no longer acceptable to wait hours for database updates to reach a search index because the entire index needs to be rewritten. Navigational interfaces like facets need to reflect the information that is available right now. New information needs to immediately show up in searches, be included in analytics, and shared with other systems in real-time.

Load and Restore: Information sets are getting so large that customers cannot load them into other systems in at timely manner or worse, restore them if the primary system is corrupted. As systems are pushed to the brink simply ingesting data, some customers have actually abandoned the notion of restoring the information.

People: Often overlooked when talking about technology solutions, talented and well-trained people are required to run traditional systems. This pool is already limited. Systems that can manage massive information sets are often built from scratch which can naturally lead to a very small number of people that actually know how the system works. This is a very real and scary proposition for teams running their business on these systems. The developer meeting the proverbial bus is generally not replaceable for months or years, the time it takes to hire and train an expert.

Time to Rethink Some Basics

These issues are pushing us to rethink core principles of information management systems. Much of this work has started but it is far from complete. Following are the trends that I am seeing today and believe will be critical to successfully managing increasingly large information sets.

Scale and Performance: Terabytes and petabytes of information are becoming common. As we start to actually use this information and plot the growth curve, exabytes are a clear part of the near future. New architectures are required to support the basic storage and retrieval of these information sets. Massively Parallel Processing (MPP) architectures seem to be the common approach. But even the most advanced MPP clusters will max out before they can reasonably address the storage needs we see coming over the next several years. I doubt we will see wholesale architecture shifts but we will see significant modifications to existing MPP systems such as vertical scaling strategies and dedicated task specific sub-clusters, that, along with incremental improvements in CPU, RAM, and disk drive performance, will help us reach these scaling and performance needs.

Storing and accessing information is only the beginning. Complex queries, sorting, information analytics, and information transformation must also occur in sub-second time in order to support the systems being developed today. MPP architectures can do some of this through parallelization but this can only take us so far. I expect to see specialized implementations of common algorithms, such as sorting or facet generation, that use near real time strategies such as caching to bring performance and data freshness into acceptable real-time ranges. Other architectural strategies will include dependence on in-memory processing and integration of specialized processing systems. For example, MapReduce has been integrated into multiple database systems in order to extend the processing capabilities on large information sets. I believe MapReduce has a future as a standard feature of MPP information management systems as well as a standalone dedicated system.

Flexibility and Specialization: Massive information sets are often comprised of many types of information, typically denormalized. The information can range from pure text to binary videos and everything in between. Sometimes referred to as unstructured or semistrucutred information, these names can be confusing because they typically imply all information that is not relational.

What is important is that the system can support changing data types. Whether it is a new field or an entirely new type, managing these changes in the broad BigData context at large scale breaks down. These systems must be able to flexibly handle information of varying degrees of complexity. The onus will not be on the developer or administrator to preview every piece of information and adjust the system accordingly, it will be on the tools to work with the information as is and allow the developer to incrementally master the information, to discover how the information can be used over time rather than expecting to know a priori. What if you could assign relationships, or even discover relationships, within your information throughout the lifecycle of the information without doing invasive information surgery?

With such large data sets in place there is pressure to do more with the information. Creating multiple systems and moving the data around is a non-sarter in many cases. This means we expect to do more with the information within a single system. While I don’t believe specialized information systems such as OLTP will be disappearing, I do believe we will see a class of systems that claim their specialty is scale. These systems will need to provide top tier support for transactions, queries, search, analytics, and delivery. Real-time in situ processing of information at 100’s of Terabytes and Petabytes will become the norm.

Different types of information also demand different ways to access and manipulate them. Many of the NoSQL systems are implementing SQL. This is not because they want to create the next great RDBMS, they simply recognize that for some tasks within their environment SQL is the best choice of language. Following from the idea that modern systems will support multiple modes of information processing, I expect to see an increasing number of databases that are not tied to one specific language for manipulating the information but support many languages. These systems will provide many lenses through which to view the contents.

An example of a project that is driving this type of convergence is a customer that needs to search document contents alongside semantic relationships. This required creating a data store to hold hundreds of millions of document and billions of semantic triples. It needed the flexibility to store multiple data types, some unknown, and query and search across the contents. The system also streamed new documents in, extracting semantic relationships while ingesting. While this type of project could be pulled off with traditional systems the complexity, time, and cost of implementing it with a relational database was not feasible.

People and Ease of Use: The number of systems that are generating huge amounts of data is growing faster than the number of people with the skills to manage them. While many of these are still custom systems, the output of this custom development (e.g. voldemort) is making it into the public light along with commercial systems focused on the same issues. While I expect the number of systems to consolidate, and therefore consolidate areas of expertise, I believe we will have a shortage of skilled administrators and developers over the next several years.

Part of the reason these systems have become popular is that it is quite easy to get started with them. This doesn’t mean it is easy to develop a massive scale production application with them though. In order to ride this wave of folks experimenting and trying out these databases, I believe vendors will continue to focus on making it as easy as possible to get started with their technologies while requiring heavy lifting tasks of those that are pushing their technology to the limits.

What does it all mean?

Just imagine: A system that can scale to petabytes, supports almost any type of information, allows ad hoc queries and analytics, is transactional and searchable in real-time, and is easy to work with… and is of course low latency. Some might claim this system is here today. I believe it may be close in some cases but most likely will need to bring together specialized systems in new ways.

A lot of work to be done, and it is exciting. What is the next architectural step that will get us to petabytes and exabytes? What type of specialization or commoditization will we see with this technology? Who can we call on to develop and manage these systems?

We’ll start to see high profile answers to these questions in 2011 and 2012. I believe we will see not only significant investments in BigData from the name brands to drive these answers, but we will also see some breakthroughs in architectures. Whether integrating existing specialized systems in novel ways or the invention of new platforms from the less established brands, the battle for BigData dominance, or perhaps even participation, is just getting started.

Gone Corporate

Well, it happened. I got sucked into the dubious world of the corporate blog. Thankfully we have a liberal policy “Write about what you want to write about.” And our team has been faithful in supporting this, check out this post form our VP of People, I love it. And I love the marketing team supporting this effort from MarkLogic, professionals that “get it.”

On top of that I think our corporate blog team has one of the best, if not the best, role models in corporate blogging: MarkLogic’s CEO for 6+ years, David Kellogg and his incredibly high quality Kellblog which is less a corporate blog these days, given that he has moved on from MarkLogic, but still a fantastic platform for the always thinking, to the point, and goinggoinggoing DK.

All of this is really to say that while I have not posted here recently I have a massive post on BigData that is in limbo because it may become a corporate contributed article. I hope some form of it will see the light of this blog. In addition I’ve done three corporate posts on BigData, BigETL, and just today the Strata 2011 Executive Summit.

Hello iPad

I work with publishers, iPad enthusiasm can feel boundless. I’ve used them briefly a few times. I’ve read many articles and reviews. I’ve waited for the hype to subside, it seems to be increasing. I’ve waited to see if it would be useful for work, several folks at work have replaced their laptops in meetings. Last week I spent 6 hours on a cross country flight jealous, not of the folks in first class but of the folks with that magical device. When I got home I broke down and bought an iPad.

So for the last couple days we’ve had our very own iPad at home. This is a bit of what I found.

I started off resisting and ended up a believer, in just a couple of hours. It wasn’t one particular app that got me there. It was the total experience of the device today, the potential it holds, and the gestalt or gist or essence or zeitgeist that it represents. It’s like the sea change when you become a parent. Everything changes, forever, irrefutably. When you don’t have kids your friends that do are happy to explain to you how wonderfully different everything is. A comparison of social calendars though illustrates quite the opposite! But then you have kids and you realize, oh, this is what they mean. In some ways it’s hard to even grasp what it means, what is different, what will be different. As you move through the experience you start to see not just the obvious but the subtle things that have changed, and they are everywhere. And there is no going back.

immersive intimate – I hear these two words a lot when friends describe the iPad. They are accurate. There are many great things about the device that are just different enough. Just the right screen size, no cords to mess with, no case to open, usually on, lightweight, simple to use… but the touch screen at this scale is transformative. The mouse and keyboard are no longer in the way. They become options, tools to use if you need them. In the same way using a mouse to point and click is more intuitive than using a command line to run archaic commands, touching a screen to interact is not just intuitive, it’s native to humans, it is how we’ve interacted with the world since day one, direct touch.

I’ve witnessed very few 2 year olds effecively manipulate a mouse but every 2 year old I know (and I know quite a few these days) that has been exposed to an iPhone knows how to make repeatable things happen with the device and, by the way, expects every screen to have a touch interface. There is no hesitation, a learning curve that is so fast it feels instantaneous, and the engagement is deep. The ability to interact directly with the items on the screen brings us closer and lends a major, if not primary, hand to the sense of immersion.

Unlike every other computer, laptop, or mobile device I’ve used, the iPad is comfortable to use everywhere. Office desk, home couch, standing, sitting; it fits your needs. I can flip through it as a magazine, lay it flat to play a boardgame, prop it up in the kitchen to refer to recipe, pop it on for a second to look up a movie reference, or scribble down the latest “great idea.” I’ve spent a lot of time with computers but I’ve never really wanted to curl up with one on the couch. Ok, maybe I came close with the Amiga but I was only 13, what did I know?

The iPad is comfortable to use anywhere for a broader array of activities than other devices, it expands the contexts where it makes sense to use a computer. With that simple step it it has permission to play a broader role than desktop, laptop, or even cell phone. In this way it is a more intimate device, it plays a part in our lives throughout the day.

physically social – The iPad is physically social in a new way. With a mobile phone the screen is so small it is basically a single person at a time device. With a desktop or laptop there is only ever one person steering the keyboard and mouse. With a touch screen of the size of an iPad multiple people can interact directly with the device simultaneously. It reminds me of the microsoft “surface” demos or the awesome reactable demo. one big difference though: those tabletop touch screens are all about themselves, I have to go to them. The iPad is about me, portable social computing that can be driven by multiple people at the same time.

And as I am just starting to discover there are applications that can tether iPhones to the iPad. The first form of this I saw was Scrabble for the iPad. It allows you to use your iPhone to hold your letters while the iPad acts as the board. Genius! And clearly this is just the beginning for this type of in person and physical social interaction with the iPad. I can imagine applications that treat the iPad as an instant on local server allowing physically present as well as virtually present people to interact, like a temporary virtual flashmob.

What’s Missing? The iPad, coupled with the enormous number of mobile devices, feels like it signifies the early phase of a major shift in the way we interact with computers. Of course this implies there are shortcomings today. There are:

  • application interconnectedness is weak. Very few applications can work with each other. I want to edit text documents stored in dropbox with the notes app.
  • it’s a new type of social device that doesn’t support the social environment it exists in, yet. Just like a mobile phone I see people loading an image or video and handing their iPad to other folks so they can have have the same experience. Cell phones generally make it back to the user. Not so with the iPad. We’ve had it in our home for less than 24 hours and it clearly belongs to the family. Even the idea that I needed to tie the iPad and applications to a specific user when configuring it seemed weird to me. I want two instances of flipboard, one for me, one for my wife. I want to flip it into “kid mode” and let my daughter play with a set of apps that I have filtered. This is the first computing device that I have experienced that demands to be physically shared and I don’t think it knows that about itself quite yet.
  • Mutitasking is blatantly missing. Thankfully the fine folks at Apple have been on the case for some time.
  • Typing is still awkward. I think this is more my own shortcoming but I don’t find myself wanting to write anything on it. I also don’t want to plug a keyboard into it. Hmph.

I fought it but I finally broke down. I’m glad I did. I hated hearing Apple’s boss call it a “magical and revolutionary product” but I think there might be something to that description.

Some Other Notes

  • Something I read that helped push me over the edge, the most interesting look at the iPad that I have found, was this in depth post from John Borthwick about his experience over 11 weeks of exploring the iPad.
  • I should be writing a letter to United that goes something like this: “Dear United, Handing out media players in Business Class? Installing video in the seat front? Forget about it, give us iPads loaded with HD content…”
  • The day the iPads came out I heard a 20 something remark publicly about his friend, I assume also 20 something years old, that had gotten an iPad: “Is she a grandma now?!” Intimating that she couldn’t see well enough to use a cell phone. It seemed wrong for the long run given that I know at least one real life grandma has more apps and coolness on her iPhone than most 30 somethings. And she has an iPad. I include this because I couldn’t get this generation gap comment out of my mind; the iPad may not be cool for the younger generations.
  • The apps people are building on the iPad remind me of the immersive nature of the primer in the Diamond Age by Neal Stephenson; an educational interactive multimedia “book” for children.