2010-11-05

The Wrong Stuff

Imagine for a moment that Tim Berners-Lee had been a complete moron.  Imagine he had taken the single most important innovation of the web, namely the URL,  and split in two parts.  One part being the server address

someserver.example.com

and the other part being the path.

/my/file/here.html

Now imagine that the server part would have been part of the browser configuration -- so that the web browser knows about a small but configurable set of web servers.  Web page adresses would simply be paths, which when combined with the browser configuration would form what we today know as an URL.

But of course, unlike the URL it would only exist in the browser after combining the browser configuration and the page address (the path) in question.

Next, imagine that any path given to the browser is blindly probed by the browser until it hits a server that actually has the page you refer to. So for the page /my/page.html you would have a probing sequence of this type:

someserver.example.com/my/page.html
otherserver.example.com/my/page.html
thatserver.example.com/my/page.html

Not the brightest resolution scheme there is, right?  In light of how the web we know and love works, this seems like a pretty stupid idea today.  What if /my/page.html exists in mutiple of these servers?  What if there were millions of servers and we would have to wait for the browser to resolve the page by asking about half of them before scoring a hit?

(Don't laugh yet; At the time I set up my first web server there were about 50 well known web servers in the world so this isn't that far fetched.  If you account for the existence of unimaginative people, that is.)

To mitigate some of the pain of downloading large pages you introduce a cache locally on your machine.  A caching layer that doesn't really differentiate between pages that you have stuffed into the cache locally and pages you have downloaded from one of the servers.  Or, in practice, differentiate between which pages were downloaded from which server.

You might think that nobody would be stupid enough to come up with a lame scheme like that.  Especially not in a post-web world.

Well, you can stop laughing. Someone has come up with a scheme like that for a similar problem.  In post-web times. It is called "Apache Maven" and it is used for dealing with software artifacts.   So the next time you come across some developer stating that Maven is a great solution for sorting out software dependencies you have my permission to point and laugh. Even if it is usually socially unacceptable to poke fun at the intellectually challenged.

Maven was well intended, but designed and implemented by people who were The Wrong Stuff.

9 comments:

  1. I don't think the comparison is quite correct.

    A URL uniquely identifies content on the web. The Maven equivalent is the combination of group id, artifact id and version number (so it's more like classes in Java than the web).

    Maven takes this an extra step by storing MD5 hashes of all cached content to give an extra level of comfort that content is what you think it is.

    The Maven repositories, which you seem to take issue with, are more like a caching proxy on the web. Your cache is specific to your network, and *should* be part of your local configuration. Maven differs from the web in that you have to specify the caching-proxy for remote sources of content. But there are fewer Maven repositories than web sites, and I want much tighter control over what is allowed in my code base then what my web browser accesses.

    Apologies if I've misunderstood, but your analogy seems flawed.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. I think that is the crux of the matter: an URL provides a stronger identity than a Maven coordinate. Thus a resource on the web has an authorative source, while a Maven artifact does not.

    This is not a problem for trivial cases where. For instance where we imagine that there is only one, central authority and a lot of caches of that authority. And for some the trivial case is the reality they live in and they rarely see problems.

    The problem is when you have multiple repositories and need to deal with artifacts that are not in the central repository. Which can be the case for third party artifacts that are not in Maven central. (We have about 5 or 6 such artifacts in our current project and there *has* been confusion over "where do we get this artifact from" and build breakages because people follow the same rules for choosing coordinates for slighty diferent SVN versions of the same artifact)

    The local repository in .m2/repository now becomes a problem because it can mask the fact that you have artifacts that are not resolvable and it becomes hard to manage it as a cache because Maven treats all artifacts the same regardless of where they came from. (Cache without authorative source).

    Another risk is that people can do bad things to published artifacts by populating a bogus artifact into a repository that is in the probing sequence. You could use this, in some scenarios, to inject a tampered-with artifact with a back-door into someone's build.

    ReplyDelete
  4. The best practice is to have one repository, often called your company repository (although some open source projects also run their own dedicated repositories).

    Probing through several repositories for the artifacts you need is a really bad idea.

    If you want artifacts that are outside your repository, you either have to import/deploy the artifacts manually, or proxy a trusted repository that contains the artifacts.

    If you are worried about the risks, you tighten the control of your repository with signed artifacts and auth.

    These principles apply no matter which software assembly tool you're talking about, whether it's about Maven, Subversion, Debian packages, whatever. It's as risky as you make it.

    I recommend the free online Nexus book - it contains all the information you need to set up a maven repo as strict as you need it to be.

    ReplyDelete
  5. Thomas: that is only a best practice because the severe limitations in how dependencies are resolved pretty much force you to do it -- not because it is a good idea.

    It is easy to forget why one does something when one does not consider the root cause for the problem. And the root cause in this case is bad design. Spending time configuring the repository manager to make up for shortcomings in design is a kludge and a workaround.


    Sadly, just since I wrote that blog entry, a coworker has wasted almost an entire day trying to track down a problem related to the broken way in which Maven has to resolve transitive dependencies. He should not have to *know* how Maven is designed in order to use it and I should not have to waste my time looking into it since I am one of the people on the project who has a bit more experience with Maven. If well designed, it should have been able to handle this in a reasonable way.

    (someone forgot to mirror a repository that is used in a transitive dependency. sure, that someone could have realized that repository entries are not picked up from the pom of dependencies, but seriously: it would be reasonable to expect Maven to do this or at least warn about it)

    ReplyDelete
  6. Maven's defaults is to go for the central global repo. This approach really shines when building an open source project (like itself, and all the other projects using maven out there). This ease-of-use is what gave Maven such a huge adoption in the first place.

    Of course, when you want to use it inside a corporate context, you'll have to override the normal behavior, and Maven allows this, and it is even well documented.

    Dependency resolution and artifact repositories are by nature conceptually tricky things. You are trying to do something complex, so the tools you apply will probably be complex as well.

    If you prefer tackling this complexity by copying all your binary dependencies into the SVN repo, and building a set of homegrown Ant scripts and conventions for building, go ahead. This works, but people who have been through this (in most cases dreadful) experience see that Maven has a sensible design and approach. It's not perfect, but it's the best thing we've got (IMHO).

    ReplyDelete
  7. Thomas: There is nothing that says that you can't have a default value for the repository of an artifact. (ie. defaulting to Maven Central). This would give you the "ease of use" for the trivial case.

    However, this is not how Maven is used by nontrivial projects that have dependencies outside Maven Central.

    Optimizing everything around the trivial case and making relatively typical scenarios hard to manage properly represents very bad design.

    Dependency resolution in Maven is made unnecessarily difficult because it involves multiple disjoint configurations: pom, parent pom, settings.xml, and local repository config etc. not only are they disjoint, but in the case where you use a repository you can't know where an artifact came from unless you explicitly know the confinguration of the server (including evaluation order of repositories).

    in some cases the local repository config can be hard to modify for a given developer. depending on policy on the project there repository may only be configured by one person (who then becomes a bottleneck).

    it can also be hard to figure out exactly how dependencies are resolved in the repository manager if you have multiple levels of virtual repositories. (to add insult to injury: certain errors can be masked by the repository manager and mvn so what was a checksum error can look like a resolution failure)

    of course, all this becomes rather academic if you are going to share your project with other people outside your organization using a different repository. then you either have to publish your repository setup or you have to pollute your pom with external repositories.

    a good example of the latter is Apache Mahout, which forces you to discover how to resolve transitive dependencies and then MANUALLY copy the information over into your pom, your settings.xml or into your repository manager. all of which is rather silly, because Maven should be able to infer this information.

    which means that the repository is an integral part of the identity of an artifact.

    I did not advocate checking in binary deps into SVN, so I am not sure why you brought that up.

    ReplyDelete
  8. I don't think maven's out-of-the-box conventions is hard to manage, but I think further discussion on that point dangers on getting subjective.

    Dependency resolution in Maven is a very powerful mechanism, and yes, you can land yourself in a mess if it's not done right. Again, this isn't special for maven. Depending on things that again depend on other things is tricky stuff. Maven has a mechanism for this that weighs which dependency is "closest", and if that doesn't suit you, you override the transitive dependency with a "local" one.

    Doing dependencyManagement in parent-poms is an extremely useful feature that can reduces a lot of scattered dependency configuration.

    Keeping settings.xml small, staying away from big profiles, avoiding multiple repositories, these are things that your Maven experts/admins need to know.

    At our place, we have a standard settings.xml that everyone has to use. It just contains our company repo, and says it mirrors "*" (all repos). If people don't use this settings.xml, maven will not work for them, end of story.

    Beyond that, we have not had any problems with 3rd party artifacts in our repository that I can remember. We only proxy repositories that we trust, and this is maven central, and not many others. If we need artifacts that are only available in "shady" repositories, we download them and deploy them in our own 3rd party repository.

    If I was going to share artifacts with another company, I would probably arrange for setting up a new shared repository, and then proxy this repository in our internal company repo.

    I cannot vouch for Apache Mahout, but if you are trying to build something with dependencies that are not available in a repository you trust (i.e. is proxied in your company repo), you're bound for trouble. You can't build a car if you don't have the parts.

    I just brought up deps in SVN as the typical alternative to using maven/repository manager. If you have some concrete alternative to Maven, or its dependency resolution mechanism, or the architecture of maven repos, I'd love to hear it.

    ReplyDelete
  9. Thomas: if by "powerful" you mean "configurable" I would agree. the problem is that "configurable" isn't necessarily good when what you really need is "intuitive". if by "powerful" you mean "solves the problem with a minimum of hassle", then it would be hard to get further from observed reality. Maven is a mountain and I have yet to talk to any developer having used it on nontrivial projects without experiencing a lot of pain.

    also, people who do know Maven well sometimes have a tendency to mistake them coping with Maven for ease of use. they tend to forget that they had to spend a lot of time learning.

    (btw I generally never use the term "powerful" about software because it conveys very little unambiguous meaning.)

    there are no real alternatives to Maven today in terms of dependency management. it seems that Gradle and Ant+Ivy are pretty much stuck with the Maven model for dependencies. which is a bit sad. because I would much rather spend time working on build tools to solve more interesting problems than just mopping up designs that are contrary to common usage patterns.

    actually, I'd rather not spend time on build tools at all. unfortunately Maven requires you to spend a significant amount of time learning it and then maintaining things so that other developers don't trip up. (it does stand out. I've used perhaps 10-12 different build systems in my career, some of them in-house systems that varied from "crap" to "hallelujah". Maven is by far the one that has burned the most time and I have only been using it for about 2 years now)

    ReplyDelete