Code Cop: dependencies

Showing posts with label dependencies. Show all posts

7 January 2020

Visualising Architecture: GraphML Charting Module Dependencies

When working with existing code, analysing module dependencies helps me understand the bigger picture. For well defined modules like Eclipse OSGi or Maven there are tools to show all modules and their inter dependencies, e.g. the Eclipse Dependency Visualization tool or the Maven Module Dependency Graph Creator. For legacy technologies like COBOL and NATURAL - and probably some more platforms - we struggle due to the lack of such tools. Then the best approach is to extract the necessary information and put it into a modern, well known format. I wrote about such an approach in the past, e.g. loading references from a CSV into Neo4j for easy navigation and visualization. Today I want to sketch another idea I have used since many years: Using GraphML to chart module dependencies. For example, the following graph shows the (filtered) modules of a NATURAL application together with their dependencies in an organic layout.

Some modules of a NATURAL application in an organic layout, 2017

Some modules of a NATURAL application in an organic layout, 2017

Process
GraphML is an XML-based file format for graphs. [...] It uses an XML-based syntax and supports the entire range of possible graph structure constellations including directed, undirected, mixed graphs, hyper graphs, and application-specific attributes. That should be more than enough to visualize any module dependencies. In fact I just need nodes and edges. My process to use GraphML is always the same:

Start with defining or extracting the modules and their dependencies.
Create a little script which loads the extracted data and creates a raw GraphML XML file.
Use an existing graph editor to tweak and layout the diagram.
Export the the diagram as PDF or as image.

Get the raw data
Extracting the module information is highly programming language specific. For Java I used my JavaClass class file parser. For NATURAL I used a CSV of all references which was provided by my client. Usually regular expressions work well to parse import, using or include statements. For example to analyse plain C, scan for #include statements (as done in Schmidrules for Application Architecture. Unfortunately there is not much documentation available about Schmidrules, which I will fix in the future.)

Create the XML
The next task is to create the XML. I use a little Ruby script to serializes a graph of nodes into GraphML. The used Node class needs a name and a list of dependencies, which are Nodes again. save stores the graph as GraphML.

require 'rexml/document'

class GraphmlSerializer
  include REXML

  # Save the _graph_ to _filename_ .
  def save(filename, graph)
    File.open(filename + '.graphml', 'w') do |f|
      doc = graph_to_xml(graph)
      doc.write(out_string = '<?xml version="1.0" encoding="UTF-8" standalone="no"?>')
      f.print out_string
    end
  end

  # Return an XML document of the GraphML serialized _graph_ .
  def graph_to_xml(graph)
    doc = create_xml_doc
    container = add_graph_element(doc)
    graph.to_a.each { |node| add_node_as_xml(container, node) }
    doc
  end

  private

  def create_xml_doc
    REXML::Document.new
  end

  def add_graph_element(doc)
    root = doc.add_element('graphml',
      'xmlns' => 'http://graphml.graphdrawing.org/xmlns',
      'xmlns:y' => 'http://www.yworks.com/xml/graphml',
      'xmlns:xsi' => 'http://www.w3.org/2001/XMLSchema-instance',
      'xsi:schemaLocation' => 'http://graphml.graphdrawing.org/xmlns http://www.yworks.com/xml/schema/graphml/1.1/ygraphml.xsd')
    root.add_element('key', 'id' => 'n1',
                     'for' => 'node',
                     'yfiles.type' => 'nodegraphics')
    root.add_element('key', 'id' => 'e1',
                     'for' => 'edge',
                     'yfiles.type' => 'edgegraphics')

    root.add_element('graph', 'edgedefault' => 'directed')
  end

  public

  # Add the _node_ as XML to the _container_ .
  def add_node_as_xml(container, node)
    add_node_element(container, node)

    node.dependencies.each do |dep|
      add_edge_element(container, node, dep)
    end
  end

  private

  def add_node_element(container, node)
    elem = container.add_element('node', 'id' => node.name)
    shape_node = elem.add_element('data', 'key' => 'n1').
      add_element('y:ShapeNode')
    shape_node.
      add_element('y:NodeLabel').
      add_text(node.to_s)
  end

  def add_edge_element(container, node, dep)
    edge = container.add_element('edge')
    edge.add_attribute('id', node.name + '.' + dep.name)
    edge.add_attribute('source', node.name)
    edge.add_attribute('target', dep.name)
  end

end

Layout the diagram
I do not try to layout the graph when creating it. Existing tools do a much better job. I use the yEd Graph Editor. yEd is a Java application. Download and uncompress the zipped archive. To get the final diagram

I load the GraphML file into the editor. (If the graph is huge, yEd needs more memory. yEd is just an executable Jar - Java Archive - it can get more heap on startup using java -XX:+UseG1GC -Xmx800m -jar ./yed-3.16.2.1/yed.jar.)
Then I select all nodes and apply menu commands Tools/Fit Node to Label. This is because the size of the nodes does not match the size of the node's names.
Finally I apply the menu commands Layout/Hierarchical or maybe Layout/Organic/Smart. In the end it needs some trial and error to find the proper layout.

Conclusion
This approach is very flexible. The nodes of the graph can be anything, a "module" can be a file, a folder or a bunch of folders. In the example shown above, the nodes where NATURAL modules (aka files), i.e. functions, subroutines, programs or includes. In another example shipped with JavaClass the nodes are components, i.e. source folders similar to Maven multi module projects. The diagram below shows the components of a product family of large Eclipse RCP applications with their dependencies in a hierarchical layout. Pretty crazy, isn't it. ;-)

Components of an Eclipse RCP application in hierarchical layout, 2012

Components of an Eclipse RCP application in hierarchical layout, 2012

30 August 2019

Visualising Architecture: Neo4j vs. Module Dependencies

Last year I wrote about some work I did for a client using NATURAL (an application development and deployment environment using a proprietary language), for example using NATstyle, adding custom rules and creating custom reports. Today I will share some things I used to visualise the architecture. Usually I want to get the bigger picture of the architecture before I change it.

Dependencies are killing us
Let's start with some context: This is Banking with some serious legacy: Groups of "solutions" are bundled together as "domains". Each solution contains 5.000 to 10.000 modules (files), which are either top level applications (executable modules) or subroutines (internal modules). Some modules call code of other solutions. There are some system "libraries" which bundle commonly used modules similar to solutions. A recent cross check lists more than 160.000 calls crossing solution boundaries. Nobody knows which modules outside of one's own solution are calling in and changes are difficult because all APIs are potentially public. As usual - dependencies are killing us.

Graph Database
To get an idea what was going on, I wanted to visualize the dependencies. If the data could be converted and imported into standard tools, things would be easier. But there were way too many data points. I needed a database, a graph database, which should be able to deal with hundreds of thousand of nodes, i.e. the modules, and their edges, i.e. the directed dependencies (call or include).

Extract, Transform, Load (ETL)
While ETL is a concept from data warehousing, we exactly needed to "copy data from one or more sources into a destination system which represented the data differently from the source(s) or in a different context than the source(s)." The first step was to extract the relevant information, i.e. the call site modules, destination modules together with more architectural information like "solution" and "domain". I got this data as CSV from a system administrator. The data needed to be transformed into a format which could be easily loaded. Use your favourite scripting language or some sed&awk-fu. I used a little Ruby script,

ZipFile.new("CrossReference.zip").
  read('CrossReference.csv').
  split(/\n/).
  map { |line| line.chomp }.
  map { |csv_line| csv_line.split(/;\s*/, 9) }.
  map { |values| values[0..7] }. # drop irrelevant columns
  map { |values| values.join(',') }. # use default field terminator ,
  each { |line| puts line }

to uncompress the file, drop irrelevant data and replace the column separator.

Loading into Neo4j
Neo4j is a well known Graph Platform with a large community. I had never used it and this was the perfect excuse to start playing with it ;-) It took me around three hours to understand the basics and load the data into a prototype. It was easier than I thought. I followed Neo4j's Tutorial on Importing Relational Data. With some warning: I had no idea how to use Neo4j. Likely I used it wrongly and this is not a good example.

CREATE CONSTRAINT ON (m:Module) ASSERT m.name IS UNIQUE;
CREATE INDEX ON :Module(solution);
CREATE INDEX ON :Module(domain);

// left column
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///references.csv" AS row
MERGE (ms:Module {name:row.source_module})
ON CREATE SET ms.domain = row.source_domain, ms.solution = row.source_solution

// right column
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///references.csv" AS row
MERGE (mt:Module {name:row.target_module})
ON CREATE SET mt.domain = row.target_domain, mt.solution = row.target_solution

// relation
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///references.csv" AS row
MATCH (ms:Module {name:row.source_module})
MATCH (mt:Module {name:row.target_module})
MERGE (ms)-[r:CALLS]->(mt)
ON CREATE SET r.count = toInt(1);

This was my Cypher script. Cypher is Neo4j's declarative query language used for querying and updating of the graph. I loaded the list of references and created all source modules and then all target modules. Then I loaded the references again adding the relation between source and target modules. Sure, loading the huge CSV three times was wasteful, but it did the job. Remember, I had no idea what I was doing ;-)

Querying and Visualising
Cypher is a query language. For example, which modules had most cross solution dependencies? The query MATCH (mf:Module)-[c:CALLS]->() RETURN mf.name, count(distinct c) as outgoing ORDER BY outgoing DESC LIMIT 25 returned

YYI19N00 36
YXI19N00 36
YRWBAN01 34
XGHLEP10 34
YWI19N00 32
XCNBMK40 31

and so on. (By the way, I just loved the names. Modules in NATURAL can only be named using eight characters. Such fun, isn't it.) Now it got interesting. Usually visualisation is a main issue, with Neo4j it was a no-brainer. The Neo4J Browser comes out of the box, runs Cypher queries and displays the results neatly. For example, here are the 36 (external) dependencies of module YYI19N00:

Called by module YYI19N00

As I said, the whole thing was a prototype. It got me started. For an in-depth analysis I would need to traverse the graph interactively or scripted like a Jupyter notebook. In addition, there are several visual tools for Neo4J to make sense of and see how the data is connected - exactly what I would want to know about the dependencies.

5 June 2015

Choose Your Development Services Wisely

URIs Should Not Change
Modern software development relies a great deal on the web. Our source code is hosted on GitHub, we download necessary libraries from Maven Central, Ruby Gems or NPM. We communicate and discuss using issue trackers and forums and our software is built on hosted Continuous Integration servers. We rely a lot on SaaS infrastructure. But the Internet is constantly in motion. Pages appear and vanish. New services get created, moved around and shut down again. When Tim Berners-Lee created the web, he wished for URIs that would not change, but unfortunately the reality is different.

I hate dead links. While usually dead links are not a big deal, you just use Internet search to find their new homes, I still find them annoying. It is impossible to update all dead links on my blog and in my notes, but at least I want the cross references of my own stuff to be correct. This means that I have to migrate and update something at least once a year. The recent shutdown of Google Code concerned me and I had a lot of extra work. This made me think.

Learning: Host as much of your content under your own control, e.g. your personal web space.

To deny the usage of today's powerful and often free services like GitHub would be foolish, but still I believe we (developers) should consider them dependencies and as dependencies they are also liabilities. We should try to reduce coupling and be aware of potential migration costs.

Learning: When choosing a SaaS service, think about its benefits versus the impact of it not being available any more.
Learning: When you pay for it, it is more stable.

Personal Development Process
The impact can be internal, which means it affects only you and your work, or it can be external, which means it affects others, your users or people using your code or documentation. For example let us consider CloudBees. I started using it in the early beta and it was great. Given time I moved all my private, kata and open source projects there and had them build on each commit. It was awesome. But last year they removed their free plan and I did not want to pay, so I stopped using it. (This is no criticism. CloudBees is a company and needs to make money and reduce cost.) The external impact was zero as the Jenkins instance was private. The internal impact seemed huge. I had lost my CI. I looked for alternatives like Travis, but was too lazy to configure it for all my projects. Then I used the Jenkins Job Import Plugin to copy my jobs to a local instance and got it sort of running. (I had to patch the plugin, wasting hours...) Still I needed to touch every project configuration and in the end I abandoned CI. In reality the private impact was also low as I am not actively developing any Open Source right now and I am not working on real commercial software where CI is a must. Now I just run my local builds more often. It was cool to use CloudBees, but I can live without it.

Learning: Feel free to use third party SaaS for convenience, i.e. for anything that you can live without easily.

Information Sharing
Another example is about written information, the Hackergarten wiki. When I started Hackergarten Vienna in 2011, I collected material how to run it and put it into the stub wiki someone had created for Hackergarten. I did not think about it and just used the existing wiki at Wikispaces. It seemed the right place. There were a few changes by other users, but not many changes at all. Two years later Wikispaces removed their free plan. Do you see a pattern? The internal impact was zero, the but external impact was high, as I wanted to keep the information about running a Hackergarten available to other hackers. Still I did not want to spend 50$ to keep my three pages alive. Fortunately Wikispaces offered a raw download of your wiki pages. I used this accessible copy of my work and converted the wiki pages into blog pages in no time. As I changed the pages rarely the extra overhead of working in HTML versus Creole was acceptable. Of course I had to update several links, a few blog posts and two slide-decks, sigh. (And it increased my dependency to Google Blogger.)

Learning: When choosing a SaaS service, check its ways of migrating. Avoid lock-in.
Learning: Use static pages for data that rarely changes.

Code Repositories
Moving code repositories is always a pain. My JavaClass Ruby Gem started out on Rubyforge, later I moved it to Google Code. With Google Code shutting down I had to migrate it again, together with seven other projects. The raw code and history were no problem, hg convert dealt with that. But there were a lot of small things to take care of. For example, different version control system used a different ignore syntax. The Google Code project description was proprietary and needed to be copied manually. The same was true for wiki pages, issues and downloads.

I had to change many incoming links. First URL to change was the source repository location in all migrated projects' descriptors, e.g. Maven's pom.xml, Ruby's gemspec, Node's package.json and so on. Next were the links to and from project wiki pages and finally I updated many blog posts and several slide-decks. And all the project documentation, e.g. Maven sites or RDoc API pages needed to be re-generated to reflect the new locations. While this would be no big deal for a single project, it was a lot of work for all of them. I full-text-searched my hard-disc for obsolete URLs and kept finding them again and again.

Maybe I should not cross link my stuff that much, and I am not even sure I do link that much at all. But instead of putting the GitHub URL of the code kata we will be working on in a Coding Dojo directly into the slides, I could just write down the URL on a flip-chart at the beginning of the dojo. The information about the kata seems to be more stable than the location of the source code. Also I might use the same slides working on code in different languages, which might be stored in different repositories. But on the other hand, if I bother to write documentation and I reference something related, I expect it to be linked for fast navigation. That is the essence of hyper-text, isn't it?

Learning: (maybe) Do not cross link too much.
Learning: (maybe) Do not link from stable resources to less stable ones.

Generated Artefacts
Next to source code I had to migrate generated artefacts like my Maven repository. I had used a Google Code feature that a repository was accessible in raw mode. I would just push to my Maven repository repository (recursion yeah ;-) and the newly released artefacts would show up. That was very convenient. Unfortunately Bitbucket could not do that. I had a look into Bitbucket pages, but really did not feel like changing the layout of the repository. I was getting tired of all this. In the end I just uploaded the whole thing to my public web space. Static web pages and binary files, e.g. compressed archives, can be hosted on any web server and I should have put them there in the very beginning. Again I had to update site locations, repository URLs and incoming links in several projects and blog posts. As I updated my parent Pom I had to release new versions of several projects. I started to hate hyper-links.

Learning: Host static data on regular (personal) web spaces.

You might argue that Maven Central would be a better place for Maven artefacts and I totally agree. I consider Maven Central much more stable than my personal web space, but I did not bother to go through the process of getting access to a service that would mirror my releases to Central. Anyway, this mirroring service, like Sonatype's, feels less stable than Central itself.

Learning: Host your stuff on the most stable option available.

Now all my repositories are hosted on Bitbucket. If its services stop working some day, and they surely will stop somewhere in the future, I will stop using hosted repositories for my projects. I will not migrate everything again. I am done. Update January 2020: Bitbucket stops supporting Mercurial this year. I am done. :-(

Learning: (maybe) Do not bother with dead links or losing your stuff. Who cares?

CNAMEs
For some time Bitbucket offered a CNAME feature that allowed you to associate a domain or sub-domain with an account. That was really nice, instead of bitbucket.org/pkofler/project-x I used hg.code-cop.org/project-x. I liked it, or so I thought. Of course Bitbucket decided to disable this feature this July and I ended - again - updating URLs everywhere. While the change was minimal, path and parameters of URLs stayed the same, I had to touch almost every source repository to change its public repository location, fix 20+ blog posts and update several Coding Dojo slide-decks, which in turn needed to be uploaded to SlideShare again.

Learning: Only use the minimal, strictly necessary features of a SaaS.
Learning: Even when offered, turn off all features you do not need, e.g. wiki, issues, etc.
Learning: Do not use anything just because it is handy or "cool". It increases your coupling.

URLs Change All The Time
Maybe I should not link too much in this post as eventually I will have to change all the links again, grrrr. I consider using a URL service like bitly. Maybe not exactly like bitly because its purpose is the shortening and marketing aspect of links and I do not see a way to change the actual link once it was created. And I would depend on another service, which eventually would go away. So I need to host the service myself, like J. B. Rainsberger does. I like the idea of updating all my links with a single update statement in the link database. I wished I had used such a thing. It would increase work when creating links, but would reduce work when changing them. Like with code, it seems that my links are far more often changed than created, at least some of them.

I could not find any free or otherwise service providing this functionality. So I would have to create my own. Also I do not know the impact of the extra redirect on search engines and other automatic consumers of my content. And still some manual changes are inevitable. If the repository location moves, the SCM settings need to be changed. So I will just wait until the next feature or whole service is discontinued and I have to start over.

Thanks to Thomas Sundberg for proof-reading this article.

2 October 2014

Visualising Architecture: Maven Dependency Graph

Last year, during my CodeCopTour, I pair programmed with Alexander Dinauer. During a break he mentioned that he just created a Maven Module Dependency Graph Creator. How awesome is that. The Graph Creator is a Gradle project that allows you to visually analyse the dependencies of a multi-module Maven project. First it sounds a bit weird - a Gradle project to analyse Maven modules. It generates a graph in Graphviz DOT language and an html report using Viz.js. You can use the dot tool which comes with Graphviz to transform the resulting graph file into JPG, SVG etc.

In a regular project there is not much use for the graph creator tool, but in super large projects there definitely is. I like to use it when I run code reviews for my clients, which involve codebases around 1 million lines of code or more. Such projects can contain up to two hundred Maven modules. Maven prohibits cyclic dependencies, still projects of such size often show crazy internal structure. Here is a (down scaled) example from one of my clients:

Maven dependencies in large project

Looks pretty wild, doesn't it? The graph is the first step in my analysis. I need to load it into a graph editor and cluster modules into their architectural components. Then outliers and architectural anomalies are more explicit.

15 November 2011

Visualising Architecture: Eclipse Plug-in Dependencies

In my job at the Blue Company I am working on a product family of large Eclipse RCP applications. Each application consists of a number of OSGi plug-ins which are the main building block of Eclipse RCP. There are dedicated as well as shared plug-ins. I am new to the project and struggle to understand which applications depend on which plug-ins. I would like to see a component diagram, showing the dependencies between modules, i.e. on OSGi level.

There is an Eclipse's Dependency Visualization tool inside the Eclipse Incubator project: Plug-in org.eclipse.pde.visualization.dependency_1.0.0.20110531 which displays dependencies of plug-ins. This is exactly what I need. Here is the visualization of one of the products:

(The yellow and green bounding boxes were added manually and indicate parts of the architecture.) The tool's main use is to identify unresolved plug-ins, a common problem in large applications. While you find unresolved plug-ins by running the application, the tool helps even before the application is launched. For more information see this article on developerWorks about the use of the Dependency Visualization. The latest version of the plug-in, including its download for Eclipse 3.5.x, is available at the testdrivenguy's blog.

27 July 2008

Viewing Dependencies with Eclipse

Dependencies between classes are somehow double faced. Classes need other classes to do their work because code has to be broken down into small chunks to be easier to understand. On the other hand, growing quadratic with the number of classes (n nodes can define up to n*(n-1) directed edges in a graph), they can be a major source of code complexity: Tight coupling, Feature Envy and cyclic dependencies are only some of the problems caused by relations between classes "going wild".

I try to avoid (too many) usages of other classes in my source. But it's not that easy. Consider a class in some legacy code base that shows signs of Feature Envy for classes in another package. Should I move it there? To be sure I need to know all classes my class is dependent upon? I mean all classes, so checking imports does not help. Even if I forbid wildcard imports (what I do), that are still not all classes. Classes in the same package are rarely shown by tools. Even my favourite Code Analysis Plugin (CAP) for Eclipse does not show them.

So here is my little, crappy, amateur Java Dependency View Plugin for Eclipse. Unpack it and put the jar into Eclipse's plugins directory. Restart Eclipse and open the Dependency View. open Dependency View in Eclipse with Open View

open Dependency View in Eclipse with Open View

(The small blue icon is my old coding-logo, a bit skewed.) In the view you'll see all kinds of outgoing references: inner classes, classes in the same package, other classes, generic types and sometimes - a bug in the view.

Dependency View in Eclipse showing all outgoing references of java.util.ArrayList

The plugin was developed with Eclipse 3.2 and tested in 3.3. Source code is included in the jar.

A nice feature, that I was not able to implement would have been reverse search, i.e. a list of all classes depending on the current selected one. If you happen to know how to to that, i.e. how to search for all classes that refer the current class, drop me a note.