Proposal for improving the Patent2Net code base

Introduction

We would like to outline our proposal about which direction we would like to see the code base of the Patent2Net project evolve.

Status quo

After getting more into the details and refreshing our knowledge about P2N (the last time was just a short visit to git clone and running one or two scripts), we recognize that the overall architecture is batch-processing-centric. The input parameters are coming from a configuration file (requete.cql) and data is moving through different processing steps in form of separate scripts to produce different output formats.

Thoughts

While we appreciate the command line approach and also see it as a first citizen, we would like to be able to also use Patent2Net as a library.

Approach

To get there, we would like to work on the code base and begin to move it into this direction gradually, without breaking anything regarding the current script-based API. There are various things we want to achieve:

  • Introduce a setup.py to be able to use Patent2Net as a native Python module and as a dependency of other projects.
  • Provide a reasonably sane library API to the core analytic functions by decoupling the configuration and storage aspects from the actual data processing steps.
    Remark: While we would like this to live inside the “patent2net” namespace in the long run, we would like to start with this new infrastructure inside a separate folder/namespace called “p2n”, because of case-sensitivity issues with the “Patent2Net” folder. This is just a minor thing and things might move back to a “patent2net” (lower-case) folder after all refactoring has taken place.
  • Provide a command line script p2n as a single entrypoint for performing all steps as a kind of a shell / commandline-based API wrapper around the current script-based architecture. This should act as a convenient and stable CLI interface while pieces under the hood might move around.
  • Refactor common code from the different scripts into respective configuration- and utility-modules.
  • Care for Windows compatibility: While we are working on macOS (workstation) and Linux (server), we recognize that the P2N user base is coming from the Windows world. Of course we will try to keep this platform as a first citizen when applying my changes, but we might require some help though.
  • Improve the documentation: We would like to introduce the great Sphinx documentation generator coming from the Python project. We use it with almost every project / code base we are working on and it is really a pleasure to have high fidelity documentation on your fingertips, either when doing local development or when being automatically updated and published after pushing to git origin. In order to do this, we would gradually move things from Markdown to reStructuredText, as this is the more powerful and native sister markup language around.

The first wave of changes into the direction outlined above can be reviewed through Commits · Patent2net/P2N · GitHub.