XMP metadata in PDF documents from European publication server

Introduction

Coming from a recent update through Use EPO publication server for obtaining PDFs of EP publications (#12) · ip-tools/ip-navigator@525f5f4 · GitHub (thanks again, @aghster!), PDF documents can be directly fetched from the European publication server.

After looking at them in detail, we can say these documents are minted by high standards. Besides being actually ASCII-accessible (not only scanned images attached to each other), there is also XMP metadata in XML/RDF format embedded into the documents.

This is coming from the EPO initiative to encode and publish data as linked open data, see also EPO - Linked open EP data. While we aimed at unlocking this for PatZilla already, we didn’t have the chance to try yet. However, great to see this in the wild already.

About XMP

Adobe’s Extensible Metadata Platform (XMP) is a file labeling technology that lets you embed metadata into files themselves during the content creation process. With an XMP enabled application, your workgroup can capture meaningful information about a project (such as titles and descriptions, searchable keywords, and up-to-date author and copyright information) in a format that is easily understood by your team as well as by software applications, hardware devices, and even file formats. Best of all, as team members modify files and assets, they can edit and update the metadata in real time during the workflow.

With XMP, desktop applications and back-end publishing systems gain a common method for capturing, sharing, and leveraging this valuable metadata. Adobe has taken the “heavy lifting” out of metadata integration, offering content creators an easy way to embed meaningful information about their projects and providing industry partners with standards-based building blocks to develop optimized workflow solutions.

Details

By providing a standard way of tagging files with metadata across products from Adobe and other vendors, XMP is a powerful solution enabler. As an open source technology, it is freely available to developers, which means that the user community benefits from the innovations contributed by developers worldwide. The XMP SDKs are available in the downloads section. Furthermore, XMP is extensible — it can accommodate existing metadata schemas, so systems don’t need to be rebuilt from scratch. A growing number of third-party applications now support XMP.

Since early 2012, XMP is also an ISO standard (16684-1).

As serialization format, a subset of the W3C RDF/XML syntax is most commonly used. It is a syntax to express a Resource Description Framework graph in XML. There are various equivalent ways to serialize the same XMP packet in RDF/XML.

Example for EP0666666A2

RDF embedded in <x:xmpmeta>
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.0-c320 44.290368, Mon Jun 11 2007 09:18:48">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xap="http://ns.adobe.com/xap/1.0/">
         <xap:ModifyDate>2010-12-01T05:48:25+01:00</xap:ModifyDate>
         <xap:CreateDate>2010-10-14T02:40:28+01:00</xap:CreateDate>
         <xap:MetadataDate>2010-12-01T05:48:25+01:00</xap:MetadataDate>
         <xap:CreatorTool>Jouve S.A., EPO - Publication - KB, TaggedPDF v1.27</xap:CreatorTool>
         <xap:Identifier>
            <rdf:Bag>
               <rdf:li>EP-0666666-A2-19950809</rdf:li>
            </rdf:Bag>
         </xap:Identifier>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:xapMM="http://ns.adobe.com/xap/1.0/mm/">
         <xapMM:DocumentID>uuid:c5973d2e-1dd1-11b2-0a00-848e67090100</xapMM:DocumentID>
         <xapMM:InstanceID>uuid:ad8177c0-1dd1-11b2-0a00-985d70093600</xapMM:InstanceID>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/">
         <dc:format>application/pdf</dc:format>
         <dc:identifier>EP-0666666-A2-19950809</dc:identifier>
         <dc:description>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">A non-quota access indicator is circulated among nodes in a multi-node quota based communication system with a shared resource, indicating maximum possible non-quota access to the shared resource to a given node receiving same. Upon arrival at a node, the indicator is saved and then updated to reflect the current status of that node as either starved or satisfied, the former being a condition of currently having quota remaining and a shared resource access requirement, and the latter being a condition of either currently having no remaining quota or having no current shared resource access requirement. After updating, the node immediately propagates the indicator to the next node in the system. When a node without quota requires access to the shared resource, it compares its requirement to the last stored indicator and accesses the shared resource if the stored indicator is equal to or greater than the access requirement.</rdf:li>
               <rdf:li xml:lang="en">A non-quota access indicator is circulated among nodes in a multi-node quota based communication system with a shared resource, indicating maximum possible non-quota access to the shared resource to a given node receiving same. Upon arrival at a node, the indicator is saved and then updated to reflect the current status of that node as either starved or satisfied, the former being a condition of currently having quota remaining and a shared resource access requirement, and the latter being a condition of either currently having no remaining quota or having no current shared resource access requirement. After updating, the node immediately propagates the indicator to the next node in the system. When a node without quota requires access to the shared resource, it compares its requirement to the last stored indicator and accesses the shared resource if the stored indicator is equal to or greater than the access requirement.</rdf:li>
            </rdf:Alt>
         </dc:description>
         <dc:language>
            <rdf:Bag>
               <rdf:li>en</rdf:li>
            </rdf:Bag>
         </dc:language>
         <dc:publisher>
            <rdf:Bag>
               <rdf:li>European Patent Office</rdf:li>
            </rdf:Bag>
         </dc:publisher>
         <dc:subject>
            <rdf:Bag>
               <rdf:li>EP0666666</rdf:li>
               <rdf:li>EP 0666666</rdf:li>
               <rdf:li>H04L 12/24</rdf:li>
               <rdf:li>H04L 12/42</rdf:li>
               <rdf:li>H04L 29/00</rdf:li>
               <rdf:li>Method and apparatus for improved throughput in a multi-node communication system with a shared resource</rdf:li>
            </rdf:Bag>
         </dc:subject>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">Method and apparatus for improved throughput in a multi-node communication system with a shared resource - European Patent Office - EP 0666666 A2</rdf:li>
               <rdf:li xml:lang="en">Method and apparatus for improved throughput in a multi-node communication system with a shared resource - European Patent Office - EP 0666666 A2</rdf:li>
            </rdf:Alt>
         </dc:title>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
         <pdf:Producer>Adobe PDF Library 8.0</pdf:Producer>
         <pdf:Keywords>"EP0666666"; "EP 0666666"; "H04L 12/24"; "H04L 12/42"; "H04L 29/00"; "Method and apparatus for improved throughput in a multi-node communication system with a shared resource"</pdf:Keywords>
         <pdf:PDFVersion>1.4</pdf:PDFVersion>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/">
         <pdfaid:part>1</pdfaid:part>
         <pdfaid:conformance>B</pdfaid:conformance>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:patent="http://www.epo.org/patent-bibliographic-data/1.0/">
         <patent:TotalNumberOfPages>8</patent:TotalNumberOfPages>
         <patent:Publication rdf:parseType="Resource">
            <patent:CountryCode>EP</patent:CountryCode>
            <patent:Number>0666666</patent:Number>
            <patent:KindCode>A2</patent:KindCode>
            <patent:Date>1995-08-09</patent:Date>
         </patent:Publication>
         <patent:Application rdf:parseType="Resource">
            <patent:Number>95480005</patent:Number>
            <patent:Date>1995-01-24</patent:Date>
         </patent:Application>
         <patent:Priority>
            <rdf:Bag>
               <rdf:li rdf:parseType="Resource">
                  <patent:CountryCode>US</patent:CountryCode>
                  <patent:Number>192884</patent:Number>
                  <patent:Date>1994-02-07</patent:Date>
               </rdf:li>
            </rdf:Bag>
         </patent:Priority>
         <patent:Classification>
            <rdf:Bag>
               <rdf:li>H04L 12/24</rdf:li>
               <rdf:li>H04L 12/42</rdf:li>
               <rdf:li>H04L 29/00</rdf:li>
            </rdf:Bag>
         </patent:Classification>
         <patent:Applicant>
            <rdf:Bag>
               <rdf:li>International Business Machines Corporation</rdf:li>
            </rdf:Bag>
         </patent:Applicant>
         <patent:Inventor>
            <rdf:Bag>
               <rdf:li>Cidon, Israel</rdf:li>
               <rdf:li>Georgiadis, Leonidas</rdf:li>
               <rdf:li>Guerin, Roch Andre</rdf:li>
               <rdf:li>Shavitt, Yuval Yitzchak</rdf:li>
               <rdf:li>Slater, Andrew Emlyn</rdf:li>
            </rdf:Bag>
         </patent:Inventor>
         <patent:Representative>
            <rdf:Bag>
               <rdf:li>Therias, Philippe</rdf:li>
            </rdf:Bag>
         </patent:Representative>
         <patent:Title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">Method and apparatus for improved throughput in a multi-node communication system with a shared resource</rdf:li>
               <rdf:li xml:lang="de">Verfahren und Vorrichtung für verbesserten Durchfluss in einem Vielfachknoten-Kommunikationssystem mit einem gemeinsamen Betriebsmittel</rdf:li>
               <rdf:li xml:lang="en">Method and apparatus for improved throughput in a multi-node communication system with a shared resource</rdf:li>
               <rdf:li xml:lang="fr">Méthode et dispositif pour débit amélioré dans un système de communication à plusieurs noeuds avec une ressource partagée</rdf:li>
            </rdf:Alt>
         </patent:Title>
         <patent:Abstract>
            <rdf:Seq>
               <rdf:li xml:lang="en">A non-quota access indicator is circulated among nodes in a multi-node quota based communication system with a shared resource, indicating maximum possible non-quota access to the shared resource to a given node receiving same. Upon arrival at a node, the indicator is saved and then updated to reflect the current status of that node as either starved or satisfied, the former being a condition of currently having quota remaining and a shared resource access requirement, and the latter being a condition of either currently having no remaining quota or having no current shared resource access requirement. After updating, the node immediately propagates the indicator to the next node in the system. When a node without quota requires access to the shared resource, it compares its requirement to the last stored indicator and accesses the shared resource if the stored indicator is equal to or greater than the access requirement.</rdf:li>
            </rdf:Seq>
         </patent:Abstract>
         <patent:DocumentStructure>
            <rdf:Seq>
               <rdf:li rdf:parseType="Resource">
                  <patent:DocumentSection>bibliography</patent:DocumentSection>
                  <patent:StartPage>1</patent:StartPage>
                  <patent:NumberOfPages>1</patent:NumberOfPages>
               </rdf:li>
               <rdf:li rdf:parseType="Resource">
                  <patent:DocumentSection>description</patent:DocumentSection>
                  <patent:StartPage>2</patent:StartPage>
                  <patent:NumberOfPages>4</patent:NumberOfPages>
               </rdf:li>
               <rdf:li rdf:parseType="Resource">
                  <patent:DocumentSection>claims</patent:DocumentSection>
                  <patent:StartPage>5</patent:StartPage>
                  <patent:NumberOfPages>2</patent:NumberOfPages>
               </rdf:li>
               <rdf:li rdf:parseType="Resource">
                  <patent:DocumentSection>drawings</patent:DocumentSection>
                  <patent:StartPage>7</patent:StartPage>
                  <patent:NumberOfPages>2</patent:NumberOfPages>
               </rdf:li>
               <rdf:li rdf:parseType="Resource">
                  <patent:DocumentSection>abstract</patent:DocumentSection>
                  <patent:StartPage>1</patent:StartPage>
                  <patent:NumberOfPages>1</patent:NumberOfPages>
               </rdf:li>
            </rdf:Seq>
         </patent:DocumentStructure>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:pdfaExtension="http://www.aiim.org/pdfa/ns/extension/"
            xmlns:pdfaSchema="http://www.aiim.org/pdfa/ns/schema#"
            xmlns:pdfaProperty="http://www.aiim.org/pdfa/ns/property#"
            xmlns:pdfaType="http://www.aiim.org/pdfa/ns/type#"
            xmlns:pdfaField="http://www.aiim.org/pdfa/ns/field#">
         <pdfaExtension:schemas>
            <rdf:Bag>
               <rdf:li rdf:parseType="Resource">
                  <pdfaSchema:namespaceURI>http://www.epo.org/patent-bibliographic-data/1.0/</pdfaSchema:namespaceURI>
                  <pdfaSchema:prefix>patent</pdfaSchema:prefix>
                  <pdfaSchema:schema>Patent Bibliographic Data Schema V. 1.0</pdfaSchema:schema>
                  <pdfaSchema:property>
                     <rdf:Seq>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>external</pdfaProperty:category>
                           <pdfaProperty:name>Publication</pdfaProperty:name>
                           <pdfaProperty:valueType>DocId</pdfaProperty:valueType>
                           <pdfaProperty:description>Contains country code, publication number/date, correction code, and kind of the published&#xA;patent document</pdfaProperty:description>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>external</pdfaProperty:category>
                           <pdfaProperty:name>InternationalPublication</pdfaProperty:name>
                           <pdfaProperty:valueType>DocId</pdfaProperty:valueType>
                           <pdfaProperty:description>Contains country code, publication number/date, and kind of the international published patent&#xA;document</pdfaProperty:description>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>external</pdfaProperty:category>
                           <pdfaProperty:name>Application</pdfaProperty:name>
                           <pdfaProperty:valueType>DocId</pdfaProperty:valueType>
                           <pdfaProperty:description>Contains number and filing date of the patent document</pdfaProperty:description>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>external</pdfaProperty:category>
                           <pdfaProperty:name>InternationalApplication</pdfaProperty:name>
                           <pdfaProperty:valueType>DocId</pdfaProperty:valueType>
                           <pdfaProperty:description>Contains number and filing date of the international patent document</pdfaProperty:description>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>external</pdfaProperty:category>
                           <pdfaProperty:name>Priority</pdfaProperty:name>
                           <pdfaProperty:valueType>Bag DocId</pdfaProperty:valueType>
                           <pdfaProperty:description>Contains list of priorities (country code, publication number and date) of the patent&#xA;document</pdfaProperty:description>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>external</pdfaProperty:category>
                           <pdfaProperty:name>Classification</pdfaProperty:name>
                           <pdfaProperty:valueType>Bag Text</pdfaProperty:valueType>
                           <pdfaProperty:description>Contains list of Patent Classification symbols of the patent document - XML ST36 element:&#xA;classification-ipcr/text, or B511, B512 - See WIPO ST8 for content structure description</pdfaProperty:description>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>external</pdfaProperty:category>
                           <pdfaProperty:name>Applicant</pdfaProperty:name>
                           <pdfaProperty:valueType>Bag Text</pdfaProperty:valueType>
                           <pdfaProperty:description>Contains list of applicants of the patent document</pdfaProperty:description>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>external</pdfaProperty:category>
                           <pdfaProperty:name>Inventor</pdfaProperty:name>
                           <pdfaProperty:valueType>Bag Text</pdfaProperty:valueType>
                           <pdfaProperty:description>Contains list of inventors of the patent document</pdfaProperty:description>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>external</pdfaProperty:category>
                           <pdfaProperty:name>Proprietor</pdfaProperty:name>
                           <pdfaProperty:valueType>Bag Text</pdfaProperty:valueType>
                           <pdfaProperty:description>Contains list of proprietors of the patent document</pdfaProperty:description>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>external</pdfaProperty:category>
                           <pdfaProperty:name>Representative</pdfaProperty:name>
                           <pdfaProperty:valueType>Bag Text</pdfaProperty:valueType>
                           <pdfaProperty:description>Contains list of representatives of the patent document</pdfaProperty:description>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>external</pdfaProperty:category>
                           <pdfaProperty:name>Title</pdfaProperty:name>
                           <pdfaProperty:valueType>Lang Alt</pdfaProperty:valueType>
                           <pdfaProperty:description>Contains title or an alternative list of titles of the patent document - XML ST36 element: B541&#xA;(title language) and B542 - default value depends on the value of the attribute @lang in the element ep-patentdocument</pdfaProperty:description>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>external</pdfaProperty:category>
                           <pdfaProperty:name>Abstract</pdfaProperty:name>
                           <pdfaProperty:valueType>Seq Text</pdfaProperty:valueType>
                           <pdfaProperty:description>Contains abstract or an ordered list of abstracts of the patent document - XML ST36 element:&#xA;abstract/@lang and abstract/p - first abstract in the ordered list depends on the value of the attribute @lang in the element ep-patent-document</pdfaProperty:description>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>external</pdfaProperty:category>
                           <pdfaProperty:name>TotalNumberOfPages</pdfaProperty:name>
                           <pdfaProperty:valueType>Real</pdfaProperty:valueType>
                           <pdfaProperty:description>Total number of pages</pdfaProperty:description>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>external</pdfaProperty:category>
                           <pdfaProperty:name>DocumentStructure</pdfaProperty:name>
                           <pdfaProperty:valueType>Seq Bookmark</pdfaProperty:valueType>
                           <pdfaProperty:description>Contains an ordered list of bookmark-related data (name, number of the first page, number of&#xA;pages)</pdfaProperty:description>
                        </rdf:li>
                     </rdf:Seq>
                  </pdfaSchema:property>
                  <pdfaSchema:valueType>
                     <rdf:Seq>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaType:type>DocId</pdfaType:type>
                           <pdfaType:namespaceURI>http://www.epo.org/patent-bibliographic-data/1.0/</pdfaType:namespaceURI>
                           <pdfaType:prefix>patent</pdfaType:prefix>
                           <pdfaType:description>Provides a structure for document identification related data</pdfaType:description>
                           <pdfaType:field>
                              <rdf:Seq>
                                 <rdf:li rdf:parseType="Resource">
                                    <pdfaField:name>CountryCode</pdfaField:name>
                                    <pdfaField:valueType>Text</pdfaField:valueType>
                                    <pdfaField:description>Country Code - XML ST36 element:&#xA;- B190 for a publication or B871/ctry for an international publication&#xA;- B330/ctry for a priority</pdfaField:description>
                                 </rdf:li>
                                 <rdf:li rdf:parseType="Resource">
                                    <pdfaField:name>Number</pdfaField:name>
                                    <pdfaField:valueType>Text</pdfaField:valueType>
                                    <pdfaField:description>Publication or Application or Priority Number - XML ST36 element:&#xA;- B110 for a publication or B871/dnum/pnum for an international publication&#xA;- B210 for an application or B861/dnum/anum for an international application&#xA;- B310 for a priority</pdfaField:description>
                                 </rdf:li>
                                 <rdf:li rdf:parseType="Resource">
                                    <pdfaField:name>KindCode</pdfaField:name>
                                    <pdfaField:valueType>Text</pdfaField:valueType>
                                    <pdfaField:description>Kind Code - XML ST36 element: B130 for a publication or B871/kind for an international publication</pdfaField:description>
                                 </rdf:li>
                                 <rdf:li rdf:parseType="Resource">
                                    <pdfaField:name>CorrectionCode</pdfaField:name>
                                    <pdfaField:valueType>Text</pdfaField:valueType>
                                    <pdfaField:description>Correction Code - XML ST36 element: B151+B132EP - See WIPO ST50 for correction code definition</pdfaField:description>
                                 </rdf:li>
                                 <rdf:li rdf:parseType="Resource">
                                    <pdfaField:name>Date</pdfaField:name>
                                    <pdfaField:valueType>Date</pdfaField:valueType>
                                    <pdfaField:description>Publication or Application or Priority Date - XML ST36 element:&#xA;- B140/date for a publication or B871/date for an international publication&#xA;- B220/date for an application or B861/date for an international application&#xA;- B320/date for a priority</pdfaField:description>
                                 </rdf:li>
                              </rdf:Seq>
                           </pdfaType:field>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaType:type>Bookmark</pdfaType:type>
                           <pdfaType:namespaceURI>http://www.epo.org/patent-bibliographic-data/1.0/</pdfaType:namespaceURI>
                           <pdfaType:prefix>patent</pdfaType:prefix>
                           <pdfaType:description>Provides a structure for describing bookmarks of a patent document</pdfaType:description>
                           <pdfaType:field>
                              <rdf:Seq>
                                 <rdf:li rdf:parseType="Resource">
                                    <pdfaField:name>DocumentSection</pdfaField:name>
                                    <pdfaField:valueType>Text</pdfaField:valueType>
                                    <pdfaField:description>Possible values are: bibliography, abstract, description, claims, drawings, search-report, amendment</pdfaField:description>
                                 </rdf:li>
                                 <rdf:li rdf:parseType="Resource">
                                    <pdfaField:name>StartPage</pdfaField:name>
                                    <pdfaField:valueType>Real</pdfaField:valueType>
                                    <pdfaField:description>Number of the first page of a given DocumentSection</pdfaField:description>
                                 </rdf:li>
                                 <rdf:li rdf:parseType="Resource">
                                    <pdfaField:name>NumberOfPages</pdfaField:name>
                                    <pdfaField:valueType>Real</pdfaField:valueType>
                                    <pdfaField:description>Number of pages of a given DocumentSection</pdfaField:description>
                                 </rdf:li>
                              </rdf:Seq>
                           </pdfaType:field>
                        </rdf:li>
                     </rdf:Seq>
                  </pdfaSchema:valueType>
               </rdf:li>
               <rdf:li rdf:parseType="Resource">
                  <pdfaSchema:namespaceURI>http://ns.adobe.com/xap/1.0/mm/</pdfaSchema:namespaceURI>
                  <pdfaSchema:prefix>xapMM</pdfaSchema:prefix>
                  <pdfaSchema:schema>xapMM Schema</pdfaSchema:schema>
                  <pdfaSchema:property>
                     <rdf:Seq>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>internal</pdfaProperty:category>
                           <pdfaProperty:name>InstanceID</pdfaProperty:name>
                           <pdfaProperty:valueType>URI</pdfaProperty:valueType>
                           <pdfaProperty:description>InstanceID Property</pdfaProperty:description>
                        </rdf:li>
                     </rdf:Seq>
                  </pdfaSchema:property>
               </rdf:li>
            </rdf:Bag>
         </pdfaExtension:schemas>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>

Open XML/RDF by clicking its title “RDF embedded in <x:xmpmeta>”.