For one of our clients we are looking for a Consultant (m/w/d) Crawler for MS SharePoint and OneNote.
The service is requested as part of the project AI@SSC. The project has the purpose to build crawler that periodically, or on demand, fetches content from certain websites over Henkel’s SharePoint Server and dump a structured image of the fetched content in a given server location, such that this dumped image can be used as a mirror of the original content.
Crawling file lists (Site Libraries and lists/Documents)
Most of the SharePoint websites are simply list of files, such as MS Office Word, Excel, PDF, Access and so on. In these cases, the crawler:
• downloads the files,
• provide simple meta data about them
o those that are asserted in the SharePoint (author, date, etc.);
o most importantly the URL of files, in a way that the provided URLs uniquely identify (in as much as possible – can be discussed) the crawled content and that they can be used to access/download files from their origin SharePoint site or to explore/view them in a standard web browser (e.g. the URL of the main SharePoint site will be useless).
2. Crawling Pages of SharePoint Team and Communications Sites
In case a SharePoint site contains rich content pages, the crawler shall be able
• to read SharePoint communication/team websites;
• identify logical segments such as web pages and their related contents (including attachments to them) and convert and dump them in HTML format (such as can be used as a mirror of the original content);
• all the identifies segments shall be assigned to a set of metadata, of which a working URL is a shall.
3. Redding OneNote Files and Generating HTML Pages
OneNote files are special case that require further processing:
• Logical segments, such as sections/subsections, in each notebook shall be recognized and then converted into a HTML file(s)
• Each logical segment shall contain meta data available in OneNote, e.g. a title, author, date, as a URL that can be used in a standard browser to access directly the identified page/segment;
• The attachments in the identified segments such excel, pdf, and so on, shall be download and repackaged properly together with the main HTML files;
• The generated HTML files shall contain proper links to the above-mentioned attachments;
• The generated HTML file and its attachments are self-contained and complete in the sense that they are mirroring the content of their origin OneNote segment.
In the above descriptions, we assume the metadata consists of at least two URIs:
a) A Local URI, which is simply the location in which the file(s) for logical document generated by the crawler are stored (e.g., /var/ crawler/filexyz)
b) The URI of origin (the web URL of the resource)
c) These two URIs are taken as unique key. Presumably, the mapping of (a) to (b) (local logical documents to web documents) is one-to-one but this may not be feasible always (perhaps additional # mark can be added to URL)
d) Other metadata such as the time of crawling, the date file/document created, their authors and so on will be stored alongside the key.
e) The metadata will be stored preferably in a format such as YAML and are saved individually in location (a), or less desirably in a simple database table.
The above crawler software should comply to security restrictions set by general Henkel policy.
The software should run in a Linux environment (Ubuntu/Debian) and preferably have a python API; however, in an exceptional situation, the crawler can be a MS Windows app (this impose additional maintenance cost which should be taken into consideration).
Ideally the crawler identifies no-changes in documents and skip recrawling them.
Compliance with GDPR is handled by client and the content provider, i.e. it is assumed that crawling the SharePoint sites does not violate any rules regarding EU personal information protection.
Duration: 6 weeks