INTRODUCTION ------------ Search API Attachments This module is an add-on to the Search API which allows the indexing and searching of attachments. The extraction can be done using : Apache Tika Library or The Solr built-in extractor or Acquia Search or pdftotext for pdfs or python pdf2text for pdfs REQUIREMENTS ------------ Requires the ability to run java on your server and an installation of the Apache Tika library if you don't want to use the Solr build in extractor. PHP-iconv to index txt files. MODULE INSTALLATION ------------------- Copy search_api_attachments into your modules folder Install the search_api_attachments module in your Drupal site Go to the configuration: admin/config/search/search_api/attachments Choose an extraction method and follow the instructions under the respective heading below. EXTRACTION CONFIGURATION (Tika) ------------------------------- Install java Download Apache Tika library: http://tika.apache.org/download.html Downloaded file is something like tika-app-y.x.jar in admin/config/search/search_api/attachments, Enter the parent directory and the file name of the .jar file. - Hidden settings search_api_attachments_java: By changing this variable, you can set the path to your java executable. The default is 'java'. EXTRACTION CONFIGURATION (Solr) ------------------------------- This requires Solr search (search_api_solr) module and the Solr config files that come with it. Please follow the Solr search module instructions for configuring asearch api solr server. EXTRACTION CONFIGURATION (Pdftotext) ------------------------------------ Pdftotext is a command line utility tool included by default on many linux distributions. See the wikipedia page for more info: https://en.wikipedia.org/wiki/Pdftotext EXTRACTION CONFIGURATION (python Pdf2txt) ----------------------------------------- Install Pdf2txt (tested with package version 20110515+dfsg-1 and python 2.7.9) > sudo apt-get install python-pdfminer SUBMODULES ------------------------------- For each of these, find more details in contrib folder. search_api_attachments_comment search_api_attachments_commerce_product_reference search_api_attachments_entityreference search_api_attachments_field_collections search_api_attachments_links search_api_attachments_multifield search_api_attachments_multiple_entities search_api_attachments_paragraphs search_api_attachments_references search_api_attachments_user_content CACHING ------- Extracting files content can take a long time and it may not be needed to do it again each time a node gets reindexed. search_api_attachments have a cache bin where we store all the extracted files contents: this is the cache_search_api_attachments table. cache keys are in the form of: 'cached_extraction_[fid]' where [fid] is the file id. When a file is deleted or updated, we drop its extracted stored cache. When the sidewide cache is deleted (drush cc all per example) we drop all the stored extracted files cache only if 'Preserve cached extractions across cache clears.' option is unchecked in the configuration form of the module. DEVELOPMENT ----------- On the admin form of Search API attachements, you can enable the debug feature. It will add a lot of information in the watchdog while indexing. Hidden Features --------------- This module suggests a Views filter to choose to search in attachments files too or not. HOOKS ----- This module provides hook_search_api_attachments_indexable. See more details in search_api_attachments.api.php