README.md

   1 # tap-google-sheets
   2
   3 This is a [Singer](https://singer.io) tap that produces JSON-formatted data
   4 following the [Singer
   5 spec](https://github.com/singer-io/getting-started/blob/master/SPEC.md).
   6
   7 This tap:
   8
   9 - Pulls raw data from the [Google Sheets v4 API](https://developers.google.com/sheets/api)
  10 - Extracts the following endpoints:
  11   - [File Metadata](https://developers.google.com/drive/api/v3/reference/files/get)
  12   - [Spreadsheet Metadata](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/get)
  13   - [Spreadsheet Values](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get)
  14 - Outputs the following metadata streams:
  15   - File Metadata: Name, audit/change info from Google Drive
  16   - Spreadsheet Metadata: Basic metadata about the Spreadsheet: Title, Locale, URL, etc.
  17   - Sheet Metadata: Title, URL, Area (max column and row), and Column Metadata
  18     - Column Metadata: Column Header Name, Data type, Format
  19   - Sheets Loaded: Sheet title, load date, number of rows
  20 - For each Sheet:
  21   - Outputs the schema for each resource (based on the column header and datatypes of row 2, the first row of data)
  22   - Outputs a record for all columns that have column headers, and for each row of data
  23   - Emits a Singer ACTIVATE_VERSION message after each sheet is complete. This forces hard deletes on the data downstream if fewer records are sent.
  24   - Primary Key for each row in a Sheet is the Row Number:  `__sdc_row`
  25   - Each Row in a Sheet also includes Foreign Keys to the Spreadsheet Metadata, `__sdc_spreadsheet_id`, and Sheet Metadata, `__sdc_sheet_id`.
  26
  27 ## API Endpoints
  28 [**file (GET)**](https://developers.google.com/drive/api/v3/reference/files/get)
  29 - Endpoint: https://www.googleapis.com/drive/v3/files/${spreadsheet_id}?fields=id,name,createdTime,modifiedTime,version
  30 - Primary keys: id
  31 - Replication strategy: Incremental (GET file audit data for spreadsheet_id in config)
  32 - Process/Transformations: Replicate Data if Modified
  33
  34 [**metadata (GET)**](https://developers.google.com/drive/api/v3/reference/files/get)
  35 - Endpoint: https://sheets.googleapis.com/v4/spreadsheets/${spreadsheet_id}?includeGridData=true&ranges=1:2
  36 - This endpoint eturns spreadsheet metadata, sheet metadata, and value metadata (data type information)
  37 - Primary keys: Spreadsheet Id, Sheet Id, Column Index
  38 - Foreign keys: None
  39 - Replication strategy: Full (get and replace file metadata for spreadshee_id in config)
  40 - Process/Transformations:
  41   - Verify Sheets: Check sheets exist (compared to catalog) and check gridProperties (available area)
  42     - sheetId, title, index, gridProperties (rowCount, columnCount)
  43   - Verify Field Headers (1st row): Check field headers exist (compared to catalog), missing headers (columns to skip), column order/position, and column name uniqueness
  44   - Create/Verify Datatypes based on 2nd row value and cell metadata
  45       - First check:
  46         - [effectiveValue: key](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/other#ExtendedValue)
  47           - Valid types: numberValue, stringValue, boolValue
  48           - Invalid types: formulaValue, errorValue
  49       - Then check:
  50         - [effectiveFormat.numberFormat.type](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/cells#NumberFormatType)
  51           - Valid types: UNEPECIFIED, TEXT, NUMBER, PERCENT, CURRENCY, DATE, TIME, DATE_TIME, SCIENTIFIC
  52           - Determine JSON schema column data type based on the value and the above cell metadata settings.
  53           - If DATE, DATE_TIME, or TIME, set JSON schema format accordingly
  54
  55 [**values (GET)**](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get)
  56 - Endpoint: https://sheets.googleapis.com/v4/spreadsheets/${spreadsheet_id}/values/'${sheet_name}'!${row_range}?dateTimeRenderOption=SERIAL_NUMBER&valueRenderOption=UNFORMATTED_VALUE&majorDimension=ROWS
  57 - This endpoint loops through sheets and row ranges to get the [unformatted values](https://developers.google.com/sheets/api/reference/rest/v4/ValueRenderOption) (effective values only), dates and datetimes as [serial numbers](https://developers.google.com/sheets/api/reference/rest/v4/DateTimeRenderOption)
  58 - Primary keys: _sdc_row
  59 - Replication strategy: Full (GET file audit data for spreadsheet_id in config)
  60 - Process/Transformations:
  61   - Loop through sheets (compared to catalog selection)
  62     - Send metadata for sheet
  63   - Loop through ALL columns for columns having a column header
  64   - Loop through ranges of rows for ALL rows in sheet available area max row (from sheet metadata)
  65   - Transform values, if necessary (dates, date-times, times, boolean).
  66     - Date/time serial numbers converted to date, date-time, and time strings. Google Sheets uses Lotus 1-2-3 [Serial Number](https://developers.google.com/sheets/api/reference/rest/v4/DateTimeRenderOption) format for date/times. These are converted to normal UTC date-time strings.
  67   - Process/send records to target
  68
  69 ## Authentication
  70 The [**Google Sheets Setup & Authentication**](https://drive.google.com/open?id=1FojlvtLwS0-BzGS37R0jEXtwSHqSiO1Uw-7RKQQO-C4) Google Doc provides instructions show how to configure the Google Cloud API credentials to enable Google Drive and Google Sheets APIs, configure Google Cloud to authorize/verify your domain ownership, generate an API key (client_id, client_secret), authenticate and generate a refresh_token, and prepare your tap config.json with the necessary parameters.
  71 - Enable Googe Drive APIs and Authorization Scope: https://www.googleapis.com/auth/drive.metadata.readonly
  72 - Enable Google Sheets API and Authorization Scope: https://www.googleapis.com/auth/spreadsheets.readonly
  73 - Tap config.json parameters:
  74   - client_id: identifies your application
  75   - client_secret: authenticates your application
  76   - refresh_token: generates an access token to authorize your session
  77   - spreadsheet_id: unique identifier for each spreadsheet in Google Drive
  78   - start_date: absolute minimum start date to check file modified
  79   - user_agent: tap-name and email address; identifies your application in the Remote API server logs
  80
  81 ## Quick Start
  82
  83 1. Install
  84
  85     Clone this repository, and then install using setup.py. We recommend using a virtualenv:
  86
  87     ```bash
  88     > virtualenv -p python3 venv
  89     > source venv/bin/activate
  90     > python setup.py install
  91     OR
  92     > cd .../tap-google-sheets
  93     > pip install .
  94     ```
  95 2. Dependent libraries
  96     The following dependent libraries were installed.
  97     ```bash
  98     > pip install target-json
  99     > pip install target-stitch
 100     > pip install singer-tools
 101     > pip install singer-python
 102     ```
 103     - [singer-tools](https://github.com/singer-io/singer-tools)
 104     - [target-stitch](https://github.com/singer-io/target-stitch)
 105
 106 3. Create your tap's `config.json` file. Include the client_id, client_secret, refresh_token, site_urls (website URL properties in a comma delimited list; do not include the domain-level property in the list), start_date (UTC format), and user_agent (tap name with the api user email address).
 107
 108     ```json
 109     {
 110         "client_id": "YOUR_CLIENT_ID",
 111         "client_secret": "YOUR_CLIENT_SECRET",
 112         "refresh_token": "YOUR_REFRESH_TOKEN",
 113         "spreadsheet_id": "YOUR_GOOGLE_SPREADSHEET_ID",
 114         "start_date": "2019-01-01T00:00:00Z",
 115         "user_agent": "tap-google-sheets <api_user_email@example.com>"
 116     }
 117     ```
 118
 119     Optionally, also create a `state.json` file. `currently_syncing` is an optional attribute used for identifying the last object to be synced in case the job is interrupted mid-stream. The next run would begin where the last job left off.
 120     Only the `performance_reports` uses a bookmark. The date-time bookmark is stored in a nested structure based on the endpoint, site, and sub_type.
 121
 122     ```json
 123     {
 124         "currently_syncing": "file_metadata",
 125         "bookmarks": {
 126             "file_metadata": "2019-09-27T22:34:39.000000Z"
 127         }
 128     }
 129     ```
 130
 131 4. Run the Tap in Discovery Mode
 132     This creates a catalog.json for selecting objects/fields to integrate:
 133     ```bash
 134     tap-google-sheets --config config.json --discover > catalog.json
 135     ```
 136    See the Singer docs on discovery mode
 137    [here](https://github.com/singer-io/getting-started/blob/master/docs/DISCOVERY_MODE.md#discovery-mode).
 138
 139 5. Run the Tap in Sync Mode (with catalog) and [write out to state file](https://github.com/singer-io/getting-started/blob/master/docs/RUNNING_AND_DEVELOPING.md#running-a-singer-tap-with-a-singer-target)
 140
 141     For Sync mode:
 142     ```bash
 143     > tap-google-sheets --config tap_config.json --catalog catalog.json > state.json
 144     > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
 145     ```
 146     To load to json files to verify outputs:
 147     ```bash
 148     > tap-google-sheets --config tap_config.json --catalog catalog.json | target-json > state.json
 149     > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
 150     ```
 151     To pseudo-load to [Stitch Import API](https://github.com/singer-io/target-stitch) with dry run:
 152     ```bash
 153     > tap-google-sheets --config tap_config.json --catalog catalog.json | target-stitch --config target_config.json --dry-run > state.json
 154     > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
 155     ```
 156
 157 6. Test the Tap
 158
 159     While developing the Google Search Console tap, the following utilities were run in accordance with Singer.io best practices:
 160     Pylint to improve [code quality](https://github.com/singer-io/getting-started/blob/master/docs/BEST_PRACTICES.md#code-quality):
 161     ```bash
 162     > pylint tap_google_sheets -d missing-docstring -d logging-format-interpolation -d too-many-locals -d too-many-arguments
 163     ```
 164     Pylint test resulted in the following score:
 165     ```bash
 166     Your code has been rated at 9.78/10
 167     ```
 168
 169     To [check the tap](https://github.com/singer-io/singer-tools#singer-check-tap) and verify working:
 170     ```bash
 171     > tap-google-sheets --config tap_config.json --catalog catalog.json | singer-check-tap > state.json
 172     > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
 173     ```
 174     Check tap resulted in the following:
 175     ```bash
 176     The output is valid.
 177     It contained 3881 messages for 13 streams.
 178
 179         13 schema messages
 180       3841 record messages
 181         27 state messages
 182
 183     Details by stream:
 184     +----------------------+---------+---------+
 185     | stream               | records | schemas |
 186     +----------------------+---------+---------+
 187     | file_metadata        | 1       | 1       |
 188     | spreadsheet_metadata | 1       | 1       |
 189     | Test-1               | 9       | 1       |
 190     | Test 2               | 2       | 1       |
 191     | SKU COGS             | 218     | 1       |
 192     | Item Master          | 216     | 1       |
 193     | Retail Price         | 273     | 1       |
 194     | Retail Price NEW     | 284     | 1       |
 195     | Forecast Scenarios   | 2681    | 1       |
 196     | Promo Type           | 91      | 1       |
 197     | Shipping Method      | 47      | 1       |
 198     | sheet_metadata       | 9       | 1       |
 199     | sheets_loaded        | 9       | 1       |
 200     +----------------------+---------+---------+
 201     ```
 202 ---
 203
 204 Copyright &copy; 2019 Stitch