From 89643ba6fa98db82efd3246805ef801a8bfb5c81 Mon Sep 17 00:00:00 2001 From: Jeff Huth Date: Wed, 13 Nov 2019 17:03:56 -0800 Subject: Initial commit Discovery mode works. Still working on normal sync. --- README.md | 173 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 172 insertions(+), 1 deletion(-) (limited to 'README.md') diff --git a/README.md b/README.md index db00e68..6fcb0b9 100644 --- a/README.md +++ b/README.md @@ -1 +1,172 @@ -tap-google-sheets +# tap-google-sheets + +This is a [Singer](https://singer.io) tap that produces JSON-formatted data +following the [Singer +spec](https://github.com/singer-io/getting-started/blob/master/SPEC.md). + +This tap: + +- Pulls raw data from the [Google Sheets v4 API](https://developers.google.com/sheets/api) +- Extracts the following endpoints: + - [Spreadsheet File](https://developers.google.com/drive/api/v3/reference/files/get) + - [Spreadsheet Metadata, Sheets, Headers](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/get) + - [Spreadsheet Values](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get) +- For each Sheet: + - Outputs the schema for each resource (based on the column header and datatypes of first row of data) + - Outputs a record for all columns with column headers, and for each row of data until it reaches an empty row + +## API Endpoints +[**file (GET)**](https://developers.google.com/drive/api/v3/reference/files/get) +- Endpoint: https://www.googleapis.com/drive/v3/files/${spreadsheet_id}?fields=id,name,createdTime,modifiedTime,version +- Primary keys: id +- Replication strategy: Full (GET file audit data for spreadsheet_id in config) +- Process/Transformations: Replicate Data if Modified + +[**metadata (GET)**](https://developers.google.com/drive/api/v3/reference/files/get) +- Endpoint: https://sheets.googleapis.com/v4/spreadsheets/${spreadsheet_id}?includeGridData=true&ranges=1:2 +- This endpoint eturns spreadsheet metadata, sheet metadata, and value metadata (data type information) +- Primary keys: spreadsheetId, title, field_name +- Foreign keys: None +- Replication strategy: Full (get and replace file metadata for spreadshee_id in config) +- Process/Transformations: + - Verify Sheets: Check sheets exist (compared to catalog) and check gridProperties (available area) + - sheetId, title, index, gridProperties (rowCount, columnCount) + - Verify Field Headers (1st row): Check field headers exist (compared to catalog), missing headers (columns to skip), column order/position, and column uniqueness + - Header's field_name, position: data.rowData[0].values[i].formattedValue + - Create/Verify Datatypes (2nd row): + - Row 2's datatype, format: data.rowData[1].values[i] + - First check: + - [effectiveValue: key](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/other#ExtendedValue) + - Valid types: numberValue, stringValue, boolValue + - Invalid types: formulaValue, errorValue + - Then check: + - [effectiveFormat.numberFormat.type](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/cells#NumberFormatType) + - Valid types: UNEPECIFIED, TEXT, NUMBER, PERCENT, CURRENCY, DATE, TIME, DATE_TIME, SCIENTIFIC + - If DATE or DATE_TIME, set JSON schema datatype = string and format = date-time + - [effectiveFormat.numberFormat.pattern](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/cells#NumberFormat) + +[**values (GET)**](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get) +- Endpoint: https://sheets.googleapis.com/v4/spreadsheets/${spreadsheet_id}/values/'${sheet_name}'!${row_range}?dateTimeRenderOption=SERIAL_NUMBER&valueRenderOption=UNFORMATTED_VALUE&majorDimension=ROWS +- This endpoint loops through sheets and row ranges to get the [unformatted values](https://developers.google.com/sheets/api/reference/rest/v4/ValueRenderOption) (effective values only), dates and datetimes as [serial numbers](https://developers.google.com/sheets/api/reference/rest/v4/DateTimeRenderOption) +- Primary keys: row +- Replication strategy: Full (GET file audit data for spreadsheet_id in config) +- Process/Transformations: + - Loop through sheets (compared to catalog selection) + - Send metadata for sheet + - Loop through ranges of rows until reaching empty row or area max row (from sheet metadata) + - Transform values, if necessary (dates, date-times, boolean, integer, numers) + - Process/send records + +## Authentication +The [**Google Sheets Setup & Authentication**](https://drive.google.com/open?id=1FojlvtLwS0-BzGS37R0jEXtwSHqSiO1Uw-7RKQQO-C4) Google Doc provides instructions show how to configure the Google Cloud API credentials to enable Google Drive and Google Sheets APIs, configure Google Cloud to authorize/verify your domain ownership, generate an API key (client_id, client_secret), authenticate and generate a refresh_token, and prepare your tap config.json with the necessary parameters. +- Enable Googe Drive APIs and Authorization Scope: https://www.googleapis.com/auth/drive.metadata.readonly +- Enable Google Sheets API and Authorization Scope: https://www.googleapis.com/auth/spreadsheets.readonly +- Tap config.json parameters: + - client_id: identifies your application + - client_secret: authenticates your application + - refresh_token: generates an access token to authorize your session + - spreadsheet_id: unique identifier for each spreadsheet in Google Drive + - start_date: absolute minimum start date to check file modified + - user_agent: tap-name and email address; identifies your application in the Remote API server logs + +## Quick Start + +1. Install + + Clone this repository, and then install using setup.py. We recommend using a virtualenv: + + ```bash + > virtualenv -p python3 venv + > source venv/bin/activate + > python setup.py install + OR + > cd .../tap-google-sheets + > pip install . + ``` +2. Dependent libraries + The following dependent libraries were installed. + ```bash + > pip install target-json + > pip install target-stitch + > pip install singer-tools + > pip install singer-python + ``` + - [singer-tools](https://github.com/singer-io/singer-tools) + - [target-stitch](https://github.com/singer-io/target-stitch) + +3. Create your tap's `config.json` file. Include the client_id, client_secret, refresh_token, site_urls (website URL properties in a comma delimited list; do not include the domain-level property in the list), start_date (UTC format), and user_agent (tap name with the api user email address). + + ```json + { + "client_id": "YOUR_CLIENT_ID", + "client_secret": "YOUR_CLIENT_SECRET", + "refresh_token": "YOUR_REFRESH_TOKEN", + "spreadsheet_id": "YOUR_GOOGLE_SPREADSHEET_ID", + "start_date": "2019-01-01T00:00:00Z", + "user_agent": "tap-google-sheets " + } + ``` + + Optionally, also create a `state.json` file. `currently_syncing` is an optional attribute used for identifying the last object to be synced in case the job is interrupted mid-stream. The next run would begin where the last job left off. + Only the `performance_reports` uses a bookmark. The date-time bookmark is stored in a nested structure based on the endpoint, site, and sub_type. + + ```json + { + "currently_syncing": "file_metadata", + "bookmarks": { + "file_metadata": "2019-09-27T22:34:39.000000Z", + "sheets_loaded": "2019-09-27T21:55:42.000000Z" + } + } + ``` + +4. Run the Tap in Discovery Mode + This creates a catalog.json for selecting objects/fields to integrate: + ```bash + tap-google-sheets --config config.json --discover > catalog.json + ``` + See the Singer docs on discovery mode + [here](https://github.com/singer-io/getting-started/blob/master/docs/DISCOVERY_MODE.md#discovery-mode). + +5. Run the Tap in Sync Mode (with catalog) and [write out to state file](https://github.com/singer-io/getting-started/blob/master/docs/RUNNING_AND_DEVELOPING.md#running-a-singer-tap-with-a-singer-target) + + For Sync mode: + ```bash + > tap-google-sheets --config tap_config.json --catalog catalog.json > state.json + > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json + ``` + To load to json files to verify outputs: + ```bash + > tap-google-sheets --config tap_config.json --catalog catalog.json | target-json > state.json + > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json + ``` + To pseudo-load to [Stitch Import API](https://github.com/singer-io/target-stitch) with dry run: + ```bash + > tap-google-sheets --config tap_config.json --catalog catalog.json | target-stitch --config target_config.json --dry-run > state.json + > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json + ``` + +6. Test the Tap + + While developing the Google Search Console tap, the following utilities were run in accordance with Singer.io best practices: + Pylint to improve [code quality](https://github.com/singer-io/getting-started/blob/master/docs/BEST_PRACTICES.md#code-quality): + ```bash + > pylint tap_google_sheets -d missing-docstring -d logging-format-interpolation -d too-many-locals -d too-many-arguments + ``` + Pylint test resulted in the following score: + ```bash + TBD + ``` + + To [check the tap](https://github.com/singer-io/singer-tools#singer-check-tap) and verify working: + ```bash + > tap-google-sheets --config tap_config.json --catalog catalog.json | singer-check-tap > state.json + > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json + ``` + Check tap resulted in the following: + ```bash + TBD + ``` +--- + +Copyright © 2019 Stitch -- cgit v1.2.3