aboutsummaryrefslogtreecommitdiffhomepage
path: root/README.md
blob: 6fcb0b95d86e7743fc1ce251825ac5dcc4082ba3 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
# tap-google-sheets

This is a [Singer](https://singer.io) tap that produces JSON-formatted data
following the [Singer
spec](https://github.com/singer-io/getting-started/blob/master/SPEC.md).

This tap:

- Pulls raw data from the [Google Sheets v4 API](https://developers.google.com/sheets/api)
- Extracts the following endpoints:
  - [Spreadsheet File](https://developers.google.com/drive/api/v3/reference/files/get)
  - [Spreadsheet Metadata, Sheets, Headers](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/get)
  - [Spreadsheet Values](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get)
- For each Sheet:
  - Outputs the schema for each resource (based on the column header and datatypes of first row of data)
  - Outputs a record for all columns with column headers, and for each row of data until it reaches an empty row

## API Endpoints
[**file (GET)**](https://developers.google.com/drive/api/v3/reference/files/get)
- Endpoint: https://www.googleapis.com/drive/v3/files/${spreadsheet_id}?fields=id,name,createdTime,modifiedTime,version
- Primary keys: id
- Replication strategy: Full (GET file audit data for spreadsheet_id in config)
- Process/Transformations: Replicate Data if Modified

[**metadata (GET)**](https://developers.google.com/drive/api/v3/reference/files/get)
- Endpoint: https://sheets.googleapis.com/v4/spreadsheets/${spreadsheet_id}?includeGridData=true&ranges=1:2
- This endpoint eturns spreadsheet metadata, sheet metadata, and value metadata (data type information)
- Primary keys: spreadsheetId, title, field_name
- Foreign keys: None
- Replication strategy: Full (get and replace file metadata for spreadshee_id in config)
- Process/Transformations:
  - Verify Sheets: Check sheets exist (compared to catalog) and check gridProperties (available area)
    - sheetId, title, index, gridProperties (rowCount, columnCount)
  - Verify Field Headers (1st row): Check field headers exist (compared to catalog), missing headers (columns to skip), column order/position, and column uniqueness
    - Header's field_name, position: data.rowData[0].values[i].formattedValue
  - Create/Verify Datatypes (2nd row): 
    - Row 2's datatype, format: data.rowData[1].values[i]
      - First check:
        - [effectiveValue: key](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/other#ExtendedValue)
          - Valid types: numberValue, stringValue, boolValue
          - Invalid types: formulaValue, errorValue
      - Then check:
        - [effectiveFormat.numberFormat.type](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/cells#NumberFormatType)
          - Valid types: UNEPECIFIED, TEXT, NUMBER, PERCENT, CURRENCY, DATE, TIME, DATE_TIME, SCIENTIFIC
          - If DATE or DATE_TIME, set JSON schema datatype = string and format = date-time
        - [effectiveFormat.numberFormat.pattern](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/cells#NumberFormat)

[**values (GET)**](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get)
- Endpoint: https://sheets.googleapis.com/v4/spreadsheets/${spreadsheet_id}/values/'${sheet_name}'!${row_range}?dateTimeRenderOption=SERIAL_NUMBER&valueRenderOption=UNFORMATTED_VALUE&majorDimension=ROWS
- This endpoint loops through sheets and row ranges to get the [unformatted values](https://developers.google.com/sheets/api/reference/rest/v4/ValueRenderOption) (effective values only), dates and datetimes as [serial numbers](https://developers.google.com/sheets/api/reference/rest/v4/DateTimeRenderOption)
- Primary keys: row
- Replication strategy: Full (GET file audit data for spreadsheet_id in config)
- Process/Transformations:
  - Loop through sheets (compared to catalog selection)
    - Send metadata for sheet
  - Loop through ranges of rows until reaching empty row or area max row (from sheet metadata)
  - Transform values, if necessary (dates, date-times, boolean, integer, numers)
  - Process/send records

## Authentication
The [**Google Sheets Setup & Authentication**](https://drive.google.com/open?id=1FojlvtLwS0-BzGS37R0jEXtwSHqSiO1Uw-7RKQQO-C4) Google Doc provides instructions show how to configure the Google Cloud API credentials to enable Google Drive and Google Sheets APIs, configure Google Cloud to authorize/verify your domain ownership, generate an API key (client_id, client_secret), authenticate and generate a refresh_token, and prepare your tap config.json with the necessary parameters.
- Enable Googe Drive APIs and Authorization Scope: https://www.googleapis.com/auth/drive.metadata.readonly
- Enable Google Sheets API and Authorization Scope: https://www.googleapis.com/auth/spreadsheets.readonly
- Tap config.json parameters:
  - client_id: identifies your application
  - client_secret: authenticates your application
  - refresh_token: generates an access token to authorize your session
  - spreadsheet_id: unique identifier for each spreadsheet in Google Drive
  - start_date: absolute minimum start date to check file modified
  - user_agent: tap-name and email address; identifies your application in the Remote API server logs

## Quick Start

1. Install

    Clone this repository, and then install using setup.py. We recommend using a virtualenv:

    ```bash
    > virtualenv -p python3 venv
    > source venv/bin/activate
    > python setup.py install
    OR
    > cd .../tap-google-sheets
    > pip install .
    ```
2. Dependent libraries
    The following dependent libraries were installed.
    ```bash
    > pip install target-json
    > pip install target-stitch
    > pip install singer-tools
    > pip install singer-python
    ```
    - [singer-tools](https://github.com/singer-io/singer-tools)
    - [target-stitch](https://github.com/singer-io/target-stitch)

3. Create your tap's `config.json` file. Include the client_id, client_secret, refresh_token, site_urls (website URL properties in a comma delimited list; do not include the domain-level property in the list), start_date (UTC format), and user_agent (tap name with the api user email address).

    ```json
    {
        "client_id": "YOUR_CLIENT_ID",
        "client_secret": "YOUR_CLIENT_SECRET",
        "refresh_token": "YOUR_REFRESH_TOKEN",
        "spreadsheet_id": "YOUR_GOOGLE_SPREADSHEET_ID",
        "start_date": "2019-01-01T00:00:00Z",
        "user_agent": "tap-google-sheets <api_user_email@example.com>"
    }
    ```
    
    Optionally, also create a `state.json` file. `currently_syncing` is an optional attribute used for identifying the last object to be synced in case the job is interrupted mid-stream. The next run would begin where the last job left off.
    Only the `performance_reports` uses a bookmark. The date-time bookmark is stored in a nested structure based on the endpoint, site, and sub_type.

    ```json
    {
        "currently_syncing": "file_metadata",
        "bookmarks": {
            "file_metadata": "2019-09-27T22:34:39.000000Z",
            "sheets_loaded": "2019-09-27T21:55:42.000000Z"
        }
    }
    ```

4. Run the Tap in Discovery Mode
    This creates a catalog.json for selecting objects/fields to integrate:
    ```bash
    tap-google-sheets --config config.json --discover > catalog.json
    ```
   See the Singer docs on discovery mode
   [here](https://github.com/singer-io/getting-started/blob/master/docs/DISCOVERY_MODE.md#discovery-mode).

5. Run the Tap in Sync Mode (with catalog) and [write out to state file](https://github.com/singer-io/getting-started/blob/master/docs/RUNNING_AND_DEVELOPING.md#running-a-singer-tap-with-a-singer-target)

    For Sync mode:
    ```bash
    > tap-google-sheets --config tap_config.json --catalog catalog.json > state.json
    > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
    ```
    To load to json files to verify outputs:
    ```bash
    > tap-google-sheets --config tap_config.json --catalog catalog.json | target-json > state.json
    > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
    ```
    To pseudo-load to [Stitch Import API](https://github.com/singer-io/target-stitch) with dry run:
    ```bash
    > tap-google-sheets --config tap_config.json --catalog catalog.json | target-stitch --config target_config.json --dry-run > state.json
    > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
    ```

6. Test the Tap
    
    While developing the Google Search Console tap, the following utilities were run in accordance with Singer.io best practices:
    Pylint to improve [code quality](https://github.com/singer-io/getting-started/blob/master/docs/BEST_PRACTICES.md#code-quality):
    ```bash
    > pylint tap_google_sheets -d missing-docstring -d logging-format-interpolation -d too-many-locals -d too-many-arguments
    ```
    Pylint test resulted in the following score:
    ```bash
    TBD
    ```

    To [check the tap](https://github.com/singer-io/singer-tools#singer-check-tap) and verify working:
    ```bash
    > tap-google-sheets --config tap_config.json --catalog catalog.json | singer-check-tap > state.json
    > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
    ```
    Check tap resulted in the following:
    ```bash
    TBD
    ```
---

Copyright &copy; 2019 Stitch